Data Science Libraries in Docker Container and AWS

Data Science Libraries in Docker Container and hosted on AWS

In this post, we will be installing data science libraries in a docker container and AWS.  It will show how to easily access the data science libraries from anywhere in the world without downloading and installing Python, runtime libraries, and the Jupyter package. The walk-through is for Windows platform.

AWS

We need an AWS Account to set up an EC2 instance. If you do not have an account please set up to begin. To interface with EC2, we need to configure a Key Pair to facilitate a secure connection to running EC2 instance. We will create an ssh key pair locally and import the public component of the key pair into AWS. From the EC2 Dashboard, access Key Pairs and then click “Create key pair”.

AWS

Fill in the details and download the .ppk file for use with Putty.

We also need to create a new Security Group. From the EC2 Dashboard, access the Security Group pane and click “Create Security Group”. Give the security group a name like “docker-jupyter-nginx” and a description. Access the Inbound tab and configure the following security rules:
• SSH: Port Range: 22, Source: Anywhere. This allows us to connect to the server via SSH.
• HTTP: Port Range: 80, Source: Anywhere. This allows us to connect the EC2 to a website.
• HTTPS: Port Range: 443, Source: Anywhere. This allows us to connect the EC2 to a website.
• Custom TCP Rule: Port Range: 2376, Source: Anywhere. This allows us to access Docker Hub.
• Custom TCP Rule: Port Range: 8888, Source: Anywhere. This allows us the port on which to run Jupyter notebook. Setting the source to anywhere will allow us to access Jupyter Notebook from any IP address. However, we can also limit access by entering a custom IP into the source.

security_group_aws

To create a new instance, start from the EC2 Dashboard and click “Launch Instance”.

ec2-aws

Click “Review and Launch”.

Click “View Instances” tab to see the new instance running. We need to take a note of the IP address. The format is user_name@public_dns_name.

Windows

To access EC2 instance from Windows platform, download and install PuTTY. It is an SSH and telnet client for Windows platform. To configure the new EC2 instance for Docker, we ssh into the instance using IP address using PuTTY. Enter the Host Name (IP address) into the appropriate field. 

Click on the “+” button next to the SSH field to expand. Then click on “Auth” and enter the name of private key file (i.e. the ppk file downloaded from AWS).

Click “Open” and specify the user name: “ubuntu“.

Docker

Next, we install and configure Docker using an install script (get.docker.com )provided by the Docker team.  Execute this statement:

curl -sSL https://get.docker.com/ | sh 

Add the ubuntu user to the docker group to allow the ubuntu
user to issue commands to docker without sudo. Execute this statement:
sudo usermod -aG docker ubuntu

Next, reboot to let the changes take effect. Execute this statement:
sudo reboot

Reconnect to the system and check the docker version by executing this statement:

docker -v

docker

Data Science Libraries

We create a containerized data science environment using Jupyter Docker Stacks. Jupyter Docker Stacks are a set of ready-to-run Docker images containing Jupyter applications and interactive computing tools.  We begin by pulling jupyter/scipy-notebook from Project Jupyter’s public Docker Hub account.

The jupyter/scipy -notebook image contains:

  • Minimally-functional Jupyter Notebook server
  • Miniconda Python 3.6
  • Pandoc and TeX Live for notebook document conversion
  • git, emacs, jed, nano, and unzip
  • pandas, numexpr, matplotlib, scipy, seaborn, scikit-learn, scikit-image, sympy, cython, patsy, statsmodel, cloudpickle, dill, numba, bokeh, sqlalchemy, hdf5, vincent, beautifulsoup, protobuf, and xlrd packages
  • ipywidgets for interactive visualizations in Python notebooks
  • Facets for visualizing machine learning dataset

Execute this statement:

docker pull jupyter/scipy-notebook

Once we have pulled the image, it is available to us in docker images cache. To run the Jupyter container, execute this statement:

docker run -p 8888:8888 jupyter/scipy-notebook

The -p flag serves to link port 8888 on the EC2 instance, to the port 8888 on which the Jupyter Notebook server is running in the Docker container.

The output from the running Jupyter Notebook server provides with an authentication token to access the Notebook server through a browser.

We can access the Jupyter Notebook at the public IP address and the port in the format of user_name@public_dns_name:8888/

The Jupyter server will prompt for a token to ensure not just anyone can access the environment. Execute this statement:

docker logs jupyter

We can copy the token from PuTTY with the mouse. Click the left mouse button in the terminal window, and drag to select text. When we let go of the button, the text is automatically copied to the clipboard. Paste the token to access the token. We have successfully installed data science libraries in a docker container AWS. This allows us to access our stack from anywhere.

Example

Following is just an example of a notebook running on AWS.

Tags:

Leave a Reply