Data science in the Google cloud - 
Anyone analysing big data (buzzword, here refereed to as data too big to load into memory) soon will come to the realization that processing such data requires a lot of computational resources. During my PhD I mainly worked with the local high-performance-computer (HPC) at the University of Sussex. A couple of years into my PhD and I increasingly realized that our little HPC suffers from the tragedy of the commons with more and more people requesting computation time on a few available nodes. That and also the tendency to have limited flexibility for running customized code (no root access, outdated modules and libraries, little space on the home drive to set up virtual environments, etc. …) has made me quite frustrated and willing to switch to the “Cloud” for accessing computing resources. \
Cloud computing these days is well established, but mainly concentrated in the hands of three leading US firms. As far as I am aware one basically has to choose between Amazon AWS, Microsoft Azure and Google Cloud programs. Each have their own benefits and I leave it to the reader to search elsewhere for information on which one to chose. \
I picked the Google cloud free trial offer partly because of the following reasons:
They have a 300$ give away. (I think Microsoft and Amazon offer sth. similar though)
The free trial period lasts 12 months after which it runs out without incurring further cost. Furthermore there will remain a free-use contingent which can be exhausted for free. You fire up some use time on a f1-micro VM for instance.
I am increasingly using Google’s Earth Engine platform and plan to use Google cloud storage to enhance my workflow.
Private 1GB Git hosting (now especially useful since Competitor Microsoft has acquired Github )
That being said, I have also heard great things about AWS and Azure as well and might try them out at a later point as well.
So here is how I started. My goal was to first get familiar with computing in the cloud and try to install some standard tools. Therefore
First I fired up a micro instance Virtual Machine (which, in the google cloud, you can run over 700h each month for free).
On the SSH button you have the opportunity to directly log into your cloud instance in the browser or in another ssh-client of you choosing.
Each VM can be selected and also started / stopped or completly reseted in this screen as well (also via the "…" button!)
I’m going to install some basic data-science tools. Here is the entire thing as bash-script to be executed on the next, bigger, VM in a later stage ;-)
# First lets install some necessary libraries sudo apt-get -y install bzip2 sudo apt-get -y install screen # Make a update and upgrade all, then clean up sudo apt-get update sudo apt-get -y upgrade sudo apt-get -y autoremove # Make download folder mkdir downloads cd downloads # Download anaconda wget https://repo.continuum.io/archive/Anaconda2-5.2.0-Linux-x86_64.sh # Install in the background (accept and updating any previous installations) bash Anaconda2-5.2.0-Linux-x86_64.sh -b -u -p $HOME/anaconda2 echo "export PATH=\"~/anaconda2/bin:$PATH\"" >> ~/.bashrc # Reload conf source ~/.bashrc # Install R # Add debian stretch repo and key, then install echo "deb http://cran.rstudio.com/bin/linux/debian stretch-cran35/" | sudo tee -a /etc/apt/sources.list sudo apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' sudo apt-get update sudo apt-get install -y r-base r-base-core r-base-dev sudo apt-get install -y libatlas3-base # Also install rstudio keyserver sudo apt-get -y install psmisc libssl-dev libcurl4-openssl-dev libssh2-1-dev wget https://download2.rstudio.org/rstudio-server-stretch-1.1.453-amd64.deb sudo dpkg -i rstudio-server-stretch-1.1.453-amd64.deb # Also install julia for later sudo apt-get -y install julia
Note to myself: For the future it might be easier to configure an analysis-ready docker image. Sth. to do for later… \
Now we create a new configuration for a jupyter notebook and start it on the vm.
# Create config jupyter notebook --generate-config # Add this to the configure echo "c = get_config()" >> ~/.jupyter/jupyter_notebook_config.py echo "c.NotebookApp.ip = '*'" >> ~/.jupyter/jupyter_notebook_config.py echo "c.NotebookApp.open_browser = False" >> ~/.jupyter/jupyter_notebook_config.py echo "c.NotebookApp.port = 8177" >> ~/.jupyter/jupyter_notebook_config.py # Set a password jupyter notebook password # Start up jupyter-notebook --no-browser --port=8177
The jupyter notebook can now be theoretically viewed in a browser. However we have to get access to the Google cloud intranet first. For this we will use the google cloud SDK, which you need to install on your local computer as well.
Then execute for the google cloud sdk:
# After installation: auth gcloud init # The open a SSH tunnel. For me that is: gcloud compute ssh --zone=us-central1-c --ssh-flag="-D" --ssh-flag="8177" --ssh-flag="-N" --ssh-flag="-n" wolkentest # If you have never done before, you will need to create a public/private ssh key
Now that you have created a SSH tunnel you can just open your local browser (ie. Chrome or similar) and navigate towards localhost:8177 and you should see your jupyter notebook. Happy computing!
At the end, ensure that the VM is turned off, otherwise it will create ongoing costs!