Category Archives: Computer Servicing

Docker – VirtualBox Defaults on OS X

I noticed a discrepancy between what system info is detected natively on Roadrunner (Apple Xserve) and what was being shown when I started a Docker container.

Here’s what Roadrunner’s system info looks like outside of a Docker container:

 

However, here’s what is seen when running a Docker container:

 

 

It’s important to notice the that the Docker container is only seeing 2 CPUs. Ideally, the Docker container would see that this system has 8 cores available. By default, however, it does not. In order to remedy this, the user has to adjust settings in VirtualBox. VirtualBox is a virtual machine thingy that gets installed with the Docker Toolbox for OS X. Apparently, Docker runs within VirtualBox, but this is not really transparent to a beginner Docker user on OS X.

To change the way VirtualBox (and, in turn, Docker) can access the full system hardware, you must launch the VirtualBox application (if you installed Docker using Docker Toolbox, you should be able to find this in your Applications folder). Once you’ve launched VirtualBox, you’ll have to turn off the virtual machine that’s currently running. Once that’s been accomplished, you can make changes and then restart the virtual machine.

 

Shutdown VirtualBox machine before you can make changes:

 

Here are the default CPU settings that VirtualBox is using:

 

 

Maxed out the CPU slider:

 

 

 

Here are the default RAM settings that VirtualBox is using:

 

 

 

Changed RAM slider to 24GB:

 

 

 

Now, let’s see what the Docker container reports for system info after making these changes:

 

Looking at the CPUs now, we see it has 8 listed (as opposed to only 2 initially). I think this means that Docker now has full access to the hardware on this machine.

This situation is a weird shortcoming of Docker (and/or VirtualBox). Additionally, I think this issue might only exist on the OS X and Windows versions of Docker, since they require the installation of the Docker Toolbox (which installs VirtualBox). I don’t think Linux installations suffer from this issue.

Docker – Improving Roberts Lab Reproducibility

In an attempt at furthering our lab’s abilities to maximize our reproducibility, I’ve been  working on developing an all-encompassing Docker image. Docker is a type of virtual machine (i.e. a self-contained computer that runs within your computer). For the Roberts Lab, the advantage of using Docker is that the Docker images can be customized to run a specific suite of software and these images can then be used by any other person in the lab (assuming they can run Docker on their particular operating system), regardless of operating system. In turn, if everyone is using the same Docker image (i.e. the same virtual machine with all the same software), then we should be able to reproduce data analyses more reliably, due to the fact that there won’t be differences between software versions that people are using. Additionally, using Docker greatly simplifies the setup of new computers with the requisite software.

I’ve put together a Dockerfile (a Dockerfile is a text file/script that Docker uses to retrieve software and build a computer image with those specific instructions) which will automatically build a Docker image (i.e. virtual computer) that contains all of the normal bioinformatics software our lab uses. This has been a side project while I wait for Stacks analysis to complete (or, fail, depending on the day) and it’s finally usable! The image that is built from this Dockerfile will even let the user run R Studio and/or Jupyter Notebooks in their browser (I’m excited about this part)!

Here’s the current list of software that will be installed:

bedtools 2.25.0
bismark 0.15.0
blast 2.3.0+
bowtie2 2.2.8
bsmap 2.90
cufflinks 2.1.1
fastqc 0.11.5
fastx_toolkit 0.0.13
R 3.2.5
RStudio Server0.99
pyrad 3.0.66
samtools 0.1.19
stacks 1.40
tophat 2.1.1
trimmomatic 0.36

In order to set this up, you need to install Docker and download the Dockerfile (Dockerfile.bio) I’ve created.

I’ve written a bit of a user guide (specific to this Dockerfile) here to get people started: docker.md

The user guide explains a bit how all of this works and tries to progress from a “basic” this-is-how-to-get-started-with-Docker to an “advanced” description of how to map ports, mount local volumes in your containers, and how to start/attach previously used containers.

The next major goal I have with this Docker project is to get the R kernel installed for Jupyter Notebooks. Currently, the Jupyter Notebook installation is restricted to the default Python 2 kernel.

Additionally, I’d like to improve the usability of the Docker image by setting up aliases in the image. Meaning, a user who wants to use the bowtie program can just type “bowtie”. Currently, the user has to type “bowtie2_2.2.8″ (although, with this being in the system PATH and tab-completion, it’s not that big of a deal), which is a bit ugly.

For some next level stuff, I’d also like to setup all Roberts Lab computers to automatically launch the Docker image when the user opens a terminal. This would greatly simplify things for new lab members. They wouldn’t have to deal with going through the various Docker commands to start a Docker container. Instead, their terminal would just put them directly into the container and the user would be none-the-wiser. They’d be reproducibly conducting data analysis without even having to think about it.

Computer Setup – Cluster Node003 Conversion

Here’s an overview of some of the struggles getting node003 converted/upgraded to function as an independent computer (as opposed to a slave node in the Apple computer cluster).

  • 6TB HDD
  • Only 2.2TB recognized when connected to Hummingbird via Firewire – internet suggests that is max for Xserve; USB might recognize full drive) – Hummingbird is a converted Xserve running Mavericks
  • Reformatted on different Mac and full drive size recognized
  • Connected to Hummingbird (via USB) and full 6TB recognized
  • Connected to Mac Mini to install OS X
  • Tried installing OS X 10.8.5 (Mountain Lion) via CMD+r at boot, but failed partway through installation
  • Tried and couldn’t reformat drive through CMD+r at boot with Disk Utility
  • Broken partition tables identified on Linux, used GParted to establish partition table, back to Mac Mini and OS X (Mountain Lion) install worked
  • Upgraded to OS X 10.11.5 (El Capitan)
  • Inserted drive to Mac cluster node003 – wouldn’t boot all the way – Apple icon, progress bar > Do Not Enter symbol
  • Removed drive, put original back in, connected 6TB HDD via USB, but booting from USB not an option (when booting and holding Option key)
  • Probably due to node003 being part of cluster – reformatted original node003 drive with clean install of OS X Server.
  • Booting from USB now an option and worked with 6TB HDD!
  • Put 6TB HDD w/El Capitan in internal sled and won’t boot! Apple icon, progress bar > Do Not Enter symbol
  • Installed OS X 10.11.5 (El Capitan) on old 1TB drive and inserted into node003 – worked perfectly!
  • Will just use 1TB boot drive and figure out another use for 6TB HDD
  • Renamed node003 to roadrunner
  • Current plan is to upgrade from 12GB to 48GB of RAM and then automate moving data off this drive to long-term storage on Owl (Synology server).

Software Install – samtools-0.1.19 and stacks-1.37

Getting ready to analyze our Ostrea lurida genotype-by-sequencing data and wanted to use the Stacks software.

We have an existing version of Stacks on Hummingbird (the Apple server blade I will be running this analysis on), but I figured I might as well install the latest version (stacks-1.37).

Additionally, Stacks requires samtools-0.1.19 to run, which we did NOT have installed.

I tracked all of this in the Jupyter (iPython) notebook below.

Due to permissions issues during installation, I frequently had to leave the Jupyter notebook to run “sudo” in bash. As such, the notebook is messy, but does outline the necessary steps to get these two programs installed.

Jupyter notebook: 20160406_STACKS_install.ipynb

NBviewer: 20160406_STACKS_install.ipynb

Data Analysis – Identification of duplicate files on Eagle

Recently, we’ve been bumping into our storage limit on Eagle (our Synology DS413):

 

Being fairly certain that there’s a significant amount of large datasets that is duplicated throughout Eagle, I ran a program on Linux called “fslint”. It searches for duplicates files based on a few parameters and is smart enough to be able to compare files with different filenames that share the same file contents!

I decided to check for duplicate files in the Eagle/archive folder and the Eagle/web folder. Initially, I tried searching for duplicates across all of Eagle, but after a week of running I got tired of waiting for results and ran the analysis on those two directories independently. As such, there is a possibility that there are more duplicates (consuming even more space) across the remainder of Eagle that have not been identified. However, this is a good starting point.

Here are the two output files from the fslint analysis:

 

To get a summary of the fslint output, I tallied the total amount of duplicates files that were >100MB in size. This was performed in a Jupyter notebook (see below):
Notebook Viewer: 20160114_wasted_space_synologies.ipynb
Jupyter (IPython) Notebook File: 20160114_wasted_space_synologies.ipynb

 

Here are the cleaned output files from the fslint analysis:

 

Summary

There are duplicates of files (>100MB in size) that are consuming at least 730GB!

Since the majority of these files exist in the Eagle/web folder, careful consideration will have to be taken in determining which duplicates (if any) can be deleted since it’s highly possible that there are notebooks that link to some of the files. Regardless, this analysis shows just how space is being consumed by the presence of large, duplicate files; something to consider for future data handling/storage/analysis with Eagle.

Data Storage – Synology DX513

Running a bit low on storage on Owl (Synology DS1812+) and we will be receiving a ton of data in the next few months, so we purchased a Synology DX513. It’s an expansion unit designed specifically for seamlessly expanding our existing storage volume on Owl.

Installed 5 x 8TB Seagate HDDs and connected to Owl with the supplied eSATA cable.

Now, we just need to wait (possibly days) for the full expansion to be completed.

Uninterruptible Power Supplies (UPS)

A new UPS we installed this week for our qPCR machine (Opticon2 – BioRad) to handle power surges and power outages doesn’t seem to be working properly. With the qPCR machine (and computer and NanoDrop1000) plugged into the “battery” outlets on the UPS, this is what happens when the Opticon goes through a heating cycle:

The UPS becomes overloaded when the Opticon is in a heating cycle.

 

And, sometimes, that results in triggering a fault, shutting everything off in the middle of a qPCR run:

Fault message indicating unit overload.

 

This is supremely lame because having a battery backup is a great way to prevent the qPCR machine from shutting off when a power outage occurs!

 

I switched the Opticon (and computer and NanoDrop1000) to the outlets that are solely for surge protection. Check out what happens when I run the qPCR machine now:

Opticon plugged in to surge protection outlet while in heating cycle. Notice that output load is 0%.

 

So, I guess we’ll settle for at least having the surge protection aspect of things.

 

While handling this UPS issue, I realized that the two Synology servers we have possess a built-in UPS monitor. So, I connected the USB cables to/from each of the UPS that each server is plugged into and enabled UPS shutdown in the Synology Diskstation Management (DSM):

 

Eagle

 

Owl

 

Now, both Synology units will enter Safe Mode when the UPS they’re connected to reaches a low battery status. This will help minimize data loss/corruption during the next extended power outage we experience.

Server Email Notifications Fix – Eagle

The system was previously set to use Steven’s Comcast SMTP server. Sending a test email from Eagle failed, indicating authentication failure. I changed this to use the University of Washington’s email server for outgoing messages. Here’s how…

In the Synology Disk Station Manager (DSM):

Control Panel > Notifications

  • Service provider: Custom SMTP Server
  • SMTP server: smtp.washington.edu
  • SMTP port: 587
  • Username: myUWnetID@uw.edu
  • Password: myUWpassword