R Studio | Sam's Notebook

# Server Configuration File # Use Microsoft R Open instead of default R version. # Comment out and restart R Studio Server (sudo rstudio-server restart) # to restore default R version. rsession-which-r=/opt/microsoft/ropen/3.4.3/lib64/R/bin/R

0000-0002-2747-368X

Previously took the analysis just through the mapping, but didn’t realize Steven wanted me to fully process the data.

So, as en exercise, I followed through with deduplication and sorting of the BAM files.

Then, ran a quick analysis using MethylKit in R. The analysis simply copied what Steven had done with another data set and I haven’t examined it very thoroughly, so am not well-versed on what it’s doing and/or why.

Jupyter Notebook (GitHub):

20180530_emu_oly_methylation_mapping_deduplication.ipynb

R Studio Project (download the folder, load project in R Studio, and then run the script in the scripts subdirectory to run the analysis):

20180531_oly_methylkit/

Will take the full data sets through this whole pipeline.

0000-0002-2747-368X

In an attempt at furthering our lab’s abilities to maximize our reproducibility, I’ve been working on developing an all-encompassing Docker image. Docker is a type of virtual machine (i.e. a self-contained computer that runs within your computer). For the Roberts Lab, the advantage of using Docker is that the Docker images can be customized to run a specific suite of software and these images can then be used by any other person in the lab (assuming they can run Docker on their particular operating system), regardless of operating system. In turn, if everyone is using the same Docker image (i.e. the same virtual machine with all the same software), then we should be able to reproduce data analyses more reliably, due to the fact that there won’t be differences between software versions that people are using. Additionally, using Docker greatly simplifies the setup of new computers with the requisite software.

I’ve put together a Dockerfile (a Dockerfile is a text file/script that Docker uses to retrieve software and build a computer image with those specific instructions) which will automatically build a Docker image (i.e. virtual computer) that contains all of the normal bioinformatics software our lab uses. This has been a side project while I wait for Stacks analysis to complete (or, fail, depending on the day) and it’s finally usable! The image that is built from this Dockerfile will even let the user run R Studio and/or Jupyter Notebooks in their browser (I’m excited about this part)!

Here’s the current list of software that will be installed:

bedtools 2.25.0

bismark 0.15.0

blast 2.3.0+

bowtie2 2.2.8

bsmap 2.90

cufflinks 2.1.1

fastqc 0.11.5

fastx_toolkit 0.0.13

R 3.2.5

RStudio Server0.99

pyrad 3.0.66

samtools 0.1.19

stacks 1.40

tophat 2.1.1

trimmomatic 0.36

In order to set this up, you need to install Docker and download the Dockerfile (Dockerfile.bio) I’ve created.

I’ve written a bit of a user guide (specific to this Dockerfile) here to get people started: docker.md

The user guide explains a bit how all of this works and tries to progress from a “basic” this-is-how-to-get-started-with-Docker to an “advanced” description of how to map ports, mount local volumes in your containers, and how to start/attach previously used containers.

The next major goal I have with this Docker project is to get the R kernel installed for Jupyter Notebooks. Currently, the Jupyter Notebook installation is restricted to the default Python 2 kernel.

Additionally, I’d like to improve the usability of the Docker image by setting up aliases in the image. Meaning, a user who wants to use the bowtie program can just type “bowtie”. Currently, the user has to type “bowtie2_2.2.8″ (although, with this being in the system PATH and tab-completion, it’s not that big of a deal), which is a bit ugly.

For some next level stuff, I’d also like to setup all Roberts Lab computers to automatically launch the Docker image when the user opens a terminal. This would greatly simplify things for new lab members. They wouldn’t have to deal with going through the various Docker commands to start a Docker container. Instead, their terminal would just put them directly into the container and the user would be none-the-wiser. They’d be reproducibly conducting data analysis without even having to think about it.

Sam's Notebook

University of Washington – Fishery Sciences – Roberts Lab

Tag Archives: R Studio

Installation – Microsoft Machine Learning Server (Microsoft R Open) on Emu/Roadrunner R Studio Server

BS-seq Mapping – Olympia oyster bisulfite sequencing: Bismark Continued

Docker – Improving Roberts Lab Reproducibility