Category Archives: Genotype-by-sequencing at BGI

Data Analysis – Continued O.lurida Fst Analysis from GBS Data

Continued the analysis I started the other day. Still following Katherine Silliman’s notebook for guidance.

Quick summary of this analysis:

  • Mean Fst comparing all populations = 0.139539326187024
  • Mean Fst HL vs NF = 0.143075552548742
  • Mean Fst HL vs SN = 0.155234939276722
  • Mean Fst NF vs SN = 0.117889300124951

NOTE: Mean Fst values were calculated after replacing negative Fst values with 0. Thus, the means are higher than they would be had the raw data been used. I followed Katherine’s notebook and she doesn’t explicitly explain why she does this, nor what the potential implications are for interpreting the data. Will have to discus her rationale behind this with her.

Jupyter notebook: 20161201_docker_oly_vcf_analysis_R.ipynb

Data Analysis – Initial O.lurida Fst Determination from GBS Data

Finally running some analysis on the output from my PyRad analysison 20160727.

I’m following Katherine Silliman’s Jupyter notebook (2bRAD Subset Population Structure Analysis.ipynb) as a guide.

The initial analysis (which isn’t much) is in the Jupyter notebook below. The analysis will be continued on a later date.

Jupyter notebook: 20161117_docker_oly_vcf_analysis.ipynb

I’ve embedded the notebook below, but it’s much easier to view (there are many lengthy commands/filenames that wrap lines in the embedded version below) the actual file linked above.

Computing – Retrieve data from Amazon EC2 Instance

I had an existing instance that still had data on it from my PyRad analysis on 20160727 that I needed to retrieve.

Logged into Amazon AWS via the web interface and started my existing instance (via the Actions > Instance State > Start menu). After the instance started and generated a new public IP address, I SSH’d into the instance:

ssh -i "/full/path/to/bioinformatics.pem" ubuntu@instance.public.ip.address

NOTE: I needed the full path to the PEM file! Tried multiple times using a relative path (e.g. ~/Documents/bionformatics.pem) and received error messages that the file did not exist and “Permission denied (public key)”.

Changed to the directory with the PyRAD analysis and created a tarball to speed up eventual download from the EC2 instance to my local computer:

tar -cvzf 20160715_pyrad_analysis.tar.gz /home/ubuntu/data/analysis/

After compression, I used secure copy to copy the file from the EC2 instance to my local computer:

scp -i "/full/path/to/bioinformatics.pem" ubuntu@instance.public.ip.address:/home/ubuntu/data/20160715_pyrad_analysis.tar.gz /Volumes/toaster/sam/

This didn’t work initially because I attempted to transfer the file using Hummingbird (instead of my computer). The SSH connection kept timing out. The reason for this was that I hadn’t previously used Hummingbird to connect to the EC2 instance and Hummingbird’s IP address wasn’t listed in the Security Groups table as being allowed to connect. I made that change using the Amazon AWS web interface:

Once transfer was complete, I terminated the EC2 instance and the corresponding data volume.

Data Analysis – fastStructure Population Analysis of Oly GBS PyRAD Output

After some basal readings about what Fst is (see notebook below for a definition and reference), I decided to try to use fastStructure to analyze the PyRAD output from 20160727.

The quick, TL;DR: after spending a bunch of time installing the program, it doesn’t handle the default Structure file (.str); requires some companion file types that PyRAD doesn’t output.

I’ve put this here for posterity and background reference on Fst…

Will proceed with using the full blown Structure program to try to glean some info from these three populations.

 

Jupyter Notebook: 20160816_oly_gbs_fst_calcs.ipynb

 

Data Analysis – PyRad Analysis of Olympia Oyster GBS Data

Previously, I ran a PyRad analysis on just a subset of these samples in an attempt to have some data for a grant pre-proposal.

I’ve now completed a PyRad analysis on the full set. Now, I just need to figure out what to do with the output from this…

Jupyter Notebook: 20160715_ec2_oly_gbs_pyrad.ipynb

Computing – Amazon EC2 Instance Out of Space?

Running PyRad analysis on the Olympia oyster GBS data. PyRad exited with warnings about running out of space. However, looking at free disk space on the EC2 Instance suggests that there’s still space left on the disk. Possibly PyRad monitors the expected disk space usage during analysis to verify there will be sufficient disk space to write to? Regardless, will expand EC2 volume instance to a larger size…

 

 

Data Analysis – Oly GBS Data Using Stacks 1.37

This analysis ran (or, more properly, was attempted) for a couple of weeks and failed a few times. The failures seemed to be linked to the external hard drive I was reading/writing data to. It continually locked up, leading to “Segmentation fault” errors.

We’ve replaced the external with a different one in hopes that it’ll be able to handle the workload. Will be attempting to re-run Stacks with the new external hard drive.

I’m posting the Jupyter notebook here for posterity.

Jupyter notebook: 20160428_Oly_GBS_STACKS.ipynb

Data Management – O.lurida Raw BGI GBS FASTQ Data

BGI had previously supplied us with demultiplexed GBS FASTQ files. However, they had not provided us with the information/data on how those files were created. I contacted them and they’ve given us the two original FASTQ files, as well as the library index file and corresponding script they used for demultiplexing all of the files. The Jupyter (iPython) notebook below updates our checksum and readme files in our server directory that’s hosting the files: http://owl.fish.washington.edu/nightingales/O_lurida/20160223_gbs/

See Jupyter Notebook below for processing details.

Jupyter Notebook: 20160427_Oly_GBS_data_management.ipynb

Data Analysis – Subset Olympia Oyster GBS Data from BGI as Single Population Using PyRAD

Attempting to get some sort of analysis of the Ostrea lurida GBS data from BGI, particularly since the last run at it using Stacks crashed (literally) and burned (not literally).

Katherine Silliman at UIC recommended using PyRAD.

I’ve taken the example Jupyter notebook from the PyRAD website and passed a subset of the 96 individuals through it.

In this instance, the subset of individuals were all analyzed as a single population. I have another Jupyter notebook running on a different computer that will separate the three populations that are present in this subset.

Overall, I don’t fully understand the results. However, this seems to be the quickest assessment of the data (from the *.snps file generated):

28 individuals, 36424 loci, 72251 snps

Additionally, I did run into an issue when I tried to visualize the data (using the *.vcf file generated) in IGV (see screen cap below). I’ve posted the issue to the pyrad GitHub repo in hopes of getting it resolved.

 

One last thing. This might be obvious to most, but I discovered that trying to do all this computation over the network (via a mounted server share) is significantly slower than performing these operations on th efiles when they’re stored locally. Somewhere in the notebook you’ll notice that I copy all of the working directory from the server (Owl) to the local machine (Hummingbird). Things proceeded very quickly after doing that. Didn’t realize this would have so much impact on speed!!

Jupyter Notebook: 20160418_pyrad_oly_PE-GBS.ipynb

NBviewer: 20160418_pyrad_oly_PE-GBS

Data Analysis – Oly GBS Data from BGI Using Stacks

UPDATE (20160418) : I’m posting this more for posterity, as Stacks continually locked up at both the “ustacks” and “cstacks” stages. These processes would take days to run (on the full 96 samples) and then the processes would become “stuck” (viewed via the top command in OS X).

Have moved on to trying PyRAD in the meantime.

Need to get the GBS from BGI data analyzed.

Installed Stacks (and its dependencies on Hummingbird earlier today).

Below is the Jupyter (iPython) notebook I ran to perform this analysis.

Jupyter (iPython) Notebook: 20160406_Oly_GBS_STACKS.ipynb

Jupyter Notebook Viewer: 20160406_Oly_GBS_STACKS