Tag Archives: ipython notebook

Data Received – Ostrea lurida MBD-enriched BS-seq

Received the Olympia oyster, MBD-enriched BS-seq sequencing files (50bp, single read) from ZymoResearch (submitted 20151208). Here’s the sample list:

E1_hc1_2B
E1_hc1_4B
E1_hc2_15B
E1_hc2_17
E1_hc3_1
E1_hc3_5
E1_hc3_7
E1_hc3_10
E1_hc3_11
E1_ss2_9B
E1_ss2_14B
E1_ss2_18B
E1_ss3_3B
E1_ss3_14B
E1_ss3_15B
E1_ss3_16B
E1_ss3_20
E1_ss5_18

The 18 samples listed above had previously been MBD-enriched and then sent to ZymoResearch for bisulfite conversion, multiplex library construction, and subsequent sequencing. The library (multiplex of all samples) was sequenced in a single lane, three times. Thus, we would expect 54 FASTQ files. However, ZymoResearch was dissatisfied with the QC of the initial sequencing run (completed on 20160129), so they re-ran the samples (completed on 20160202). This created two sets of data, resulting in a total of 108 FASTQ files.

ZymoResearch data portal does not allow bulk download of files. However, I ended up using Chrono Download Manager extension for Google Chrome to allow for automated downloading of each file (per ZymoResearch recommendation).

After download, the files were moved to their permanent storage location on Owl: http://owl.fish.washington.edu/nightingales/O_lurida/20160203_mbdseq

The readme.md file was updated to include project/file information.

The file manipulations were performed in a Jupyter notebook (see below).

Total reads generated for this project: 1,481,836,875

Jupyter Notebook file: 20160203_Olurida_Zymo_Data_Handling.ipynb

Notebook Viewer: 20160203_Olurida_Zymo_Data_Handling.ipynb

Data Received – Ostrea lurida genome sequencing files from BGI

0000-0002-2747-368X

Downloaded data from the BGI project portal to our server, Owl, using the Synology Download Station. Although the BGI portal is aesthetically nice, it’s set up poorly for bulk downloads and took a few tries to download all of the files.

Data integrity was assessed and read counts for each file were generated. The files were moved to their permanent storage location on Owl: http://owl.fish.washington.edu/nightingales/O_lurida

The readme.md file was updated to include project/file information.

The file manipulations were performed in a Jupyter notebook (see below).

Total reads generated for this project: 1,225,964,680

BGI provided us with the raw data files for us to play around with, but they are also currently in the process of performing the genome assembly.

Jupyter Notebook file: 20160126_Olurida_BGI_data_handling.ipynb

Notebook Viewer: 20160126_Olurida_BGI_data_handling.ipynb

Data Received – Panopea generosa genome sequencing files from BGI

0000-0002-2747-368X

Data integrity was assessed and read counts for each file were generated. The files were moved to their permanent storage location on Owl: http://owl.fish.washington.edu/nightingales/P_generosa/

The readme.md file was updated to include project/file information.

The file manipulations were performed in a Jupyter notebook (see below).

Total reads generated for this project: 1,208,635,950

BGI provided us with the raw data files for us to play around with, but they are also currently in the process of performing the genome assembly.

Jupyter Notebook file: 20160126_Olurida_BGI_data_handling.ipynb

Notebook Viewer: 20160126_Olurida_BGI_data_handling.ipynb

Data Analysis – Identification of duplicate files on Eagle

0000-0002-2747-368X

Recently, we’ve been bumping into our storage limit on Eagle (our Synology DS413):

Being fairly certain that there’s a significant amount of large datasets that is duplicated throughout Eagle, I ran a program on Linux called “fslint”. It searches for duplicates files based on a few parameters and is smart enough to be able to compare files with different filenames that share the same file contents!

I decided to check for duplicate files in the Eagle/archive folder and the Eagle/web folder. Initially, I tried searching for duplicates across all of Eagle, but after a week of running I got tired of waiting for results and ran the analysis on those two directories independently. As such, there is a possibility that there are more duplicates (consuming even more space) across the remainder of Eagle that have not been identified. However, this is a good starting point.

Here are the two output files from the fslint analysis:

To get a summary of the fslint output, I tallied the total amount of duplicates files that were >100MB in size. This was performed in a Jupyter notebook (see below):
Notebook Viewer: 20160114_wasted_space_synologies.ipynb
Jupyter (IPython) Notebook File: 20160114_wasted_space_synologies.ipynb

Here are the cleaned output files from the fslint analysis:

Summary

There are duplicates of files (>100MB in size) that are consuming at least 730GB!

Since the majority of these files exist in the Eagle/web folder, careful consideration will have to be taken in determining which duplicates (if any) can be deleted since it’s highly possible that there are notebooks that link to some of the files. Regardless, this analysis shows just how space is being consumed by the presence of large, duplicate files; something to consider for future data handling/storage/analysis with Eagle.

Bioinformatics – Trimmomatic/FASTQC on C.gigas Larvae OA NGS Data

0000-0002-2747-368X

Previously trimmed the first 39 bases of sequence from reads from the BS-Seq data in an attempt to improve our ability to map the reads back to the C.gigas genome. However, Mac (and Steven) noticed that the last ~10 bases of all the reads showed a steady increase in the %G, suggesting some sort of bias (maybe adaptor??):

Although I didn’t mention this previously, the figure above also shows an odd “waves” pattern that repeats in all bases except for G. Not sure what to think of that…

Quick summary of actions taken (specifics are available in Jupyter notebook below):

Trim first 39 bases from all reads in all raw sequencing files.
Trim last 10 bases from all reads in raw sequencing files
Concatenate the two sets of reads (400ppm and 1000ppm treatments) into single FASTQ files for Steven to work with.

Raw sequencing files:

Notebook Viewer: 20150521_Cgigas_larvae_OA_Trimmomatic_FASTQC

Jupyter (IPython) notebook: 20150521_Cgigas_larvae_OA_Trimmomatic_FASTQC.ipynb

Output files

Trimmed, concatenated FASTQ files
20150521_trimmed_2212_lane2_400ppm_GCCAAT.fastq.gz
20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz

FASTQC files
20150521_trimmed_2212_lane2_400ppm_GCCAAT_fastqc.html
20150521_trimmed_2212_lane2_400ppm_GCCAAT_fastqc.zip

20150521_trimmed_2212_lane2_1000ppm_CTTGTA_fastqc.html
20150521_trimmed_2212_lane2_1000ppm_CTTGTA_fastqc.zip

Example of FASTQC analysis pre-trim:

Example FASTQC post-trim (from 400ppm data):

Trimming has removed the intended bad stuff (inconsistent sequence in the first 39 bases and rise in %G in the last 10 bases). Sequences are ready for further analysis for Steven.

However, we still see the “waves” pattern with the T, A and C. Additionally, we still don’t know what caused the weird inconsistencies, nor what sequence is contained therein that might be leading to that. Will contact the sequencing facility to see if they have any insight.

Bioinformatics – Trimmomatic/FASTQC on C.gigas Larvae OA NGS Data

0000-0002-2747-368X

In another troubleshooting attempt for this problematic BS-seq Illumina data, I’m going to use Trimmomatic to remove the first 39 bases of each read. This is due to the fact that even after the previous quality trimming with Trimmomatic, the first 39 bases still showed inconsistent quality:

Ran Trimmomatic on just a single data set to try things out: 2212_lane2_CTTGTA_L002_R1_001.fastq.gz

Notebook Viewer: 20150506_Cgigas_larvae_OA_trimmomatic_FASTQC

Jupyter (IPython) notebook: 20150506_Cgigas_larvae_OA_trimmomatic_FASTQC.ipynb

Results:

Trimmed FASTQ: 20150506_trimmed_2212_lane2_CTTGTA_L002_R1_001.fastq.gz

FASTQC Report: 20150506_trimmed_2212_lane2_CTTGTA_L002_R1_001_fastqc.html

You can see how flat the newly trimmed data is (which is what one would expect).

Steven will take this trimmed dataset and try additional mapping with it to see if removal of the first 39 bases will improve the mapping.

BLAST – C.gigas Larvae OA Illumina Data Against GenBank nt DB

0000-0002-2747-368X

In an attempt to figure out what’s going on with the Illumina data we recently received for these samples, I BLASTed the 400ppm data set that had previously been de-novo assembled by Steven: EmmaBS400.fa.

Jupyter (IPython) Notebook : 20150501_Cgigas_larvae_OA_BLASTn_nt.ipynb

Notebook Viewer : 20150501_Cgigas_larvae_OA_BLASTn_nt

Results:

BLASTn Output File: 20150501_nt_blastn.tab

BLAST e-vals <= 0.001: 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt

Unique BLAST Species: 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt

Firstly, since this library was bisulfite converted, we know that matching won’t be as robust as we’d normally see.

However, the BLAST matches for this are terrible.

Only 0.65% of the BLAST matches (e-value <0.001) are to Crassostrea gigas. Yep, you read that correctly: 0.65%.

It’s nearly 40-fold less than the top species: Dictyostelium discoideum (a slime mold)

It’s 30-fold less than the next species: Danio rerio (zebra fish)

Then it’s followed up by human and mouse.

I think I will need to contact the Univ. of Oregon sequencing facility to see what their thoughts on this data is, because it’s not even remotely close to what we should be seeing, even with the bisulfite conversion…

BLASTN – C.gigas OA Larvae to C.gigas Ensembl 1.24 BLAST DB

0000-0002-2747-368X

I also created a nucleotide BLAST database (DB) from the Crassostrea_gigas.GCA_000297895.1.24.fa

Jupyter (IPython) Notebook: 20150429_Gigas_larvae_OA_BLASTn.ipynb

Notebook Viewer: 20150429_Gigas_larvae_OA_BLASTn

Results:

The results are not great.

All query contigs successfully BLAST to sequences in the C.gigas Ensembl BLAST DB. However, only 33 of the sequences (out of ~37,000) have an e-value of 0.0. The next best e-value for any matches is 0.001. For the uninitiated, that value is not very good, especially when you’re BLASTing against the same exact species DB.

Will BLASTn the C.gigas contigs against the entire GenBank nt (all nucleotides) to see what the taxonomy breakdown looks like of these sequences.

Quality Trimming – C.gigas Larvae OA BS-Seq Data

0000-0002-2747-368X

Jupyter (IPython) Notebook: 20150414_C_gigas_Larvae_OA_Trimmomatic_FASTQC.ipynb

NBviewer: 20150414_C_gigas_Larvae_OA_Trimmomatic_FASTQC.ipynb

Trimmed FASTQC

400ppm Index – GCCAAT

20150414_trimmed_2212_lane2_GCCAAT_L002_R1_001_fastqc.html
20150414_trimmed_2212_lane2_GCCAAT_L002_R1_002_fastqc.html
20150414_trimmed_2212_lane2_GCCAAT_L002_R1_003_fastqc.html
20150414_trimmed_2212_lane2_GCCAAT_L002_R1_004_fastqc.html
20150414_trimmed_2212_lane2_GCCAAT_L002_R1_005_fastqc.html
20150414_trimmed_2212_lane2_GCCAAT_L002_R1_006_fastqc.html

1000ppm Index – CTTGTA

20150414_trimmed_2212_lane2_CTTGTA_L002_R1_001_fastqc.html
20150414_trimmed_2212_lane2_CTTGTA_L002_R1_002_fastqc.html
20150414_trimmed_2212_lane2_CTTGTA_L002_R1_003_fastqc.html
20150414_trimmed_2212_lane2_CTTGTA_L002_R1_004_fastqc.html

Quality Trimming – LSU C.virginica Oil Spill MBD BS-Seq Data

0000-0002-2747-368X

Jupyter (IPython) Notebook: 20150414_C_virginica_LSU_Oil_Spill_Trimmomatic_FASTQC.ipynb

NBviewer: 20150414_C_virginica_LSU_Oil_Spill_Trimmomatic_FASTQC.ipynb

Trimmed FASTQC

Sam's Notebook

University of Washington – Fishery Sciences – Roberts Lab

Tag Archives: ipython notebook

Data Received – Ostrea lurida MBD-enriched BS-seq

Data Received – Ostrea lurida genome sequencing files from BGI

Data Received – Panopea generosa genome sequencing files from BGI

Data Analysis – Identification of duplicate files on Eagle

Bioinformatics – Trimmomatic/FASTQC on C.gigas Larvae OA NGS Data

Output files

Bioinformatics – Trimmomatic/FASTQC on C.gigas Larvae OA NGS Data

BLAST – C.gigas Larvae OA Illumina Data Against GenBank nt DB

BLASTN – C.gigas OA Larvae to C.gigas Ensembl 1.24 BLAST DB

Quality Trimming – C.gigas Larvae OA BS-Seq Data

Trimmed FASTQC

400ppm Index – GCCAAT

1000ppm Index – CTTGTA

Quality Trimming – LSU C.virginica Oil Spill MBD BS-Seq Data

Trimmed FASTQC

NB3 No oil Index – ACAGTG

NB6 No oil Index – GCCAAT

NB11 No oil Index – CAGATC

HB2 25,000ppm oil Index – ATCACG

HB16 25,000ppm oil Index – TTAGGC

HB30 25,000ppm oil Index – TGACCA