Sam's Notebook » Olympia oyster reciprocal transplant

SRA Submission – Olymia oyster Whole Genome BS-seq Data

kubu4 — Wed, 03 Oct 2018 17:47:35 +0000

Submitted our whole genome bisulfite sequencing data to NCBI Sequence Read Archive (SRA).

Relevant SRA info is below.

Have updated nightingales Google Sheet with SRA info.

SAMPLE	SRA (Study)	BioProject	BioSample
1NF11	SRP163248	PRJNA494552	SAMN10172233
1NF15	SRP163248	PRJNA494552	SAMN10172234
1NF16	SRP163248	PRJNA494552	SAMN10172235
1NF17	SRP163248	PRJNA494552	SAMN10172236
2NF5	SRP163248	PRJNA494552	SAMN10172237
2NF6	SRP163248	PRJNA494552	SAMN10172238
2NF7	SRP163248	PRJNA494552	SAMN10172239
2NF8	SRP163248	PRJNA494552	SAMN10172240

BS-seq Mapping – Olympia oyster bisulfite sequencing: TrimGalore > FastQC > Bismark

kubu4 — Tue, 08 May 2018 17:40:06 +0000

0000-0002-2747-368X

Steven asked me to evaluate our methylation sequencing data sets for Olympia oyster.

According to our Olympia oyster genome wiki, we have the following two sets of BS-seq data:

All computing was conducted on our Apple Xserve: emu.

All steps were documented in this Jupyter Notebook (GitHub): 20180503_emu_oly_methylation_mapping.ipynb

NOTE: The Jupyter Notebook linked above is very large in size. As such it will not render on GitHub. It will need to be downloaded to a computer that can run Jupyter Notebooks and viewed that way.

Here’s a brief overview of what was done.

Samples were trimmed with TrimGalore and then evaluated with FastQC. MultiQC was used to generate a nice visual summary report of all samples.

The Olympia oyster genome assembly, pbjelly_sjw_01, was used as the reference genome and was prepared for use in Bismark:


/home/shared/Bismark-0.19.1/bismark_genome_preparation \
--path_to_bowtie /home/shared/bowtie2-2.3.4.1-linux-x86_64/ \
--verbose /home/sam/data/oly_methylseq/oly_genome/ \
2> 20180507_bismark_genome_prep.err

Bismark was run on trimmed samples with the following command:


/home/shared/Bismark-0.19.1/bismark \
--path_to_bowtie /home/shared/bowtie2-2.3.4.1-linux-x86_64/ \
--genome /home/sam/data/oly_methylseq/oly_genome/ \
-u 1000000 \
-p 16 \
--non_directional \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/1_ATCACG_L001_R1_001_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/2_CGATGT_L001_R1_001_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/3_TTAGGC_L001_R1_001_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/4_TGACCA_L001_R1_001_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/5_ACAGTG_L001_R1_001_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/6_GCCAAT_L001_R1_001_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/7_CAGATC_L001_R1_001_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/8_ACTTGA_L001_R1_001_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_10_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_11_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_12_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_13_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_14_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_15_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_16_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_17_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_18_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_1_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_2_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_3_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_4_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_5_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_6_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_7_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_8_s456_trimmed.fq.gz \
/home/sam/analyses/20180503_oly_methylseq_trimgalore/zr1394_9_s456_trimmed.fq.gz \
2> 20180507_bismark_02.err

Results:

TrimGalore output folder:

20180503_oly_methylseq_trimgalore

FastQC output folder:

20180503_oly_methylseq_trimgalore/20180503_trim_fastqc/

MultiQC output folder:

20180503_oly_methylseq_trimgalore/20180503_trim_fastqc/multiqc_data/

MultiQC Report (HTML):

20180503_oly_methylseq_trimgalore/20180503_trim_fastqc/multiqc_data/multiqc_report.html

Bismark genome folder: 20180503_oly_genome_pbjelly_sjw_01_bismark/

Bismark output folder:

20180507_oly_methylseq_bismark

Whole genome BS-seq (2015)

Prep overview

Library prep: Roberts Lab
Sequencing: Genewiz

Bismark Report	Mapping Percentage
1_ATCACG_L001_R1_001_trimmed_bismark_bt2_SE_report.txt	40.3%
2_CGATGT_L001_R1_001_trimmed_bismark_bt2_SE_report.txt	39.9%
3_TTAGGC_L001_R1_001_trimmed_bismark_bt2_SE_report.txt	40.2%
4_TGACCA_L001_R1_001_trimmed_bismark_bt2_SE_report.txt	40.4%
5_ACAGTG_L001_R1_001_trimmed_bismark_bt2_SE_report.txt	39.9%
6_GCCAAT_L001_R1_001_trimmed_bismark_bt2_SE_report.txt	39.6%
7_CAGATC_L001_R1_001_trimmed_bismark_bt2_SE_report.txt	39.9%
8_ACTTGA_L001_R1_001_trimmed_bismark_bt2_SE_report.txt	39.7%

MBD BS-seq (2015)

Prep overview

MBD: Roberts Lab
Library prep: ZymoResearch
Sequencing: ZymoResearch

Bismark Report	Mapping Percentage
zr1394_1_s456_trimmed_bismark_bt2_SE_report.txt	33.0%
zr1394_2_s456_trimmed_bismark_bt2_SE_report.txt	34.1%
zr1394_3_s456_trimmed_bismark_bt2_SE_report.txt	32.5%
zr1394_4_s456_trimmed_bismark_bt2_SE_report.txt	32.8%
zr1394_5_s456_trimmed_bismark_bt2_SE_report.txt	35.2%
zr1394_6_s456_trimmed_bismark_bt2_SE_report.txt	35.5%
zr1394_7_s456_trimmed_bismark_bt2_SE_report.txt	32.8%
zr1394_8_s456_trimmed_bismark_bt2_SE_report.txt	33.0%
zr1394_9_s456_trimmed_bismark_bt2_SE_report.txt	34.7%
zr1394_10_s456_trimmed_bismark_bt2_SE_report.txt	34.9%
zr1394_11_s456_trimmed_bismark_bt2_SE_report.txt	30.5%
zr1394_12_s456_trimmed_bismark_bt2_SE_report.txt	35.8%
zr1394_13_s456_trimmed_bismark_bt2_SE_report.txt	32.5%
zr1394_14_s456_trimmed_bismark_bt2_SE_report.txt	30.8%
zr1394_15_s456_trimmed_bismark_bt2_SE_report.txt	31.3%
zr1394_16_s456_trimmed_bismark_bt2_SE_report.txt	30.7%
zr1394_17_s456_trimmed_bismark_bt2_SE_report.txt	32.4%
zr1394_18_s456_trimmed_bismark_bt2_SE_report.txt	34.9%

Manuscript Writing – Submitted!

kubu4 — Wed, 19 Apr 2017 17:17:00 +0000

0000-0002-2747-368X

Boom!

Here are some useful links:

data records repo-URL: https://osf.io/j8rc2/
draft repo-URL: https://github.com/kubu4/paper_oly_gbs
draft: https://www.authorea.com/users/4974/articles/149442
preprint (Overleaf): https://www.overleaf.com/read/mqbbvmwxhncg
preprint (PDF): https://osf.io/cdj7m/

Manuscript – Oly GBS 14 Day Plan

kubu4 — Tue, 18 Apr 2017 14:55:02 +0000

0000-0002-2747-368X

For Pub-a-thon 2017, Steven has asked us to put together a 14 day plan for our manuscripts.

My manuscript is accessible in three locations:

Current: Overleaf for final editing/formatting before submission Scientific Data.
Archival: Authorea for initial writing/comments.
Archival: GitHub for initial writing/issues.

Additionally, I have established a data repository with a Digital Object Identifier (DOI) at Open Science Framework

Here’s what I have going on:

Day 1

Convert .xls data records to .csv to see if they will render in OSF repo.
Assemble figure: phylogenetic tree.
Add figure to manuscript.
Deal with any minor edits.

Day 2

Assemble figure: Puget Sound map.
Add figure to manuscript.
Deal with any minor edits.

Day 3

Submit? Depends on what Steven’s availability is to finish of Background & Summary and write up Abstract.

Data Management – SRA Submission Oly GBS Batch Submission

kubu4 — Tue, 21 Mar 2017 19:05:24 +0000

0000-0002-2747-368X

An earlier attempt at submitting these files failed.

I re-uploaded the failed files (indicated in my previous notebook entry linked above) and tried again.

It failed again, despite having successfully uploaded just minutes before.

I re-uploaded that “missing” file and tried again.

This time, it succeeded (and no end-of-stream error for the 1SN_1A file!)!

Will post here with the SRA accession number once it goes live!

Data Management – SRA Submission Oly GBS Batch Submission Fail

kubu4 — Mon, 20 Mar 2017 13:53:02 +0000

0000-0002-2747-368X

I had previously submitted the two non-demultiplexed genotype-by-sequencing (GBS) files provided by BGI to the NCBI short read archive (SRA).

Recently, Jay responded to an issue I had posted on the GitHub repo for the manuscript we’re writing about this data.

He noticed that the SRA no longer wants “raw data dumps” (i.e. the type of submission I made before). So, that means I had to prepare the demultiplexed files provided by BGI to actually submit to the SRA.

Last week, I uploaded all 192 of the files via FTP. It took over 10hrs.

(FTP tips: – Use ftp -i to initiate FTP. – Then use open ftp.address.IP to connect. – Then can use mput with regular expressions to upload multiple files)

Today, I created a batch BioSample submission:

Initiated the submission process (Ummm, this looks like it’s going to take awhile…):

Aaaaaand, it failed:

It seems like the FTP failed at some point, as there’s nothing about those seven files that would separate them from the remaining 185 files. Additional support for FTP failure is that the 1SN_1A_1.fq.gz error message makes it sound like only part of the file got transferred.

I’ll retrieve those files from our UW Google Drive (since their original home on Owl is still down) and get them trasnferred to the SRA.

Computing – Oly BGI GBS Reproducibility; fail?

kubu4 — Fri, 17 Mar 2017 21:41:55 +0000

0000-0002-2747-368X

OK, so things have improved since the last attempt at getting this BGI script to run and demultiplex the raw data.

I played around with the index.lst file format (based on the error I received last time, it seemed like a good possibility that the file formatting was incorrect) and actually got the script to run to completion! Granted, it took over 16hrs (!!), but it completed!

See the Jupyter notebook link below.

Results:

Well, although the script finished and kicked out all the demultiplexed FASTQ files, the contents of the FASTQ files don’t match (the read counts differ between these results and the BGI files) the original set of demultiplexed files. I’m not entirely sure if this is to be expected or not, since the script allows for a single nucleotide mismatch when demultiplexing. Is it possible that the mismatch could be interpreted slightly differently each time this is run? I’m not certain.

Theoretically, you should get the same results every time…

Maybe I’ll re-run this again over the weekend and see how the results compare to this run and the original BGI demultiplexing…

Jupyter notebook (GitHub): 20170314_docker_Oly_BGI_GBS_demultiplexing_reproducibility.ipynb

Jupyter notebook (may be easier to view in GitHub link above):

Computing – Oly BGI GBS Reproducibility Fail (but, less so than last time)…

kubu4 — Tue, 14 Mar 2017 22:56:21 +0000

0000-0002-2747-368X

Well, my previous attempt at reproducing the demultiplexing that BGI performed was an exercise in futility. BGI got back to me with the following message:

Hi Sam,

We downloaded it and it seems fine when compiling. You can compile it with the below command under Linux system.

tar -zxvf ReSeqTools_XXX.tar.gz ; cd iTools_Code; chmod 775 iTools ; ./ iTools -h

I gave that whirl and got the following message:

Error opening terminal: xterm

Some internet searching got me sucked into a useless black hole about 64 bit systems running 32 bit programs and enabling the 64 bit kernel on Mac OS X 10.7.5 (Lion) since it’s not enabled by default and on and on. In the end, I can’t seem to enable the 64 bit kernel on my Mac Pro, likely due to hardware limitations related to the graphics card and/or displays that are connected.

Anyway, I decided to try getting this program installed again, using a Docker container (instead of trying to install locally on my Mac).

Results:

It didn’t work again, but for a different reason! Despite the instructions in the readme file provided with iTools, you don’t actually need to run make! All that has to be done is unzipping the tarball!! However, despite figuring this out, the program fails with the following error message: “Warming : sample double in this INDEX Files. Sample ID: OYSzenG1AAD96FAAPEI-109; please renamed it diff” (note: this is copied/pasted – the spelling errors are note mine). So, I think there’s something wrong with the formatting of the index file that BGI provided me with.

I’ve contacted them for more info.

See the Jupyter notebook linked below to see what I tried.

Jupyter notebook (GitHub): 20170314_docker_Oly_BGI_GBS_demultiplexing_reproducibility.ipynb

Computing – Oly BGI GBS Reproducibility Fail

kubu4 — Tue, 07 Mar 2017 23:36:53 +0000

0000-0002-2747-368X

Since we’re preparing a manuscript that relies on BGI’s manipulation/handling of the genotype-by-sequencing data, I attempted to could reproduce the demultiplexing steps that BGI used in order to perform the SNP/genotyping on these samples.

The key word in the above sentence is “attempted.” Ugh, what a massive waste of time it turned out to be. I’ve contacted BGI to get some help on this.

In the meantime, here’s a brief (actually, not as brief as I’d like) rundown of my struggles.

The demultiplexing software that BGI used is something called “iTools” which is bundled in this GitHub repo: Resqtools

To demutliplex, they ran a script called: split.sh

The script seems fairly straightforward. Here is what it contains:

iTools Fqtools splitpool \
-InFq1 160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_1.fq.gz \
-InFq2 160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_2.fq.gz \
-Index index.lst \
-Flag enzyme.txt \
-MisMatch \
-OutDir split

It tells the iTools program to use the Fqtools tool “splitpool” to operate on a pair of gzipped FASTQ files. It also utilizes an index file (index.lst) which contains all the barcodes needed to identify, and separate, the individual samples that were combined prior to sequencing.

The first bump in the road is the -Flag enzyme.txt portion of the code. BGI did not provide me with this file. I recently requested them to send me it (or its contents, since I suspected it was only a single line text file). They sent me the contents of the file:

CAGC
CTGC

The next problem is neither of those two sequences are the recognition site for the enzyme that was (supposedly) used: ApeKI. The recognition site for ApeKI is: GCWGC

Regardless, I decided to see if I could reproduce the demultiplexing using the info they’d provided me.

I cloned the Resqtools repo, changed into the Reseqtools/iTools directory and typed make.

This resulted in an error informing me that it could not find boost/spirit/core.hpp

I tracked down the Boost library junk, downloaded the newest version and untarred it in /usr/local/bin.

Tried to run make in the Reseqtools/iTools directory and got the same error. Realized iTools might not be searching the system $PATH (this turned out to be correct), so I moved the contents of the Boost folder to the iTools, ran make and got the same error. Turns out, the newest version of Boost doesn’t have that core.hpp file any more. Looking at the iTools documentation, iTools was built around Boost 1.44. OMG…

Downloaded Boost 1.44 and went through the same steps as above. This eliminated the missing core.hpp error!

But, of course, led to another error. The error:

"Threading support unavaliable: it has been explicitly disabled with BOOST_DISABLE_THREADS"

That was related to something with newer versions of the GCC compiler (this is, essentially, built into the computer; it’s not worth trying to install/use old versions of GCC) trying to work with old versions of Boost. Found a patch for a config file here: libstdcpp3.hpp.patch

I made the appropriate edits to the file as shown in that link and ran make and it almost worked!

The current error is:

./src/Variants/soapsv-v1.02/include.h:15:16: fatal error: gd.h: No such file or directory

I gave up and contacted BGI to see if they can get me a functional version of iTools…

FASTQC – Oly BGI GBS Raw Illumina Data Demultiplexed

kubu4 — Tue, 07 Mar 2017 17:46:56 +0000

0000-0002-2747-368X

Last week, I ran the two raw FASTQ files through FastQC. As expected, FastQC detected “errors”. These errors are due to the presence of adapter sequences, barcodes, and the use of a restriction enzyme (ApeKI) in library preparation. In summary, it’s not surprising that FastQC was not please with the data because it’s expecting a “standard” library prep that’s already been trimmed and demultiplexed.

However, just for comparison, I ran the demultiplexed files through FastQC. The Jupyter notebook is linked (GitHub) and embedded below. I recommend viewing the Jupyter notebook on GitHub for easier viewing.

Results:

Pretty much the same, but with slight improvements due to removal of adapter and barcode sequences. The restriction site still leads to FastQC to report errors, which is expected.

Links to all of the FastQC output files are linked at the bottom of the notebook.

Jupyter notebook (GitHub): 20170306_docker_fastqc_demultiplexed_bgi_oly_gbs.ipynb