Tag Archives: md5

Data Received – Chionoecetes bairdi RNAseq & FastQC Analysis

We received Grace’s 100bp PE NovaSeq (Illumian) RNAseq data from the Northwest Genomics Center today.

Data was downloaded via their Aspera browser plugin and rsynced to:

Owl/nightingales/C_bairdi

MD5 checksums were generated (md5sum on Ubuntu):

checksums.md5


321ec408ba7e0f0be1929ca44871f963  304428_S1_L001_R1_001.fastq.gz
b95c69f755c9c42d9203429119d4234d  304428_S1_L001_R2_001.fastq.gz
a0fd8db312057dedd480231d4d125fd3  304428_S1_L002_R1_001.fastq.gz
c6e70ef7f3c8a866851a1b9453aef36a  304428_S1_L002_R2_001.fastq.gz

FastQC analysis was run, followed by MultiQC.

Output folder (gannet/Atumefaciens):

20181015_Cbairdi_fastqc/

MultiQC Report (HTML):

20181015_Cbairdi_fastqc/multiqc_report.html

Nightingales spreadsheet was updated with file info and FastQC info:

Nightingales (Google Sheet)

Data Management – Replacement of Corrupt BGI Oly Genome FASTQ Files

0000-0002-2747-368X

Previously, Sean and Steven identified two potentially corrupt FASTQ files. I contacted BGI about getting replacement files and they informed me that all versions of the FASTQ files they have delivered on three separate occasions are all the same file (despite having different file names). As such, I could use one of these versions to replace the corrupt FASTQ files. So, that’s what I did!

See the Jupyter Notebook below for the deets!

Jupyter Notebook (GitHub): 20170104_docker_oly_BGI_genome_corruption_solved.ipynb

Data Management – Geoduck RRBS Data Integrity Verification

0000-0002-2747-368X

Yesterday, I downloaded the Illumina FASTQ files provided by Genewiz for Hollie Putnam’s reduced representation bisulfite geoduck libraries. However, Genewiz had not provided a checksum file at the time.

I received the checksum file from Genewiz and have verified that the data is intact. Verification is described in the Jupyter notebook below.

Data files are located here: owl/web/nightingales/P_generosa

Jupyter notebook (GitHub): 20161230_docker_geoduck_RRBS_md5_checks.ipynb

Data Management – Integrity Check of Final BGI Olympia Oyster & Geoduck Data

0000-0002-2747-368X

After completing the downloads of these files from BGI, I needed to verify that the downloaded copies matched the originals. Below is a Jupyter Notebook detailing how I verified file integrity via MD5 checksums. It also highlights the importance of doing this check when working with large sequencing files (or, just large files in general), as a few of them had mis-matching MD5 checksums!

Although the notebook is embedded below, it might be easier viewing via the notebook link (hosted on GitHub).

At the end of the day, I had to re-download some files, but all the MD5 checksums match and these data are ready for analysis:

Final Ostrea lurida genome files

Final Panopea generosa genome files

Jupyter Notebook: 20161214_docker_BGI_data_integrity_check.ipynb

Data Management – Tracking O.lurida FASTQ File Corruption

0000-0002-2747-368X

UPDATE 20170104 – These two corrupt files have been replaced with non-corrupt files.

Sean identified an issue with one of the original FASTQ files provided to use by BGI. Additionally, Steven had (unknowingly) identified the same corrupt file, as well as a second corrupt file in the set of FASTQ files. The issue is discussed here: https://github.com/sr320/LabDocs/issues/334

Steven noticed the two files when he ran the program FASTQC and two files generated no output (but no error message!).

The two files in question are:

151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz
151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz

This post is an attempt to document where things went wrong, but having glanced through this data a bit already, it won’t provide any answers.

I originally downloaded the data on 20160127 to my home folder on Owl (this is detailed in the Jupyter notebook in that post) and generated/compared MD5 checksum values. The values matched at that time.

So, let’s investigate a bit further…

Launch Docker container

docker run - p 8888:8888 -v /Users/sam/data/:/data -v /Users/sam/owl_home/:/owl_home -v /Users/sam/owl_web/:owl_web -v /Users/sam/gitrepos/LabDocs/jupyter_nbs/sam/:/jupyter_nbs -it 0ba43904567e

The command allows access to Jupyter Notebook over port 8888 and makes my Jupyter Notebook GitHub repo and my data files accessible to the Docker container.

Once the container was started, started Jupyter Notebook with the following command inside the Docker container:

jupyter notebook

This command is configured in the Docker container to launch a Jupyter Notebook without a browser on port 8888.

Jupyter notebook file: 20161117_docker_oly_genome_fastq_corruption.ipynb

I’ve embedded the notebook below, but it’s much easier to view (there are many lengthy commands/filenames that wrap lines in the embedded version below) the actual file linked above.

Sequencing Data – C.gigas Larvae OA

0000-0002-2747-368X

Our sequencing data (Illumina HiSeq2500, 100SE) for this project has completed by Univ. of Oregon Genomics Core Facility (order number 2212).

Samples sequenced/pooled for this run:

Sample	Treatment	Barcode
400ppm	400ppm	GCCAAT
1000ppm	1000ppm	CTTGTA

All code listed below was run on OS X 10.9.5

Ran a bash script called “download.sh” to download all the files. The script contents were:

#!/bin/bash curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_001.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_002.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_003.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_004.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_005.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_006.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_007.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_008.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_009.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_010.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_011.fastq.gz curl -O http://gcf.uoregon.edu:8080/job/download/2212?fileName=lane2_NoIndex_L002_R1_012.fastq.gz

Downloaded all 12 fastq.gz files to Owl/web/nightingales/C_gigas

Renamed all files by removing the beginning of each file name (2112?fileName=) and replacing that with 2212_:

$for file in 2212*lane2_NoIndex_L002_R1_0*; do mv "$file" "${file/#2212?fileName=/2212_}"; done

Created a directory readme.md (markdown) file to list & describe directory contents: readme.md

$ls *.gz >> readme.md

Note: In order for the readme file to appear in the web directory listing, the file cannot be all upper-case.

Create MD5 checksums for each the files: checkums.md5

$md5 2212* >> checksums.md5

Sequencing Data – LSU C.virginica MBD BS-Seq

0000-0002-2747-368X

Our sequencing data (Illumina HiSeq2500, 100SE) for this project has completed by Univ. of Oregon Genomics Core Facility (order number 2112).

Samples sequenced/pooled for this run:

Sample	Treatment	Barcode
HB2	25,000ppm oil	ATCACG
HB16	25,000ppm oil	TTAGGC
HB30	25,000ppm oil	TGACCA
NB3	No oil	ACAGTG
NB6	No oil	GCCAAT
NB11	No oil	CAGATC

All code listed below was run on OS X 10.9.5

Downloaded all 15 fastq.gz files to Owl/web/nightingales/C_virginica:

$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_001.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_002.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_003.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_004.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_005.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_006.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_007.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_008.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_009.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_010.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_011.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_012.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_013.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_014.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_015.fastq.gz

Renamed all files by removing the beginning of each file name (2112?fileName=) and replacing that with 2112_:

$for file in 2112*lane1_NoIndex_L001_R1_0*; do mv "$file" "${file/#2112?fileName=/2112_}"; done

Created a directory readme.md (markdown) file to list & describe directory contents: readme.md

$ls *.gz >> readme.md

Note: In order for the readme file to appear in the web directory listing, the file cannot be all upper-case.

Created MD5 checksums for each fastq.gz file: checksums.md5

$md5 *.gz >> checksums.md5

Sam's Notebook

University of Washington – Fishery Sciences – Roberts Lab

Tag Archives: md5

Data Received – Chionoecetes bairdi RNAseq & FastQC Analysis

Data Management – Replacement of Corrupt BGI Oly Genome FASTQ Files

Data Management – Geoduck RRBS Data Integrity Verification

Data Management – Integrity Check of Final BGI Olympia Oyster & Geoduck Data

Data Management – Tracking O.lurida FASTQ File Corruption

Sequencing Data – C.gigas Larvae OA

Sequencing Data – LSU C.virginica MBD BS-Seq