Tag Archives: FASTQC

Bioinformatics – Trimmomatic/FASTQC on C.gigas Larvae OA NGS Data

Previously trimmed the first 39 bases of sequence from reads from the BS-Seq data in an attempt to improve our ability to map the reads back to the C.gigas genome. However, Mac (and Steven) noticed that the last ~10 bases of all the reads showed a steady increase in the %G, suggesting some sort of bias (maybe adaptor??):

Although I didn’t mention this previously, the figure above also shows an odd “waves” pattern that repeats in all bases except for G. Not sure what to think of that…

Quick summary of actions taken (specifics are available in Jupyter notebook below):

  • Trim first 39 bases from all reads in all raw sequencing files.
  • Trim last 10 bases from all reads in raw sequencing files
  • Concatenate the two sets of reads (400ppm and 1000ppm treatments) into single FASTQ files for Steven to work with.

Raw sequencing files:

Notebook Viewer: 20150521_Cgigas_larvae_OA_Trimmomatic_FASTQC

Jupyter (IPython) notebook: 20150521_Cgigas_larvae_OA_Trimmomatic_FASTQC.ipynb

 

 

Output files

Trimmed, concatenated FASTQ files
20150521_trimmed_2212_lane2_400ppm_GCCAAT.fastq.gz
20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz

 

FASTQC files
20150521_trimmed_2212_lane2_400ppm_GCCAAT_fastqc.html
20150521_trimmed_2212_lane2_400ppm_GCCAAT_fastqc.zip

20150521_trimmed_2212_lane2_1000ppm_CTTGTA_fastqc.html
20150521_trimmed_2212_lane2_1000ppm_CTTGTA_fastqc.zip

 

Example of FASTQC analysis pre-trim:

 

 

Example FASTQC post-trim (from 400ppm data):

 

Trimming has removed the intended bad stuff (inconsistent sequence in the first 39 bases and rise in %G in the last 10 bases). Sequences are ready for further analysis for Steven.

However, we still see the “waves” pattern with the T, A and C. Additionally, we still don’t know what caused the weird inconsistencies, nor what sequence is contained therein that might be leading to that. Will contact the sequencing facility to see if they have any insight.

Bioinformatics – Trimmomatic/FASTQC on C.gigas Larvae OA NGS Data

In another troubleshooting attempt for this problematic BS-seq Illumina data, I’m going to use Trimmomatic to remove the first 39 bases of each read. This is due to the fact that even after the previous quality trimming with Trimmomatic, the first 39 bases still showed inconsistent quality:

 

Ran Trimmomatic on just a single data set to try things out: 2212_lane2_CTTGTA_L002_R1_001.fastq.gz

Notebook Viewer: 20150506_Cgigas_larvae_OA_trimmomatic_FASTQC

Jupyter (IPython) notebook: 20150506_Cgigas_larvae_OA_trimmomatic_FASTQC.ipynb

Results:

Trimmed FASTQ: 20150506_trimmed_2212_lane2_CTTGTA_L002_R1_001.fastq.gz

FASTQC Report: 20150506_trimmed_2212_lane2_CTTGTA_L002_R1_001_fastqc.html

You can see how flat the newly trimmed data is (which is what one would expect).

Steven will take this trimmed dataset and try additional mapping with it to see if removal of the first 39 bases will improve the mapping.

 

Quality Trimming – C.gigas Larvae OA BS-Seq Data

Jupyter (IPython) Notebook: 20150414_C_gigas_Larvae_OA_Trimmomatic_FASTQC.ipynb

NBviewer: 20150414_C_gigas_Larvae_OA_Trimmomatic_FASTQC.ipynb

 

Trimmed FASTQC

400ppm Index – GCCAAT

20150414_trimmed_2212_lane2_GCCAAT_L002_R1_001_fastqc.html
20150414_trimmed_2212_lane2_GCCAAT_L002_R1_002_fastqc.html
20150414_trimmed_2212_lane2_GCCAAT_L002_R1_003_fastqc.html
20150414_trimmed_2212_lane2_GCCAAT_L002_R1_004_fastqc.html
20150414_trimmed_2212_lane2_GCCAAT_L002_R1_005_fastqc.html
20150414_trimmed_2212_lane2_GCCAAT_L002_R1_006_fastqc.html

1000ppm Index – CTTGTA

20150414_trimmed_2212_lane2_CTTGTA_L002_R1_001_fastqc.html
20150414_trimmed_2212_lane2_CTTGTA_L002_R1_002_fastqc.html
20150414_trimmed_2212_lane2_CTTGTA_L002_R1_003_fastqc.html
20150414_trimmed_2212_lane2_CTTGTA_L002_R1_004_fastqc.html

 

 

Quality Trimming – LSU C.virginica Oil Spill MBD BS-Seq Data

Jupyter (IPython) Notebook: 20150414_C_virginica_LSU_Oil_Spill_Trimmomatic_FASTQC.ipynb

NBviewer: 20150414_C_virginica_LSU_Oil_Spill_Trimmomatic_FASTQC.ipynb

Trimmed FASTQC

NB3 No oil Index – ACAGTG

20150414_trimmed_2112_lane1_ACAGTG_L001_R1_001_fastqc.html
20150414_trimmed_2112_lane1_ACAGTG_L001_R1_002_fastqc.html

NB6 No oil Index – GCCAAT

20150414_trimmed_2112_lane1_GCCAAT_L001_R1_001_fastqc.html
20150414_trimmed_2112_lane1_GCCAAT_L001_R1_002_fastqc.html

NB11 No oil Index – CAGATC

20150414_trimmed_2112_lane1_CAGATC_L001_R1_001_fastqc.html
20150414_trimmed_2112_lane1_CAGATC_L001_R1_002_fastqc.html
20150414_trimmed_2112_lane1_CAGATC_L001_R1_003_fastqc.html

HB2 25,000ppm oil Index – ATCACG

20150414_trimmed_2112_lane1_ATCACG_L001_R1_001_fastqc.html
20150414_trimmed_2112_lane1_ATCACG_L001_R1_002_fastqc.html
20150414_trimmed_2112_lane1_ATCACG_L001_R1_003_fastqc.html

HB16 25,000ppm oil Index – TTAGGC

20150414_trimmed_2112_lane1_TTAGGC_L001_R1_001_fastqc.html
20150414_trimmed_2112_lane1_TTAGGC_L001_R1_002_fastqc.html

HB30 25,000ppm oil Index – TGACCA

20150414_trimmed_2112_lane1_TGACCA_L001_R1_001_fastqc.html

Sequence Data Analysis – LSU C.virginica Oil Spill MBD BS-Seq Data

Performed some rudimentary data analysis on the new, demultiplexed data downloaded earlier today:

2112_lane1_ACAGTG_L001_R1_001.fastq.gz
2112_lane1_ACAGTG_L001_R1_002.fastq.gz
2112_lane1_ATCACG_L001_R1_001.fastq.gz
2112_lane1_ATCACG_L001_R1_002.fastq.gz
2112_lane1_ATCACG_L001_R1_003.fastq.gz
2112_lane1_CAGATC_L001_R1_001.fastq.gz
2112_lane1_CAGATC_L001_R1_002.fastq.gz
2112_lane1_CAGATC_L001_R1_003.fastq.gz
2112_lane1_GCCAAT_L001_R1_001.fastq.gz
2112_lane1_GCCAAT_L001_R1_002.fastq.gz
2112_lane1_TGACCA_L001_R1_001.fastq.gz
2112_lane1_TTAGGC_L001_R1_001.fastq.gz
2112_lane1_TTAGGC_L001_R1_002.fastq.gz

 

Compared total amount of data (in gigabytes) generated from each index. The commands below send the output of the ‘ls -l’ command to awk. Awk sums the file sizes, found in the 5th field ($5) of the ‘ls -l’ command, then prints the sum, divided by 1024^3 to convert from bytes to gigabytes.

Index: ACAGTG

$ls -l 2112_lane1_AC* | awk '{sum += $5} END {print sum/1024/1024/1024}'
1.49652

 

Index: ATCACG

$ls -l 2112_lane1_AT* | awk '{sum += $5} END {print sum/1024/1024/1024}'
3.02269

 

Index: CAGATC

$ls -l 2112_lane1_CA* | awk '{sum += $5} END {print sum/1024/1024/1024}'
3.49797

 

Index: GCCAAT

$ls -l 2112_lane1_GC* | awk '{sum += $5} END {print sum/1024/1024/1024}'
2.21379

 

Index: TGACCA

$ls -l 2112_lane1_TG* | awk '{sum += $5} END {print sum/1024/1024/1024}'
0.687374

 

Index: TTAGGC

$ls -l 2112_lane1_TT* | awk '{sum += $5} END {print sum/1024/1024/1024}'
2.28902

 

Ran FASTQC on the following files downloaded earlier today. The FASTQC command is below. This command runs FASTQC in a for loop over any files that begin with “2212_lane2_C” or “2212_lane2_G” and outputs the analyses to the Arabidopsis folder on Eagle:

$for file in /Volumes/nightingales/C_virginica/2112_lane1_[ATCG]*; do fastqc "$file" --outdir=/Volumes/Eagle/Arabidopsis/; done

 

From within the Eagle/Arabidopsis folder, I renamed the FASTQC output files to prepend today’s date:

$for file in 2112_lane1_[ATCG]*; do mv "$file" "20150413_$file"; done

 

Then, I unzipped the .zip files generated by FASTQC in order to have access to the images, to eliminate the need for screen shots for display in this notebook entry:

$for file in 20150413_2112_lane1_[ATCG]*.zip; do unzip "$file"; done

 

The unzip output retained the old naming scheme, so I renamed the unzipped folders:

$for file in 2112_lane1_[ATCG]*; do mv "$file" "20150413_$file"; done

 

The FASTQC results are linked below:

20150413_2112_lane1_ACAGTG_L001_R1_001_fastqc.html
20150413_2112_lane1_ACAGTG_L001_R1_002_fastqc.html
20150413_2112_lane1_ATCACG_L001_R1_001_fastqc.html
20150413_2112_lane1_ATCACG_L001_R1_002_fastqc.html
20150413_2112_lane1_ATCACG_L001_R1_003_fastqc.html
20150413_2112_lane1_CAGATC_L001_R1_001_fastqc.html
20150413_2112_lane1_CAGATC_L001_R1_002_fastqc.html
20150413_2112_lane1_CAGATC_L001_R1_003_fastqc.html
20150413_2112_lane1_GCCAAT_L001_R1_001_fastqc.html
20150413_2112_lane1_GCCAAT_L001_R1_002_fastqc.html
20150413_2112_lane1_TGACCA_L001_R1_001_fastqc.html
20150413_2112_lane1_TTAGGC_L001_R1_001_fastqc.html
20150413_2112_lane1_TTAGGC_L001_R1_002_fastqc.html

 

Sequence Data Analysis – C.gigas Larvae OA BS-Seq Data

Compared total amount of data generated from each index. The commands below send the output of the ‘ls -l’ command to awk. Awk sums the file sizes, found in the 5th field ($5) of the ‘ls -l’ command, then prints the sum, divided by 1024^3 to convert from bytes to gigabytes.

Index: CTTGTA

$ ls -l 2212_lane2_[C]* | awk '{sum += $5} END {print sum/1024/1024/1024}'
5.33341

Index: GCCAAT
$ ls -l 2212_lane2_[G]* | awk '{sum += $5} END {print sum/1024/1024/1024}'
7.00596

There’s ~1.4x data in the GCCAAT files.

 

Ran FASTQC on the following files downloaded earlier today:

2212_lane2_CTTGTA_L002_R1_001.fastq.gz
2212_lane2_CTTGTA_L002_R1_002.fastq.gz
2212_lane2_CTTGTA_L002_R1_003.fastq.gz
2212_lane2_CTTGTA_L002_R1_004.fastq.gz
2212_lane2_GCCAAT_L002_R1_001.fastq.gz
2212_lane2_GCCAAT_L002_R1_002.fastq.gz
2212_lane2_GCCAAT_L002_R1_003.fastq.gz
2212_lane2_GCCAAT_L002_R1_004.fastq.gz
2212_lane2_GCCAAT_L002_R1_005.fastq.gz
2212_lane2_GCCAAT_L002_R1_006.fastq.gz

 

The FASTQC command is below. This command runs FASTQC in a for loop over any files that begin with “2212_lane2_C” or “2212_lane2_G” and outputs the analyses to the Arabidopsis folder on Eagle:

$for file in /Volumes/nightingales/C_gigas/2212_lane2_[CG]*; do fastqc "$file" --outdir=/Volumes/Eagle/Arabidopsis/; done

 

From within the Eagle/Arabidopsis folder, I renamed the FASTQC output files to prepend today’s date:

$for file in 2212_lane2_[GC]*; do mv "$file" "20150413_$file"; done

 

Then, I unzipped the .zip files generated by FASTQC in order to have access to the images, to eliminate the need for screen shots for display in this notebook entry:

$for file in 20150413_2212_lane2_[CG]*.zip; do unzip "$file"; done

 

The unzip output retained the old naming scheme, so I renamed the unzipped folders:

$for file in 2212_lane2_[GC]*; do mv “$file” “20150413_$file”; done

 

The FASTQC results are linked below:

20150413_2212_lane2_CTTGTA_L002_R1_001_fastqc.html

20150413_2212_lane2_CTTGTA_L002_R1_002_fastqc.html
20150413_2212_lane2_CTTGTA_L002_R1_003_fastqc.html
20150413_2212_lane2_CTTGTA_L002_R1_004_fastqc.html
20150413_2212_lane2_GCCAAT_L002_R1_001_fastqc.html
20150413_2212_lane2_GCCAAT_L002_R1_002_fastqc.html
20150413_2212_lane2_GCCAAT_L002_R1_003_fastqc.html
20150413_2212_lane2_GCCAAT_L002_R1_004_fastqc.html
20150413_2212_lane2_GCCAAT_L002_R1_005_fastqc.html
20150413_2212_lane2_GCCAAT_L002_R1_006_fastqc.html

 

TruSeq Adaptor Identification Method Comparison – LSU C.virginica Oil Spill Sequences

We recently received Illumina HiSeq2500 data back from this project. Initially looking at the data, something seems off.  Using FASTQC, the quality drops of drastically towards the last 20 bases of the reads. We also see a high degree of Illumina TruSeq adaptor/index sequences present in our data.

Since this sequencing run was multiplexed (i.e. multiple libraries were pooled and run together on the HiSeq), we need to demultiplex our sequences before performing any trimming. Otherwise, the trimming could remove the index (barcodes) sequences from the data and prevent us from separating out the different libraries from each other.

However, it turns out, demultiplexing is not a simple, straightforward task. There are a variety of programs available and they all have different options. I decided to compare TruSeq index identification using two programs:

-grep (grep is a built-in command line (bash) program that searches through files to find matches to user-provided information.)
-fastx_barcode_splitter.pl (fastx_barcode_splitter.pl is a component of the fastx_tookit that searches through FASTQ files to identify matches to user-provided index/barcode sequences.)

The advantage(s) of using grep is that it’s extremely fast, easy to use, and already exists on most Unix-based computers (Linux, OS X), thus not requiring any software installation. The disadvantage(s) of using grep for a situation like this is that it is not amenable to allowing for mismatches and/or partial matches to the user-provided information.

The advantage(s) of using fastx_barcode_splitter.pl is that it can accept a user-defined number of mismatches and/or partial matches to the user-defined index/barcode sequences. The disadvantage(s) of using fastx_barcode_splitter.pl is that it requires the user to specify the expected location of the index/barcode sequence in the target sequence: either the beginning of the line or the end of the line. It will not search beyond the length(s) of the provided index/barcode sequences. That means if you index/barcode exists in the middle of your sequences, this program will not find it. Additionally, since this program doesn’t exist natively on Unix-based machines, it must be downloaded and installed by the user.

So, I tested both of these programs to see how they compared at matching both long (the TruSeq adaptor/index sequences identified with FASTQC) and “short” (the actual 6bp index sequence) barcodes.

To simplify testing, only a single sequence file was used from the data set.

All analysis was done in a Jupyter (IPython) notebook.

FASTQC HTML file for easier viewing of FASTQC output.

NBViewer version of embedded notebook below.

 

Result:

grep

long barcodes: Found in ~12% of all reads

short barcodes: Found in ~25% of all reads

 

fastx_barcode_splitter

long barcodes, beginning of line: Found in ~15% of all reads

long barcodes, end of line: Found in < 0.008% of all reads (yes, that is actually percentage)

 

short barcodes, beginning of line: Found in ~1.3% of all reads

short barcodes, end of line: Found in ~2.7% of all reads

 

Overall, the comparison is interesting, however, the important take home from this is that in the best-case scenario (grep, short barcodes), we’re only able to identify 25% of the reads in our sequences!

It should also be noted that my analysis only used sequences in one orientation. It would be a good idea to also do this analysis by searching with the reverse and reverse complements of these sequences.