Tag Archives: Crassostrea virginica

TrimGalore/FastQC/MultiQC – C.virginica Oil Spill MBDseq Concatenated Sequences

Previously concatenated and analyzed our Crassostrea virginica oil spill MBDseq data with FastQC.

We decided to try improving things by running them through TrimGalore! to remove adapters and poor quality sequences.

Processed the samples on Roadrunner (Apple Xserve; Ubuntu 16.04) using default TrimGalore! settings.

After trimming, TrimGalore! output was summarized using MultiQC. Trimmed FastQ files were then analyzed with FastQC and followed up with MultiQC.

Documented in Jupyter Notebook (see below).

Jupyter Notebook (GitHub):


RESULTS

TrimGalore! output folder:

TrimGalore! MultiQC report (HTML):

TrimGalore! FastQC output folder:

FastQC MultiQC report (HTML:

Overall, things look a bit better, but there are still some issues. Will likely eliminate sample 2112_lane_1_TGACCA from analysis and apply some additional sequence filtering, based on sequence length.


SEQUENCE CONTENT PLOT


SHORT SEQUENCE CONTAMINATION


Transposable Element Mapping – Crassostrea virginica Genome, Cvirginica_v300, using RepeatMasker 4.07

Per this GitHub issue, I’m IDing transposable elements (TEs) in the Crassostrea virginica genome.

Genome used:

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 four times:

  1. Species = all

  2. Species = Crassostrea gigas (Pacific oyster)

  3. Species = Crassostrea virginica (Eastern oyster)

  4. Default settings (i.e. no species select – will use human genome).

The idea with running this with four different settings was to get a sense of how the analyses would differ with species specifications.

All runs were performed on roadrunner.

All commands were documented in a Jupyter Notebook (GitHub):

NOTE: RepeatMasker writes the desired output files (*.out, *.cat.gz, and *.gff) to the same directory that the genome is located in! If you conduct multiple runs with the same genome in the same directory, it will overwrite those files, as they are named using the genome assembly filename. Be sure to move files out of the genome directory after each run!


RESULTS:
RUN 1 (species – all)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Cvirginica_v300.fa       
sequences:            11
total length:  684741128 bp  (684675328 bp excl N/X-runs)
GC level:         34.83 %
bases masked:  113771462 bp ( 16.62 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        97003     27946871 bp    4.08 %
   SINEs:            48145      9242559 bp    1.35 %
   Penelope           1429       256929 bp    0.04 %
   LINEs:            27022     10570154 bp    1.54 %
    CRE/SLACS           28         2219 bp    0.00 %
     L2/CR1/Rex       2160       316660 bp    0.05 %
     R1/LOA/Jockey    3058       386611 bp    0.06 %
     R2/R4/NeSL        511       226938 bp    0.03 %
     RTE/Bov-B        7377      3276312 bp    0.48 %
     L1/CIN4          1331        95476 bp    0.01 %
   LTR elements:     21836      8134158 bp    1.19 %
     BEL/Pao          1807       936488 bp    0.14 %
     Ty1/Copia        3046       296183 bp    0.04 %
     Gypsy/DIRS1     12789      6060883 bp    0.89 %
       Retroviral     2369       152228 bp    0.02 %

DNA transposons     180693     29492426 bp    4.31 %
   hobo-Activator    12869      1114188 bp    0.16 %
   Tc1-IS630-Pogo    17233      2485049 bp    0.36 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac           2388       405926 bp    0.06 %
   Tourist/Harbinger  9302       992476 bp    0.14 %
   Other (Mirage,      238        15946 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:       137707     45460608 bp    6.64 %

Total interspersed repeats:   102899905 bp   15.03 %


Small RNA:           45243      9057873 bp    1.32 %

Satellites:           3852       760316 bp    0.11 %
Simple repeats:     203542      8946510 bp    1.31 %
Low complexity:      26205      1281043 bp    0.19 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be root          
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+


RUN 2 (species – Crassostrea gigas)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Cvirginica_v300.fa       
sequences:            11
total length:  684741128 bp  (684675328 bp excl N/X-runs)
GC level:         34.83 %
bases masked:   93923386 bp ( 13.72 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        26397     15008601 bp    2.19 %
   SINEs:                4          722 bp    0.00 %
   Penelope            675       190160 bp    0.03 %
   LINEs:            17645      8922188 bp    1.30 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex         70        39188 bp    0.01 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          4         5110 bp    0.00 %
     RTE/Bov-B        6194      2718955 bp    0.40 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:      8748      6085691 bp    0.89 %
     BEL/Pao           933       788887 bp    0.12 %
     Ty1/Copia          47        82743 bp    0.01 %
     Gypsy/DIRS1      6819      4822734 bp    0.70 %
       Retroviral        0            0 bp    0.00 %

DNA transposons     163945     26422122 bp    3.86 %
   hobo-Activator     7742       720623 bp    0.11 %
   Tc1-IS630-Pogo    15615      2328538 bp    0.34 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac           2246       393498 bp    0.06 %
   Tourist/Harbinger  8431       876020 bp    0.13 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:       160681     41266796 bp    6.03 %

Total interspersed repeats:    82697519 bp   12.08 %


Small RNA:             214        40811 bp    0.01 %

Satellites:           1396       217317 bp    0.03 %
Simple repeats:     216869      9637447 bp    1.41 %
Low complexity:      27520      1418990 bp    0.21 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea gigas
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        

RUN 3 (species – Crassostrea virginica)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Cvirginica_v300.fa       
sequences:            11
total length:  684741128 bp  (684675328 bp excl N/X-runs)
GC level:         34.83 %
bases masked:   46637065 bp ( 6.81 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        43139      8952068 bp    1.31 %
   SINEs:            43139      8952068 bp    1.31 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       3538      1564942 bp    0.23 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        65151     23982146 bp    3.50 %

Total interspersed repeats:    34499156 bp    5.04 %


Small RNA:           43353      8992879 bp    1.31 %

Satellites:              1          222 bp    0.00 %
Simple repeats:     232627     10544162 bp    1.54 %
Low complexity:      29762      1561018 bp    0.23 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 4 (default settings – human genome)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Cvirginica_v300.fa       
sequences:            11
total length:  684741128 bp  (684675328 bp excl N/X-runs)
GC level:         34.83 %
bases masked:   13461422 bp ( 1.97 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:             2056       120820 bp    0.02 %
      ALUs            0            0 bp    0.00 %
      MIRs          240        14635 bp    0.00 %

LINEs:             3408       331585 bp    0.05 %
      LINE1         240        16835 bp    0.00 %
      LINE2         728        69177 bp    0.01 %
      L3/CR1       1369       135234 bp    0.02 %

LTR elements:       704       236625 bp    0.03 %
      ERVL           14          944 bp    0.00 %
      ERVL-MaLRs     12          892 bp    0.00 %
      ERV_classI    272        36695 bp    0.01 %
      ERV_classII     4          206 bp    0.00 %

DNA elements:      1088       100026 bp    0.01 %
     hAT-Charlie     27         1543 bp    0.00 %
     TcMar-Tigger   142         9891 bp    0.00 %

Unclassified:        57         6096 bp    0.00 %

Total interspersed repeats:   795152 bp    0.12 %


Small RNA:         3698       279669 bp    0.04 %

Satellites:          73         5524 bp    0.00 %
Simple repeats:  247957     10848509 bp    1.58 %
Low complexity:   30084      1536314 bp    0.22 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be homo sapiens  
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

Transposable Element Mapping – Crassostrea virginica NCBI Genome Assembly using RepeatMasker 4.07

Genome used: NCBI GCA_002022765.4_C_virginica-3.0

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 with species set to Crassotrea virginica.

All commands were documented in a Jupyter Notebook (GitHub):


RESULTS:

Output folder:

Output table (GFF):

Summary table (text):

==================================================
file name: GCF_002022765.2_C_virginica-3.0_genomic.fasta
sequences:            11
total length:  684741128 bp  (684675328 bp excl N/X-runs)
GC level:         34.83 %
bases masked:   46637065 bp ( 6.81 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        43139      8952068 bp    1.31 %
   SINEs:            43139      8952068 bp    1.31 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       3538      1564942 bp    0.23 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        65151     23982146 bp    3.50 %

Total interspersed repeats:    34499156 bp    5.04 %


Small RNA:           43353      8992879 bp    1.31 %

Satellites:              1          222 bp    0.00 %
Simple repeats:     232627     10544162 bp    1.54 %
Low complexity:      29762      1561018 bp    0.23 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

Data Management – SRA Submission LSU C.virginica Oil Spill MBD BS-seq Data

Submitted the Crassostrea virginica (Eastern oyster) MBD BS-seq data we received on 20150413 to NCBI Sequence Read Archive.

Data was uploaded via the web browser interface, as the FTP method was not functioning properly.

SRA deets are below (assigned FASTQ files to new BioProject and created new BioSamples).

SRA Study: SRP139854
BioProject: PRJNA449904

BioSamples Table

Sample Treatment BioSample
HB2 oil 25,000ppm SAMN08919868
HB16 oil 25,000ppm SAMN08919921
HB30 oil 25,000ppm SAMN08919953
NB3 unexposed SAMN08919461
NB6 unexposed SAMN08919577
NB11 unexposed SAMN08919772

TrimGalore/FastQC/MultiQC – Trim 10bp 5’/3′ ends C.virginica MBD BS-seq FASTQ data

Steven found out that the Bismarck documentation (Bismarck is the bisulfite aligner we use in our BS-seq pipeline) suggests trimming 10bp from both the 5′ and 3′ ends. Since this is the next step in our pipeline, we figured we should probably just follow their recommendations!

TrimGalore job script:

Standard error was redirected on the command line to this file:

MD5 checksums were generated on the resulting trimmed FASTQ files:

All data was copied to my folder on Owl.

Checksums for FASTQ files were verified post-data transfer (data not shown).

Results:

Output folder:

FastQC output folder:

MultiQC output folder:

MultiQC HTML report:

Hey! Look at that! Everything is much better! Thanks for the excellent documentation and suggestions, Bismarck!

TrimGalore/FastQC/MultiQC – 2bp 3′ end Read 1s Trim C.virginica MBD BS-seq FASTQ data

Earlier today, I ran TrimGalore/FastQC/MultiQC on the Crassostrea virginica MBD BS-seq data from ZymoResearch and hard trimmed the first 14bp from each read. Things looked better at the 5′ end, but the 3′ end of each of the READ1 seqs showed a wonky 2bp blip, so decided to trim that off.

I ran TrimGalore (using the built-in FastQC option), with a hard trim of the last 2bp of each first read set that had previously had the 14bp hard trim and followed up with MultiQC for a summary of the FastQC reports.

TrimGalore job script:

Standard error was redirected on the command line to this file:

MD5 checksums were generated on the resulting trimmed FASTQ files:

All data was copied to my folder on Owl.

Checksums for FASTQ files were verified post-data transfer (data not shown).

Results:

Output folder:

FastQC output folder:

MultiQC output folder:

MultiQC HTML report:

Well, this is a bit strange, but the 2bp trimming on the read 1s looks fine, but now the read 2s are weird in the same region!

Regardless, while this was running, Steven found out that the Bismarck documentation (Bismarck is the bisulfite aligner we use in our BS-seq pipeline) suggests trimming 10bp from both the 5′ and 3′ ends. So, maybe this was all moot. I’ll go ahead and re-run this following the Bismark recommendations.

TrimGalore/FastQC/MultiQC – 14bp Trim C.virginica MBD BS-seq FASTQ data

Yesterday, I ran TrimGalore/FastQC/MultiQC on the Crassostrea virginica MBD BS-seq data from ZymoResearch with the default settings (i.e. “auto-trim”). There was still some variability in the first ~15bp of the reads and Steven wanted to see how a hard trim would change things.

I ran TrimGalore (using the built-in FastQC option), with a hard trim of the first 14bp of each read and followed up with MultiQC for a summary of the FastQC reports.

TrimGalore job script:

Standard error was redirected on the command line to this file:

MD5 checksums were generated on the resulting trimmed FASTQ files:

All data was copied to my folder on Owl.

Checksums for FASTQ files were verified post-data transfer (data not shown).

Results:

Output folder:

FastQC output folder:

MultiQC output folder:

MultiQC HTML report:

OK, this trimming definitely took care of the variability seen in the first ~15bp of all the reads.

However, I noticed that the last 2bp of each of the Read 1 seqs all have some wonky stuff going on. I’m guessing I should probably trim that stuff off, too…

TrimGalore/FastQC/MultiQC – Auto-trim C.virginica MBD BS-seq FASTQ data

Yesterday, I ran FastQC/MultiQC on the Crassostrea virginica MBD BS-seq data from ZymoResearch. Steven wanted to trim it and see how things turned out.

I ran TrimGalore (using the built-in FastQC option) and followed up with MultiQC for a summary of the FastQC reports.

TrimGalore job script:

Standard error was redirected on the command line to this file:

MD5 checksums were generated on the resulting trimmed FASTQ files:

All data was copied to my folder on Owl.

Checksums for FASTQ files were verified post-data transfer.

Results:

Output folder:

FastQC output folder:

MultiQC output folder:

MultiQC HTML report:

Overall, the auto-trim didn’t alter things too much. Specifically, Steven is concerned about the variability in the first 15bp (seen in the Per Base Sequence Content section of the MultiQC output). It was reduced, but not greatly. Will perform an independent run of TrimGalore and employ a hard trim of the first 14bp of each read and see how that looks.

FastQC/MultiQC – C. virginica MBD BS-seq Data

Per Steven’s GitHub Issues request, I ran FastQC on the Eastern oyster MBD bisulfite sequencing data we recently got back from ZymoResearch.

Ran FastQC locally with the following script: 20180409_fastqc_Cvirginica_MBD.sh


#!/bin/bash
/home/sam/software/FastQC/fastqc \
--threads 18 \
--outdir /home/sam/20180409_fastqc_Cvirginica_MBD \
/mnt/owl/nightingales/C_virginica/zr2096_10_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_10_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_1_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_1_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_2_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_2_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_3_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_3_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_4_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_4_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_5_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_5_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_6_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_6_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_7_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_7_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_8_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_8_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_9_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_9_s1_R2.fastq.gz

MultiQC was then run on the FastQC output files.

All files were moved to Owl after the jobs completed.

Results:

FastQC Output folder: 20180409_fastqc_Cvirginica_MBD/

MultiQC Output folder: 20180409_fastqc_Cvirginica_MBD/multiqc_data/

MultiQC report (HTML): 20180409_fastqc_Cvirginica_MBD/multiqc_data/multiqc_report.html

Everything looks good to me.

Steven’s interested in seeing what the trimmed output would look like (and, how it would impact mapping efficiencies). Will initiate trimming.

See the GitHub issue linked above for the full discussion.