BS-seq Mapping – Olympia oyster bisulfite sequencing: Bismark Continued

Previously took the analysis just through the mapping, but didn’t realize Steven wanted me to fully process the data.

So, as en exercise, I followed through with deduplication and sorting of the BAM files.

Then, ran a quick analysis using MethylKit in R. The analysis simply copied what Steven had done with another data set and I haven’t examined it very thoroughly, so am not well-versed on what it’s doing and/or why.

Jupyter Notebook (GitHub):

R Studio Project (download the folder, load project in R Studio, and then run the script in the scripts subdirectory to run the analysis):

Will take the full data sets through this whole pipeline.

Transposable Element Mapping – Crassostrea virginica NCBI Genome Assembly using RepeatMasker 4.07

Genome used: NCBI GCA_002022765.4_C_virginica-3.0

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 with species set to Crassotrea virginica.

All commands were documented in a Jupyter Notebook (GitHub):


RESULTS:

Output folder:

Output table (GFF):

Summary table (text):

==================================================
file name: GCF_002022765.2_C_virginica-3.0_genomic.fasta
sequences:            11
total length:  684741128 bp  (684675328 bp excl N/X-runs)
GC level:         34.83 %
bases masked:   46637065 bp ( 6.81 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        43139      8952068 bp    1.31 %
   SINEs:            43139      8952068 bp    1.31 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       3538      1564942 bp    0.23 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        65151     23982146 bp    3.50 %

Total interspersed repeats:    34499156 bp    5.04 %


Small RNA:           43353      8992879 bp    1.31 %

Satellites:              1          222 bp    0.00 %
Simple repeats:     232627     10544162 bp    1.54 %
Low complexity:      29762      1561018 bp    0.23 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

Read Mapping – Olympia oyster 2bRAD Data with Bowtie2 (on Mox)

Per Steven’s request, mapped our Olympia oyster 2bRAD data.

Mapped to:

This was run on our Mox computing node.

Slurm script: 20180515_oly_2bRAD_bowtie2_mapping.sh

The script is far too long to paste here, due to the shear number of input files. However, here’s a snippet to show the command and options that were used:


/gscratch/srlab/programs/bowtie2-2.3.4.1-linux-x86_64/bowtie2 \
--threads 24 \
--no-unal \
--score-min L,16,1 \
--local \
-L 16 \
-S /gscratch/srlab/sam/outputs/20180515_oly_2bRAD_bowtie2_mapping/20180515_oly_2bRAD_bowtie2_mapping.sam \
-x /gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/20180515_oly_bowtie2_pbjelly_sjw_01_genome_index/pbjelly_sjw_01_ref \
-U 

See the linked Slurm script above for the entire thing.


RESULTS:

Output folder:

SAM file (104GB)

Mapping summary:


20180515_oly_2bRAD_bowtie2_mapping$ cat slurm-180337.out 
729797535 reads; of these:
  729797535 (100.00%) were unpaired; of these:
    273989476 (37.54%) aligned 0 times
    310581308 (42.56%) aligned exactly 1 time
    145226751 (19.90%) aligned >1 times
62.46% overall alignment rate

Transposable Element Mapping – Olympia Oyster Genome Assembly using RepeatMasker 4.07

Steven wanted transposable elements (TEs) in the Olympia oyster genome identified.

After some minor struggles, I was able to get RepeatMasker installed on on both of our Apple Xserves (emu & roadrunner; running Ubuntu 16.04LTS).

Genome used: pbjelly_sjw_01

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 four times:

  1. Default settings (i.e. no species select – will use human genome).

  2. Species = Crassostrea gigas (Pacific oyster)

  3. Species = Crassostrea virginica (Eastern oyster)

  4. Species = Ostrea lurida (Olympia oyster)

The idea was to get a sense of how the analyses would differ with species specifications. However, it’s likely that the only species setting that will make any difference will be Run #2 (Crassostrea gigas).

The reason I say this is that RepeatMasker has a built in tool to query which species are available in the RepBase database (e.g.):

RepeatMasker-4.0.7/util/queryRepeatDatabase.pl -species "crassostrea virginica" -stat

Here’s a very brief overview of what that yields:

  • Crassotrea gigas: 792 specific repeats

  • Crassostrea virginica: 4 Crassostrea virginica specific repeats

  • Ostrea lurida: 0 Ostrea lurida specific repeats

All runs were performed on roadrunner.

All commands were documented in a Jupyter Notebook (GitHub):

NOTE: RepeatMasker writes the desired output files (*.out, *.cat.gz, and *.gff) to the same directory that the genome is located in! If you conduct multiple runs with the same genome in the same directory, it will overwrite those files, as they are named using the genome assembly filename.


RESULTS:
RUN 1 (default settings – human genome)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:   20002806 bp ( 1.71 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:            17794      1061170 bp    0.09 %
      ALUs          363        31340 bp    0.00 %
      MIRs         1166        92129 bp    0.01 %

LINEs:             4456       888114 bp    0.08 %
      LINE1         976       103929 bp    0.01 %
      LINE2         813        82891 bp    0.01 %
      L3/CR1        699        63627 bp    0.01 %

LTR elements:      1187       199118 bp    0.02 %
      ERVL          155        15828 bp    0.00 %
      ERVL-MaLRs    200        20737 bp    0.00 %
      ERV_classI    379        42833 bp    0.00 %
      ERV_classII    66         6896 bp    0.00 %

DNA elements:      2290       196866 bp    0.02 %
     hAT-Charlie    190        15468 bp    0.00 %
     TcMar-Tigger   732        37473 bp    0.00 %

Unclassified:       101        12946 bp    0.00 %

Total interspersed repeats:  2358214 bp    0.20 %


Small RNA:         5954       433422 bp    0.04 %

Satellites:         366        55705 bp    0.00 %
Simple repeats:  310641     14322152 bp    1.22 %
Low complexity:   47381      2844279 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be homo sapiens  
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 2 (species – Crassostrea gigas)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:  160759267 bp ( 13.71 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements       213132     69887654 bp    5.96 %
   SINEs:             2374       311974 bp    0.03 %
   Penelope         171792     57862186 bp    4.94 %
   LINEs:           195605     63430615 bp    5.41 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex        731       357995 bp    0.03 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL         13        11377 bp    0.00 %
     RTE/Bov-B        8085      1948581 bp    0.17 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:     15153      6145065 bp    0.52 %
     BEL/Pao          2119       955773 bp    0.08 %
     Ty1/Copia         101        75372 bp    0.01 %
     Gypsy/DIRS1     11776      4815361 bp    0.41 %
       Retroviral        0            0 bp    0.00 %

DNA transposons     256292     35689117 bp    3.04 %
   hobo-Activator    19847      2059651 bp    0.18 %
   Tc1-IS630-Pogo    43269      6806311 bp    0.58 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac           7935      1060296 bp    0.09 %
   Tourist/Harbinger  9503       887332 bp    0.08 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:       174943     38299211 bp    3.27 %

Total interspersed repeats:   143875982 bp   12.27 %


Small RNA:             280        78768 bp    0.01 %

Satellites:           7383      1362194 bp    0.12 %
Simple repeats:     278809     12982714 bp    1.11 %
Low complexity:      44078      2622506 bp    0.22 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea gigas
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 3 (species – Crassostrea virginica)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:   39598953 bp ( 3.38 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        63882     10327611 bp    0.88 %
   SINEs:            63882     10327611 bp    0.88 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       9433      2307292 bp    0.20 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        51558      9836468 bp    0.84 %

Total interspersed repeats:    22471371 bp    1.92 %


Small RNA:           64164     10406776 bp    0.89 %

Satellites:             10         5985 bp    0.00 %
Simple repeats:     298612     14185090 bp    1.21 %
Low complexity:      47510      2866522 bp    0.24 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 4 (species – Ostrea lurida)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: jelly.out.fasta          
sequences:        696946
total length: 1253001795 bp  (1172226648 bp excl N/X-runs)
GC level:         36.51 %
bases masked:   17617763 bp ( 1.50 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements            0            0 bp    0.00 %
   SINEs:                0            0 bp    0.00 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons          0            0 bp    0.00 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:            3          189 bp    0.00 %

Total interspersed repeats:         189 bp    0.00 %


Small RNA:             282        79165 bp    0.01 %

Satellites:             10         5985 bp    0.00 %
Simple repeats:     313082     14662647 bp    1.25 %
Low complexity:      47785      2878201 bp    0.25 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be ostrea lurida 
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

Software Installation – RepeatMasker v4.0.7 on Emu/Roadrunner Continued

After yesterday’s difficulties getting RMblast to compile, I deleted the folder and went through the build process again.

This time it worked, but it did not put rmblastn in the specified location (/home/shared/rmblast).

This fact took me a fair amount of time to figure out. Finally, after a couple of different re-builds, I ran find to see if rmblastn existed somewhere I wasn’t looking:

Additionally, I couldn’t find the location of the various BLAST executables. Some internet sleuthing led me to the NCBI page on installing BLAST+ from source, which indicates that the executables are stored in:

ncbi-blast-VERSION+-src/c++/ReleaseMT/bin/

How intuitive! /s

In order to improve readability and usability of the /home/shared/ directory, I renamed the /home/shared/rmblast directory to reflect the BLAST version and created a symbolic link in that directory to the rmlbastn executable:

Symbolic link to RMBLAST

Initiate RepeatMasker configuration


Confirm perl install location:


Confirm RepeatMasker install location:


Specify TRF install location:


Hmmm, TRF error. Looking for file called trf:


Renamed TRF file to trf and now it’s automatically found:


Set RMBlast as search engine:


Set RMBlast install location:


Set RMBlast as default search engine:


Confirmation of RMBlast as default search engine and successful installation of RepeatMasker:


Software Installation – RepeatMasker v4.0.7 on Emu/Roadrunner

Steven asked that I re-run some Olympia oyster transposable elements analysis using RepeatMasker and a newer version of our Olympia oyster genome assembly.

Installed the software on both of the Apple Xserves (Emu and Roadrunner) running Ubuntu 16.04.

Followed the instructions outlined here:

Starting with the prerequisites:

1. Download and install RMBlast

  • NCBI Blast 2.6.0 source

  • isb 2.6.0 patch

Unfortunately, the make command continually failed:

cd /home/shared/ncbi-blast-2.6.0+-src/c++
make

While trying to troubleshoot this issue, continued with the other prerequisites:

2. Downloaded Tandem Repeat Finder v.4.09

  • Saved file (trf409.linux64) to /home/shared/bin. NOTE: /home/shared/bin is part of the system PATH. See the /etc/environment file.

  • Changed permissions to be executable:

sudo chmod 775 trf409.linux64

3. Downloaded RepBase RepeatMasker Edition 20170127 (NOTE: This requires registration in order to obtain a username/password to download the file).

Installed RepeatMasker:

4. Downloaded RepeatMasker 4.0.7

  • Saved to /home/shared/RepeatMasker-4.0.7

5. Installed RepBase RepeatMasker Edition 20170127 in /home/shared//home/shared/RepeatMasker-4.0.7/Libraries

Currently re-building RMBlast and it takes forever… Will report back when I have it running.

TrimGalore/FastQC/MultiQC – TrimGalore! RRBS Geoduck BS-seq FASTQ data (directional)

Earlier this week, I ran TrimGalore!, but set the trimming, incorrectly – due to a copy/paste mistake, as --non-directional, so I re-ran with the correct settings.

Steven requested that I trim the Geoduck RRBS libraries that we have, in preparation to run them through Bismark.

These libraries were originally created by Hollie Putnam using the TruSeq DNA Methylation Kit (Illumina):

All analysis is documented in a Jupyter Notebook; see link below.

Overview of process:

  1. Run TrimGalore! with --paired and --rrbs settings.

  2. Run FastQC and MultiQC on trimmed files.

  3. Copy all data to owl (see Results below for link).

  4. Confirm data integrity via MD5 checksums.

Jupyter Notebook:


Results:
TrimGalore! output folder:
FastQC output folder:
MultiQC output folder:
MultiQC report (HTML):

FastQC – RRBS Geoduck BS-seq FASTQ data

Earlier today I finished trimming Hollie’s RRBS BS-seq FastQ data.

However, the original files were never analyzed with FastQC, so I ran it on the original files.

These libraries were originally created by Hollie Putnam using the TruSeq DNA Methylation Kit (Illumina):

FastQC was run, followed by MultiQC. Analysis was run on Roadrunner.

All analysis is documented in a Jupyter Notebook; see link below.

Jupyter Notebook:

Results:
FastQC output folder:
MultiQC output folder:
MultiQC report (HTML):

TrimGalore/FastQC/MultiQC – TrimGalore! RRBS Geoduck BS-seq FASTQ data


20180516 – UPDATE!!

THIS WAS RUN WITH THE INCORRECT SETTING IN TRIMGALORE! --non-directional

WILL RE-RUN


Steven requested that I trim the Geoduck RRBS libraries that we have, in preparation to run them through Bismark.

These libraries were originally created by Hollie Putnam using the TruSeq DNA Methylation Kit (Illumina):

All analysis is documented in a Jupyter Notebook; see link below.

Overview of process:

  1. Copy EPI* FastQ files from owl/P_generosa to roadrunner.

  2. Confirm data integrity via MD5 checksums.

  3. Run TrimGalore! with --paired, --rrbs, and --non-directional settings.

  4. Run FastQC and MultiQC on trimmed files.

  5. Copy all data to owl (see Results below for link).

  6. Confirm data integrity via MD5 checksums.

Jupyter Notebook:


Results:
TrimGalore! output folder:
FastQC output folder:
MultiQC output folder:
MultiQC report (HTML):