BSMAP-methatio-methykit workflow

In [1]:
#Setting Variables
#file ID
fid="CgF_nov"
#where is bsmap
bsmap="/Users/Shared/Apps/bsmap-2.73/"
#fastq files location R1 location
R1="/Volumes/web/trilobite/Crassostrea_gigas_HTSdata/batterbox/FCC39EM/Sample_BS_CgF/filtered_BS_CgF_TTAGGC_L004_R1.fastq.gz"
#genome file 
genome="/Volumes/web/whale/ensembl/ftp.ensemblgenomes.org/pub/release-21/metazoa/fasta/crassostrea_gigas/dna/Crassostrea_gigas.GCA_000297895.1.21.dna_sm.genome.fa"
In [2]:
cd /Volumes/web/Mollusk/bs_larvae_exp/
/Volumes/web/Mollusk/bs_larvae_exp

In [6]:
mkdir {fid}
In [7]:
cd {fid}
/Volumes/web/Mollusk/bs_larvae_exp/CgF_nov

In [8]:
!{bsmap}bsmap -a {R1} -d {genome} -o bsmap_out.sam -p 1

BSMAP v2.73
Start at:  Wed Feb 12 16:25:40 2014

Input reference file: /Volumes/web/whale/ensembl/ftp.ensemblgenomes.org/pub/release-21/metazoa/fasta/crassostrea_gigas/dna/Crassostrea_gigas.GCA_000297895.1.21.dna_sm.genome.fa 	(format: FASTA)
Load in 7658 db seqs, total size 557717710 bp. 9 secs passed
total_kmers: 43046721
Create seed table. 25 secs passed
max number of mismatches: read_length * 8% 	max gap size: 0
kmer cut-off ratio:5e-07
max multi-hits: 100	max Ns: 5	seed size: 16	index interval: 4
quality cutoff: 0	base quality char: '!'
min fragment size:28	max fragemt size:500
start from read #1	end at read #4294967295
additional alignment: T in reads => C in reference
mapping strand: ++,-+
Single-end alignment(1 threads)
Input read file: /Volumes/web/trilobite/Crassostrea_gigas_HTSdata/batterbox/FCC39EM/Sample_BS_CgF/filtered_BS_CgF_TTAGGC_L004_R1.fastq.gz 	(format: gzipped FASTQ)
Output file: bsmap_out.sam	 (format: SAM)
Thread #0: 	50000 reads finished. 29 secs passed
Thread #0: 	100000 reads finished. 34 secs passed
Thread #0: 	150000 reads finished. 39 secs passed
Thread #0: 	200000 reads finished. 43 secs passed
Thread #0: 	250000 reads finished. 48 secs passed
Thread #0: 	300000 reads finished. 53 secs passed
Thread #0: 	345868 reads finished. 57 secs passed
Total number of aligned reads: 218893 (63%)
Done.
Finished at Wed Feb 12 16:26:38 2014
Total time consumed:  58 secs

In [9]:
!python {bsmap}methratio.py -d {genome} -u -z -g -o methratio_out.txt -s {bsmap}samtools bsmap_out.sam
@ Wed Feb 12 16:26:38 2014: reading reference /Volumes/web/whale/ensembl/ftp.ensemblgenomes.org/pub/release-21/metazoa/fasta/crassostrea_gigas/dna/Crassostrea_gigas.GCA_000297895.1.21.dna_sm.genome.fa ...
@ Wed Feb 12 16:27:11 2014: reading bsmap_out.sam ...
[samopen] SAM header is present: 7658 sequences.
@ Wed Feb 12 16:27:19 2014: combining CpG methylation from both strands ...
@ Wed Feb 12 16:27:48 2014: writing methratio_out.txt ...
@ Wed Feb 12 16:29:45 2014: done.
total 168921 valid mappings, 1626861 covered cytosines, average coverage: 1.11 fold.

In []:
#Now I'm going to filter the methratio output file so that I'm only obtaining the context '__CG_' and loci with at least 5x coverage
In [13]:
#Below is the command for only obtaining the context '__CG_'
In [39]:
!grep "[A-Z][A-Z]CG[A-Z]" </Volumes/web/Mollusk/bs_larvae_exp/CgF_nov/methratio_out.txt> /Volumes/web/Mollusk/bs_larvae_exp/CgF_nov/methratio_out_CG.txt
In [40]:
#Below is the command for obtaining a filtered file with at least 5x coverage
In [15]:
!awk '$8 >= 5' </Volumes/web/Mollusk/bs_larvae_exp/CgF_nov/methratio_out_CG.txt> /Volumes/web/Mollusk/bs_larvae_exp/CgF_nov/methratio_out_Fnov_filtered.txt
In [16]:
#Now I need to format my files to be read into the methylKit package in R
In [10]:
cd /Volumes/web/Mollusk/bs_larvae_exp/CgF_nov
/Volumes/web/Mollusk/bs_larvae_exp/CgF_nov

In [11]:
!chmod +x methratio.awk.sh
In [12]:
!/Volumes/web/Mollusk/bs_larvae_exp/CgF_nov/methratio.awk.sh /Volumes/web/Mollusk/bs_larvae_exp/CgF_nov/methratio_out_Fnov_filtered.txt > /Volumes/web/Mollusk/bs_larvae_exp/CgF_nov/F_nov_methylkit.txt
In [5]:
In []: