BSMAP-methatio-methykit workflow

Assumptions- as with most worflows in development - a working directory is used.

In [16]:
#SR edits - adding direct link to ipynb viewer and raw files
!date 
Thu Feb 27 12:43:52 PST 2014

In [17]:
filename='BSMAP2MK_workflow_F'

ipynb viewer format:

In [18]:
!echo 'http://nbviewer.ipython.org/github/sr320/ipython_nb/blob/master/''{filename}''.ipynb'
http://nbviewer.ipython.org/github/sr320/ipython_nb/blob/master/BSMAP2MK_workflow_F.ipynb

ipynb file:

In [19]:
!echo 'http://nbviewer.ipython.org/github/sr320/ipython_nb/blob/master/''{filename}''.ipynb'
http://nbviewer.ipython.org/github/sr320/ipython_nb/blob/master/BSMAP2MK_workflow_F.ipynb

Setting Variables

In [29]:
#file ID
fid="CgF"

#TIMESTAMP
date=!date +%m%d_%H%M
#working directory (parent)
wd="/Volumes/web-1/Mollusk/bs_larvae_exp"

#where is bsmap
bsmap="/Users/Shared/Apps/bsmap-2.73/"

#fastq files location R1 location
R1="/Volumes/web-1/Mollusk/bs_larvae_exp/Concatenated_Files_R1/F_R1.fastq"

#fastq files location R2 location
#comment out if SE
R2="/Volumes/web-1/Mollusk/bs_larvae_exp/Concatenated_Files_R2/F_R2.fastq"


#genome file 
genome="/Volumes/web-1/whale/ensembl/ftp.ensemblgenomes.org/pub/release-21/metazoa/fasta/crassostrea_gigas/dna/Crassostrea_gigas.GCA_000297895.1.21.dna_sm.genome.fa"
In [30]:
cd {wd}
/Volumes/web-1/Mollusk/bs_larvae_exp

In [31]:
mkdir {fid}_{date}
In [32]:
cd {fid}_{date}
/Volumes/web-1/Mollusk/bs_larvae_exp/CgF_[0227_1246]

In [36]:
#option - number of processes 
!{bsmap}bsmap -a {R1} -b {R2} -d {genome} -o bsmap_out.sam -p 1

BSMAP v2.73
Start at:  Thu Feb 27 12:54:05 2014

Input reference file: /Volumes/web-1/whale/ensembl/ftp.ensemblgenomes.org/pub/release-21/metazoa/fasta/crassostrea_gigas/dna/Crassostrea_gigas.GCA_000297895.1.21.dna_sm.genome.fa 	(format: FASTA)
Load in 7658 db seqs, total size 557717710 bp. 9 secs passed
total_kmers: 43046721
Create seed table. 24 secs passed
max number of mismatches: read_length * 8% 	max gap size: 0
kmer cut-off ratio:5e-07
max multi-hits: 100	max Ns: 5	seed size: 16	index interval: 4
quality cutoff: 0	base quality char: '!'
min fragment size:28	max fragemt size:500
start from read #1	end at read #4294967295
additional alignment: T in reads => C in reference
mapping strand (read_1): ++,-+
mapping strand (read_2): +-,--
Pair-end alignment(1 threads)
Input read file #1: /Volumes/web-1/Mollusk/bs_larvae_exp/Concatenated_Files_R1/F_R1.fastq 	(format: FASTQ)
Input read file #2: /Volumes/web-1/Mollusk/bs_larvae_exp/Concatenated_Files_R2/F_R2.fastq 	(format: FASTQ)
Output file: bsmap_out.sam	 (format: SAM)
Thread #0: 	50000 read pairs finished. 36 secs passed
Thread #0: 	100000 read pairs finished. 47 secs passed
Thread #0: 	150000 read pairs finished. 59 secs passed
Thread #0: 	200000 read pairs finished. 70 secs passed
Thread #0: 	250000 read pairs finished. 81 secs passed
Thread #0: 	300000 read pairs finished. 92 secs passed
Thread #0: 	350000 read pairs finished. 104 secs passed
Thread #0: 	400000 read pairs finished. 115 secs passed
Thread #0: 	450000 read pairs finished. 126 secs passed
Thread #0: 	500000 read pairs finished. 138 secs passed
Thread #0: 	550000 read pairs finished. 149 secs passed
Thread #0: 	600000 read pairs finished. 160 secs passed
Total number of aligned reads: 
pairs:       101 (0.016%)
single a:    408265 (63%)
single b:    395665 (61%)
Done.
Finished at Thu Feb 27 12:56:46 2014
Total time consumed:  161 secs

In [37]:
!python {bsmap}methratio.py -d {genome} -u -z -g -o methratio_out.txt -s {bsmap}samtools bsmap_out.sam
@ Thu Feb 27 12:58:57 2014: reading reference /Volumes/web-1/whale/ensembl/ftp.ensemblgenomes.org/pub/release-21/metazoa/fasta/crassostrea_gigas/dna/Crassostrea_gigas.GCA_000297895.1.21.dna_sm.genome.fa ...
@ Thu Feb 27 12:59:31 2014: reading bsmap_out.sam ...
[samopen] SAM header is present: 7658 sequences.
@ Thu Feb 27 12:59:58 2014: combining CpG methylation from both strands ...
@ Thu Feb 27 13:00:30 2014: writing methratio_out.txt ...
@ Thu Feb 27 13:05:07 2014: done.
total 622742 valid mappings, 4769720 covered cytosines, average coverage: 1.37 fold.

Filter the methratio output file so that I'm only obtaining the context '_CG' and loci with at least 5x coverage

In [38]:
#command for only obtaining the context '__CG_'
!grep "[A-Z][A-Z]CG[A-Z]" <methratio_out.txt> methratio_out_CG.txt
In [39]:
#obtaining a filtered file with at least 5x coverage
!awk '$8 >= 5' <methratio_out_CG.txt> methratio_out_CG5x.txt
In [13]:
#Now I need to format my files to be read into the methylKit package in R
In [14]:
#!/Volumes/web/Mollusk/bs_larvae_exp/methratio.awk.sh methratio_out_CG5x.txt > methratio_out_CG_mkit.txt
In [15]:
#!head methratio_out_CG_mkit.txt
In [15]: