{ "metadata": { "name": "BiGo_RNAseq" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": "#RNA-Seq data corresponding to BS-Seq Oyster Sperm sample \n\nUpdated July 23, 2013 14:30PDT - samtools \nUpdated July 23, 2013 14:00PDT" }, { "cell_type": "markdown", "metadata": {}, "source": "---\nData\n\nRNA-seq data from both the sperm and pooled gill samples were provided by the core facility. \nOriginally BAM files were provided, later followed up with fastq files.\n\n\"72975_GACTAAGA%20et%20al%20and%20(274)%20Discovery%20Environment\"\n\n\n\n\nSperm data = 72976/GTGTCTAC\n\n\n\n" }, { "cell_type": "markdown", "metadata": {}, "source": "---\n##CLC \nfastq import (s_1.GTGTCTAC_1 (paired)-1) \n51,084,360 sequences\n\n`\n Discard read names = Yes\n Discard quality scores = No\n Paired orientation = Paired reads (forward-reverse)\n Minimum distance = 180\n Maximum distance = 250\n Original sequence resource = /Volumes/NGS Drive/NGS Raw Data/RNA_complement_BS/C25GNACXX.oyster.fastqs/s_1.GTGTCTAC_1.fastq\n Original sequence resource = /Volumes/NGS Drive/NGS Raw Data/RNA_complement_BS/C25GNACXX.oyster.fastqs/s_1.GTGTCTAC_2.fastq\n Quality score = NCBI/Sanger or Illumina Pipeline 1.8 and later\n MiSeq de-multiplexing = No\n`\n\n####QC Report\n\n\n####Supp QC Report\n\n" }, { "cell_type": "markdown", "metadata": {}, "source": "---\nTrimming \n\n\"CLC%20Genomics%20Workbench%206.0.2\"\n\n\n\n\n\n\n\n\n---\n\n###RNA-seq \n\nInput file\n\n\"CLC%20Genomics%20Workbench%206.0.2\" \n\n\n\n*Need to modify GFF so that CLC recognizes to annotate*\n\n" }, { "cell_type": "code", "collapsed": false, "input": "!head /Volumes/web/cnidarian/oyster.v9.gene_mRNA.gff", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "C16582\tGLEAN\tgene\t35\t385\t0.555898\t-\t.\tID=CGI_10000001;\r\nC16582\tGLEAN\tmrna\t35\t385\t.\t-\t0\tParent=CGI_10000001;\r\nC17212\tGLEAN\tgene\t31\t363\t0.999572\t+\t.\tID=CGI_10000002;\r\nC17212\tGLEAN\tmrna\t31\t363\t.\t+\t0\tParent=CGI_10000002;\r\nC17316\tGLEAN\tgene\t30\t257\t0.555898\t+\t.\tID=CGI_10000003;\r\nC17316\tGLEAN\tmrna\t30\t257\t.\t+\t0\tParent=CGI_10000003;\r\nC17476\tGLEAN\tgene\t34\t257\t0.998947\t-\t.\tID=CGI_10000004;\r\nC17476\tGLEAN\tmrna\t104\t257\t.\t-\t0\tParent=CGI_10000004;\r\nC17476\tGLEAN\tmrna\t34\t74\t.\t-\t2\tParent=CGI_10000004;\r\nC17998\tGLEAN\tgene\t196\t387\t1\t-\t.\tID=CGI_10000005;\r\n" } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": "!wc /Volumes/web/cnidarian/oyster.v9.gene_mRNA.gff", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": " 224718 2022462 14376214 /Volumes/web/cnidarian/oyster.v9.gene_mRNA.gff\r\n" } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": "Annotated version 9 of oyster genome then ran RNA-seq with Exon Discovery and Expression value as transcripts. Only using paired data.\n\n\n\"RNA-Seq%20Analysis\"\n\n\n##_Concerned about the number of mapped reads_ \n\n\n\"CLC_Genomics_Workbench_6.0.2_179F262E.png\"/\n\nFull Report \n\n\n" }, { "cell_type": "markdown", "metadata": {}, "source": "---\n\n##RNA-Seq on genes (for Comparison)\n\n\n" }, { "cell_type": "markdown", "metadata": {}, "source": "--- \n\n\n##Tophat Analysis \n\n\n--\n\n\"(274)%20Discovery%20Environment\"\n\n\nTopHat Output\n\nThe tophat script produces a number of files in the directory in which it was invoked. Most of these files are internal, intermediate files that are generated for use within the pipeline. The output files you will likely want to look at are:\n\naccepted_hits.bam. A list of read alignments in SAM format. SAM is a compact short read alignment format that is increasingly being adopted. The formal specification is here. \n\njunctions.bed. A UCSC BED track of junctions reported by TopHat. Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction. \n\ninsertions.bed and deletions.bed. UCSC BED tracks of insertions and deletions reported by TopHat. \n\nInsertions.bed - chromLeft refers to the last genomic base before the insertion. \n\nDeletions.bed - chromLeft refers to the first genomic base of the deletion. \n\n---\n\nData @ \n\n\n" }, { "cell_type": "code", "collapsed": false, "input": "ls /Volumes/web/cnidarian/tophat_071313", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "\u001b[31maccepted_hits.bam\u001b[m\u001b[m* \u001b[31mprep_reads.info\u001b[m\u001b[m*\r\n\u001b[31mdeletions.bed\u001b[m\u001b[m* \u001b[31ms_1.bam\u001b[m\u001b[m*\r\n\u001b[31minsertions.bed\u001b[m\u001b[m* \u001b[31ms_1.bam.bai\u001b[m\u001b[m*\r\n\u001b[31mjunctions.bed\u001b[m\u001b[m* \u001b[31munmapped.bam\u001b[m\u001b[m*\r\n\u001b[31mlookup_roberts_grc_oyster.xls\u001b[m\u001b[m*\r\n" } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": "\"IGV%20-%20Session:%20http://eagle.fish.washington.edu/cnidarian/oyster_v9_igv_session.xml%20and%20Home%20%7C%20TWiT.TV\"" }, { "cell_type": "markdown", "metadata": {}, "source": "URL to load IGV Session corresponding to screenshot above. \n\n`http://eagle.fish.washington.edu/cnidarian/BiGo_RNAseq_072213_igv_session.xml`" }, { "cell_type": "markdown", "metadata": {}, "source": "---\n## Using Samtools to interogate TopHat Bam file\n\nreference: " }, { "cell_type": "code", "collapsed": false, "input": "cd /Volumes/Bay3/Software/BSMAP/bsmap-2.74/samtools", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "/Volumes/Bay3/Software/BSMAP/bsmap-2.74/samtools" }, { "output_type": "stream", "stream": "stdout", "text": "\n" } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": "ls", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "AUTHORS bam_index.o bedidx.o kstring.c\r\nCOPYING bam_lpileup.c bgzf.c kstring.h\r\nChangeLog bam_lpileup.o bgzf.h kstring.o\r\nINSTALL bam_mate.c bgzf.o libbam.a\r\nMakefile bam_mate.o bgzip.c \u001b[34mmisc\u001b[m\u001b[m/\r\nMakefile.mingw bam_md.c cut_target.c phase.c\r\nNEWS bam_md.o cut_target.o phase.o\r\nbam.c bam_pileup.c errmod.c razf.c\r\nbam.h bam_pileup.o errmod.h razf.h\r\nbam.o bam_plcmd.c errmod.o razf.o\r\nbam2bcf.c bam_plcmd.o \u001b[34mexamples\u001b[m\u001b[m/ razip.c\r\nbam2bcf.h bam_reheader.c faidx.c sam.c\r\nbam2bcf.o bam_reheader.o faidx.h sam.h\r\nbam2bcf_indel.c bam_rmdup.c faidx.o sam.o\r\nbam2bcf_indel.o bam_rmdup.o kaln.c sam_header.c\r\nbam2depth.c bam_rmdupse.c kaln.h sam_header.h\r\nbam2depth.o bam_rmdupse.o kaln.o sam_header.o\r\nbam_aux.c bam_sort.c khash.h sam_view.c\r\nbam_aux.o bam_sort.o klist.h sam_view.o\r\nbam_cat.c bam_stat.c knetfile.c sample.c\r\nbam_cat.o bam_stat.o knetfile.h sample.h\r\nbam_color.c bam_tview.c knetfile.o sample.o\r\nbam_color.o bam_tview.o kprobaln.c \u001b[31msamtools\u001b[m\u001b[m*\r\nbam_endian.h bamtk.c kprobaln.h samtools.1\r\nbam_import.c bamtk.o kprobaln.o \u001b[34mwin32\u001b[m\u001b[m/\r\nbam_import.o \u001b[34mbcftools\u001b[m\u001b[m/ kseq.h\r\nbam_index.c bedidx.c ksort.h\r\n" } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": "!samtools", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "/bin/sh: samtools: command not found\r\n" } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": "cp samtools /usr/local/bin", "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": "!samtools", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "\r\nProgram: samtools (Tools for alignments in the SAM format)\r\nVersion: 0.1.18 (r982:295)\r\n\r\nUsage: samtools [options]\r\n\r\nCommand: view SAM<->BAM conversion\r\n sort sort alignment file\r\n mpileup multi-way pileup\r\n depth compute the depth\r\n faidx index/extract FASTA\r\n index index alignment\r\n idxstats BAM index stats (r595 or later)\r\n fixmate fix mate information\r\n flagstat simple stats\r\n calmd recalculate MD/NM tags and '=' bases\r\n merge merge sorted alignments\r\n rmdup remove PCR duplicates\r\n reheader replace BAM header\r\n cat concatenate BAMs\r\n targetcut cut fosmid regions (for fosmid pool only)\r\n phase phase heterozygotes\r\n\r\n" } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": "!samtools view -c /Volumes/web/cnidarian/tophat_071313/s_1.bam", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "164718919\r\n" } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": "#only mapped reads\n!samtools view -c -F 4 /Volumes/web/cnidarian/tophat_071313/s_1.bam", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "164718919\r\n" } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": "#unmapped reads\n!samtools view -c -f 4 /Volumes/web/cnidarian/tophat_071313/s_1.bam", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "0\r\n" } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": "!samtools view -c /Volumes/web/cnidarian/tophat_071313/accepted_hits.bam", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "164718919\r\n" } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": "!samtools view -c /Volumes/web/cnidarian/tophat_071313/unmapped.bam", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "7263587\r\n" } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": "!samtools flagstat /Volumes/web/cnidarian/tophat_071313/s_1.bam", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "164718919 + 0 in total (QC-passed reads + QC-failed reads)\r\n0 + 0 duplicates\r\n164718919 + 0 mapped (100.00%:nan%)\r\n164718919 + 0 paired in sequencing\r\n82181831 + 0 read1\r\n82537088 + 0 read2\r\n42418114 + 0 properly paired (25.75%:nan%)\r\n161646608 + 0 with itself and mate mapped\r\n3072311 + 0 singletons (1.87%:nan%)\r\n1264082 + 0 with mate mapped to a different chr\r\n395312 + 0 with mate mapped to a different chr (mapQ>=5)\r\n" }, { "output_type": "stream", "stream": "stdout", "text": "164718919 + 0 in total (QC-passed reads + QC-failed reads)\r\n0 + 0 duplicates\r\n164718919 + 0 mapped (100.00%:nan%)\r\n164718919 + 0 paired in sequencing\r\n82181831 + 0 read1\r\n82537088 + 0 read2\r\n42418114 + 0 properly paired (25.75%:nan%)\r\n161646608 + 0 with itself and mate mapped\r\n3072311 + 0 singletons (1.87%:nan%)\r\n1264082 + 0 with mate mapped to a different chr\r\n395312 + 0 with mate mapped to a different chr (mapQ>=5)\r\n" } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": "!samtools flagstat /Volumes/web/cnidarian/tophat_071313/accepted_hits.bam", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "164718919 + 0 in total (QC-passed reads + QC-failed reads)\r\n0 + 0 duplicates\r\n164718919 + 0 mapped (100.00%:nan%)\r\n164718919 + 0 paired in sequencing\r\n82181831 + 0 read1\r\n82537088 + 0 read2\r\n42418114 + 0 properly paired (25.75%:nan%)\r\n161646608 + 0 with itself and mate mapped\r\n3072311 + 0 singletons (1.87%:nan%)\r\n1264082 + 0 with mate mapped to a different chr\r\n395312 + 0 with mate mapped to a different chr (mapQ>=5)\r\n" } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": "!samtools flagstat /Volumes/web/cnidarian/tophat_071313/unmapped.bam", "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": "7161175 + 102412 in total (QC-passed reads + QC-failed reads)\r\n0 + 0 duplicates\r\n0 + 0 mapped (0.00%:0.00%)\r\n7161175 + 102412 paired in sequencing\r\n3670234 + 76916 read1\r\n3490941 + 25496 read2\r\n0 + 0 properly paired (0.00%:0.00%)\r\n0 + 0 with itself and mate mapped\r\n0 + 0 singletons (0.00%:0.00%)\r\n0 + 0 with mate mapped to a different chr\r\n0 + 0 with mate mapped to a different chr (mapQ>=5)\r\n" } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": "", "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }