{
 "metadata": {
  "name": "OlyO_PacBio"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "#Olympia Oyster (Pat) Initial PacBio Data Analysis\n\n_Includes conversion, mapping RNA-seq data, and blast comparison of transcriptome_  \n_Focus is on scaffolds made from assembly of reads > 10k bp_\n\n-updated July 10 2013 14:00 - Tophat PE Male Gonad added"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "---\n_Data was provided by core facility with comments:_\n\nData are grouped into folders by SMRT cell. Folders are generically named by the position of the cell in the run (A01_1, B01_1, etc.). Reads are available in PacBio's HDF5 format as .bas.h5 files. Metadata for each SMRT cell is available in PacBio's XML format.\n\nTools for analysis of .bas.h5 files are available on PacBio's DevNet site (http://pacbiodevnet.com/) including the SMRT Analysis toolkit which can perform de novo assembly and resequencing.\n\n_Data was downloaded locally._  \n<http://eagle.fish.washington.edu/whale/index.php?dir=Pat%2F>\n    \nTwo files\n\n<img src=\"https://www.evernote.com/shard/s10/sh/f3666265-4768-4529-b16f-9c9d2458ab2a/18a16265a05c085c37230ba26509119d/deep/0/Screenshot%207/8/13%2012:20%20PM.png\" alt=\"Screenshot%207/8/13%2012:20%20PM\" />"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "##Conversion to fastq\n\n\n_via Giles._\n\nBasically I took the bas.h5 file and ran it through the pbh5tools, specifically bash5tools.py.\n\nThe exact command I ran was\n\nbash5tools.py --readType unrolled --outType fastq inputfile.bas.h5\n\nwhere inputfile.bas.h5 was the name of the file you got from them.\n\n**My only question was whether I did the --readType correctly**.  I found some info on that online as well as some examples of running this command and they all talked about using a \"Raw\" mode which was removed from this version of bash5tools so I assumed it was the unrolled option because the other two options (ccs and subreads) where the same in previous versions.\n\nTo Install,\n\nFirst you need numpy and h5py, both are python libraries.  I installed these via MacPorts.  Then you to install the pbcore library, which is a python library provided by PacBios (same website as the pbh5tools).\n\nBasically download pbcore, unpack, then run\npython setup.py build\nsudo python setup.py install\n\nThis will install it with all the other python libraries, then do same with pbh5tools.  You might need to add things to your PYTHONPATH and PATH environment variables.  If you want to, you can customize where it goes by adding a --prefix <LOCATION> to the install command but then you have to adjust PYTHONPATH and PATH (which is what I did).\n\nLinks:\n\nUnderstanding PacBios reads\nhttps://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Understanding-PacBio-transcriptome-data#wiki-readexplained\n\nExamples for bash5tools.py\nhttp://seqanswers.com/forums/archive/index.php/t-16895.html\n\nGithub for tools\nhttps://github.com/PacificBiosciences\n\n---\nresulting in   \n<http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_1.fastq>\n\n    \n    "
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "##Importing into CLC\n\n_info via CLC_  \n\nCLC bio tools are currently not optimized to handle the specific error profile of PacBio reads, and we therefore do not yet provide support for using this data type.\n\nIf you still wish to try using PacBio data in the Genomics Workbench, then if you have your data in standard fastq format, you should be able to succeed in importing it into the CLC bio Genomics Workbench using the Illumina import option. If you choose to do this, please choose the option \"NCBI/Sanger or Illumina Pipeline 1.8 or later\" under 'Quality scores' in the Import wizard .\n\nOur developers are currently evaluating methods for error correction of PacBio reads (using short read data), and for hybrid de novo assembly using PacBio and short read data combined as input. If the outcome of this work matches the performance of tools commonly used for PacBio error correction (e.g. Celera Assembler using the pacBioToCA utility; Koren et al. 2012. Nature Biotechnology), or hybrid de novo assembly (e.g. ALPATHS-LG; Ribeiro et al. 2012. Genome Research), we will be considering providing such functionality in the future.\n"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "_thus_\n\n<img src=\"https://www.evernote.com/shard/s10/sh/c650d9d8-d87c-4cd5-8cd5-5957fb31b515/1655f6ede0b68e282eef9e79002774a6/deep/0/Screenshot%207/8/13%2012:27%20PM.png\" alt=\"Screenshot%207/8/13%2012:27%20PM\" width = 50%/>\n\n###Stats\n47,475 Sequences  \n\nQC Report - <http://eagle.fish.washington.edu/cnidarian/OlyO_PacBio_QC.pdf>\n\n_also viewable below_\n\n<iframe src=\"https://docs.google.com/file/d/0B9V_gF766XZATTlWVzBDMkxxOGM/preview\" width=\"90%\" height=\"480\"></iframe>"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "---\nMore stats-  \n**3058 sequences > 10000 bp**  \n\n<http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k.fa>  \n    \n    Running denovo assembly on\n    Fast (simple contigs) \n<http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa>\n    \n<iframe src=\"https://docs.google.com/file/d/0B9V_gF766XZAOUg3ajNnbDZoMk0/preview\" width=\"90%\" height=\"480\"></iframe>\n    "
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "!head /Volumes/web/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": ">OlyO_Pat_PacBio_10k_contig_1\r\nAAAAAAAGGGAGATGTTTTCCTCATGTTGAATTGAATTCTTCAACTCATTTAAATCAGGC\r\nAAGTGTTCAACACTAACGCTAGTTCCGGTTAGCCTGTAGGTCTAATAACTTTTGTCCAAA\r\nCTACCAGGATATACAATTATGCACTGTTTAGCAGGGAGACATCACAAAAGGTATTTAATT\r\nCCGATTACGCAAGAGCTTTTCTGCGTATTGATCAGGTTTTTTGAATCGAGGCAATGCTAC\r\nCTTACAGACAAAGTTTGTTGTTGTTCAGGGGTTTCAATAGTCACGTTTTAAAGGCAGCAT\r\nAATTTCGTTATAATTCTAATGGTCGTTATACAATTTAGTTTACCAATATTTGACCTATCA\r\nTTTGAGTCAAATACTTGGTCCTGATTGTGGTTTCATAGCCATTGTTAAGTGCCGTTTTAC\r\nACACTGTATTTGACTACGGACTGTTCCGTTACCTGATCAAGACTATGGGGCCTCCCACGG\r\nCGGGTGTGACGCGGTCAACAGGGGATGCCTACTCCTCTCTAGGCAGCCTGATCCCCACTT\r\n"
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "---\n###Ran cd-hit-est via webserver\n\n\nSummary: Nothing combined at 90% level (probably to big for cd-est hit)\n\n\n"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "---\n###Importing in iPlant\n\n<img src=\"https://www.evernote.com/shard/s10/sh/27343fdc-9473-4c4e-9736-bcac9d0ea81c/16a1cd3d7ed6a500b81bb7315bf43d70/deep/0/Screenshot%207/10/13%207:09%20AM.png\" alt=\"Screenshot%207/10/13%207:09%20AM\" />\n\n\n"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "TopHat2-PE  \n<img src=\"https://www.evernote.com/shard/s10/sh/459553be-e99a-4bd6-8c63-13d38cd7704a/b59aa6d62e89e9fd5b95f7298a69e767/deep/0/Screenshot%207/10/13%207:10%20AM.png\" alt=\"Screenshot%207/10/13%207:10%20AM\" />\n\n---\n<img src=\"https://www.evernote.com/shard/s10/sh/e6391b0d-fb8e-4fa6-9440-71fdf5104d74/51b686554935cb1efa7bcc4e1f0003dc/deep/0/Screenshot%207/10/13%207:11%20AM.png\" alt=\"Screenshot%207/10/13%207:11%20AM\" />\n\n---\nReference Genome - OlyO_Pat_PacBio_10k_contigs.fa\n\nOutput - OlyO_10kcontigs_TopHatm106\n"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "#finished",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 32
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "output file   \n`http://eagle.fish.washington.edu/cnidarian/OlyO10k_filtered_106A_Male_Mix_TAGCTT_L004_R1.bam`\n\n<img src=\"https://www.evernote.com/shard/s10/sh/1107e754-c96c-4a3b-aa96-1e75a072d6a0/0fd7cf33ccdd81ccccf958203a1f1134/deep/0/Screenshot%207/10/13%201:53%20PM.png\" alt=\"Screenshot%207/10/13%201:53%20PM\" />"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "---\n##Annotating 10kcontigs\n<http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa>\n    \nHave working copy of transcriptome <http://eagle.fish.washington.edu/cnidarian/OlyOv3_400bp.fasta> and will blastn against 10k contigs    "
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "!makeblastdb -in /Volumes/web/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa -dbtype nucl -out /Volumes/Bay3/Software/ncbi-blast-2.2.27\\+/db/OlyO_PacBio_10kcontigs",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\r\n\r\nBuilding a new DB, current time: 07/10/2013 07:40:30\r\nNew DB name:   /Volumes/Bay3/Software/ncbi-blast-2.2.27+/db/OlyO_PacBio_10kcontigs\r\nNew DB title:  /Volumes/web/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa\r\nSequence type: Nucleotide\r\nKeep Linkouts: T\r\nKeep MBits: T\r\nMaximum file size: 1000000000B\r\n"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "Adding sequences from FASTA; added 553 sequences in 0.575964 seconds.\r\n"
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "!blastn -query /Volumes/web/cnidarian/OlyOv3_400bp.fasta -db /Volumes/Bay3/Software/ncbi-blast-2.2.27\\+/db/OlyO_PacBio_10kcontigs -out /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs -outfmt 6 -evalue 1E-20 -num_threads 2 -task blastn",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "!head /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "4486232\tOlyO_Pat_PacBio_10k_contig_319\t89.39\t132\t4\t5\t142\t264\t21178\t21048\t1e-39\t 161\r\n4486232\tOlyO_Pat_PacBio_10k_contig_319\t90.00\t130\t4\t6\t268\t390\t13091\t12964\t1e-38\t 158\r\n4486232\tOlyO_Pat_PacBio_10k_contig_319\t90.00\t130\t4\t6\t268\t390\t20765\t20638\t1e-38\t 158\r\n4486232\tOlyO_Pat_PacBio_10k_contig_319\t87.88\t132\t6\t5\t142\t264\t13504\t13374\t5e-37\t 152\r\n4486232\tOlyO_Pat_PacBio_10k_contig_319\t84.50\t129\t9\t9\t144\t267\t11561\t11683\t4e-25\t 113\r\n4486232\tOlyO_Pat_PacBio_10k_contig_319\t84.50\t129\t9\t9\t144\t267\t19235\t19357\t4e-25\t 113\r\n4486232\tOlyO_Pat_PacBio_10k_contig_319\t82.14\t140\t9\t11\t142\t265\t5867\t5728\t7e-23\t 105\r\n4486255\tOlyO_Pat_PacBio_10k_contig_420\t77.51\t418\t31\t33\t36\t397\t9827\t10237\t4e-63\t 239\r\n4486256\tOlyO_Pat_PacBio_10k_contig_420\t77.51\t418\t31\t33\t5\t366\t10237\t9827\t4e-63\t 239\r\n4486297\tOlyO_Pat_PacBio_10k_contig_420\t78.89\t289\t34\t17\t2\t276\t24459\t24184\t2e-49\t 194\r\n"
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "!head /Volumes/web/cnidarian/OlyOv3_400bp.fasta",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": ">4485895 length 400 cvg_67.8_tip_0\r\nACCCAGAAAGGTTTAAAGAATGTATTTGATGAAGCCATTCTGGCTGCTCTGGAACCTCCTGAACCACCCAAAAAGAAGAAGTGTGTGTTGTTGTAATCTT\r\nTGAACTCTCGTCAGTTTCATGTGTAATCATAGAATGATTTCAACTTGTCATCTGTGGGAAAATCTTGTGCAAAATTAAAAATAAAAACCACTTTTATACA\r\nTGTCTGGATAAGTATTTTCACAGATGGAAGAGTGCGGGTTGAAATAGAGATTATTCCAACTTTCTGAAGAAAAGGAATATTTGAAGTTCCTGAGACGGAA\r\nAAGGCAGGTGTTATTTTCAAGCGAACCACTAGCACAGTGCTGTGGTTTTATTATCCCATATGGGTCCAATGAACATATGATTTGTAAATATATATATAAT\r\n\r\n>4485897 length 400 cvg_5.5_tip_1\r\nACACTGCACATCGCGGTCCTAGATTTTAACGACAACACGCCGTATTTCCTTAACAGTACATATAAATTTAGTGAAAATGAATCCACTTACAACAGAACAA\r\nGAATTGGAGCATTGTATGCCCATGATCTGGACTCGGGACAAAACGCCAATATAACGTTTTCTATCTCTGGAGGAAACAGTCAAGGACATTTTCAAATAGA\r\nTCCATACACGGGTGATTTATTCATAAATGGCCTTATCGACAGAGAAAATGTATCCTCATACAACCTGAGAGTTACTATAAGAGATAATCCGTCTAATCCG\r\n"
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "!grep -c \"OlyO_Pat_\" /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs ",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1730\r\n"
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "---\nWill try perl script I modified to take this \"reverse\" blast and make a gff file.  \n\nAnother option in SQLShare"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "!2_Blast2gff.pl -i /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs -o /Users/sr320/Desktop/OlyO10kcontigs_exon_a.gff -d \"OlyOv3_400\" -p EXON -s \"something\"",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 27
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "mv /Users/sr320/Desktop/OlyO10kcontigs_exon_a.gff /Volumes/web/cnidarian/",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "mv: /Volumes/web/cnidarian/OlyO10kcontigs_exon_a.gff: set owner/group (was: 501/20): Operation not permitted\r\n"
      }
     ],
     "prompt_number": 30
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "perl script would not write to eagle so had to move"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "<http://eagle.fish.washington.edu/cnidarian/OlyO10kcontigs_exon_a.gff>"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "!head /Volumes/web/cnidarian/OlyO10kcontigs_exon_a.gff",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "OlyO_Pat_PacBio_10k_contig_319\tblastn:OlyOv3_400\tblastn\t21178\t21048\t1e-39\t-\t.\t4486232\t\r\nOlyO_Pat_PacBio_10k_contig_319\tblastn:OlyOv3_400\tblastn\t13091\t12964\t1e-38\t-\t.\t4486232\t\r\nOlyO_Pat_PacBio_10k_contig_319\tblastn:OlyOv3_400\tblastn\t20765\t20638\t1e-38\t-\t.\t4486232\t\r\nOlyO_Pat_PacBio_10k_contig_319\tblastn:OlyOv3_400\tblastn\t13504\t13374\t5e-37\t-\t.\t4486232\t\r\nOlyO_Pat_PacBio_10k_contig_319\tblastn:OlyOv3_400\tblastn\t11561\t11683\t4e-25\t+\t.\t4486232\t\r\nOlyO_Pat_PacBio_10k_contig_319\tblastn:OlyOv3_400\tblastn\t19235\t19357\t4e-25\t+\t.\t4486232\t\r\nOlyO_Pat_PacBio_10k_contig_319\tblastn:OlyOv3_400\tblastn\t5867\t5728\t7e-23\t-\t.\t4486232\t\r\nOlyO_Pat_PacBio_10k_contig_420\tblastn:OlyOv3_400\tblastn\t9827\t10237\t4e-63\t+\t.\t4486255\t\r\nOlyO_Pat_PacBio_10k_contig_420\tblastn:OlyOv3_400\tblastn\t10237\t9827\t4e-63\t-\t.\t4486256\t\r\nOlyO_Pat_PacBio_10k_contig_420\tblastn:OlyOv3_400\tblastn\t24459\t24184\t2e-49\t-\t.\t4486297\t\r\n"
      }
     ],
     "prompt_number": 31
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": " <img src=\"https://www.evernote.com/shard/s10/sh/984ad64e-20c5-418a-91ae-a7a94f2d95c8/6a1ab77f10a0c0941d0f439286b51d99/deep/0/Screenshot%207/10/13%208:18%20AM.png\" alt=\"Screenshot%207/10/13%208:18%20AM\" />"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}