{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "##Count the total number of sequences in the FASTQ file and store in variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This command uses bash commands to count the number of lines in the FASTQ file (```wc-l```),\n", "divides the total number of lines by ```4``` (there are 4 lines per read in Illumina FASTQ files).\n", "The ```echo``` command is used to print the result to the screen, which gets stored in the variable:\n", "```TotalSeqs```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "TotalSeqs = !echo $((`wc -l < 2112_lane1_NoIndex_L001_R1_001.fastq` / 4))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['16000000']\n" ] } ], "source": [ "#Prints the value stored in TotalSeqs.\n", "#Notice that this is a Python string list and is not an integer!\n", "print TotalSeqs" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Converts the value in the TotalSeqs string list at index 0 (TotalSeqs[0]) to \n", "#an integer value of base 10.\n", "#This conversion will be used repeatedly throughout this notebook to allow \n", "#mathematical calculations using the numbers generated by bash commands.\n", "TotalSeqs = int(TotalSeqs[0], 10)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "16000000\n" ] } ], "source": [ "print TotalSeqs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Use bash ```grep``` and ```wc -l``` to count all the instances of the TruSeq adaptor sequence:\n", "\n", "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "TruSeq_adaptor_grep = !grep -o 'GATCGGAAGAGCACACGTCTGAACTCCAGTCAC' 2112_lane1_NoIndex_L001_R1_001.fastq \\\n", "| wc -l" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Converts the value in the TruSeq_adaptor_grep string list at index 0 (TruSeq_adaptor_grep[0]) to \n", "#an integer value of base 10.\n", "TruSeq_adaptor_grep = int(TruSeq_adaptor_grep[0])" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2484611\n" ] } ], "source": [ "print TruSeq_adaptor_grep" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Percentage of Reads Containing TruSeq adaptor sequence" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15.52881875\n" ] } ], "source": [ "#Calculates percentage of reads having TruSeq adaptor sequences.\n", "#Uses \"float\" to convert integer values to floating point decimals. Necessary since \n", "#the calculation on integers would be < 1 & would result in an answer of '0'.\n", "print ((float(TruSeq_adaptor_grep)/TotalSeqs)*100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Use bash ```grep``` and ```wc -l``` to count all the instances of the Epinext universal primer sequence:\n", "\n", "AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "EpinextUniversal_grep = !grep -o 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT' 2112_lane1_NoIndex_L001_R1_001.fastq \\\n", "| wc -l" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Converts the value in the EpinextUniversal_grep string list at index 0 (EpinextUniversal_grep[0]) to \n", "#an integer value of base 10.\n", "EpinextUniversal_grep = int(EpinextUniversal_grep[0])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "184\n" ] } ], "source": [ "print EpinextUniversal_grep" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Percentage of Reads Containing Epinext Universal Primer sequence" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.00115\n" ] } ], "source": [ "#Calculates percentage of reads having TruSeq adaptor sequences.\n", "#Uses \"float\" to convert integer values to floating point decimals. Necessary since \n", "#the calculation on integers would be < 1 & would result in an answer of '0'.\n", "print ((float(EpinextUniversal_grep)/TotalSeqs)*100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Use ```fastx_barcode_splitter``` to identify TruSeq adaptor sequence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####The ```fastx_barcode_splitter``` is a component of fastx_toolkit-0.0.13.2" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TruSeqAdaptor\tGATCGGAAGAGCACACGTCTGAACTCCAGTCAC\r\n" ] } ], "source": [ "#The full-lengths barcode file used by fastx_barcode_splitter.\n", "!head TruSeqAdaptor.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Look for TruSeq adaptor at beginning of lines" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Barcode\tCount\tLocation\r\n", "TruSeqAdaptor\t2721791\t./bol_TruSeqAdaptor.fastq\r\n", "unmatched\t13278209\t./bol_unmatched.fastq\r\n", "total\t16000000\r\n" ] } ], "source": [ "#Gunzip the gzipped FASTQ file.\n", "#Pipe the output of that to fastx_barcode_splitter.pl\n", "#fastx_barcode_splitter uses a default mismatch value = 1\n", "#Specify barcode file (--bcfile TruSeqAdaptor.txt)\n", "#Specify to look for barcode at beginning of file (--bol)\n", "#Specify output location and append a prefix to new file name (--prefix ./bol_)\n", "#Specify new file name suffix (--suffix \".fastq\")\n", "#Print data to screen and output file (tee bol_TruSeqAdaptor_stats.txt)\n", "!gunzip -c 2112_lane1_NoIndex_L001_R1_001.fastq.gz | \\\n", "fastx_barcode_splitter.pl \\\n", "--bcfile TruSeqAdaptor.txt \\\n", "--bol \\\n", "--prefix ./bol_ \\\n", "--suffix \".fastq\" | \\\n", "tee bol_TruSeqAdaptor_stats.txt" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Uses awk to capture the second field (the \"Count\" column; print $2) from\n", "#the second line (FNR == 2) of the bol_TruSeqAdaptor_stats.txt\n", "#Stores the value in the variable TruSeqAdaptor_fastx_bol as a Python string list.\n", "TruSeqAdaptor_fastx_bol = !awk 'FNR == 2 {print $2}' bol_TruSeqAdaptor_stats.txt" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['2721791']\n" ] } ], "source": [ "print TruSeqAdaptor_fastx_bol" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Converts the value in the TruSeqAdaptor_fastx_bol string list at index 0 (TruSeqAdaptor_fastx_bol[0]) to \n", "#an integer value of base 10.\n", "TruSeqAdaptor_fastx_bol = int(TruSeqAdaptor_fastx_bol[0])" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2721791\n" ] } ], "source": [ "print TruSeqAdaptor_fastx_bol" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Percentage of Reads Containing TruSeq adaptor sequence at Beginning of Lines" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "17.01119375\n" ] } ], "source": [ "#Calculates percentage of reads having TruSeq adaptor sequences.\n", "#Uses \"float\" to convert integer values to floating point decimals. Necessary since \n", "#the calculation on integers would be < 1 & would result in an answer of '0'.\n", "print ((float(TruSeqAdaptor_fastx_bol)/TotalSeqs)*100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Look for TruSeq adaptor at end of lines" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Barcode\tCount\tLocation\r\n", "TruSeqAdaptor\t9890\t./eol_TruSeqAdaptor.fastq\r\n", "unmatched\t15990110\t./eol_unmatched.fastq\r\n", "total\t16000000\r\n" ] } ], "source": [ "#Gunzip the gzipped FASTQ file.\n", "#Pipe the output of that to fastx_barcode_splitter.pl\n", "#fastx_barcode_splitter uses a default mismatch value = 1\n", "#Specify barcode file (--bcfile TruSeqAdaptor.txt)\n", "#Specify to look for barcode at beginning of file (--bol)\n", "#Specify output location and append a prefix to new file name (--prefix ./bol_)\n", "#Specify new file name suffix (--suffix \".fastq\")\n", "#Print data to screen and output file (tee bol_TruSeqAdaptor_stats.txt)\n", "!gunzip -c 2112_lane1_NoIndex_L001_R1_001.fastq.gz | \\\n", "fastx_barcode_splitter.pl \\\n", "--bcfile TruSeqAdaptor.txt \\\n", "--eol \\\n", "--prefix ./eol_ \\\n", "--suffix \".fastq\" | \\\n", "tee eol_TruSeqAdaptor_stats.txt" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Uses awk to capture the second field (the \"Count\" column; print $2) from\n", "#the second line (FNR == 2) of the TruSeqAdaptor_fastx_eol.txt\n", "#Stores the value in the variable TruSeqAdaptor_fastx_eol as a Python string list.\n", "TruSeqAdaptor_fastx_eol = !awk 'FNR == 2 {print $2}' eol_TruSeqAdaptor_stats.txt" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Converts the value in the TruSeqAdaptor_fastx_bol string list at index 0 (TruSeqAdaptor_fastx_bol[0]) to \n", "#an integer value of base 10.\n", "TruSeqAdaptor_fastx_eol = int(TruSeqAdaptor_fastx_eol[0])" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "9890\n" ] } ], "source": [ "print TruSeqAdaptor_fastx_eol" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Percentage of Reads Containing TruSeq adaptor sequence at End of Lines" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0618125\n" ] } ], "source": [ "#Calculates percentage of reads having TruSeq adaptor sequences.\n", "#Uses \"float\" to convert integer values to floating point decimals. Necessary since \n", "#the calculation on integers would be < 1 & would result in an answer of '0'.\n", "print ((float(TruSeqAdaptor_fastx_eol)/TotalSeqs)*100)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }