{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "##Count the total number of sequences in the FASTQ file and store in variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This command uses bash commands to count the number of lines in the FASTQ file (```wc-l```),\n", "divides the total number of lines by ```4``` (there are 4 lines per read in Illumina FASTQ files).\n", "The ```echo``` command is used to print the result to the screen, which gets stored in the variable:\n", "```TotalSeqs```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "TotalSeqs = !echo $((`wc -l < 2112_lane1_NoIndex_L001_R1_001.fastq` / 4))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['16000000']\n" ] } ], "source": [ "#Prints the value stored in TotalSeqs.\n", "#Notice that this is a Python string list and is not an integer!\n", "print TotalSeqs" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Converts the value in the TotalSeqs string list at index 0 (TotalSeqs[0]) to \n", "#an integer value of base 10.\n", "#This conversion will be used repeatedly throughout this notebook to allow \n", "#mathematical calculations using the numbers generated by bash commands.\n", "TotalSeqs = int(TotalSeqs[0])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "16000000\n" ] } ], "source": [ "print TotalSeqs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Use bash ```grep``` and ```wc -l``` to count all the instances of the Epinext adaptor 1 sequence:\n", "\n", "ACACTCTTTCCCTACACGACGCTCTTCCGATCT" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "TruSeq_adaptor1_grep = !grep -o 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT ' 2112_lane1_NoIndex_L001_R1_001.fastq \\\n", "| wc -l" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Converts the value in the TruSeq_adaptor1_grep string list at index 0 (TruSeq_adaptor1_grep[0]) to \n", "#an integer value of base 10.\n", "TruSeq_adaptor1_grep = int(TruSeq_adaptor1_grep[0])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n" ] } ], "source": [ "print TruSeq_adaptor1_grep" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Percentage of Reads Containing Epinext adaptor 1 sequence" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0\n" ] } ], "source": [ "#Calculates percentage of reads having TruSeq adaptor sequences.\n", "#Uses \"float\" to convert integer values to floating point decimals. Necessary since \n", "#the calculation on integers would be < 1 & would result in an answer of '0'.\n", "print ((float(TruSeq_adaptor1_grep)/TotalSeqs)*100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Use ```fastx_barcode_splitter``` to identify Epinext adaptor 1 sequence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "####The ```fastx_barcode_splitter``` is a component of fastx_toolkit-0.0.13.2" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epinext_1\tACACTCTTTCCCTACACGACGCTCTTCCGATCT\r\n" ] } ], "source": [ "#The full-lengths barcode file used by fastx_barcode_splitter.\n", "!head EpinextAdaptor1.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Look for Epinext adaptor 1 at beginning of lines" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Barcode\tCount\tLocation\r\n", "Epinext_1\t5\t./bol_Epinext_1.fastq\r\n", "unmatched\t15999995\t./bol_unmatched.fastq\r\n", "total\t16000000\r\n" ] } ], "source": [ "#Gunzip the gzipped FASTQ file.\n", "#Pipe the output of that to fastx_barcode_splitter.pl\n", "#fastx_barcode_splitter uses a default mismatch value = 1\n", "#Specify barcode file (--bcfile EpinextAdaptor1.txt)\n", "#Specify to look for barcode at beginning of file (--bol)\n", "#Specify output location and append a prefix to new file name (--prefix ./bol_)\n", "#Specify new file name suffix (--suffix \".fastq\")\n", "#Print data to screen and output file (tee bol_EpinextAdaptor1_stats.txt)\n", "!gunzip -c 2112_lane1_NoIndex_L001_R1_001.fastq.gz | \\\n", "fastx_barcode_splitter.pl \\\n", "--bcfile EpinextAdaptor1.txt \\\n", "--bol \\\n", "--prefix ./bol_ \\\n", "--suffix \".fastq\" | \\\n", "tee bol_EpinextAdaptor1_stats.txt" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Uses awk to capture the second field (the \"Count\" column; print $2) from\n", "#the second line (FNR == 2) of the bol_EpinextAdaptor1_stats.txt\n", "#Stores the value in the variable EpinextAdaptor1_fastx_bol as a Python string list.\n", "EpinextAdaptor1_fastx_bol = !awk 'FNR == 2 {print $2}' bol_EpinextAdaptor1_stats.txt" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['5']\n" ] } ], "source": [ "print EpinextAdaptor1_fastx_bol" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Converts the value in the TruSeqAdaptor_fastx_bol string list at index 0 (TruSeqAdaptor_fastx_bol[0]) to \n", "#an integer value of base 10.\n", "EpinextAdaptor1_fastx_bol = int(EpinextAdaptor1_fastx_bol[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Percentage of Reads Containing Epinext adaptor 1 sequence at Beginning of Lines" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.125e-05\n" ] } ], "source": [ "#Calculates percentage of reads having Epinext adaptor 1 sequences.\n", "#Uses \"float\" to convert integer values to floating point decimals. Necessary since \n", "#the calculation on integers would be < 1 & would result in an answer of '0'.\n", "print ((float(EpinextAdaptor1_fastx_bol)/TotalSeqs)*100)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }