Below is a quick workflow I am using to help Drinan annotate ~1.5 million sequences from an amplicon targeting NGS effort of sand.
head /Users/sr320/Dropbox/hummingbird-ipython-nbs/data/DanD/meiofauna_forward_sequences.fa >M02215:33:000000000-AFA9E:1:1101:14961:2005 1:N:0:15 TGACTGTGCTAAGGTAGCATAATTAATTGTCTTTTAATTAGAGACTTGTTTGAAAGATTT TTTGAATTTAATATAGTTTTAAAATTATAAAAATGAATTTTTATATATTGGTAAAAATAC CATGATTTTTTAAAAAGACGATAAGACCCTATCAAGTTTTACTTAAATTTAAAGAAAATT TAGGTTTTAATGGGGCATTATTATTTATTTTAAATAAATTTTGATCTTAAATTAAATTTT AGGAAATTTAATAAAATTACTGTAGGGATAACAGTGTAATATTTTTTAAAGTTCATATTT A >M02215:33:000000000-AFA9E:1:1101:11050:2011 1:N:0:15 TAACTGTGCTAAGGTAGCATAATCACTTGTCTCCTAATTAGAGACTGGCATGAAAGGGTA AACTCTTTATAACTTTATAAAGCATACACACTGAAATTTTTATTTAGACGAAGAAATCTA
Within a given working directory I proceeded to (in Jupyter NB)
cp meiofauna_forward_sequences.fa query.fasta
– to simply rename.
!/Users/steven/Dropbox/hummingbird-ipython-nbs/script-box/fasta-splitter.pl \
– to split. This was if failure occurs, simple restart.
--n-parts 20 \
query.fasta
a little magic
%%bash for f in query.part* do blastn \ -query $f \ -db /Volumes/Data/blast_db/nt \ -evalue 1e-5 \ -max_target_seqs 1 \ -max_hsps 1 \ -outfmt "6 std sskingdoms stitle staxids sscinames scomnames sblastnames" \ -num_threads 14 \ -out blastout_"$f"_nt done
Spiced it up with output format-
from manual:
Options 6, 7, and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers. The supported format specifiers are: qseqid means Query Seq-id qgi means Query GI qacc means Query accesion qaccver means Query accesion.version qlen means Query sequence length sseqid means Subject Seq-id sallseqid means All subject Seq-id(s), separated by a ';' sgi means Subject GI sallgi means All subject GIs sacc means Subject accession saccver means Subject accession.version sallacc means All subject accessions slen means Subject sequence length qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence evalue means Expect value bitscore means Bit score score means Raw score length means Alignment length pident means Percentage of identical matches nident means Number of identical matches mismatch means Number of mismatches positive means Number of positive-scoring matches gapopen means Number of gap openings gaps means Total number of gaps ppos means Percentage of positive-scoring matches frames means Query and subject frames separated by a '/' qframe means Query frame sframe means Subject frame btop means Blast traceback operations (BTOP) staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order) sscinames means unique Subject Scientific Name(s), separated by a ';' scomnames means unique Subject Common Name(s), separated by a ';' sblastnames means unique Subject Blast Name(s), separated by a ';' (in alphabetical order) sskingdoms means unique Subject Super Kingdom(s), separated by a ';' (in alphabetical order) stitle means Subject Title salltitles means All Subject Title(s), separated by a '<>' sstrand means Subject Strand qcovs means Query Coverage Per Subject qcovhsp means Query Coverage Per HSP When not provided, the default value is: 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore', which is equivalent to the keyword 'std'
and sample output:
M02215:33:000000000-AFA9E:1:1103:16706:22078 gi|56550013|gb|AY803660.1| 85.62 160 17 6 145 299 30 188 7e-37 163 Eukaryota Hydrothelphusa madagascariensis 18S ribosomal RNA gene, partial sequence 168669 Hydrothelphusa madagascariensis Hydrothelphusa madagascariensis crustaceans M02215:33:000000000-AFA9E:1:1103:16318:23659 gi|393186538|gb|JX083886.1| 82.50 280 37 11 25 297 1 275 2e-58 235 Eukaryota Microeuraphia sp. 1 MPL-2012 isolate KACb37 16S ribosomal RNA gene, partial sequence; mitochondrial 1204330 Microeuraphia sp. 1 MPL-2012 Microeuraphia sp. 1 MPL-2012 crustaceans M02215:33:000000000-AFA9E:1:1103:23857:24453 gi|386786433|gb|JQ435298.1| 96.23 53 2 0 248 300 1545 1597 5e-14 87.9 Eukaryota Cf. Traiania sp. DNA106167 voucher MCZ DNA106167 28S ribosomal RNA gene, partial sequence 1183147 cf. Traiania sp. DNA106167 cf. Traiania sp. DNA106167 daddy longlegs M02215:33:000000000-AFA9E:1:1103:12121:24652 gi|302140702|gb|GQ343306.1| 98.31 295 5 0 6 300 22 316 1e-143 518 Eukaryota Evadne nordmanni isolate E37/5 16S ribosomal RNA gene, partial sequence; mitochondrial 141403 Evadne nordmanni Evadne nordmanni crustaceans M02215:33:000000000-AFA9E:1:1104:23888:3062 gi|94960351|gb|DQ467789.1| 98.25 57 1 0 1 57 60 116 6e-18 100 Eukaryota Lynceus macleyanus isolate 53 16S ribosomal RNA gene, partial sequence; mitochondrial 381959 Lynceus macleyanus Lynceus macleyanus crustaceans M02215:33:000000000-AFA9E:1:1104:11382:3077 gi|343168999|gb|JN018352.1| 84.52 310 36 9 1 300 2729 3036 7e-77 296 Eukaryota Ischyropsalis pyrenaea voucher MNHN-JAA33 28S ribosomal RNA gene, partial sequence 1046795 Ischyropsalis pyrenaea Ischyropsalis pyrenaea daddy longlegs M02215:33:000000000-AFA9E:1:1104:16849:3145 gi|472441035|gb|KC529449.1| 86.77 310 29 12 1 301 558 864 1e-88 335 Eukaryota Microdalyellia nanella isolate W43ss 18S ribosomal RNA gene, partial sequence 1311903 Microdalyellia nanella Microdalyellia nanella flatworms M02215:33:000000000-AFA9E:1:1104:15899:3944 gi|349592295|gb|JN205453.1| 92.12 292 21 1 12 301 1 292 2e-111 411 N/A Uncultured organism clone KCON28S38 28S ribosomal RNA gene, partial sequence 155900 uncultured organism uncultured organism N/A M02215:33:000000000-AFA9E:1:1104:14928:4021 gi|46812207|gb|AY569664.1| 87.10 310 23 15 1 301 574 875 1e-88 335 Eukaryota Arenicolides ecaudata 18S ribosomal RNA gene, partial sequence 273060 Arenicolides ecaudata Arenicolides ecaudata segmented worms M02215:33:000000000-AFA9E:1:1104:19218:4051 gi|301178327|gb|HM799910.1| 96.47 85 2 1 1 85 550 633 1e-29 139 Eukaryota Uncultured marine metazoan clone PRTBE7499 small subunit ribosomal RNA gene, partial sequence 329654 uncultured marine metazoan uncultured marine metazoan animals M02215:33:000000000-AFA9E:1:1104:19414:4162 gi|295805440|emb|FN389538.1| 95.24 84 3 1 1 84 21 103 2e-27 132 Eukaryota Uncultured Glomus 18S ribosomal RNA gene, isolate, partial sequence. 07ED8BM 231055 uncultured Glomus uncultured Glomus glomeromycetes M02215:33:000000000-AFA9E:1:1104:18162:4373 gi|342210101|gb|JF277589.1| 90.52 306 21 7 1 299 67 371 2e-107 398 Eukaryota Nemertean sp. 2 SA-2011 voucher MCZ DNA106139 16S ribosomal RNA gene, partial sequence; mitochondrial 947588 Nemertean sp. 2 SA-2011 Nemertean sp. 2 SA-2011 ribbon worms M02215:33:000000000-AFA9E:1:1104:25946:4764 gi|110535863|gb|DQ665998.1| 84.92 305 37 9 1 301 552 851 5e-78 300 Eukaryota Phagocata vitta 18S ribosomal RNA gene, complete sequence 391283 Phagocata vitta Phagocata vitta flatworms M02215:33:000000000-AFA9E:1:1104:26451:5022 gi|307647639|gb|HM564573.1| 93.71 302 18 1 1 301 470 771 1e-123 451 Eukaryota Phanodermatidae sp. JCC52 18S ribosomal RNA gene, partial sequence 883396 Phanodermatidae sp. JCC52 Phanodermatidae sp. JCC52 nematodes M02215:33:000000000-AFA9E:1:1104:14581:5493 gi|472441081|gb|KC529495.1| 86.36 308 33 9 1 301 559 864 2e-86 327 Eukaryota Dochmiotrema limicola isolate UH80.2 18S ribosomal RNA gene, partial sequence 1311982 Dochmiotrema limicola Dochmiotrema limicola flatworms M02215:33:000000000-AFA9E:1:1104:9527:5655 gi|170672378|gb|EU376009.1| 87.06 309 32 6 1 301 2834 3142 9e-91 342 Eukaryota Craterostigmus tasmanianus 28S ribosomal RNA gene, partial sequence 60162 Craterostigmus tasmanianus Craterostigmus tasmanianus centipedes M02215:33:000000000-AFA9E:1:1104:15515:5664 gi|373860124|gb|HQ865054.1| 86.75 302 34 6 2 301 7 304 2e-87 331 Eukaryota Uncultured eukaryote clone SGPX651 18S ribosomal RNA gene, partial sequence 100272 uncultured eukaryote uncultured eukaryote eukaryotes M02215:33:000000000-AFA9E:1:1104:25827:5676 gi|154101573|gb|EF552055.1| 99.34 301 2 0 1 301 68 368 6e-152 545 Eukaryota Balanus glandula isolate 2 16S ribosomal RNA gene, partial sequence; mitochondrial 110520 Balanus glandula Balanus glandula crustaceans M02215:33:000000000-AFA9E:1:1104:22156:5763 gi|157787506|gb|EF990727.1| 89.11 303 27 4 1 300 2446 2745 1e-99 372 Eukaryota Rhabditoides inermiformis strain SB328 28S large subunit ribosomal RNA gene, partial sequence 96653 Rhabditoides inermiformis Rhabditoides inermiformis nematodes M02215:33:000000000-AFA9E:1:1104:14633:5939 gi|213959396|gb|FJ426630.1| 78.43 306 53 12 1 301 47 344 4e-44 187 Eukaryota Brachionus angularis voucher S. H. Cheng 001 16S ribosomal RNA gene, partial sequence; mitochondrial 396692 Brachionus angularis Brachionus angularis rotifers M02215:33:000000000-AFA9E:1:1104:10324:6174 gi|386696111|gb|JQ000284.1| 79.22 308 52 11 2 301 537 840 4e-49 204 Eukaryota Xolalgidae gen. sp. AD1204 18S ribosomal RNA gene, partial sequence 1111400 Xolalgidae gen. sp. AD1204 Xolalgidae gen. sp. AD1204 mites & ticks M02215:33:000000000-AFA9E:1:1104:13062:6188 gi|74100231|gb|DQ186202.1| 87.97 291 25 7 1 284 13348 13635 1e-88 335 Eukaryota Thalassiosira pseudonana mitochondrion, complete genome 35128 Thalassiosira pseudonana Thalassiosira pseudonana diatoms M02215:33:000000000-AFA9E:1:1104:13172:6214 gi|109390598|emb|AM039747.1| 82.95 305 39 13 1 298 2492 2790 7e-67 263 Eukaryota Heligmosomoides polygyrus 28S rRNA gene 6339 Heligmosomoides polygyrus Heligmosomoides polygyrus nematodes
Will end up cat
ing 20 output files once done.
Recent Comments