Posted by & filed under Miscellaneous, Workflows.

Below is a quick workflow I am using to help Drinan annotate ~1.5 million sequences from an amplicon targeting NGS effort of sand.

 head /Users/sr320/Dropbox/hummingbird-ipython-nbs/data/DanD/meiofauna_forward_sequences.fa
 >M02215:33:000000000-AFA9E:1:1101:14961:2005 1:N:0:15
 TGACTGTGCTAAGGTAGCATAATTAATTGTCTTTTAATTAGAGACTTGTTTGAAAGATTT
 TTTGAATTTAATATAGTTTTAAAATTATAAAAATGAATTTTTATATATTGGTAAAAATAC
 CATGATTTTTTAAAAAGACGATAAGACCCTATCAAGTTTTACTTAAATTTAAAGAAAATT
 TAGGTTTTAATGGGGCATTATTATTTATTTTAAATAAATTTTGATCTTAAATTAAATTTT
 AGGAAATTTAATAAAATTACTGTAGGGATAACAGTGTAATATTTTTTAAAGTTCATATTT
 A
 >M02215:33:000000000-AFA9E:1:1101:11050:2011 1:N:0:15
 TAACTGTGCTAAGGTAGCATAATCACTTGTCTCCTAATTAGAGACTGGCATGAAAGGGTA
 AACTCTTTATAACTTTATAAAGCATACACACTGAAATTTTTATTTAGACGAAGAAATCTA

 

 

 

Within a given working directory I proceeded to (in Jupyter NB)

cp meiofauna_forward_sequences.fa query.fasta – to simply rename.

!/Users/steven/Dropbox/hummingbird-ipython-nbs/script-box/fasta-splitter.pl \
--n-parts 20 \
query.fasta
– to split. This was if failure occurs, simple restart.

a little magic

 %%bash
 for f in query.part*
 do
 blastn \
 -query $f \
 -db /Volumes/Data/blast_db/nt \
 -evalue 1e-5 \
 -max_target_seqs 1 \
 -max_hsps 1 \
 -outfmt "6 std sskingdoms stitle staxids sscinames scomnames sblastnames" \
 -num_threads 14 \
 -out blastout_"$f"_nt
 done

 

Spiced it up with output format-

from manual:

 

 Options 6, 7, and 10 can be additionally configured to produce
 a custom format specified by space delimited format specifiers.
 The supported format specifiers are:
 qseqid means Query Seq-id
 qgi means Query GI
 qacc means Query accesion
 qaccver means Query accesion.version
 qlen means Query sequence length
 sseqid means Subject Seq-id
 sallseqid means All subject Seq-id(s), separated by a ';'
 sgi means Subject GI
 sallgi means All subject GIs
 sacc means Subject accession
 saccver means Subject accession.version
 sallacc means All subject accessions
 slen means Subject sequence length
 qstart means Start of alignment in query
 qend means End of alignment in query
 sstart means Start of alignment in subject
 send means End of alignment in subject
 qseq means Aligned part of query sequence
 sseq means Aligned part of subject sequence
 evalue means Expect value
 bitscore means Bit score
 score means Raw score
 length means Alignment length
 pident means Percentage of identical matches
 nident means Number of identical matches
 mismatch means Number of mismatches
 positive means Number of positive-scoring matches
 gapopen means Number of gap openings
 gaps means Total number of gaps
 ppos means Percentage of positive-scoring matches
 frames means Query and subject frames separated by a '/'
 qframe means Query frame
 sframe means Subject frame
 btop means Blast traceback operations (BTOP)
 staxids means unique Subject Taxonomy ID(s), separated by a ';'
 (in numerical order)
 sscinames means unique Subject Scientific Name(s), separated by a ';'
 scomnames means unique Subject Common Name(s), separated by a ';'
 sblastnames means unique Subject Blast Name(s), separated by a ';'
 (in alphabetical order)
 sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
 (in alphabetical order)
 stitle means Subject Title
 salltitles means All Subject Title(s), separated by a '<>'
 sstrand means Subject Strand
 qcovs means Query Coverage Per Subject
 qcovhsp means Query Coverage Per HSP
 When not provided, the default value is:
 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
 evalue bitscore', which is equivalent to the keyword 'std'

 

 

and sample output:

 

M02215:33:000000000-AFA9E:1:1103:16706:22078 gi|56550013|gb|AY803660.1| 85.62 160 17 6 145 299 30 188 7e-37 163 Eukaryota Hydrothelphusa madagascariensis 18S ribosomal RNA gene, partial sequence 168669 Hydrothelphusa madagascariensis Hydrothelphusa madagascariensis crustaceans
M02215:33:000000000-AFA9E:1:1103:16318:23659 gi|393186538|gb|JX083886.1| 82.50 280 37 11 25 297 1 275 2e-58 235 Eukaryota Microeuraphia sp. 1 MPL-2012 isolate KACb37 16S ribosomal RNA gene, partial sequence; mitochondrial 1204330 Microeuraphia sp. 1 MPL-2012 Microeuraphia sp. 1 MPL-2012 crustaceans
M02215:33:000000000-AFA9E:1:1103:23857:24453 gi|386786433|gb|JQ435298.1| 96.23 53 2 0 248 300 1545 1597 5e-14 87.9 Eukaryota Cf. Traiania sp. DNA106167 voucher MCZ DNA106167 28S ribosomal RNA gene, partial sequence 1183147 cf. Traiania sp. DNA106167 cf. Traiania sp. DNA106167 daddy longlegs
M02215:33:000000000-AFA9E:1:1103:12121:24652 gi|302140702|gb|GQ343306.1| 98.31 295 5 0 6 300 22 316 1e-143 518 Eukaryota Evadne nordmanni isolate E37/5 16S ribosomal RNA gene, partial sequence; mitochondrial 141403 Evadne nordmanni Evadne nordmanni crustaceans
M02215:33:000000000-AFA9E:1:1104:23888:3062 gi|94960351|gb|DQ467789.1| 98.25 57 1 0 1 57 60 116 6e-18 100 Eukaryota Lynceus macleyanus isolate 53 16S ribosomal RNA gene, partial sequence; mitochondrial 381959 Lynceus macleyanus Lynceus macleyanus crustaceans
M02215:33:000000000-AFA9E:1:1104:11382:3077 gi|343168999|gb|JN018352.1| 84.52 310 36 9 1 300 2729 3036 7e-77 296 Eukaryota Ischyropsalis pyrenaea voucher MNHN-JAA33 28S ribosomal RNA gene, partial sequence 1046795 Ischyropsalis pyrenaea Ischyropsalis pyrenaea daddy longlegs
M02215:33:000000000-AFA9E:1:1104:16849:3145 gi|472441035|gb|KC529449.1| 86.77 310 29 12 1 301 558 864 1e-88 335 Eukaryota Microdalyellia nanella isolate W43ss 18S ribosomal RNA gene, partial sequence 1311903 Microdalyellia nanella Microdalyellia nanella flatworms
M02215:33:000000000-AFA9E:1:1104:15899:3944 gi|349592295|gb|JN205453.1| 92.12 292 21 1 12 301 1 292 2e-111 411 N/A Uncultured organism clone KCON28S38 28S ribosomal RNA gene, partial sequence 155900 uncultured organism uncultured organism N/A
M02215:33:000000000-AFA9E:1:1104:14928:4021 gi|46812207|gb|AY569664.1| 87.10 310 23 15 1 301 574 875 1e-88 335 Eukaryota Arenicolides ecaudata 18S ribosomal RNA gene, partial sequence 273060 Arenicolides ecaudata Arenicolides ecaudata segmented worms
M02215:33:000000000-AFA9E:1:1104:19218:4051 gi|301178327|gb|HM799910.1| 96.47 85 2 1 1 85 550 633 1e-29 139 Eukaryota Uncultured marine metazoan clone PRTBE7499 small subunit ribosomal RNA gene, partial sequence 329654 uncultured marine metazoan uncultured marine metazoan animals
M02215:33:000000000-AFA9E:1:1104:19414:4162 gi|295805440|emb|FN389538.1| 95.24 84 3 1 1 84 21 103 2e-27 132 Eukaryota Uncultured Glomus 18S ribosomal RNA gene, isolate, partial sequence. 07ED8BM 231055 uncultured Glomus uncultured Glomus glomeromycetes
M02215:33:000000000-AFA9E:1:1104:18162:4373 gi|342210101|gb|JF277589.1| 90.52 306 21 7 1 299 67 371 2e-107 398 Eukaryota Nemertean sp. 2 SA-2011 voucher MCZ DNA106139 16S ribosomal RNA gene, partial sequence; mitochondrial 947588 Nemertean sp. 2 SA-2011 Nemertean sp. 2 SA-2011 ribbon worms
M02215:33:000000000-AFA9E:1:1104:25946:4764 gi|110535863|gb|DQ665998.1| 84.92 305 37 9 1 301 552 851 5e-78 300 Eukaryota Phagocata vitta 18S ribosomal RNA gene, complete sequence 391283 Phagocata vitta Phagocata vitta flatworms
M02215:33:000000000-AFA9E:1:1104:26451:5022 gi|307647639|gb|HM564573.1| 93.71 302 18 1 1 301 470 771 1e-123 451 Eukaryota Phanodermatidae sp. JCC52 18S ribosomal RNA gene, partial sequence 883396 Phanodermatidae sp. JCC52 Phanodermatidae sp. JCC52 nematodes
M02215:33:000000000-AFA9E:1:1104:14581:5493 gi|472441081|gb|KC529495.1| 86.36 308 33 9 1 301 559 864 2e-86 327 Eukaryota Dochmiotrema limicola isolate UH80.2 18S ribosomal RNA gene, partial sequence 1311982 Dochmiotrema limicola Dochmiotrema limicola flatworms
M02215:33:000000000-AFA9E:1:1104:9527:5655 gi|170672378|gb|EU376009.1| 87.06 309 32 6 1 301 2834 3142 9e-91 342 Eukaryota Craterostigmus tasmanianus 28S ribosomal RNA gene, partial sequence 60162 Craterostigmus tasmanianus Craterostigmus tasmanianus centipedes
M02215:33:000000000-AFA9E:1:1104:15515:5664 gi|373860124|gb|HQ865054.1| 86.75 302 34 6 2 301 7 304 2e-87 331 Eukaryota Uncultured eukaryote clone SGPX651 18S ribosomal RNA gene, partial sequence 100272 uncultured eukaryote uncultured eukaryote eukaryotes
M02215:33:000000000-AFA9E:1:1104:25827:5676 gi|154101573|gb|EF552055.1| 99.34 301 2 0 1 301 68 368 6e-152 545 Eukaryota Balanus glandula isolate 2 16S ribosomal RNA gene, partial sequence; mitochondrial 110520 Balanus glandula Balanus glandula crustaceans
M02215:33:000000000-AFA9E:1:1104:22156:5763 gi|157787506|gb|EF990727.1| 89.11 303 27 4 1 300 2446 2745 1e-99 372 Eukaryota Rhabditoides inermiformis strain SB328 28S large subunit ribosomal RNA gene, partial sequence 96653 Rhabditoides inermiformis Rhabditoides inermiformis nematodes
M02215:33:000000000-AFA9E:1:1104:14633:5939 gi|213959396|gb|FJ426630.1| 78.43 306 53 12 1 301 47 344 4e-44 187 Eukaryota Brachionus angularis voucher S. H. Cheng 001 16S ribosomal RNA gene, partial sequence; mitochondrial 396692 Brachionus angularis Brachionus angularis rotifers
M02215:33:000000000-AFA9E:1:1104:10324:6174 gi|386696111|gb|JQ000284.1| 79.22 308 52 11 2 301 537 840 4e-49 204 Eukaryota Xolalgidae gen. sp. AD1204 18S ribosomal RNA gene, partial sequence 1111400 Xolalgidae gen. sp. AD1204 Xolalgidae gen. sp. AD1204 mites &amp; ticks
M02215:33:000000000-AFA9E:1:1104:13062:6188 gi|74100231|gb|DQ186202.1| 87.97 291 25 7 1 284 13348 13635 1e-88 335 Eukaryota Thalassiosira pseudonana mitochondrion, complete genome 35128 Thalassiosira pseudonana Thalassiosira pseudonana diatoms
M02215:33:000000000-AFA9E:1:1104:13172:6214 gi|109390598|emb|AM039747.1| 82.95 305 39 13 1 298 2492 2790 7e-67 263 Eukaryota Heligmosomoides polygyrus 28S rRNA gene 6339 Heligmosomoides polygyrus Heligmosomoides polygyrus nematodes

Will end up cating 20 output files once done.

Comments are closed.