Lets say there was this person, we will call her Emma for now, that needed to download lots of data but wanted to make it more robust and reliable. Here is a way to use NCBI ESearch and EFetch tools to do so. Complete documention at http://www.ncbi.nlm.nih.gov/books/NBK25498/. Specific example used is here
Example one: Download all ilumatobacter protein sequences in fasta format.
Will use Esearch to get GI numbers, post them to history and multiple EFetch calls to retrieve data.
Input: $query – ilumatobacter[orgn]
Output: A file named “ilumatobacter.fa” containing FASTA data.
Perl script
use LWP::Simple;
$query = 'ilumatobacter[orgn]';
#assemble the esearch URL
$base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "esearch.fcgi?db=protein&term=$query&usehistory=y";
#post the esearch URL
$output = get($url);
#parse WebEnv, QueryKey and Count (# records retrieved)
$web = $1 if ($output =~ /
$key = $1 if ($output =~ /
$count = $1 if ($output =~ /
#open output file for writing
open(OUT, ">ilumatobacter.fa") || die "Can't open file!\n";
#retrieve data in batches of 500
$retmax = 500;
for ($retstart = 0; $retstart < $count; $retstart += $retmax) {
$efetch_url = $base ."efetch.fcgi?db=protein&WebEnv=$web";
$efetch_url .= "&query_key=$key&retstart=$retstart";
$efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
$efetch_out = get($efetch_url);
print OUT "$efetch_out";
}
close OUT;
So if you wanted to use this simple paste the above code in text file (Suggest using TextWrangler) and saving as .pl file (ie /Users/sr320/Desktop/ill-prot.pl
. Then in Terminal, type perl /Users/sr320/Desktop/ill-prot.pl
. The data will download to whatever directory you are in Terminal.
In actuallity, this still seems to fail randomly. This is common to see on the internets. The best guess is too many requests during busy time of day, so it might take a couple if trys. See http://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Usage_Guidelines_and_Requiremen for usage recommendations.
Recent Comments