Tag Archives: wget

Data Received – Jay’s Coral RADseq and Hollie’s Geoduck Epi-RADseq

Jay received notice from UC Berkeley that the sequencing data from his coral RADseq was ready. In addition, the sequencing contains some epiRADseq data from samples provided by Hollie Putnam. See his notebook for multiple links that describe library preparation (indexing and barcodes), sample pooling, and species breakdown.

For quickest reference, here’s Jay’s spreadsheet with virtually all the sample/index/barcode/pooling info (Google Sheet): ddRAD/EpiRAD_Jan_16

I’ve downloaded both the demultiplexed and non-demultiplexed data, verified data integrity by generating and comparing MD5 checksums, copied the files to each of the three species folders on owl/nightingales that were sequenced (Panopea generosa, Anthopleura elegantissima, Porites astreoides), generated and compared MD5 checksums for the files in their directories on owl/nightingales, and created/updated the readme files in each respective folder.

 

Data management is detailed in the Jupyter notebook below. The notebook is embedded in this post, but it may be easier to view on GitHub (linked below).

Readme files were updated outside of the notebook.

Jupyter notebook (GitHub): 20170227_docker_jay_ngs_data_retrieval.ipynb

Data Received – Bisulfite-treated Illumina Sequencing from Genewiz

Received notice the sequencing data was ready from Genewiz for the samples submitted 20151222.

Download the FASTQ files from Genewiz project directory:

wget -r -np -nc -A "*.gz" ftp://username:password@ftp2.genewiz.com/Project_BS1512183

Since two species were sequenced (C.gigas & O.lurida), the corresponding files are in the following locations:

http://owl.fish.washington.edu/nightingales/O_lurida/

http://owl.fish.washington.edu/nightingales/C_gigas/

 

In order to process the files, I needed to identify just the FASTQ files from this project and save the list of files to a bash variable called ‘bsseq':

bsseq=$(ls | grep '^[0-9]\{1\}_*' | grep -v "2bRAD")

Explanation:

bsseq=
  • This initializes a variable called “bsseq” to the values contained in the command following the equals sign.
$(ls | grep '^[0-9]\{1\}_*' | grep -v "2bRAD")
  • This lists (ls) all files, pipes them to the grep command (|), grep finds those files that begin with (^) one or two digits followed by an underscore ([0-9{1}_*), pipes those results (|) to another grep command which excludes (-v) any results containing the text “2bRAD”.

 

FILENAME SAMPLE NAME SPECIES
1_ATCACG_L001_R1_001.fastq.gz 1NF11 O.lurida
2_CGATGT_L001_R1_001.fastq.gz 1NF15 O.lurida
3_TTAGGC_L001_R1_001.fastq.gz 1NF16 O.lurida
4_TGACCA_L001_R1_001.fastq.gz 1NF17 O.lurida
5_ACAGTG_L001_R1_001.fastq.gz 2NF5 O.lurida
6_GCCAAT_L001_R1_001.fastq.gz 2NF6 O.lurida
7_CAGATC_L001_R1_001.fastq.gz 2NF7 O.lurida
8_ACTTGA_L001_R1_001.fastq.gz 2NF8 O.lurida
9_GATCAG_L001_R1_001.fastq.gz M2 C.gigas
10_TAGCTT_L001_R1_001.fastq.gz M3 C.gigas
11_GGCTAC_L001_R1_001.fastq.gz NF2_6 O.lurida
12_CTTGTA_L001_R1_001.fastq.gz NF_18 O.lurida

 

I wanted to add some information about the project to the readme file, like total number of sequencing reads generated and the number of reads in each FASTQ file.

Here’s how to count the total of all reads generated in this project

totalreads=0; for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$((linecount/4)); totalreads=$((readcount+totalreads)); done; echo $totalreads

Total reads = 138,530,448

C.gigas reads: 22,249,631

O.lurida reads: 116,280,817

Code explanation:

totalreads=0;
  • Creates variable called “totalreads” and initializes value to 0.
for i in $bsseq;
  • Initiates a for loop to process the list of files stored in $bsseq variable. The FASTQ files have been compressed with gzip and end with the .gz extension.
do linecount=
  • Creates variable called “linecount” that stores the results of the following command:
`gunzip -c "$i" | wc -l`;
  • Unzips the files ($i) to stdout (-c) instead of actually uncompressing them. This is piped to the word count command, with the line flag (wc -l) to count the number of lines in the files.
readcount=$((linecount/4));
  • Divides the value stored in linecount by 4. This is because an entry for a single Illumina read comprises four lines. This value is stored in the “readcount” variable.
totalreads=$((readcount+totalreads));
  • Adds the readcount for the current file and adds the value to totalreads.
done;
  • End the for loop.
echo $totalreads
  • Prints the value of totalreads to the screen.

Next, I wanted to generate list of the FASTQ files and corresponding read counts, and append this information to the readme file.

for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4)); printf "%s\t%s\n%s\t\t\n" "$i" "$readcount" >> readme.md; done

Code explanation:

for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4));
  • Same for loop as above that calculates the number of reads in each FASTQ file.
printf "%s\t%s\n\n" "$i" "$readcount" >> readme.md;
  • This formats the the printed output. The “%s\t%s\n\n” portion prints the value in $i as a string (%s), followed by a tab (\t), followed by the value in $readcount as a string (%s), followed by two consecutive newlines (\n\n) to provide an empty line between the entries. See the readme file linked above to see how the output looks.
>> readme.md; done
  • This appends the result from each loop to the readme.md file and ends the for loop (done).

 

Automatic Notebook Backups – wget Script & Synology Task Scheduler

UPDATE 20150714 – READ ENTIRE POST

I’ve been tweaking a shell script (notebook_backups.sh) to use the shell program wget to retrieve fully functional HTML versions of our online notebooks for offline viewing. I had been planning on setting up a cron job to automatically run this script on our Synology server (Eagle) at a set day/time. However, I came across the Task Scheduler that’s built right into the Synology GUI! So, I set up the Task Scheduler to run the notebook_backups.sh script every Sunday. See screenshots below.

 

 

 

UPDATE 201507114

The Task Scheduler was not running the script. Additionally, the Task Scheduler would not run the script even when I manually instructed the Task Scheduler to run. Some internet searching revealed that the Task Scheduler requires you to indicate what type of task is being run (e.g. bash, shell, ash, php, etc.), even if your script contains the proper “shebang” or header that normally instructs the computer which program to use to run the script. See the image below for how the Task Scheduler is currently set up. The arrow indicates that addition of “sh” to the beginning of the Task Scheduler’s path to the script. This tells the Task Scheduler to use the Shell to run the script.

 

RNA-Seq – Sea Star Data Download

Received RNA-seq data from Cornell. They provided a convenient download script for retrieving all the data files at one time (a bash script containing a series of wget commands with each individual file’s URL), which is faster/easier than performing individual wget commands for each individual file and faster/easier then using the Synology “Download Station” app when so many URLs are involved.

Here’s the script (download.sh) that was provided:

#!/bin/bash
wget -q -c -O 3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R2.fastq.gz http://cbsuapps.tc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=1160641846&refid=17091
wget -q -c -O 3291_5903_10007_H94MGADXX_V_CF71_ATCACG_R1.fastq.gz http://cbsuapps.tc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=505010539&refid=17092
wget -q -c -O 3291_5903_10008_H94MGADXX_V_CF34_CGATGT_R1.fastq.gz http://cbsuapps.tc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=636513375&refid=17093
wget -q -c -O 3291_5903_10008_H94MGADXX_V_CF34_CGATGT_R2.fastq.gz http://cbsuapps.tc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=1472734408&refid=17094
wget -q -c -O 3291_5903_10009_H94MGADXX_V_CF26_TTAGGC_R2.fastq.gz http://cbsuapps.tc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=948605937&refid=17095
wget -q -c -O 3291_5903_10009_H94MGADXX_V_CF26_TTAGGC_R1.fastq.gz http://cbsuapps.tc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=1810346594&refid=17096
wget -q -c -O 3291_5903_10010_H94MGADXX_HK_CF2_TGACCA_R2.fastq.gz http://cbsuapps.tc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=424477466&refid=17097
wget -q -c -O 3291_5903_10010_H94MGADXX_HK_CF2_TGACCA_R1.fastq.gz http://cbsuapps.tc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=630586816&refid=17098
wget -q -c -O 3291_5903_10011_H94MGADXX_HK_CF35_ACAGTG_R1.fastq.gz http://cbsuapps.tc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=1392201335&refid=17099
wget -q -c -O 3291_5903_10011_H94MGADXX_HK_CF35_ACAGTG_R2.fastq.gz http://cbsuapps.tc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=1598310685&refid=17100
wget -q -c -O 3291_5903_10012_H94MGADXX_HK_CF70_GCCAAT_R1.fastq.gz http://cbsuapps.tc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=868072864&refid=17101
wget -q -c -O 3291_5903_10012_H94MGADXX_HK_CF70_GCCAAT_R2.fastq.gz http://cbsuapps.tc.cornell.edu/Sequencing/showseqfile.aspx?mode=http&cntrl=1074182214&refid=17102

This is a bash script. However, for the most direct method of downloading these on our Synology server, we need the script to be an ash script. So, just modify the first line of the script to say “#!/bin/ash” instead of “#!/bin/bash”. Then, I placed the script in the target directory for our files, SSH’d into our Synology (Eagle), changed to the directory where I placed our script (Eagle/web/whale/SeaStarRNASeq) and then ran the script (./download.sh).