###Display system info

In [6]:
!system_profiler SPSoftwareDataType

Software:

    System Software Overview:

      System Version: OS X 10.9.5 (13F34)
      Kernel Version: Darwin 13.4.0
      Boot Volume: Hummingbird
      Boot Mode: Normal
      Computer Name: hummingbird
      User Name: Sam (Sam)
      Secure Virtual Memory: Enabled
      Time since boot: 121 days 1:26



In [7]:
cd /Volumes/Data/Sam/scratch/

/Volumes/Data/Sam/scratch


###Quality trim all fastq.gz files using [Trimmomatic (v0.30)](http://www.usadellab.org/cms/?page=trimmomatic)

####Code explanation of for loop below:
1. ```%%bash``` specifies to use the shell for this Jupyter cell
2. ```for file in /Volumes/nightingales/C_gigas/2212_lane2_[^N]*``` initiates a for loop to handle all files beginning with ```2212_lane2_``` and only those that do <em>not</em> have the letter "N" at that position in the file name.
3. ```do``` tells the for loop what to do with each of the files.
4. ```newname=${file##*/}``` takes the value of the ```$file``` variable (which is ```/Volumes/nightingales/C_gigas/2212_lane2_[^N]*```) and trims the longest match from the beginning of the pattern (the pattern is ```*/```; the ```##``` is a bash command to specifiy how to trim). The resulting output (which is just the file name without the full path) is then stored in the ```newname``` variable.
5. This line initiates Trimmomatic and uses the following arguments to specify order of execution:
    1. single end reads (```SE```)
    1. number of threads (```-threads 16```), 
    2. type of quality score (```-phred33```),
    3. input file location (```"$file"```),
    4. output file name/location (```/Volumes/Data/Sam/scratch/20140521_trimmed_$newname```),
    5. single end Illumina TruSeq adaptor trimming (```ILLUMINACLIP:/usr/local/bioinformatics/Trimmomatic-0.30/adapters/TruSeq3-SE.fa:2:30:10```); uses fasta file with adaptor sequences; came with program,
    6. trim read lengths to set length by trimming from end of read (```CROP:90```); removes last 10 bases
    7. cut number of bases at beginning of read (```HEADCROP:39```)
    6. cut number of bases at beginning of read if below quality threshold (```LEADING:3```)
    7. cut number of bases at end of read if below quality threshold (```TRAILING:3```)
    8. cut if average quality within 4 base window falls below 15 (```SLIDINGWINDOW:4:15```)
6. ```done``` closes for loop.

In [33]:
%%bash
for file in /Volumes/nightingales/C_gigas/2212_lane2_[^N]*
do
newname=${file##*/} 
java -jar /usr/local/bioinformatics/Trimmomatic-0.30/trimmomatic-0.30.jar \
SE \
-threads 16 \
-phred33 "$file" \
/Volumes/Data/Sam/scratch/20150521_trimmed_$newname \
ILLUMINACLIP:/usr/local/bioinformatics/Trimmomatic-0.30/adapters/TruSeq3-SE.fa:2:30:10 \
CROP:90 \
HEADCROP:39 \
LEADING:3 \
TRAILING:3 \
SLIDINGWINDOW:4:15;
done

TrimmomaticSE: Started with arguments: -threads 16 -phred33 /Volumes/nightingales/C_gigas/2212_lane2_CTTGTA_L002_R1_001.fastq.gz /Volumes/Data/Sam/scratch/20150521_trimmed_2212_lane2_CTTGTA_L002_R1_001.fastq.gz ILLUMINACLIP:/usr/local/bioinformatics/Trimmomatic-0.30/adapters/TruSeq3-SE.fa:2:30:10 CROP:90 HEADCROP:39 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
ILLUMINACLIP: Using 0 prefix pairs, 2 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Reads: 16000000 Surviving: 15796545 (98.73%) Dropped: 203455 (1.27%)
TrimmomaticSE: Completed successfully
TrimmomaticSE: Started with arguments: -threads 16 -phred33 /Volumes/nightingales/C_gigas/2212_lane2_CTTGTA_L002_R1_002.fastq.gz /Volumes/Data/Sam/scratch/20150521_trimmed_2212_lane2_CTTGTA_L002_R1_002.fastq.gz ILLUMINACLIP:/usr/local/bioinformatics/Trimmomatic-0.30/adapters/Tru

###Concatenate two groups of sequences into single file

####400ppm (control) sequences - Index GCCAAT

In [34]:
%%bash
#gunzips all matching files in folder and appends the data to a single file:
#201500521_trimmed_2212_lane2_400ppm_GCCAAT.fastq
for file in 20150521_trimmed_2212_lane2_G*
do
gunzip -c "$file"  >> 20150521_trimmed_2212_lane2_400ppm_GCCAAT.fastq
done

In [35]:
%%bash
#Gzip file
gzip 20150521_trimmed_2212_lane2_400ppm_GCCAAT.fastq

####1000ppm (acidification) sequences - Index CTTGTA

In [36]:
%%bash
#gunzips all matching files in folder and appends the data to a single file:
#20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq
for file in 20150521_trimmed_2212_lane2_C*
do
gunzip -c "$file" >> 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq
done

In [37]:
%%bash
#Gzip file
gzip 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq

###FASTQC on concatenated files using [FASTQC (v0.11.2)](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

In [38]:
%%bash
for file in /Volumes/Data/Sam/scratch/20150521_*[e2]_[14]*.gz; do fastqc "$file" --outdir=/Volumes/Eagle/Arabidopsis/; done

Analysis complete for 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Analysis complete for 20150521_trimmed_2212_lane2_400ppm_GCCAAT.fastq.gz


Started analysis of 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Approx 5% complete for 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Approx 10% complete for 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Approx 15% complete for 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Approx 20% complete for 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Approx 25% complete for 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Approx 30% complete for 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Approx 35% complete for 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Approx 40% complete for 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Approx 45% complete for 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Approx 50% complete for 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Approx 55% complete for 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Approx 60% complete for 20150521_trimmed_2212_lane2_1000ppm_CTTGTA.fastq.gz
Approx 65% comple

###Copy files to Eagle for web-based access

In [39]:
%%bash
for file in 2015*e2_[14]*; do cp "$file" /Volumes/Eagle/Arabidopsis/; done