Data Wrangling - CpG OE Calculations on C.virginica Genes

Steven tasked me with processing ~90 FastA files containing gene sequences from C.virginica in this GitHub Issue. He needed to determine the Observed/Expected (O/E) ratio of CpGs in each FastA. He provided this example code and this link to all the files. Additionally, today, he tasked Kaitlyn with merging all of the output CpG O/E values for each sample in to a single file, but I decided to tackle it anyway.

The CpG O/E determination was done in a Jupyter Notebook:

20190225_swoose_CpG_OE.ipynb (GitHub)

Interestingly, the processing (which relied on awk) required the use of gawk, due to the high number of output fields. The default implementation of awk on the version of Ubuntu I was using was not gawk.

The creation of a single file with all of the CpG O/E info is detailed in this bash script:

20190226_join_ID_CpGs.sh

#!/bin/bash

## Script to append sample-specific headers to each ID_CpG
## file and join all ID_CpG files.

## Run file from within this directory.

# Temp file placeholder
tmp=$(mktemp)

# Create array of subdirectories.
array=(*/)

# Create column headers for ID_CpG files using sample name from directory name.
# Expects directory names like this:
# Combined.SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.HC_VA_3_GENE_analysis
# Command will remove the "_GENE_analysis" portion.
for file in "${array[@]}"
do
  gene=$(echo "${file}" \
  | awk -F[.] '{print $6}' \
  | rev \
  | cut -d "_" -f3- \
  | rev)
  sed "1iID\t${gene}" "${file}"ID_CpG \
  > "${file}"ID_CpG_labelled
done

# Create initial file for joining
cp "${array[0]}"ID_CpG_labelled ID_CpG_labelled_all.tab

# Loop through array and performs joins.
# Looping starts at array index 1, since index 0 was processed above.
# Assumes input files were already sorted by chromosome, in column 1.
# Outputs tab-delimited file
for file in "${array[@]:1}"
do
  join \
  --nocheck-order \
  ID_CpG_labelled_all.tab "${file}"ID_CpG_labelled \
  | tr ' ' '\t' \
  > "${tmp}" \
  && mv "${tmp}" ID_CpG_labelled_all.tab
done

All data was rsync‘d to my folder on Gannet.

RESULTS

Output folder:

20190225_cpg_oe/

The output folder contains a subdirectory for each of the ~90 genes that were processed. Each contains the following files (text):

C (count of Cs in sequences)
CG (count of CGs in sequences)
*_tab (FastA to tab file created with seqkit)
*_tab2 (Just sequences from each entry within the FastA)
G (count of Gs in sequences)
ID_CpG (initial CpG O/E determination - no header)
ID_CpG_labelled (CpG O/E determination - with header)
comb (Tabbed FastA with seq lengths, CG/C/G counts)

The combined CpG O/E file (text):

20190225_cpg_oe/ID_CpG_labelled_all.tab