Data Wrangling - CpG OE Calculations on C.virginica Genes

Steven tasked me with processing ~90 FastA files containing gene sequences from C.virginica in this GitHub Issue. He needed to determine the Observed/Expected (O/E) ratio of CpGs in each FastA. He provided this example code and this link to all the files. Additionally, today, he tasked Kaitlyn with merging all of the output CpG O/E values for each sample in to a single file, but I decided to tackle it anyway.

The CpG O/E determination was done in a Jupyter Notebook:

Interestingly, the processing (which relied on awk) required the use of gawk, due to the high number of output fields. The default implementation of awk on the version of Ubuntu I was using was not gawk.

The creation of a single file with all of the CpG O/E info is detailed in this bash script:

#!/bin/bash

## Script to append sample-specific headers to each ID_CpG
## file and join all ID_CpG files.

## Run file from within this directory.

# Temp file placeholder
tmp=$(mktemp)

# Create array of subdirectories.
array=(*/)

# Create column headers for ID_CpG files using sample name from directory name.
# Expects directory names like this:
# Combined.SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.HC_VA_3_GENE_analysis
# Command will remove the "_GENE_analysis" portion.
for file in "${array[@]}"
do
  gene=$(echo "${file}" \
  | awk -F[.] '{print $6}' \
  | rev \
  | cut -d "_" -f3- \
  | rev)
  sed "1iID\t${gene}" "${file}"ID_CpG \
  > "${file}"ID_CpG_labelled
done

# Create initial file for joining
cp "${array[0]}"ID_CpG_labelled ID_CpG_labelled_all.tab

# Loop through array and performs joins.
# Looping starts at array index 1, since index 0 was processed above.
# Assumes input files were already sorted by chromosome, in column 1.
# Outputs tab-delimited file
for file in "${array[@]:1}"
do
  join \
  --nocheck-order \
  ID_CpG_labelled_all.tab "${file}"ID_CpG_labelled \
  | tr ' ' '\t' \
  > "${tmp}" \
  && mv "${tmp}" ID_CpG_labelled_all.tab
done

All data was rsync‘d to my folder on Gannet.

RESULTS

Output folder:

The output folder contains a subdirectory for each of the ~90 genes that were processed. Each contains the following files (text):

  • C (count of Cs in sequences)
  • CG (count of CGs in sequences)
  • *_tab (FastA to tab file created with seqkit)
  • *_tab2 (Just sequences from each entry within the FastA)
  • G (count of Gs in sequences)
  • ID_CpG (initial CpG O/E determination - no header)
  • ID_CpG_labelled (CpG O/E determination - with header)
  • comb (Tabbed FastA with seq lengths, CG/C/G counts)

The combined CpG O/E file (text):