Tag Archives: PacBio

Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

In this GitHub Issue, Steven had suggested I try out the minimap/miniasm/racon pipeline for assembling our Olympia oyster PacBio data.

I followed the pipeline described by this paper: http://matzlab.weebly.com/uploads/7/6/2/2/76229469/racon.pdf.

This notebook entry just contains the initial minimap execution. Followed up with miniasm and then racon.

Jupyter Notebook (GitHub): 20170907_docker_pacbio_oly_minimap2.ipynb

Project Progress – Olympia Oyster Genome Assemblies by Sean Bennett

Here’s a brief overview of what Sean has done with the Oly genome assembly front.

Metassembler

  • Assemble his BGI assembly and Platanus assembly? Confusing terms here; not sure what he means.
  • Failed due to 32-bit vs. 64-bit installation of MUMmer. He didn’t have the chance to re-compile MUMmer as 64-bit. However, a recent MUMmer announcement suggests that MUMmer can now handle genomes of unlimited size.
  • I believe he was planning on using (or was using?) GARM, which relies upon MUMmer and may also include a version of MUMmer (outdated version that led to Sean’s error message?).
  • Notebook entry

Canu

Redundans

Platanus

Data Management – SRA Submission Olympia Oyster UW PacBio Data from 20170323

Submitted the FASTQ files from the UW PacBio Data from 20170323 to the NCBI sequence read archive (SRA).

FTP’d the data to NCBI’s servers, following their instructions. Briefly,

Change to the directory where the FASTQ files are (Owl/web/nightingales/O_lurida) and then initiate an FTP session:

ftp -i ftp-private.ncbi.nlm.nih.gov

Enter provided username/password, change to my designated uploads directory, create new folder dedicate to this particular upload. Then, upload all the files using the mput command:

mput *filtered_subreads*

SRA deets are below (assigned FASTQ files to existing BioProject and created a new BioSample). Will update post with SRA number when processing is complete on the NCBI end.

SRA: SRS2339870
Study: SRR5809355
BioProject: PRJNA316624
BioSample: SAMN07326085

Data Management – Tarball of Olympia oyster UW PacBio Data from 20170323

I’d previously attempted to archive this data set on multiple occasions, across multiple days, but network dropouts kept killing my connection to the server (Owl) and, in turn, interrupting the tarball operation.

Today, I came in to a successful creation of the tarball of this PacBio data set (it only took 10hrs)! And, it’s a big file: 162GB!! Remember, that’s the compressed size!

Now, we’ll have to decide where we want to keep the tarball. I guess this’ll be part of our next data management plan discussions.

 

Data Management – Olympia oyster UW PacBio Data from 20170323

Due to other priorities, getting this PacBio data sorted and prepped for our next gen sequencing data management plan (DMP) was put on the back burner. I finally got around to this, but it wasn’t all successful.

The primary failure is the inability to get the original data file archived as a gzipped tarball. The problem lies in loss of connection to Owl during the operation. This connection issue was recently noticed by Sean in his dealings with Hyak (mox). According to Sean, the Hyak (mox) people or UW IT ran some tests of their own on this connection and their results suggested that the connection issue is related to a network problem in FTR, and is not related to Owl itself. Whatever the case is, we need to have this issue addressed sometime soon…

Anyway, below is the Jupyter notebook that demonstrates the file manipulations I used to find, copy, rename, and verify data integrity of all the FASTQ files from this sequencing run.

Next up is to get these FASTQ files added to the DMP spreadsheets.

Jupyter notebook (GitHub): 20170622_oly_pacbio_data_management.ipynb

 

I’ve also embedded the notebook below, but it might be easier to view at the GitHub link provided above.

Data Received – Olympia oyster PacBio Data

Back in December 2016, we sent off Ostrea lurida DNA to the UW PacBio sequencing facility. This is an attempt to fill in the gaps left from the BGI genome sequencing project.

See the GitHub Wiki dedicated to this for an overview of this UW PacBio sequencing.

I downloaded the data to http://owl.fish.washington.edu/nightingales/O_lurida/20170323_pacbio/ using the required browser plugin, Aspera Connect. Technically, saving the data to a subfolder within a given species’ data folder goes against our data management plan (DMP) for high-throughput sequencing data, but the sequencing data output is far different than what we normally receive from an Illumina sequencing run. Instead of a just FASTQ files, we received the following from each PacBio SMRT cell we had run (we had 10 SMRT cells run):

├── Analysis_Results
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.bax.h5
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.bax.h5
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.bax.h5
│   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.bas.h5
├── filter
│   ├── data
│   │   ├── control_reads.cmp.h5
│   │   ├── control_results_by_movie.csv
│   │   ├── data.items.json
│   │   ├── data.items.pickle
│   │   ├── filtered_regions
│   │   │   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.rgn.h5
│   │   │   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.rgn.h5
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.rgn.h5
│   │   ├── filtered_regions.fofn
│   │   ├── filtered_subread_summary.csv
│   │   ├── filtered_subreads.fasta
│   │   ├── filtered_subreads.fastq
│   │   ├── filtered_summary.csv
│   │   ├── nocontrol_filtered_subreads.fasta
│   │   ├── post_control_regions.chunk001of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.rgn.h5
│   │   ├── post_control_regions.chunk002of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.rgn.h5
│   │   ├── post_control_regions.chunk003of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.rgn.h5
│   │   ├── post_control_regions.fofn
│   │   └── slots.pickle
│   ├── index.html
│   ├── input.fofn
│   ├── input.xml
│   ├── log
│   │   ├── P_Control
│   │   │   ├── align.cmpH5.Gather.log
│   │   │   ├── align.plsFofn.Scatter.log
│   │   │   ├── align_001of003.log
│   │   │   ├── align_002of003.log
│   │   │   ├── align_003of003.log
│   │   │   ├── noControlSubreads.log
│   │   │   ├── summaryCSV.log
│   │   │   ├── updateRgn.noCtrlFofn.Gather.log
│   │   │   ├── updateRgn_001of003.log
│   │   │   ├── updateRgn_002of003.log
│   │   │   └── updateRgn_003of003.log
│   │   ├── P_ControlReports
│   │   │   └── statsJsonReport.log
│   │   ├── P_Fetch
│   │   │   ├── adapterRpt.log
│   │   │   ├── overviewRpt.log
│   │   │   └── toFofn.log
│   │   ├── P_Filter
│   │   │   ├── filter.rgnFofn.Gather.log
│   │   │   ├── filter.summary.Gather.log
│   │   │   ├── filter_001of003.log
│   │   │   ├── filter_002of003.log
│   │   │   ├── filter_003of003.log
│   │   │   ├── subreadSummary.log
│   │   │   ├── subreads.subreadFastq.Gather.log
│   │   │   ├── subreads.subreads.Gather.log
│   │   │   ├── subreads_001of003.log
│   │   │   ├── subreads_002of003.log
│   │   │   └── subreads_003of003.log
│   │   ├── P_FilterReports
│   │   │   ├── loadingRpt.log
│   │   │   ├── statsRpt.log
│   │   │   └── subreadRpt.log
│   │   ├── master.log
│   │   └── smrtpipe.log
│   ├── metadata.rdf
│   ├── results
│   │   ├── adapter_observed_insert_length_distribution.png
│   │   ├── adapter_observed_insert_length_distribution_thumb.png
│   │   ├── control_non-control_readlength.png
│   │   ├── control_non-control_readlength_thumb.png
│   │   ├── control_non-control_readquality.png
│   │   ├── control_non-control_readquality_thumb.png
│   │   ├── control_report.html
│   │   ├── control_report.json
│   │   ├── filter_reports_adapters.html
│   │   ├── filter_reports_adapters.json
│   │   ├── filter_reports_filter_stats.html
│   │   ├── filter_reports_filter_stats.json
│   │   ├── filter_reports_filter_subread_stats.html
│   │   ├── filter_reports_filter_subread_stats.json
│   │   ├── filter_reports_loading.html
│   │   ├── filter_reports_loading.json
│   │   ├── filtered_subread_report.png
│   │   ├── filtered_subread_report_thmb.png
│   │   ├── overview.html
│   │   ├── overview.json
│   │   ├── post_filter_readlength_histogram.png
│   │   ├── post_filter_readlength_histogram_thumb.png
│   │   ├── post_filterread_score_histogram.png
│   │   ├── post_filterread_score_histogram_thumb.png
│   │   ├── pre_filter_readlength_histogram.png
│   │   ├── pre_filter_readlength_histogram_thumb.png
│   │   ├── pre_filterread_score_histogram.png
│   │   └── pre_filterread_score_histogram_thumb.png
│   ├── toc.xml
│   └── workflow
│       ├── P_Control
│       │   ├── align.cmpH5.Gather.sh
│       │   ├── align.plsFofn.Scatter.sh
│       │   ├── align_001of003.sh
│       │   ├── align_002of003.sh
│       │   ├── align_003of003.sh
│       │   ├── noControlSubreads.sh
│       │   ├── summaryCSV.sh
│       │   ├── updateRgn.noCtrlFofn.Gather.sh
│       │   ├── updateRgn_001of003.sh
│       │   ├── updateRgn_002of003.sh
│       │   └── updateRgn_003of003.sh
│       ├── P_ControlReports
│       │   └── statsJsonReport.sh
│       ├── P_Fetch
│       │   ├── adapterRpt.sh
│       │   ├── overviewRpt.sh
│       │   └── toFofn.sh
│       ├── P_Filter
│       │   ├── filter.rgnFofn.Gather.sh
│       │   ├── filter.summary.Gather.sh
│       │   ├── filter_001of003.sh
│       │   ├── filter_002of003.sh
│       │   ├── filter_003of003.sh
│       │   ├── subreadSummary.sh
│       │   ├── subreads.subreadFastq.Gather.sh
│       │   ├── subreads.subreads.Gather.sh
│       │   ├── subreads_001of003.sh
│       │   ├── subreads_002of003.sh
│       │   └── subreads_003of003.sh
│       ├── P_FilterReports
│       │   ├── loadingRpt.sh
│       │   ├── statsRpt.sh
│       │   └── subreadRpt.sh
│       ├── Workflow.details.dot
│       ├── Workflow.details.html
│       ├── Workflow.details.svg
│       ├── Workflow.profile.html
│       ├── Workflow.rdf
│       ├── Workflow.summary.dot
│       ├── Workflow.summary.html
│       └── Workflow.summary.svg
├── filtered_subreads.fasta.gz
├── filtered_subreads.fastq.gz
├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.metadata.xml
└── nocontrol_filtered_subreads.fasta.gz

That’s 20 directories and 127 files – for a single SMRT cell!

Granted, there is the familiar FASTQ file (filtered_subreads.fastq), which is what will likely be used for downstream analysis, but it’s hard to make a decision on how we manage this data under the guidelines of our current DMP. It’s possible we might separate data files from the numerous other files (the other files are, essentially, metadata), but we need to decide which file type(s) (e.g. .h5 files, .fastq files) will server as the data files people will rely on for analysis. So, for the time being, this will be how the data is stored.

I’ll update the readme file to reflect the addition of the top level folders (e.g. ../20170323_pacbio/170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1/).

I’ll also update the GitHub Wiki