Tag Archives: PacBio

Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

In this GitHub Issue, Steven had suggested I try out the minimap/miniasm/racon pipeline for assembling our Olympia oyster PacBio data.

I followed the pipeline described by this paper: http://matzlab.weebly.com/uploads/7/6/2/2/76229469/racon.pdf.

This notebook entry just contains the initial minimap execution. Followed up with miniasm and then racon.

Jupyter Notebook (GitHub): 20170907_docker_pacbio_oly_minimap2.ipynb

Project Progress – Olympia Oyster Genome Assemblies by Sean Bennett

0000-0002-2747-368X

Here’s a brief overview of what Sean has done with the Oly genome assembly front.

Metassembler

Assemble his BGI assembly and Platanus assembly? Confusing terms here; not sure what he means.
Failed due to 32-bit vs. 64-bit installation of MUMmer. He didn’t have the chance to re-compile MUMmer as 64-bit. However, a recent MUMmer announcement suggests that MUMmer can now handle genomes of unlimited size.
I believe he was planning on using (or was using?) GARM, which relies upon MUMmer and may also include a version of MUMmer (outdated version that led to Sean’s error message?).
Notebook entry

Canu

Assemble UW PacBio data (filenames beginning with m170211, m170315, m170308, and m170301).
Files (including Mox scripts, Pilon contig polishing, & output FASTA files) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Canu_Output/
Notebook entry

Redundans

Assembled raw Illumina reads provided by BGI (filenames beginning with 15114 and 16103) & UW PacBio data (filenames beginning with m170211, m170315, m170308, and m170301).
Ran this two times.
First run
- Files (does NOT include Mox scripts!) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Redundans_Output/
- Notebook entry
Second run
- Files (including Mox scripts & output FASTA files) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Redundans_Output_Try_2/
- Notebook entry

Platanus

Assembled raw Illumina reads provided by BGI (beginning with 151114 and 160103).
Ran this two times.
First run
- Files (including Mox scripts & output FASTA files) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Illumina_Platanus_Assembly/
- Notebook entry
Second run
- Files (including Mox scripts & output FASTA files) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Platanus_Assembly_Kmer-22/
- Notebook entry

Data Management – SRA Submission Olympia Oyster UW PacBio Data from 20170323

0000-0002-2747-368X

Submitted the FASTQ files from the UW PacBio Data from 20170323 to the NCBI sequence read archive (SRA).

FTP’d the data to NCBI’s servers, following their instructions. Briefly,

Change to the directory where the FASTQ files are (Owl/web/nightingales/O_lurida) and then initiate an FTP session:

ftp -i ftp-private.ncbi.nlm.nih.gov

Enter provided username/password, change to my designated uploads directory, create new folder dedicate to this particular upload. Then, upload all the files using the mput command:

mput *filtered_subreads*

SRA deets are below (assigned FASTQ files to existing BioProject and created a new BioSample). Will update post with SRA number when processing is complete on the NCBI end.

SRA: SRS2339870
Study: SRR5809355
BioProject: PRJNA316624
BioSample: SAMN07326085

Data Management – Tarball of Olympia oyster UW PacBio Data from 20170323

0000-0002-2747-368X

I’d previously attempted to archive this data set on multiple occasions, across multiple days, but network dropouts kept killing my connection to the server (Owl) and, in turn, interrupting the tarball operation.

Today, I came in to a successful creation of the tarball of this PacBio data set (it only took 10hrs)! And, it’s a big file: 162GB!! Remember, that’s the compressed size!

Now, we’ll have to decide where we want to keep the tarball. I guess this’ll be part of our next data management plan discussions.

Data Management – OLYMPIA OYSTER UW PACBIO DATA (FROM 20170323) TO NIGHTINGALES

0000-0002-2747-368X

Added UW PacBio FASTQ files to our nightingales Google Sheet for keeping track of our high-throughput sequencing projects.

Data Management – Olympia oyster UW PacBio Data from 20170323

0000-0002-2747-368X

Due to other priorities, getting this PacBio data sorted and prepped for our next gen sequencing data management plan (DMP) was put on the back burner. I finally got around to this, but it wasn’t all successful.

The primary failure is the inability to get the original data file archived as a gzipped tarball. The problem lies in loss of connection to Owl during the operation. This connection issue was recently noticed by Sean in his dealings with Hyak (mox). According to Sean, the Hyak (mox) people or UW IT ran some tests of their own on this connection and their results suggested that the connection issue is related to a network problem in FTR, and is not related to Owl itself. Whatever the case is, we need to have this issue addressed sometime soon…

Anyway, below is the Jupyter notebook that demonstrates the file manipulations I used to find, copy, rename, and verify data integrity of all the FASTQ files from this sequencing run.

Next up is to get these FASTQ files added to the DMP spreadsheets.

Jupyter notebook (GitHub): 20170622_oly_pacbio_data_management.ipynb

I’ve also embedded the notebook below, but it might be easier to view at the GitHub link provided above.

Data Received – Olympia oyster PacBio Data

0000-0002-2747-368X

Back in December 2016, we sent off Ostrea lurida DNA to the UW PacBio sequencing facility. This is an attempt to fill in the gaps left from the BGI genome sequencing project.

See the GitHub Wiki dedicated to this for an overview of this UW PacBio sequencing.

I downloaded the data to http://owl.fish.washington.edu/nightingales/O_lurida/20170323_pacbio/ using the required browser plugin, Aspera Connect. Technically, saving the data to a subfolder within a given species’ data folder goes against our data management plan (DMP) for high-throughput sequencing data, but the sequencing data output is far different than what we normally receive from an Illumina sequencing run. Instead of a just FASTQ files, we received the following from each PacBio SMRT cell we had run (we had 10 SMRT cells run):

├── Analysis_Results
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.bax.h5
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.bax.h5
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.bax.h5
│   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.bas.h5
├── filter
│   ├── data
│   │   ├── control_reads.cmp.h5
│   │   ├── control_results_by_movie.csv
│   │   ├── data.items.json
│   │   ├── data.items.pickle
│   │   ├── filtered_regions
│   │   │   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.rgn.h5
│   │   │   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.rgn.h5
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.rgn.h5
│   │   ├── filtered_regions.fofn
│   │   ├── filtered_subread_summary.csv
│   │   ├── filtered_subreads.fasta
│   │   ├── filtered_subreads.fastq
│   │   ├── filtered_summary.csv
│   │   ├── nocontrol_filtered_subreads.fasta
│   │   ├── post_control_regions.chunk001of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.rgn.h5
│   │   ├── post_control_regions.chunk002of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.rgn.h5
│   │   ├── post_control_regions.chunk003of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.rgn.h5
│   │   ├── post_control_regions.fofn
│   │   └── slots.pickle
│   ├── index.html
│   ├── input.fofn
│   ├── input.xml
│   ├── log
│   │   ├── P_Control
│   │   │   ├── align.cmpH5.Gather.log
│   │   │   ├── align.plsFofn.Scatter.log
│   │   │   ├── align_001of003.log
│   │   │   ├── align_002of003.log
│   │   │   ├── align_003of003.log
│   │   │   ├── noControlSubreads.log
│   │   │   ├── summaryCSV.log
│   │   │   ├── updateRgn.noCtrlFofn.Gather.log
│   │   │   ├── updateRgn_001of003.log
│   │   │   ├── updateRgn_002of003.log
│   │   │   └── updateRgn_003of003.log
│   │   ├── P_ControlReports
│   │   │   └── statsJsonReport.log
│   │   ├── P_Fetch
│   │   │   ├── adapterRpt.log
│   │   │   ├── overviewRpt.log
│   │   │   └── toFofn.log
│   │   ├── P_Filter
│   │   │   ├── filter.rgnFofn.Gather.log
│   │   │   ├── filter.summary.Gather.log
│   │   │   ├── filter_001of003.log
│   │   │   ├── filter_002of003.log
│   │   │   ├── filter_003of003.log
│   │   │   ├── subreadSummary.log
│   │   │   ├── subreads.subreadFastq.Gather.log
│   │   │   ├── subreads.subreads.Gather.log
│   │   │   ├── subreads_001of003.log
│   │   │   ├── subreads_002of003.log
│   │   │   └── subreads_003of003.log
│   │   ├── P_FilterReports
│   │   │   ├── loadingRpt.log
│   │   │   ├── statsRpt.log
│   │   │   └── subreadRpt.log
│   │   ├── master.log
│   │   └── smrtpipe.log
│   ├── metadata.rdf
│   ├── results
│   │   ├── adapter_observed_insert_length_distribution.png
│   │   ├── adapter_observed_insert_length_distribution_thumb.png
│   │   ├── control_non-control_readlength.png
│   │   ├── control_non-control_readlength_thumb.png
│   │   ├── control_non-control_readquality.png
│   │   ├── control_non-control_readquality_thumb.png
│   │   ├── control_report.html
│   │   ├── control_report.json
│   │   ├── filter_reports_adapters.html
│   │   ├── filter_reports_adapters.json
│   │   ├── filter_reports_filter_stats.html
│   │   ├── filter_reports_filter_stats.json
│   │   ├── filter_reports_filter_subread_stats.html
│   │   ├── filter_reports_filter_subread_stats.json
│   │   ├── filter_reports_loading.html
│   │   ├── filter_reports_loading.json
│   │   ├── filtered_subread_report.png
│   │   ├── filtered_subread_report_thmb.png
│   │   ├── overview.html
│   │   ├── overview.json
│   │   ├── post_filter_readlength_histogram.png
│   │   ├── post_filter_readlength_histogram_thumb.png
│   │   ├── post_filterread_score_histogram.png
│   │   ├── post_filterread_score_histogram_thumb.png
│   │   ├── pre_filter_readlength_histogram.png
│   │   ├── pre_filter_readlength_histogram_thumb.png
│   │   ├── pre_filterread_score_histogram.png
│   │   └── pre_filterread_score_histogram_thumb.png
│   ├── toc.xml
│   └── workflow
│       ├── P_Control
│       │   ├── align.cmpH5.Gather.sh
│       │   ├── align.plsFofn.Scatter.sh
│       │   ├── align_001of003.sh
│       │   ├── align_002of003.sh
│       │   ├── align_003of003.sh
│       │   ├── noControlSubreads.sh
│       │   ├── summaryCSV.sh
│       │   ├── updateRgn.noCtrlFofn.Gather.sh
│       │   ├── updateRgn_001of003.sh
│       │   ├── updateRgn_002of003.sh
│       │   └── updateRgn_003of003.sh
│       ├── P_ControlReports
│       │   └── statsJsonReport.sh
│       ├── P_Fetch
│       │   ├── adapterRpt.sh
│       │   ├── overviewRpt.sh
│       │   └── toFofn.sh
│       ├── P_Filter
│       │   ├── filter.rgnFofn.Gather.sh
│       │   ├── filter.summary.Gather.sh
│       │   ├── filter_001of003.sh
│       │   ├── filter_002of003.sh
│       │   ├── filter_003of003.sh
│       │   ├── subreadSummary.sh
│       │   ├── subreads.subreadFastq.Gather.sh
│       │   ├── subreads.subreads.Gather.sh
│       │   ├── subreads_001of003.sh
│       │   ├── subreads_002of003.sh
│       │   └── subreads_003of003.sh
│       ├── P_FilterReports
│       │   ├── loadingRpt.sh
│       │   ├── statsRpt.sh
│       │   └── subreadRpt.sh
│       ├── Workflow.details.dot
│       ├── Workflow.details.html
│       ├── Workflow.details.svg
│       ├── Workflow.profile.html
│       ├── Workflow.rdf
│       ├── Workflow.summary.dot
│       ├── Workflow.summary.html
│       └── Workflow.summary.svg
├── filtered_subreads.fasta.gz
├── filtered_subreads.fastq.gz
├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.metadata.xml
└── nocontrol_filtered_subreads.fasta.gz

That’s 20 directories and 127 files – for a single SMRT cell!

Granted, there is the familiar FASTQ file (filtered_subreads.fastq), which is what will likely be used for downstream analysis, but it’s hard to make a decision on how we manage this data under the guidelines of our current DMP. It’s possible we might separate data files from the numerous other files (the other files are, essentially, metadata), but we need to decide which file type(s) (e.g. .h5 files, .fastq files) will server as the data files people will rely on for analysis. So, for the time being, this will be how the data is stored.

I’ll update the readme file to reflect the addition of the top level folders (e.g. ../20170323_pacbio/170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1/).

I’ll also update the GitHub Wiki

Sample Submission – Ostrea lurida gDNA for PacBio Sequencing

0000-0002-2747-368X

Submitted 10μg (30.7μL) of the O.lurida gDNA I isolated on 20161214 to the UW PacBio facility – Order #450.

Sequencing will be 10 SMRT cells. Turnaround time is ~7-8 weeks for UW customers (UW customers get queue priority).

Sam's Notebook

University of Washington – Fishery Sciences – Roberts Lab

Tag Archives: PacBio

Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

Project Progress – Olympia Oyster Genome Assemblies by Sean Bennett

Data Management – SRA Submission Olympia Oyster UW PacBio Data from 20170323

Data Management – Tarball of Olympia oyster UW PacBio Data from 20170323

Data Management – OLYMPIA OYSTER UW PACBIO DATA (FROM 20170323) TO NIGHTINGALES

Data Management – Olympia oyster UW PacBio Data from 20170323

Data Received – Olympia oyster PacBio Data

Sample Submission – Ostrea lurida gDNA for PacBio Sequencing