Goals – May 2017

A day late, but definitely not a dollar short!

No goals posted last month because I didn’t want anyone to think they were just an April Fool’s joke. ;)

This month my goal is to continue my domination of Pub-a-Thon 2017!

IMG_0238_PNG

 

I plan on doing so by getting a second paper submitted this month! That’s right! I’m working on getting the following paper re-submitted:

Differential response to stress in Ostrea lurida (Carpenter 1864)

 

Manuscript Writing – Submitted!

Boom!

 

Here are some useful links:

data records repo-URL: https://osf.io/j8rc2/
draft repo-URL: https://github.com/kubu4/paper_oly_gbs
draft: https://www.authorea.com/users/4974/articles/149442
preprint (Overleaf): https://www.overleaf.com/read/mqbbvmwxhncg
preprint (PDF): https://osf.io/cdj7m/

Manuscript – Oly GBS 14 Day Plan

For Pub-a-thon 2017, Steven has asked us to put together a 14 day plan for our manuscripts.

My manuscript is accessible in three locations:

Current: Overleaf for final editing/formatting before submission Scientific Data.
Archival: Authorea for initial writing/comments.
Archival: GitHub for initial writing/issues.

Additionally, I have established a data repository with a Digital Object Identifier (DOI) at Open Science Framework

Here’s what I have going on:

Day 1

  • Convert .xls data records to .csv to see if they will render in OSF repo.
  • Assemble figure: phylogenetic tree.
  • Add figure to manuscript.
  • Deal with any minor edits.

Day 2

  • Assemble figure: Puget Sound map.
  • Add figure to manuscript.
  • Deal with any minor edits.

Day 3

  • Submit? Depends on what Steven’s availability is to finish of Background & Summary and write up Abstract.

Data Received – Olympia oyster PacBio Data

Back in December 2016, we sent off Ostrea lurida DNA to the UW PacBio sequencing facility. This is an attempt to fill in the gaps left from the BGI genome sequencing project.

See the GitHub Wiki dedicated to this for an overview of this UW PacBio sequencing.

I downloaded the data to http://owl.fish.washington.edu/nightingales/O_lurida/20170323_pacbio/ using the required browser plugin, Aspera Connect. Technically, saving the data to a subfolder within a given species’ data folder goes against our data management plan (DMP) for high-throughput sequencing data, but the sequencing data output is far different than what we normally receive from an Illumina sequencing run. Instead of a just FASTQ files, we received the following from each PacBio SMRT cell we had run (we had 10 SMRT cells run):

├── Analysis_Results
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.bax.h5
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.bax.h5
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.bax.h5
│   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.bas.h5
├── filter
│   ├── data
│   │   ├── control_reads.cmp.h5
│   │   ├── control_results_by_movie.csv
│   │   ├── data.items.json
│   │   ├── data.items.pickle
│   │   ├── filtered_regions
│   │   │   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.rgn.h5
│   │   │   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.rgn.h5
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.rgn.h5
│   │   ├── filtered_regions.fofn
│   │   ├── filtered_subread_summary.csv
│   │   ├── filtered_subreads.fasta
│   │   ├── filtered_subreads.fastq
│   │   ├── filtered_summary.csv
│   │   ├── nocontrol_filtered_subreads.fasta
│   │   ├── post_control_regions.chunk001of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.rgn.h5
│   │   ├── post_control_regions.chunk002of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.rgn.h5
│   │   ├── post_control_regions.chunk003of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.rgn.h5
│   │   ├── post_control_regions.fofn
│   │   └── slots.pickle
│   ├── index.html
│   ├── input.fofn
│   ├── input.xml
│   ├── log
│   │   ├── P_Control
│   │   │   ├── align.cmpH5.Gather.log
│   │   │   ├── align.plsFofn.Scatter.log
│   │   │   ├── align_001of003.log
│   │   │   ├── align_002of003.log
│   │   │   ├── align_003of003.log
│   │   │   ├── noControlSubreads.log
│   │   │   ├── summaryCSV.log
│   │   │   ├── updateRgn.noCtrlFofn.Gather.log
│   │   │   ├── updateRgn_001of003.log
│   │   │   ├── updateRgn_002of003.log
│   │   │   └── updateRgn_003of003.log
│   │   ├── P_ControlReports
│   │   │   └── statsJsonReport.log
│   │   ├── P_Fetch
│   │   │   ├── adapterRpt.log
│   │   │   ├── overviewRpt.log
│   │   │   └── toFofn.log
│   │   ├── P_Filter
│   │   │   ├── filter.rgnFofn.Gather.log
│   │   │   ├── filter.summary.Gather.log
│   │   │   ├── filter_001of003.log
│   │   │   ├── filter_002of003.log
│   │   │   ├── filter_003of003.log
│   │   │   ├── subreadSummary.log
│   │   │   ├── subreads.subreadFastq.Gather.log
│   │   │   ├── subreads.subreads.Gather.log
│   │   │   ├── subreads_001of003.log
│   │   │   ├── subreads_002of003.log
│   │   │   └── subreads_003of003.log
│   │   ├── P_FilterReports
│   │   │   ├── loadingRpt.log
│   │   │   ├── statsRpt.log
│   │   │   └── subreadRpt.log
│   │   ├── master.log
│   │   └── smrtpipe.log
│   ├── metadata.rdf
│   ├── results
│   │   ├── adapter_observed_insert_length_distribution.png
│   │   ├── adapter_observed_insert_length_distribution_thumb.png
│   │   ├── control_non-control_readlength.png
│   │   ├── control_non-control_readlength_thumb.png
│   │   ├── control_non-control_readquality.png
│   │   ├── control_non-control_readquality_thumb.png
│   │   ├── control_report.html
│   │   ├── control_report.json
│   │   ├── filter_reports_adapters.html
│   │   ├── filter_reports_adapters.json
│   │   ├── filter_reports_filter_stats.html
│   │   ├── filter_reports_filter_stats.json
│   │   ├── filter_reports_filter_subread_stats.html
│   │   ├── filter_reports_filter_subread_stats.json
│   │   ├── filter_reports_loading.html
│   │   ├── filter_reports_loading.json
│   │   ├── filtered_subread_report.png
│   │   ├── filtered_subread_report_thmb.png
│   │   ├── overview.html
│   │   ├── overview.json
│   │   ├── post_filter_readlength_histogram.png
│   │   ├── post_filter_readlength_histogram_thumb.png
│   │   ├── post_filterread_score_histogram.png
│   │   ├── post_filterread_score_histogram_thumb.png
│   │   ├── pre_filter_readlength_histogram.png
│   │   ├── pre_filter_readlength_histogram_thumb.png
│   │   ├── pre_filterread_score_histogram.png
│   │   └── pre_filterread_score_histogram_thumb.png
│   ├── toc.xml
│   └── workflow
│       ├── P_Control
│       │   ├── align.cmpH5.Gather.sh
│       │   ├── align.plsFofn.Scatter.sh
│       │   ├── align_001of003.sh
│       │   ├── align_002of003.sh
│       │   ├── align_003of003.sh
│       │   ├── noControlSubreads.sh
│       │   ├── summaryCSV.sh
│       │   ├── updateRgn.noCtrlFofn.Gather.sh
│       │   ├── updateRgn_001of003.sh
│       │   ├── updateRgn_002of003.sh
│       │   └── updateRgn_003of003.sh
│       ├── P_ControlReports
│       │   └── statsJsonReport.sh
│       ├── P_Fetch
│       │   ├── adapterRpt.sh
│       │   ├── overviewRpt.sh
│       │   └── toFofn.sh
│       ├── P_Filter
│       │   ├── filter.rgnFofn.Gather.sh
│       │   ├── filter.summary.Gather.sh
│       │   ├── filter_001of003.sh
│       │   ├── filter_002of003.sh
│       │   ├── filter_003of003.sh
│       │   ├── subreadSummary.sh
│       │   ├── subreads.subreadFastq.Gather.sh
│       │   ├── subreads.subreads.Gather.sh
│       │   ├── subreads_001of003.sh
│       │   ├── subreads_002of003.sh
│       │   └── subreads_003of003.sh
│       ├── P_FilterReports
│       │   ├── loadingRpt.sh
│       │   ├── statsRpt.sh
│       │   └── subreadRpt.sh
│       ├── Workflow.details.dot
│       ├── Workflow.details.html
│       ├── Workflow.details.svg
│       ├── Workflow.profile.html
│       ├── Workflow.rdf
│       ├── Workflow.summary.dot
│       ├── Workflow.summary.html
│       └── Workflow.summary.svg
├── filtered_subreads.fasta.gz
├── filtered_subreads.fastq.gz
├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.metadata.xml
└── nocontrol_filtered_subreads.fasta.gz

That’s 20 directories and 127 files – for a single SMRT cell!

Granted, there is the familiar FASTQ file (filtered_subreads.fastq), which is what will likely be used for downstream analysis, but it’s hard to make a decision on how we manage this data under the guidelines of our current DMP. It’s possible we might separate data files from the numerous other files (the other files are, essentially, metadata), but we need to decide which file type(s) (e.g. .h5 files, .fastq files) will server as the data files people will rely on for analysis. So, for the time being, this will be how the data is stored.

I’ll update the readme file to reflect the addition of the top level folders (e.g. ../20170323_pacbio/170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1/).

I’ll also update the GitHub Wiki

Data Management – SRA Submission Oly GBS Batch Submission

An earlier attempt at submitting these files failed.

I re-uploaded the failed files (indicated in my previous notebook entry linked above) and tried again.

 

It failed again, despite having successfully uploaded just minutes before.

I re-uploaded that “missing” file and tried again.

This time, it succeeded (and no end-of-stream error for the 1SN_1A file!)!

Will post here with the SRA accession number once it goes live!

 

Computing – Owl Partially Restored

Heard back from Synology and they indicated I should click the “Repair” option to fix the System Partition Failed error message seen previously.

I did that and our data is now accessible again. However, all the user account info, scheduled tasks (e.g. Glacier backups, notebook backup script), IP configurations, mail configurations, etc. have all been reset.

I downloaded/installed the various packages needed to have the server accessible via the web and configured the IP address settings.

Have a note out to Synology to see if the configurations can be restored somehow. Once I hear back, we’ll get user accounts re-established.

Below is a chronological set of screen caps of the repair process:

 

Our data is still here! This is before performing the “Repair” operation, btw. It seems it just required some time to re-populate directory structure.

 

 

 

 

Still getting a “degraded” error message, but all drives appear normal. However, Disk 3 in the DX513 is not showing; possible cause for “degraded” status?

 

 

 

 

Set up manual IP settings by expanding the “LAN 1″ connection.

Data Management – SRA Submission Oly GBS Batch Submission Fail

I had previously submitted the two non-demultiplexed genotype-by-sequencing (GBS) files provided by BGI to the NCBI short read archive (SRA).

Recently, Jay responded to an issue I had posted on the GitHub repo for the manuscript we’re writing about this data.

He noticed that the SRA no longer wants “raw data dumps” (i.e. the type of submission I made before). So, that means I had to prepare the demultiplexed files provided by BGI to actually submit to the SRA.

Last week, I uploaded all 192 of the files via FTP. It took over 10hrs.

(FTP tips: – Use ftp -i to initiate FTP. – Then use open ftp.address.IP to connect. – Then can use mput with regular expressions to upload multiple files)

Today, I created a batch BioSample submission:

 

 

 

Initiated the submission process (Ummm, this looks like it’s going to take awhile…):

 

 

 

Aaaaaand, it failed:

 

 

It seems like the FTP failed at some point, as there’s nothing about those seven files that would separate them from the remaining 185 files. Additional support for FTP failure is that the 1SN_1A_1.fq.gz error message makes it sound like only part of the file got transferred.

I’ll retrieve those files from our UW Google Drive (since their original home on Owl is still down) and get them trasnferred to the SRA.

Troubleshooting – Synology NAS (Owl) Down After Update

TL;DR – Server didn’t recover after firmware update last night. “Repair” is an option listed in the web interface, but I want to confirm with Synology what will happen if/when I click that button…

The data on Owl is synced here (Google Drive): UW Google Drive

However, not all of Owl was fully synced at the time of this failure, so it seems like a decent amount of data is not accessible. Inaccessible data is mostly from individual user directories.

All high-throughput sequencing is also backed up to Amazon Glacier, so we do have all of that data.

 

Here is what happened, in chronological order:

 

  1. Updated DSM via web interface in “Update & Restore”. Did NOT perform manual install.
  2. System became inaccessible via web interface and Synology Assistant.
  3. The physical unit showed blue, flashing power light and green flashing LAN1 light.
  4. No other lights were illuminated (this includes no lights for any of the drive bays).
  5. The attached expansion unit (DX513) showed steady blue power light, steady green lights on all drive bays, and steady green eSATA light.
  6. I powered down both units via the DS1812+ power button.
  7. I turned on the both units via the DS1812+ power button.
  8. Both units returned to their previous status and were still inaccessible via the web interface and Synology Assistant.
  9. I powered down both units via the DS1812+ power button.
  10. I removed all drives from both units.
  11. I turned on the both units via the DS1812+ power button.
  12. I connected to the DS1812+ via Synology Assistant. A message indicated “No Hard Disk Found on 1812+”.
  13. I powered down both units via the DS1812+ power button.
  14. I added a single HDD to the DS1812+.
  15. I turned on the both units via the DS1812+ power button.
  16. I connected to the DS1812+ via Synology Assistant. I was prompted to install the latest DSM. I followed the steps and created a new admin account. Now the system shows 7 drives in the DS1812+ with a message: “System Partition Failed; Healthy”. Disk 1 shows a “Normal” status; this is the disk that I used to re-install DSM in Step 14. Additionally, the system shows one unused disk in the DX513.
  17. I powered down both units via the web interface.
  18. I removed Disk 1 from DS1812+.
  19. I turned on the both units via the DS1812+ power button.
  20. The DS1812+ returns to its initial state as described in Step 3.
  21. I powered down both units via the DS1812+ power button.
  22. I returned Disk 1 to its bay.
  23. I turned on the both units via the DS1812+ power button.
  24. There’s an option to “Repair” the Volume, but I’m not comfortable doing so until I discuss the in/outs of this with Synology. Submitted a tech support ticket with Synology.

Below are pictures of the entire process, for reference.

 

Server status when I arrived to lab this morning.

 

Pulled the HDDs from both units, in an attempt to be able to connect via Synology Assistant.

 

Units w/o HDDs.

 

No HDDs in units made the server detectable via Synology Assistant, but it indicates “Not installed” in the “Status” column…

 

Successfully connected, but the DS1812+ indicates no HDDs installed.

 

 

Added a single HDD back to the DS1812+. Notice, the drive light is green and the “Status” light is amber. This is actually an improvement over what I saw when I arrived.

 

Added back a single HDD to the DS1812+ and now have this setup menu.

 

I’m prompted to install the Synology DSM.

 

Installing DSM. This “Formatting system partition” message might be related to the eventual error message that I see (“System Partition Failed”) after this is all set up…

 

 

 

 

 

 

 

 

Prompted to create an admin account. This does not bode well, since this is behaving like a brand new installation (i.e. no record of the previous configuration, users, etc.).

 

Continuing set up.

 

All set up…

 

 

Added all the HDDs back and detected via Synology Assistant.

 

This shows that there are no other users – i.e. previous configuration is not detected.

 

After putting all the HDDs back in, got this message after logging in.

 

Looking at the Storage info in DSM; seems bad.

 

 

Physically, the drives all look fine (green lights on all drive bays), despite the indication in the DSM about “System Partition Failed” for all of them (except Disk 1). The expansion unit’s bay lights are actually all green, but were actively being read at the time of picture (i.e. flashing) so the image didn’t capture all of them being green. Amber light on expansion unit reflects what was seen in the DSM – the middle drive is “Not initialized”. Note, the drive is actually inserted, but the handle has been released.

 

This is how I left the system. Notice that after rebooting, the expansion unit no longer shows that “Not initialized” message for Disk 3. Instead, Disk 3 is now detected as installed, but not used…

 

Computing – Oly BGI GBS Reproducibility; fail?

OK, so things have improved since the last attempt at getting this BGI script to run and demultiplex the raw data.

I played around with the index.lst file format (based on the error I received last time, it seemed like a good possibility that the file formatting was incorrect) and actually got the script to run to completion! Granted, it took over 16hrs (!!), but it completed!

See the Jupyter notebook link below.

 

Results:

Well, although the script finished and kicked out all the demultiplexed FASTQ files, the contents of the FASTQ files don’t match (the read counts differ between these results and the BGI files) the original set of demultiplexed files. I’m not entirely sure if this is to be expected or not, since the script allows for a single nucleotide mismatch when demultiplexing. Is it possible that the mismatch could be interpreted slightly differently each time this is run? I’m not certain.

Theoretically, you should get the same results every time…

Maybe I’ll re-run this again over the weekend and see how the results compare to this run and the original BGI demultiplexing…

Jupyter notebook (GitHub): 20170314_docker_Oly_BGI_GBS_demultiplexing_reproducibility.ipynb

 

Jupyter notebook (may be easier to view in GitHub link above):

Computing – Oly BGI GBS Reproducibility Fail (but, less so than last time)…

Well, my previous attempt at reproducing the demultiplexing that BGI performed was an exercise in futility. BGI got back to me with the following message:

 

Hi Sam,

We downloaded it and it seems fine when compiling. You can compile it with the below command under Linux system.

tar -zxvf ReSeqTools_XXX.tar.gz ; cd iTools_Code; chmod 775 iTools ; ./ iTools -h

 

I gave that whirl and got the following message:

Error opening terminal: xterm

Some internet searching got me sucked into a useless black hole about 64 bit systems running 32 bit programs and enabling the 64 bit kernel on Mac OS X 10.7.5 (Lion) since it’s not enabled by default and on and on. In the end, I can’t seem to enable the 64 bit kernel on my Mac Pro, likely due to hardware limitations related to the graphics card and/or displays that are connected.

Anyway, I decided to try getting this program installed again, using a Docker container (instead of trying to install locally on my Mac).

 

Results:

It didn’t work again, but for a different reason! Despite the instructions in the readme file provided with iTools, you don’t actually need to run make! All that has to be done is unzipping the tarball!! However, despite figuring this out, the program fails with the following error message: “Warming : sample double in this INDEX Files. Sample ID: OYSzenG1AAD96FAAPEI-109; please renamed it diff” (note: this is copied/pasted – the spelling errors are note mine). So, I think there’s something wrong with the formatting of the index file that BGI provided me with.

I’ve contacted them for more info.

See the Jupyter notebook linked below to see what I tried.

Jupyter notebook (GitHub): 20170314_docker_Oly_BGI_GBS_demultiplexing_reproducibility.ipynb