Deb's Delvings in Genealogy: WGS

Showing posts with label WGS. Show all posts

15 May 2018

Whole Genome Sequence (Part 2) - Analysis Tools

For earlier parts see
Whole Genome Sequence (Part 1) — YSEQ.net options and files received

I am investigating bioinformatics¹ tools to analyze Whole Genome Sequence (WGS) data. I have access to a WGS for someone who has also tested at several genealogy testing companies. I want to do some comparisons between the raw data from the genealogy testing companies and the WGS, checking for accuracy of the reads. To satisfy my curiosity, I plan to investigate some of the medical implications and traits discussed in scientific papers.

Once I have multiple WGSs from relatives, I plan to do some comparisons as to whether segments that the testing companies indicate match really do match completely with the higher resolution data. I am interested in how closely the statistical predictions on linkage disequilibrium and crossovers mirror what is seen in real family multi-generational studies. For example, in the shared segments marked below, not every SNP is tested. A number of SNPs in a segment are tested and we assume the non-tested SNPs match based on statistical predictions.

My previous career in software development, testing, and support made me familiar with Open-Source Software so I look for available tools before spending time writing my own. Tools I am checking out include Samtools, National Center for Biotechnology Information's (NCBI) Genome Workbench, and Broad Institute's Integrative Genome Viewer (IGV).

This new BioRxiv paper is timely for my quest:

"A large-scale analysis of bioinformatics code on GitHub," by Pamela H Russell, Rachel L Johnson, Shreyas Ananthan, Benjamin Harnke, and Nichole E Carlson, doi: https://doi.org/10.1101/321919. The meaty data is in the supplemental material which consists of several large files (some over 200MB) linked from the article abstract.

By the way, just as with some of the best genealogy articles, the reference notes in this article led me to several additional sources I now need to consult.

As a woman, this sentence is especially depressing: "... the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists".² I hope this changes and more women participate in bioinformatics.

geralt, Pixabay (https://pixabay.com/en/learn-mathematics-child-girl-2405206/ : accessed 15 May 2018), CC0 Creative Commons.

I am impressed with how many databases and tools are out there for DNA analysis. I did not realize there are over 1,700 bioinformatics repositories and "23 'high profile' GitHub repositories containing source code for popular and highly respected bioinformatic tools."³ "Our analysis points to simple recommendations for selecting bioinformatic tools from among the thousands available."⁴ Some of these will not be useful for genealogy, but some will.

One tool aimed at the genetic genealogy community is Thomas Krahn's tool for annotating a BigY VCF file and identifying derived and novel SNPs.⁵ Thomas kindly shared this tool so others can do the analysis instead of having it done by his company YSEQ.net.

Some of the discussions in the scientific world parallel those we are having in the genealogy world.

"In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. Reproducibility requires that authors publish original data and a clear protocol to allow repetition of the analysis in a paper."⁶ In the genealogy world we are discussing publicly available DNA data, such as on GEDmatch.com, allowing DNA analysis to be reproduced and referenced from a publication.

OpenClipart-Vectors, Pixabay (https://pixabay.com/en/analysis-biology-biotechnology-2025786/ : accessed 15 May 2018), CC0 Creative Commons.

"The bioinformatics field embraces a culture of sharing — for both data and source code — that supports rapid scientific and technical progress."⁷ In the genealogy world we are discussing privacy issues versus sharing data, especially with the recent proliferation of stories on law enforcement use of genealogy databases.

I have been musing on whether to learn Python or Ruby. A recent discussion with a young programmer had me leaning towards Python. Since the "greatest amount of code in the main dataset was in Javascript, followed by Java, Python, C++, and C"⁸ maybe I will stay with Javascipt and Java, which I already know, if I develop any new tools for web usage. I have a few tools I wrote in Perl for my own use that I hope to clean up and share eventually.

In addition to DNA adding to my knowledge of my family tree, it is forcing me to upgrade my data analysis knowledge and computer tools familiarity. I hope all of this study helps keep my mind active and reduces those "senior moments" that seem to occur more frequently with the years.

1. The science of collecting and analyzing complex biological data such as genetic code.
2. Pamela H Russell, et al., "A large-scale analysis of bioinformatics code on GitHub," 15 May 2018, BioRxiv pre-publication, https://doi.org/10.1101/321919, line 35.
3. Ibid., line 27.
4. Ibid., line 148.
5. Thomas Krahn, "bigY_hg39_pipeline.sh," GitHubGist (https://gist.github.com/tkrahn/283462028c61cd213399ba7f6b773893).
6. Russell, "A large-scale analysis of bioinformatics code on GitHub," line 84.
7. Ibid., line 120.
8. Ibid., line 208.

All statements made in this blog are the opinion of the post author. This blog is not sponsored by any entity other than Debbie Parker Wayne nor is it supported through free or reduced price access to items discussed unless so indicated in the blog post. Hot links to other sites are provided as a courtesy to the reader and are not an endorsement of the other entities except as clearly stated in the narrative.

To cite this blog post:
Debbie Parker Wayne, "Whole Genome Sequence (Part 2) - Analysis Tools," Deb's Delvings, 15 May 2018 (http://debsdelvings.blogspot.com/ : accessed [date]).

© 2018, Debbie Parker Wayne, Certified Genealogist®, All Rights Reserved

25 April 2018

Whole Genome Sequence (Part 1)

This first article on Whole Genome Sequence (WGS) analysis is posted today to celebrate DNA Day, 25 April 2018.

This is the first in a continuing series on the files received when a person's entire genome is sequenced, the contents of those files, the tools needed to access the file data, and some things a genealogist can do with the data.

I now have access to the WGS data for several people who have also tested at most of the genealogy companies offering DNA tests. I am excited to be able to analyze these files so others can decide if a WGS test may be right for them now that prices are below $1000 and probably going lower "soon." When ordering higher resolution sequencing that is consistent with medical testing the price may be over $1000.

https://www.genome.gov/27541954/dna-sequencing-costs-data/

The first WGS I have access to was done through YSEQ.net. This is the company of Thomas and Astrid Krahn who are well known in the genetic genealogy community. YSEQ.net has excellent explanations of the options and processes on their website and are very responsive to questions via Facebook and their online contact form.

YSEQ.net was chosen as the testing company because they

provide the data files on a micro SD card as an option
offer 15x, 30x, and 50x options for coverage (30x coverage is generally the minimum used for medical purposes; the test taker wanted to be able to use this for health purposes and did not want to pay for additional sequencing later unless it becomes possible to phase data as it is sequenced)
provide privacy acceptable to the test taker (the outsourced sequencing does not have the test taker's name attached, the outsource sequencing company will not use the DNA data for other purposes, raw data is archived at YSEQ.net where German law prohibits the data being used without permission from the test taker)
and the reputation of the company owners

Adapted by Debbie Parker Wayne from mcmurryjulie, chromosomes, Pixabay (https://pixabay.com/en/chromosomes-genetics-dna-genes-2817314/ : accessed 15 November 2017), CC0 Creative Commons.

A kit was ordered from YSEQ.net on 15 November 2017, four swabs arrived on 18 November 2017, the kit was returned on 20 November 2017, and received by the lab on 27 November 2017. Online mtDNA results were available 39 days later on 5 January 2018. Online WGS results were available 24 days after that on 29 January 2018. That is only about 73 days including mail time between the USA and Germany. The micro SD card was received later.

The files received consisted of

a text file with information on how to download the online DNA data, an mtDNA comparison to the rCRS, and Y chromosome analysis if the test taker is male
a text file with 23andMe V3-style data with about 958,000 lines that could be used with third-party DNA websites and tools; any test taker who has also tested at other testing companies can compare the two files to see if both companies found the same allele values at all locations
an mtDNA FASTA file (this is also a plain text file format); any test taker who has also tested the full mtDNA sequence can compare the two files to see if both companies found the same allele values at all locations
a very large Variant Call Format (VCF) file with a complete set of extracted mutations - about 695MB - readable with a text file reader such as NoteTab Pro, but may slow down your system due to the size; this has interesting information on the length of data read from the test taker's chromosomes and the mutations of this test taker (provided as a TBI and GZIP file which you must unzip)
a BAM and BAI (BAM Index) file with the WGS data - these will require special tools to view as the files are compressed (this will be covered more in a later post; BAM is a binary or compressed version of a SAM file; a SAM (Sequence Alignment Map) file is a text-based format for storing biological sequences aligned to a reference sequence; Samtools are available for LINUX systems at http://www.htslib.org/)
a BAM.stats and BAM.idxstats.tsv file - both readable in a plain text file reader; these are small - 450 to 5,000 lines - and can be read by any text file reader (stats is an abbreviation for statistics; idx is a common abbreviation for index in the computer world; TSV is a tab-separated-value file similar to the CSV comma-separated-values files we use all of the time in DNA analysis)
what seems to be the mtDNA data in a BAM file format along with a BAM index

Image by Debbie Parker Wayne

I am in the process of installing Samtools on my LINUX system so I can read the BAM files. I suspect many genealogists will not do this unless they have experience with LINUX/UNIX systems. There are some Windows/Mac-based genome analysis tools also.

Even without Samtools there is a lot of interesting information here to analyze in the coming months and compare to data from the genealogy testing companies. If you are interested in learning more about BAM files see Samtools and NCBI Genome Workbench.

To cite this blog post:
Debbie Parker Wayne, "Whole Genome Sequence (Part 1)," Deb's Delvings, 25 April 2018 (http://debsdelvings.blogspot.com/ : accessed [date]).

© 2018, Debbie Parker Wayne, Certified Genealogist®, All Rights Reserved