12 January 2019

ConnectedDNA Graphs and Clues

I have been missing out on the excitement of network graph analysis for my DNA matches. While everyone else was playing with genetic network graphs, I was busy with another project. I could not stop to learn how to use Gephi (https://gephi.org/) or Google Fusion (https://fusiontables.google.com/) or NodeXL (http://www.smrfoundation.org/nodexl/installation/) to create graphs.

Shelley Crawford first documented the methods to create a genetic network graph using Excel and NodeXL. See her method beginning at Visualising Ancestry DNA matches–Part 1–Getting ready. Recently, I discovered Shelley's company ConnectedDNA selling charts she creates using custom code, Gephi, and data the customer supplies (or, optionally, AncestryDNA data that Shelley downloads using ‘viewer’ permission).

This is exciting! I can justify paying someone else to do the hard work for me while I finish my other tasks. I can make time to play with the graphs created by someone else. Thank you, Shelley.

The first part of this post (Product 1) is similar to what others have posted on their graph analysis. The second part of this post (Product 2) is something you may not have seen yet—graphs created using half sibling matches.

Product 1. Single profile graph using AncestryDNA data

My first purchase was a "Single profile graph" using one person's AncestryDNA data.

I used the DNAGedcom Client to download my matches and ICW files from AncestryDNA using the options recommended by ConnectedDNA on the order page. After placing my order I received instructions on how to provide the files to ConnectedDNA.

After processing by ConnectedDNA, I received PDFs and images of two graphs—a Network Map and a Group Map—and a Match file spreadsheet (XLSX) file. The spreadsheet contains information added by ConnectedDNA beyond what was in the DNAGedcom Client file.

Network Map Graph

At a shared match size threshold of 30 cM, the Network Graph includes 1,006 people (each represented by a colored dot on the graph), with 13,556 lines between them. An image of the full graph is shown here with no names displayed.

Names are visible when zoomed in on the PDF. The name labels are blurred here on this small section of the graph extracted from the group of brown dots in the upper left of the image above.

The PDF is searchable allowing me to locate matching test takers by name. The size of the colored dots is based on the amount of shared DNA. Bigger dots are likely more closely related. Each line represents a relationship (shared DNA) between two people.

I can label the ancestral lines containing the names of matches where the common ancestor has been identified. It is likely that I share that same line with others in the same group (same color). This is a clue as to where to focus research for a common ancestor for matches in this group. I can annotate the graph with the ancestral surnames associated with each colored group as shown in the image below.

The groupings can lead to genealogical discoveries. The "Anderson-McSpadden" cluster is purple. Some purple dots are outside of the main cluster and have lines linking them to the "Johnson" and "Richards" clusters. These outlier circles are marked as "Johnson-Richards and Anderson-McSpadden" and represent people that I share two ancestral lines with. A pair of Richards siblings married two Anderson first cousins. There is cross-over here in shared DNA between my maternal grandfather’s line (Anderson-McSpadden) and maternal grandmother’s line (Johnson-Richards) that is clearly reflected in the graph by the "Johnson-Richards and Anderson-McSpadden" outliers. If I did not know of this cross-over this would be a great clue for me.

The connecting lines are hard to see on this reduced image, but are clear when viewed in the PDF and can clearly be seen in the Group Map below.

Group Map Graph

The Group Map is a summary version of the Network Map chart. It is both a finding aid for the numbered groups, and illustrates the relationships and strength of connection between the groups. I added surnames and boxes to this version of the graph. The groups with a named common ancestor clearly split into maternal (green box) and paternal (blue box) lines. This is a clue as to whether the unnamed groups are more likely on the maternal or paternal side.

The general locations of groups and the assigned colors are the same as in the Network Map Graph. The connecting lines are simpler in this Group Map graph. The lines represent links between clusters or circles. Thicker connecting lines represent more individuals in one circle with connections in the second circle. For example, a Parker ancestor (blue circle, group 3) married a Rogers ancestor (red circle, group 2). There is a thick line between these two circles indicating many matches in the two circles share DNA. This is expected; many test takers share both of these ancestral lines with me—they are also descendants of this Parker-Rogers ancestral couple.

Match File

The Match File is a modified version of the file provided to ConnectedDNA. The addition of the group allocations (with colors as seen on the graphs) gives a quick visual clue of the likely relationship. It is easy to find surnames of interest.

Clues also lurk in this file. My Maples line is one that is not traced past the first ancestor with that surname. My Maples ancestor married a Parker. The match here who has Parker and Maples in their surname list would be a good place for me to begin investigating links to identify Maples ancestors. Another great clue as to where to look for the common ancestor is seen in Group 18 which is my Johnson line. The Group 18 match in the image below does not have Johnson in the surname list, but Parrott ancestors are further back in my Johnson line. This is likely how I am related to this match. Jarvis is another surname in my family tree. I can investigate if we also share Jarvis ancestors or if this person has a different Jarvis line from mine.

The spreadsheet file can be searched, sorted, or filtered by data found in each column using the drop down box in the column header. This allows me to focus on a subset of matches of interest, eliminating clutter from the other lines. For example, filtering the Group column for 3 displays only the matches that are in group 3 (blue circle). Clicking the drop-down funnel allows the filter to be cleared.

The image below shows the Match List filtered for Group 3 which is my Parker paternal line. Henry Parker married Nancy Black and I see Parker and Black in the surname lists of most of these matches. Haynes is a surname a few generations further back in this Parker line. Those other surnames listed for each match may be their lines not related to me or could be clues for the spots in my lineage where I have not yet identified an ancestor.

My ethnicity is always 95-98% European so the ethnicity estimates are not generally much help to me. However, anyone with recent ancestral origins in a specific biogeographical region might find this information helpful. For example, if one grandfather’s lineage is African American or Native American then any such ethnicity predictions for a match is a clue the relationship might be on that grandfather’s line.

Even if you do not prefer visual representations of data, the Excel file with the groups added to allow filtering the included matches may be worth the cost. Having the links to go directly to a match's tree or profile on Ancestry is also a time saver.

As with all non-exact searches, be aware that filtering the Surnames column for Ryan displays all of the matches who have a surname that contain the letters ryan. This includes surnames Ryan, Bryant, O'Bryan, and so on.

Product 2. Close family graph (Family Tree DNA)

My second purchase was a graph using FamilyTreeDNA data for five siblings. The standard offering at the time was for full siblings (that is the price I paid). I asked Shelley if she would consider an option to include a half sibling. She agreed to use me as a guinea pig and now offers products with siblings, close family including half siblings, and extended family. See her website for details (https://www.connecteddna.com/). Again, I used DNAGedcom Client to collect data for the siblings. I then supplied the match list, ICW file, and chromosome browser file for four full siblings and one half sibling to ConnectedDNA.

After processing by ConnectedDNA, I received PDFs and images of two graphs—a Profile Graph and a Group Graph—and a Match file spreadsheet file with information added by ConnectedDNA beyond what was in the DNAGedcom Client file.

Profile Graph

These are the Summary statistics for the unfiltered combined FamilyTreeDNA files sent to ConnectedDNA:
Profile   Nodes  Edges
full sibling 1    5,519  270,225
full sibling 2    5,489  277,941
full sibling 3    6,124  343,106
full sibling 4    5,473  271,225
half sibling      5,480  246,245
Combined file    15,052  982,276

When the data is graphed unfiltered it is too much of a blob to be meaningfully interpreted.

Shelley uses her expertise to filter the data so the graph becomes meaningful. The thresholds used for each data set varies—your thresholds may differ from the ones used for my data. Shelley used the chromosome data to identify shared-match pairs who also have at least one overlapping segment of DNA with the focus person. She then filtered the matches to those who share between 50 cM and 1,300 cM with at least one of the five profiles. The closest matches are excluded (people I asked to test and who are related to me on almost all lines). She filtered the connections between matches so that, for matches who share at least 130 cM with at least one kit, all connections are shown.

For pairs of matches below 130 cM, connections are only shown if there is also an overlapping segment of about 12 cM or more between the shared matches and one of the kits. Connections where there is an 'overlap' are darker than connections where there is not. As Shelley states clearly, "This is not triangulation, but works as a reasonable proxy since true triangulation data is not available." True triangulation is not available from the data supplied by FamilyTreeDNA, but may be when using third-party tools.

These siblings all share a mother. The half sibling has a different father. Three groups of matches emerge:
  • Half sibling only - people likely related on the half sibling's paternal line (red color)
  • Half and full siblings - people likely related on the shared maternal line (green color)
  • Full siblings only - people likely related on the full sibling's paternal line (blue color)

This reveals likely paternal or maternal links and some matches that need investigation. Several clusters contain circles of primarily one color but with a small number of red circles that represent matches to the half sibling only. The completely red clusters are likely the paternal line of the half sibling. A mix of red with green-blue circles may indicate the half sibling inherited a segment that the full siblings did not. The mix could also indicate the half sibling’s paternal line has ancestors shared with the full siblings’ line. Investigation is required.

Match Group Overview Graph

The Match Group Overview graph shows how the groups link together and the numbers assigned to each group. This is similar to the AncestryDNA Group Map described above. The colors and group numbers match those used in the FamilyTreeDNA Match List spreadsheet file.

Group Network Graph

The Group Network Graph includes one circle for each match in a group. This details the matches included in the circles of the Match Group Overview Graph. Just as with the AncestryDNA product, the PDF file has searchable names attached to each circle. Identification of the shared ancestor with one person in a group provides a clue for the likely shared ancestor with others in the group.

Match List

The spreadsheet Match List for this product includes match ID; full name; relationship range and predicted relationship; a list of close ICW matches; whether the match is full sibling only, half sibling only, or both; group number; number of shared matches; longest DNA segment shared; total shared cM; total shared cM with each sibling; ancestral surnames; Y-DNA and mtDNA haplogroups (if included in the DNAGedcom Client data files); notes; and email address for each match. Each column can be filtered to focus on the group under investigation.

Be aware that group numbers are assigned as each ConnectedDNAproduct is created. This FamilyTreeDNA product is completely separate from the AncestryDNA product discussed above. Therefore, group numbers assigned are different in the two products. The matches in Group 3 of my AncestryDNA product are in Group 21 in my FamilyTreeDNA product. I can easily determine both of these are my Parker paternal line based on matches to people who tested at both AncestryDNA and FamilyTreeDNA.

Some General Guidance on Interpretation

The graphs and spreadsheets provide clues that require investigation before conclusions can be reached. Researchers must realize that random recombination may result in some people who sort into one group genetically when they may be in another group genealogically. This is especially true when a half sibling inherited a segment of DNA from the shared parent that the other siblings did not inherit. Take care not to misinterpret these cases based on erroneous assumptions. In general, a known relative in a group is a strong clue to which part of your family tree that group represents. Random recombination may cause some of these clues to be misleading.

Shelley Crawford gives this good advice on how to use these files:
Some groups may be a mystery to you. Make your way through the matches in those groups, reviewing their trees and looking for common elements. Is there a surname or a place that appears in several of the trees? With a little research effort, you may be able to expand upon the information your matches have provided and find a common ancestor.

I am excited to explore these files more fully and make discoveries to add to my family tree.

Edited 13 January 2019: Modified phrasing to correctly reflect permissions used when the customer asks ConnectedDNA to download the data.

