Friday, January 23, 2015

Domain Change Update!

Since I changed my name to Felix Immanuel, I thought of changing my personal domain as well to reflect my name change. Hence, I changed my domain from fc.id.au to fi.id.au, so does my email which I updated in contacts page. However, I will still be receiving emails from my old email and forward it to my new email as long as I can. The website at fc.id.au will be forwarded to fi.id.au. Unfortunately, I wasn't able to retain comments, likes and follows from my personal blog due to this change. In summary, the domain change is a smooth and slow transition and will not have any impact to others.

Wednesday, January 14, 2015

How to choose thresholds for segment matches?

Choosing thresholds for segment matches is one of the greatest challenges for a genetic genealogist. Thresholds used by several people vary from 200 SNPs / 2 cM to 700 SNPs / 7 cM. The question that drills down is, what is the logic behind it? Why is it set on a particular SNP/cM pair of values? What is the relation between SNPs and cM? These questions are so crucial in deciding why there is a threshold and what threshold is right for a genetic genealogist.

Before proceeding with this blog, you need to make yourself familiar with noise thresholds and SNP density for a kit.

Noise Thresholds

A segment that is just noise cannot occur above 150 SNPs / 1 Mb threshold. The mathematics behind how this was arrived can be read from The true IBS noise range and Noise threshold on atDNA Matches, which i also tested in GEDmatch and you can see the result from the post, IBS Noise Kit at GEDMatch. 1 Mb is roughly equal to 1 cM threshold. This is one of the reasons FamilyTreeDNA has 1 cM included in total shared DNA. 

SNP Density

It is the number of SNPs per cM. SNP density varies between kit versions from different companies. E.g., If we take FTDNA kit having 700k SNPs, then the number of SNPs per cM is, 700000 SNPs / 3600 cM ~ 194 SNPs per cM.

Below are the SNP density for different kit versions:

KIT VERSIONSNP DENSITY (per cM)
23andMe V2
157
23andMe V3
265
23andMe V4
168
Ancestry
194
FTDNA Affymetrix
160
FTDNA Illumina
194

Motive

Most DNA companies only advertise only up to 5 generations. In other words, the common ancestor for a 4th cousin is 5 generations back and they are not interested in going beyond that.

FTDNA: Upto 5 generations. Anything beyond is
mentioned as 5th to remote cousins, 

23andMe: Upto 7 generations, but restricts to 7 cM, which
means, very rarely a few lucky 6th cousins are identified.

Ancestry: 5th cousins is up to 6 generations

Hence, any thresholds provided by DNA testing companies will always target what they advertised. This is one of the reasons, DNA companies have different thresholds and their own internal logic for confirming a matching.

Autosomal DNA does undergo recombination. This recombination breaks the a single strand into 2 to 6 segments unequally each time. Hence, the segment length varies greatly as the cousin relationship is distant. I would like to point to  FamilyTreeDNA FAQ.

How accurate is Family Finder’s Relationship Range?
I would also like to point to two blogs:
Hence, as the common ancestor goes back in time, the IBD segment shared with cousins also varies significantly. There is no known algorithm to accurately predict the how far can the common ancestor is back in time.

The below table is from ISOGG - Identical by Descent, provided by Tim Janzen.
Relationship
Range
Expected
Range of number of shared segments
Parent/child3539-3748 cMs23-29
First cousins548-1139 cMs888 cMs17-32
First cousins once removed220-638 cMs444 cMs12-23
Second cousins86-426 cMs222 cMs10-18
Second cousins once removed19-197 cMs111 cMs4-12
Third cousins16-111 cMs55.4 cMs2-6?
Third cousins once removed0-99 cMs27.8 cMs1-4
Fourth cousins0-54 cMs13.8 cMs0-2
As you can see, the range of segment length for fourth cousins vary greatly. They not only have a very few segments 0-2, but also some of their segment lengths falls well below 7 cM, the thresholds used by several DNA companies.

There is also a level of false-positive tolerance: If a DNA company says a match is 3rd cousin, but in reality it is 10th or vice-versa, the customer may not be very happy in using their product. However, if the same company says the match is 2nd or 3rd cousin, but in reality it is 4th cousin, the customer will not be as disappointed as the previous scenario. The motive of any DNA testing company is make their customers happy and in the process, they are happy to even eliminate some valid close cousin matches due to their smaller segments in favor of completely eliminating very distant cousins who don't add any value to most pedigree. As you can see from the above table, a fourth cousin is expected to have 13.8 cM with 2 IBD segments which can range from 0 to 54 cM. To confirm a match with reasonable accuracy, the only way is to stop at a threshold which has only recent cousins. 7 cM is found to have mostly recent cousins by 23andMe and it stated using it. This however does not mean that any segment less than 7 cM is not of recent cousins, nor any segment that is less than 7 cM is invalid.

If you look at another FTDNA FAQ, How much genetic sharing is needed for two people to be considered a Family Finder match?
For the program to consider two people a potential match, the largest matching DNA segment between two people must be at least 5.5 centiMorgans (cM) long.* The program then uses additional matching segments to confirm the relationship and to calculate the degree of relatedness. 
Based on the extensive Family Finder database, it is rare for two genuine genealogical cousins to have a largest shared segment of less than 7 cM and one less than 6 cM is exceptional. 
*updated from 5 cM 15 Aug. 2012
As you can see, largest segment used by FTDNA is 5.5 cM (up from 5 cM used 2.5 years back). The two genuine genealogical cousins could mean very different for different people based on what 'genealogical time-frame' could mean different because a 4th cousin might have lived just 150 years or 5 generations back, while genealogical time-frame could mean 500 years back. With reference to FTDNA, I believe  here they are only referring to only 4th-5th cousin as they advertised.

Thanks to GEDmatch which allows down 1 cM.

How the threshold is arrived?

As we saw previously, most DNA companies narrowed and tuned their thresholds to get more accurate and better results only till 5 to 7 generations back. Based on their research, they found the segment length to be 5.5 cM to 7 cM. However, segment length (cM) is just half. The other half is SNPs. This is where SNP density comes. For 2 FTDNA Illumina kits to match each other at 7 cM, they both have 7 cM * 194 SNPs = 1358 SNPs. Hence, for matching 2 FTDNA Illumina kits thresholds can be comfortably kept at 7 cM / 1300 SNPs. This is why you will notice relatively high SNPs for segments if the kits are of same version.

But in reality, SNPs doesn't need to have such a high density to confirm the validity of a segment. As we saw earlier the noise threshold is 150 SNPs, which means, noise cannot occur below this threshold. Can we then have 150 SNPs / 7 cM ? Yes and No. If we reduce the SNP threshold, then adjacent IBD segments will start to merge and create a longer segment creating a 'segment merging effect'. This merge effect is not because of just reducing the SNP threshold, but because of SNPs not available to check in-between the two adjacent IBD segments - i.e., incompatible DNA kit versions having no-calls in between, which is assumed to be matching. The reduced SNP threshold simply aids in matching longer segment creating a merge effect with such incompatible DNA kit versions.


This does not mean we can't reduce the SNP threshold. Reducing SNP thresholds between two kits of same version (or kits that has same SNPs) while maintaining higher segment length threshold, will match similar to results from higher SNP thresholds. Hence, there is no problem is matching at lower SNP thresholds as long as the kit versions are same and the allowed errors are in check. E.g., 700 SNPs / 7 cM will give similar results to 150 SNPs / 7 cM because, segment cannot accidentally occur at 150 SNPs (150 SNPs is more than enough to confirm the validity of a segment). The prime reason for this is, same kit version don't have no-calls in between and have higher overlapping SNP density - hence, any mismatches in these SNPs will simply going to break the segment.

How much overlapping SNP density is required for two different kit versions? What is the thumb-rule for thresholds?

Any segment that is not noise will always have threshold greater than 150 SNPs / 1 cM. If we truly look into this segment, the common ancestor for this small segment could be thousands of years because the segment length with respect to generations for common ancestor is a exponential decay curve. It could also represent a 4th cousin as we saw earlier. Hence, 1 cM is as important as 7 cM and needs to be eliminated in favor of recent cousins only after careful investigation.

If we start off with 1 cM (having 150 SNPs) and increase the threshold to 2 cM, to have the same SNP density, we must have 300 SNPs. However, at 2 cM, we don't have many kit versions having overlapping SNP density as 300.

Below are the overlapping SNP density for different kit versions:

KIT VERSION 1KIT VERSION 2OVERLAP SNP DENSITY (per cM)
23andMe V223andMe V3
148
23andMe V223andMe V4
127
23andMe V2Ancestry
88
23andMe V2FTDNA Affymetrix
40
23andMe V2FTNDA Illumina
90
23andMe V323andMe V4
145
23andMe V3Ancestry
188
23andMe V3FTDNA Affymetrix
62
23andMe V3FTNDA Illumina
194
23andMe V4Ancestry
85
23andMe V4FTDNA Affymetrix
40
23andMe V4FTNDA Illumina
86
AncestryFTDNA Affymetrix
49
AncestryFTNDA Illumina
193
FTDNA AffymetrixFTNDA Illumina
50

The average overlap SNP density is ~100 SNPs per cM. Blue are highly compatible, red are marginally compatible and gray are not compatible. However, we need to make the incompatible kits to become compatible and the only way is to increase the threshold in such a way they they don't occur accidentally and yet validate the segment. How can we do this? The thresholds must follow two rules:

  • Number of SNPs in the segment must be greater than 150 cM.
  • The SNP density per cM must be 100 and above.

While 150 SNPs will make sure the segment doesn't occur accidentally, the SNP density per cM will make sure it doesn't merge with nearby IBD segments because of no-calls assumed as matching. Hence, not just the thresholds, but we must also verify the matching segment in the result if it has the right SNP density (100 SNPs per cM). In other words, if you want to match a 3 cM segment, you need 300 SNPs overlapping SNPs, or a 5 cM segment, you need 500 SNPs, or 25 cM, you need 2500 SNPs. I will demonstrate this with an example.

As we saw above, FTDNA Illumina and Ancestry are highly compatible with 193 SNP density per cM. This provides two segments.

  • 8.5 cM requires 850 SNPs and above to confirm the segment (but has significantly more - 1727 SNPs, hence valid). 
  • 16.3 cM requires 1630 SNPs and above to confirm the segment (but has significantly more - 3483 SNPs, hence valid)



However, if the same person did the test and has a kit version that is not compatible, let see how it shows. Lisa did her 23andMe v4 test. Comparing Lisa's 23andMe v4 and Ancestry we have the below match. As we saw in the above table, 23andMe v4 and ancestry are not compatible kit and has very less SNP density. This provides only 1 matching segment.

25.2 cM requires 2520 SNPs and above (but has only 2361 SNPs, hence invalid).



This is true with all incompatible kits and extremely helpful to weed out false-positive segments. This is also extremely helpful in eliminating false positives in smaller segments e.g., 2 cM. A 2 cM can occur in a 4th cousin or can even have common ancestor thousands of years back.


As you can see in the above, all matching segments that pass the SNP density on the right also matches on the left, even through the right is between two non-compatible kits.However, only 1 segment out of 2 that does not pass the SNP density is found on the left.

Conclusion

Choosing a threshold is an important task for genetic genealogy. To summarize, the threshold must have the following properties:
  • Number of SNPs in the segment must be greater than 150 cM.
  • The SNP density per cM must be 100 and above.
Hence, if you decide to see 7 cM, make sure you have 700 SNPs as threshold and also don't forget to validate the matching segment for SNP density.

A word of Caution: Just because a segment has lower SNP density does not always means it is merged. It simply means, the possibility of being merged from two segments is extremely high and validating those segments if they are merged is not easily possible.


Friday, January 9, 2015

Match differences in DNA testing companies

FTDNA recently announced autosomal transfer of 23andMe v3 kits but not the new v4. There has been some discussion in FTDNA forums as well. Hence, I decided to investigate and compare matches based on DNA testing companies. In order to investigate, I first require a volunteer who had tested with all 3 DNA testing companies. I would like to thank Lisa for allowing me to use her kits in this analysis and post here. Lisa did her DNA testing with FTDNA, Ancestry and 23andMe (v4).

Below are all the different kits in GEDmatch which belongs to the same person.

  • FTDNA Kit: F317230
  • Ancestry Kit: A645520
  • 23andMe v4: M155233

This allows to actually test how each company kit compares well with other company kits. Most people don't notice the difference easily but the difference itself is substantial.

One- to-Many

Matching kits indicating difference.

The above excel can be downloaded from here: FTDNA_vs_Ancestry_vs_23andMeV4.xlsx

Observations

  • There are a total of 1282 kits overlapping each other.
  • 197 kits does not match between 23andMe v4 and Ancestry
  • 48 kits does not match between FTDNA and Ancestry
  • 199 kits does not match between FTDNA and 23andMe v4
  • Difference in total segment length between FTDNA and Ancestry is 14.1 cM
  • Difference in total segment length between FTDNA and 23andMeV4 is 64.9 cM
  • Difference in total segment length between Ancestry and 23andMeV4 is 68.7 cM
  • Difference in largest segment length between FTDNA and Ancestry is 5.7 cM
  • Difference in largest segment length between FTDNA and 23andMeV4 is 8.8 cM
  • Difference in largest segment length between Ancestry and 23andMeV4 is 8.8 cM
  • 23andMe V4 kits' one-to-many matches have several V4 kits at top, while the same kit is pushed at the bottom in FTDNA and Ancestry's one-to-many matches 

One-to-One

A random user from the top is picked up for one-to-one comparison with all 3 kits to see how the segments agree. In order to do this, the following kits are selected from top.
  • FTDNA Kit: F112962
  • Ancestry Kit: A578161
  • 23andMeV4 Kit: M371343
  • 23andMeV3 Kit: M223652
One-to-one comparison is done between the above sample matches from FTDNA, Ancestry and 23andMe V4 and Lisa's kits from the 3 DNA testing companies.

Comparison with 23andMe v4

Comparison with Ancestry

Comparison with FTDNA
Comparision with 23andMeV3

There are few kits that match reasonably well. However, most of the samples don't match.

Conclusion

FTDNA and Ancestry kits are perfectly compatible with each other. They are also compatible with 23andMe v3. However, 23andMe v4 kits are not compatible with FTDNA, Ancestry or 23andMe v3. They seem to drop off segments when compared with 23andMe v4 and creating many differences in segment lengths and total segments. Also, 23andMe v4 kit comparing with 23andMe v4 kits seems to add segments not found when compared with FTDNA and Ancestry. Caution must be taken when taking 23andMe V4 matches as it does not always go well with FTDNA and Ancestry kit matches.


Thursday, January 8, 2015

The fallacy of Identity by Descent (IBD)

The general definition of IBD as per ISOGG is as follows:
An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. (isogg.org)
Technically, it is a misleading term because, it seems as though any segment that had recombination is not through descent. Just because two cousins belonging to a same lineage marry each other, does that mean the recombined segments in their descendants not through descent?

Was the above the definition same all along? Absolutely not! Let's take a flashback in time to see what happened and how this definition came into existence and let's see if the above definition is still accepted by all.

Inherited from Common Ancestor

Below are the papers that refers to IBD as just inherited from a common ancestor.

Year 1994
Whittemore, Alice S., and Jerry Halpern. "Probability of gene identity by descent:
computation and applications." Biometrics (1994): 109-117.

Year 2001
Zhao, Hongyu, and Feng Liang. "On relationship inference using gamete identity by descent data."
Journal of Computational Biology 8.2 (2001): 191-200.

Year 2008
Browning, Sharon R. "Estimation of pairwise identity by descent from dense genetic
marker data in a population sample of haplotypes." Genetics 178.4 (2008): 2123-2132.
In the same paper,

Browning, Sharon R. "Estimation of pairwise identity by descent from dense genetic
marker data in a population sample of haplotypes." Genetics 178.4 (2008): 2123-2132.

Year 2012

Su, Shu-Yi, et al. "Detection of identity by descent using next-generation
whole genome sequencing data." BMC bioinformatics 13.1 (2012): 121.

Year 2014

Even last year, some papers never included recombination for IBD.
Rodriguez, Jesse M., et al. "Parente2: A fast and accurate method for
detecting identity by descent." Genome research (2014): gr-173641
.

I just gave a very few, but it is much evident that not all in scientific community do agree with IBD definition that includes 'without recombination'. So, let's see how 'without recombination' was added to the definition of IBD.

Without Recombination

How did "without recombination" attached to the definition of IBD?

Year 2003

The below paper was the first to propose the definition of IBD to include without recombination.

Hayes, Ben J., et al. "Novel multilocus measure of linkage disequilibrium to
estimate past effective population size." Genome Research 13.4 (2003): 635-643.

The purpose of the above paper was to know the historical population size of humans and other species. For this purpose, the paper is absolutely correct in using Identity By Descent without recombination, because, it achieves what they are after, 'historical population size' by calculating only directly related ancestors.


Hayes, Ben J., et al. "Novel multilocus measure of linkage disequilibrium to
estimate past effective population size." Genome Research 13.4 (2003): 635-643.
This definition is wrongly taken and misused by several recent authors for an entirely different purpose - genetic genealogy.

All subsequent papers cited the above correctly as intended for the next 5 years.
  • Wang, Jinliang. "Estimation of effective population sizes from data on genetic markers." Philosophical Transactions of the Royal Society B: Biological Sciences 360.1459 (2005): 1395-1409.
  • Meuwissen, Theo HE, and Mike E. Goddard. "Multipoint identity-by-descent prediction using dense markers to map quantitative trait loci and estimate effective population size." Genetics 176.4 (2007): 2551-2560.

Year 2010

Then, someone came along and messed up the definition, by treating IBD alleles and chromosome segment IBD differently by misquoting the 2003 paper which was intended for a different purpose.
Powell, Joseph E., Peter M. Visscher, and Michael E. Goddard. "Reconciling the analysis
of IBD and IBS in complex trait studies." Nature Reviews Genetics 11.11 (2010): 800-805.

The above definition was picked up by several other authors in the last few years and began to use it exclusively for genetic genealogy.

Year 2014
Durand, Eric Y., Nicholas Eriksson, and Cory Y. McLean. "Reducing Pervasive False-Positive Identical-by-Descent
Segments Detected by Large-Scale Pedigree Analysis." Molecular biology and evolution (2014): msu151.

Is IBD definition used correctly in genetic genealogy today?
No. Identity by Descent (IBD) was always intended to find 'Linkage disequilibrium' or how two people are related. That's the whole purpose of genetic genealogy. It was never intended to match only some few people in pedigree. Recombination does not nullify linkage disequilibrium and still the segment comes from the a single ancestral chromosome. IBD segments without recombination aided through phasing can help identify which segment came from who by eliminating the cousin marriage lineage and look for direct lineages. This does not mean non-IBD, which are IBS segments are random/accidental occurrence.

Let's look into the definition of IBS:
Identical by state or identity by type is a term used in genetics to describe two identical alleles or two identical segments or sequences of DNA. In genetic genealogy the term IBS is generally used to describe segments which are not identical by descent and therefore do not share a recent common ancestor. IBS is also used to describe small half-identical regions which are shared by many people both within and between populations and which have no genealogical relevance. (isogg.org)
The definition of IBS is clearly wrong and misleading. IBS segments do share a common ancestor. Just because a person's parents are first cousins, does that mean the segment a child inherits from his own great-grandparent through recombination between his parents is not by descent? It is the noise that isn't from a common ancestor. As the thresholds are increased, IBS cannot occur accidentally.

IBS within 23andMe v4 IBD

There are several bloggers and genetic genealogists believe that lower thresholds like 2 to 5 cM can occur randomly. Based on an experiment in GEDmatch by uploading a random kit and several other experiments, a random segment cannot match beyond 150 SNPs / 1 Mb threshold. Hence, any segment that is greater than 200 SNPs / 2 cM cannot occur randomly.
Many genetic genealogists think that small segments can occur so randomly, but 23andMe thinks otherwise because it uses imputation exclusively for matching v3 and v4.

Have you wondered how 23andMe match v3 and v4 kits with one another, with one has just half the SNPs than the other? Through imputation. Have anyone actually calculated how much "segments" are imputed, what thresholds they are and the consequences of it?

I will demonstrate now for chromosome 1 alone which imputes ~45 IBS segments as matching.

CHR START END LENGTH SNPS
1 776546 998395 0.57 cM 33
1 1255130 1308982 0.20 cM 14
1 1314172 1418112 0.39 cM 15
1 1439671 1458567 0.14 cM 4
1 1510801 1691050 0.68 cM 21
1 1707740 1731756 0.12 cM 6
1 1762014 1776269 0.12 cM 2
1 1812521 1829482 0.11 cM 3
1 1900232 1959022 0.23 cM 8
1 1992748 2017297 0.12 cM 3
1 2564183 2587026 0.47 cM 2
1 5834437 5870534 0.10 cM 4
1 12391953 12418700 0.11 cM 2
1 12429611 12496021 0.22 cM 7
1 12915847 13374687 0.72 cM 5
1 17202355 17261738 0.22 cM 2
1 24206985 24365512 0.17 cM 34
1 29870952 30039872 0.30 cM 3
1 35174826 35301579 0.18 cM 46
1 38279987 38423588 0.35 cM 29
1 104129880 104309132 0.12 cM 6
1 109549512 109643160 0.12 cM 19
1 117126812 117164591 0.12 cM 4
1 121346616 144474542 0.69 cM 8
1 145748199 146089268 0.51 cM 4
1 147382550 147729232 0.20 cM 9
1 147826789 149039031 0.45 cM 3
1 149248927 149815323 0.11 cM 7
1 160238232 160251572 0.12 cM 3
1 167973637 168036735 0.13 cM 5
1 203430997 203571766 0.30 cM 33
1 206190902 206223556 0.47 cM 6
1 206232798 206559362 0.46 cM 26
1 207684359 207696717 0.10 cM 4
1 213273769 213318765 0.10 cM 6
1 222338759 222456692 0.13 cM 14
1 222619902 222762128 0.13 cM 28
1 222927951 223283653 0.28 cM 59
1 235137472 235190683 0.27 cM 3
1 235273179 235334252 0.18 cM 13
1 235988566 235997314 0.11 cM 5
1 242106296 242200884 0.28 cM 15
1 243147031 243298257 0.25 cM 5
1 243826674 243853830 0.14 cM 4
1 243954407 243976837 0.13 cM 2

Data for the above table: 23andMe_SNPs.xlsx

The above looks like tiny segment matches often tend to ignore as noise. Guess what? The above tiny segments are assumed to be matching between 23andMe v3 and v4 for chromosome 1 alone - one segment length is as high as 0.72 cM and another segment SNP count is high as 59 - completely guessed. For it to match a v3 kit, a v4 kit has to take IBS segments from totally unrelated people, then fits the missing bits for imputation and a sequence of broken segments along with these imputed IBS segments are presented to a 23andMe user as a long matching IBD segment - and the 23andMe user have absolutely no idea that his long matching segment is only matching half and the rest of the match is assumed by using IBS from unrelated population.

If chromosome 1 alone requires 45 IBS segment imputation, how much IBS segments must be imputed for the entire set of 22+1 chromosomes to match between v3 and v4 kit? Thus the fallacy of IBD segment 'without recombination' proudly presented with IBS from unrelated population.