Tuesday, July 14, 2015

Ancient mtDNA from Xiaohe Cemetery

The Tarim Basin in western China, known for its amazingly well-preserved mummies, has been for thousands of years an important crossroad between the eastern and western parts of Eurasia. Despite its key position in communications and migration, and highly diverse peoples, languages and cultures, its prehistory is poorly understood. To shed light on the origin of the populations of the Tarim Basin, the authors had analysed mitochondrial DNA polymorphisms in human skeletal remains excavated from the Xiaohe cemetery, used by the local community between 4000 and 3500 years before present, and possibly representing some of the earliest settlers (ref link).

From the raw data provided, I extracted the RSRS markers to plot in mtDNA haplogroup and found the values are a bit off. Below are the corrected mtDNA haplogroups.

Sample number
mtDNA-Hg (HVR-I)
mtDNA-Hg (SNP)
RSRS
Corrected mtDNA-Hg
Ambiguous downstream Haplogroups
T18-1 D D A16051G, A16129G, T16187C, C16189T, G16230A, T16278C, C16311T, T16362C L3 M9b, U2e, R31b, E2, C1d1c, G2a1e
T18-7 G2a G A16129G, T16187C, C16189T, G16230A, A16293G, T16297C, C16311T, T16362C A2y
T22-6 D D A16129G, T16187C, C16189T, G16230A, C16234T, T16278C, C16311T, A16316G, T16362C M9a1a
T23-4 C4 C4 A16129G, T16187C, C16189T, T16223C, G16230A, T16278C, T16298C, C16311T, C16327T L3 B2g1, H6a1a8, B4b1b, R30b1, HV0, F3, U1b, P3b1, T2f1, C
T24-7 C4 C4 A16129G, T16187C, C16189T, T16223C, G16230A, T16278C, T16298C, C16311T, C16327T L3 B2g1, H6a1a8, B4b1b, R30b1, HV0, F3, U1b, P3b1, T2f1, C
T24-12 M5 M5 T16187C, C16189T, G16230A, T16278C, C16311T L3 L3, U5b2a1a, G2a1d2
T28-5 U5a U A16129G, T16187C, C16189T, C16192T, T16223C, G16230A, C16256T, C16270T, T16278C, C16311T U5a 
T28-8 B5 B A16129G, A16182C, A16183C, T16187C, T16217C, T16223C, G16230A, T16243C, T16278C, C16311T, C16355T R R0a1a, B4, H11a3, H1o, HV1a1, B5b, R9b1a2, HV2, H5a1c2, F1e1a, H1c4b, F3a, J1c14, T1b, H1b1b
T28-9 C4 C4 A16129G, T16187C, C16189T, T16223C, G16230A, T16278C, T16298C, C16311T, C16327T L3 B2g1, H6a1a8, B4b1b, R30b1, HV0, F3, U1b, P3b1, T2f1, C
T29-12 M M T16187C, C16189T, G16230A, T16278C, T16304C, C16311T L3 N9a2a1, R22, M27c, U5b3, U5a1d2b, R5, U7a5, M21d, U5b2a2c, L3h2, R0a2c, R21, R9, L3d3a, T2b, R31a, H5, M25, A2y
T35-1 M M A16129G, C16186T, T16187C, C16189T, G16230A, T16278C, T16298C, C16311T, G16319A M8 C1c3, M8a 
MW U7 U A16129G, T16187C, C16189T, T16223C, G16230A, T16278C, C16311T, A16318T L3 U5b2a1a, F1a3a1, F1a2, M39a2, R, M7c3b, L3i1a, L3x2a, D4b2b2b, X2h, D4g2a1a, M76, Y, M7b3, D4m1, M1a3, M27b, D4c1
M12 H H A16129G, T16187C, C16189T, T16223C, G16230A, T16278C, C16311T L3 U5b2a1a, F1a3a1, F1a2, M39a2, R, M7c3b, L3i1a, L3x2a, D4b2b2b, X2h, D4g2a1a, M76, Y, M7b3, D4m1, M1a3, M27b, D4c1
M39 C4 C4 T16187C, C16189T, G16230A, T16278C, T16298C, C16311T, C16327T C
M55 C4 C4 T16093C, T16187C, C16189T, G16230A, T16278C, T16298C, C16327T C C5a2, C4a1, C4b8
M62 K K A16129G, A16183C, T16187C, T16223C, T16224C, G16230A, C16256T, T16278C R H14a, F1a1c1, P10, R9c1a, U5b2b3, H3b1, H1x, K, D4c1b, H13a2c, J1c2a2, J1c11, R23, H3ao, U2e1b, U5a, V20, H1e5
Bm1 C4 C4 A16129G, T16187C, C16189T, G16230A, T16278C, T16298C, A16309G, C16311T, C16327T C
Bm2 C4 C4 A16129G, T16187C, C16189T, G16230A, T16278C, T16298C, A16309G, C16311T, C16327T C
Bm5 T T T16126C, A16129G, T16187C, C16189T, T16223C, G16230A, T16278C, C16292T, C16294T, C16311T T T2b3a, T2c
Bm9 U2e U A16051G, A16129C, G16153A, A16182C, A16183C, T16187C, T16223C, G16230A, C16261T, T16278C, C16311T, T16362C R U2e, R31b, R7a'b, H8c2, P8
Bm10 C4 C4 A16129G, T16187C, C16189T, G16230A, T16278C, T16298C, A16309G, C16311T, C16327T C
BM18 C5 C A16129G, T16187C, C16189T, G16230A, T16278C, T16288C, T16298C, C16311T, C16327T C5
Bm20 D D A16129G, T16187C, C16189T, C16192T, G16230A, C16266T, T16278C, C16311T, T16362C L3 D4a3b1, R6a, A2f1, U5b2b1a, D5a2a, A2aa, M7d
Bm22 U7 U A16129G, T16187C, C16189T, T16223C, G16230A, T16278C, C16311T, A16318T L3 U5b2a1a, F1a3a1, F1a2, M39a2, R, M7c3b, L3i1a, L3x2a, D4b2b2b, X2h, D4g2a1a, M76, Y, M7b3, D4m1, M1a3, M27b, D4c1
BM24 C4 C4 A16129G, T16187C, C16189T, G16230A, T16278C, T16298C, A16309G, C16311T, C16327T C
Bm25 R R A16129G, A16183C, T16187C, T16223C, G16230A, C16261T, T16278C, G16390A R7a1b2
M70 C4 C4 A16129G, T16187C, C16189T, G16230A, T16278C, T16298C, A16309G, C16311T, C16327T C
M73 D D A16129G, T16172C, A16183C, T16187C, C16189T, T16209C, G16230A, T16278C, C16311T, T16362C L3 M9b, R0a2h, A2q, D1g5, R0a1a1a, D5a2, D4b2b2b, L3f1b2a, R31a, F1a4, N9a5, M20, N10, 
M75 C4 C4 A16129G, T16187C, C16189T, G16230A, T16278C, T16298C, A16309G, C16311T, C16327T C
M87 C4 C4 A16129G, T16187C, C16189T, G16230A, T16278C, T16298C, A16309G, C16311T, C16327T C
M89 C4 C4 A16129G, T16187C, C16189T, T16223C, G16230A, T16278C, T16298C, C16311T, C16327T L3 B2g1, H6a1a8, B4b1b, R30b1, HV0, F3, U1b, P3b1, T2f1, C
M93 C4 C4 A16129G, T16187C, C16189T, T16223C, G16230A, T16278C, T16298C, C16311T, C16327T L3 B2g1, H6a1a8, B4b1b, R30b1, HV0, F3, U1b, P3b1, T2f1, C
M95 U5a U A16129G, T16187C, C16189T, C16192T, T16223C, G16230A, C16256T, C16270T, T16278C, C16291T, C16311T U5a1b1
M99 C4 C4 A16129G, T16187C, C16189T, G16230A, T16278C, T16298C, A16309G, C16311T, C16327T C
M129 C4 C4 A16129G, T16187C, C16189T, G16230A, T16278C, T16298C, C16311T, C16327T C
M130 C4 C4 A16129G, T16187C, C16189T, G16230A, T16278C, T16298C, A16309G, C16311T, C16327T C

Raw mtDNA data can be downloaded from ncbi.nlm.nih.gov (provided by authors). My complete analysis plotted in mtDNA Phylotree can be downloaded from here (download the html files and view in any web browser).

Edit: The authors explain (page 4 column 1) that they assigned haplogroups using the signature SNPs and then looked at the HVR-Is. However, I can't find any raw data uploaded/submitted to suggest those samples have the signature SNPs.

Monday, July 6, 2015

BAM Analysis Kit version 1.7 released

After nearly a year and with much lessons learnt from processing hundreds of BAM files, I finally decided to upgrade and make improvements to BAM Analysis Kit to be faster, quicker and as accurate as possible.

The first and foremost improvement is to make sure all software used are upgraded and the latest genome with patch is used. Hence, the human genome GRCh37.75 is used. I don't want to move towards human genome build 38 because it is not adopted by DNA testing companies yet. SAMtools, lobSTR, dbSNPs, GATK and Picard are all upgraded to latest versions.

Several improvements had been made to extractly extract rCRS and RSRS mutations. The tool also identifies mtDNA and Y-DNA haplogroups without having to post process the mutations. A simple SNPedia report is also generated using annotation dump available from OpenSNP. The download size and operating size is reduced to half. The tool also processes population admixture using dv3, globe13 and eurogenes36.

To calculate Y-STR, lobSTR is being used. However, based on past experience, CODIS wasn't useful and removed since it takes too much of time and no result or added value to genealogy. Hence, lobSTR is optimized to use only Y-STR alone. I am also in the process of building my own Y-STR Kit to extract Y-STR values to adjust the motif/nomenclature to make sure it aligns with FamilyTreeDNA values which is used by most genetic genealogists.

Let me know if there are any bugs or issues. Also post me with feedback/suggestions.

Monday, June 22, 2015

RISE150 - Ancient DNA has matches with living people

RISE150, an ancient DNA from Przeclawice, Poland which is 3469 years old is uploaded as F999948 to GEDmatch. Interestingly, it matches a few living people.

GEDmatch
Largest Segment
The largest segment matching with a living person is around 8.3 cM (using 5 cM/ 500 SNPs threshold)
Largest Segment


Mt-DNA
The mt-DNA haplogroup is U5a1b1

Admixture
MDLP
Eurogenes
For more Ancient DNA, please visit - http://www.y-str.org/p/ancient-dna.html

RISE395 - Ancient DNA has matches with living people

RISE395, an ancient DNA from Bol'shekaraganskii, Russia which is 3540 years old is uploaded as F999949 to GEDmatch. Interestingly, it matches a few living people.

GEDmatch

Largest Segment
The largest segment is as large as 7.9 cM (using 5 cM/500 SNPs threshold)


Largest Segment
Mt-DNA
The mtDNA haplogroup is U2e1

Admixture
Eurogenes

MDLP
For more Ancient DNA, please visit - http://www.y-str.org/p/ancient-dna.html

RISE505 - Ancient DNA has matches with living people

RISE505, an ancient DNA from Kytmanovo, Russia which is 3391 years old is uploaded as F999953 to GEDmatch. Interestingly, it has matches with living people.

GEDmatch
Largest Segment
The largest segment is as long as 8 cM using default 7 cM/ 700 thresholds.

Largest Segment


Mt-DNA
The mtDNA haplogroup is U4a1b

Admixture
Eurogenes

MDLP

For more Ancient DNA, please visit - http://www.y-str.org/p/ancient-dna.html