Saturday, January 31, 2015

Designer 101 for every evolutionist!

I have my own thoughts and response esp., on evolution and the arguments put forward in favor of it using genetics. In this post, I will address every argument put in favor of evolution using standard procedures and methodologies used in most software companies for designing, developing, testing and roll-out of software.

Genetic similarity of humans with primates

One of the most common answers I always find is the genetic similarity of humans with primates. This statement must be analyzed what it means, to understand what is assumed. The assumption here is the consideration of only the genes and the rest of the Junk DNA is useless. Hence, the genes (which are the only biologically significant) are identical up to 90%.

A gene is the basic physical and functional unit of heredity. Genes, which are made up of DNA, act as instructions to make molecules called proteins. Every developer has a set of libraries. Most of the developers, don't develop these libraries with common functions from scratch. They put all common functions in one dll or library and link it either statically or dynamically with every other program they require. Hence, a gene is comparable to a function or set of functions, more like a library, a dll file, used in all the organisms that require it. Even though Microsoft Word and Excel loads the exact same set of common libraries and calls same functions, and has similar look and feel, they function differently and both are designed. Human and chimp using same genes for their function which is used as an evidence for evolution, is like saying Microsoft Word and Excel had a common ancestor and evolved because they share the same set of functions and load the same common libraries. Just because a dll is used in a program does not necessarily means the program will use all functions in that dll. A program will use only the functions it requires and the rest will never used. This is exactly why some genes are switched off or inactivated while others function.

As we know operating systems came a long way, but does that have a common ancestor and evolved itself? Identical operating systems actually share code, libraries etc but they are always developed. Take Linux for example: They share code from Unix - Does that mean Linux evolved itself from Unix?



Hence, identical genes between species is actually an evidence for Designer at work rather than evolution.

C-value enigma or the Onion Test

Some organisms (like some amoebae, onions, some arthropods, and amphibians) have much more DNA per cell than humans, but cannot possibly be more developmentally or cognitively complex, implying that eukaryotic genomes can and do carry varying amounts of unnecessary baggage.

If you look into the code of any developer, you might find unused functions, unused imports, unused libraries, debug and trace information. I personally have the habit of embedding the data within the code to make the code more portable. These things increase the size of the code but not it's functionality.

For example, some software doing simple stuff like a dictionary are 100 MB in size, while Stuxnet which brought down Iranian nuclear reactors is just half a megabyte. The size of the code is not directly proportional to it's complexity, which every developer knows this. Any project manager would know that complexity does not depend on the number of code and does not estimate the costs based on lines of code required.
“Measuring programming progress by lines of code is like measuring aircraft building progress by weight.” --Bill Gates
Murphy's law on software engineering states, The chances of a program doing what it's supposed to do is inversely proportional to the number of lines of code used to write it. If this is the case for computer code designed and developed by humans, isn't it a perfect evidence for a grand Designer at work, creating complex organisms with less genomes perfectly satisfying Murphy's law?

There is an another logic which is even used in DOS and Windows. If there is a fixed size and you are to write in them, the remaining left over space is huge when the actual information is very less, which makes the overall size huge. The DOS and Windows file systems use fixed-size clusters. Even if the actual data being stored requires less storage than the cluster size, an entire cluster is reserved for the file. This is why when very less information like 1 byte on a million files will not be 1 MB in size but rather, 4 GB for a 4K cluster. Does genome has such features? We do know introns within a gene are removed by RNA splicing indicating some kind of cluster like structures.

Courtesy: killdisk.com/wiping.htm


Hence, C-value enigma or Onion test is actually an evidence for Designer at work, satisfying Murphy's laws, cluster style design rather than evolution.

Reverse Engineering Genome

Most of the work biologists do today are just reverse engineering the genome, trying to understand it's design and documenting it. Most decompilers provide data region, code region, functions used, the data string used and the entry point.  However, none of these luxury exists in genome reverse engineering. All we have is a sequence of code when executed has a specific functions and codons to identify them. Most of the remaining genome, we have no clue. If you look into the structure of any executable file, without having any clue of the structure, you can quickly come to conclusion that most of the info in there is junk. But if you know the actual structure, it isn't junk but actual data in them.


As you can see, the only readable info in the file is MZ. Apart from this, nothing is decode-able without actually having documentation on it.


The same logic is with genome. Just because a sequence of binary code is not a part of a function (or a gene), does not mean it is not used. If I want to be more accurate, a genome is not just a code, but much more to it. A genome is like a code that has instructions to build the computer itself and then install all the required operating systems and the application code and then finally make the computer boot execute the required code in the right order.

Conclusion

I would like to add more, but these are more than enough as it covers most of the arguments used in genetics. It's only a designer can understand and detect the architecture of an another Designer. People who can't detect design or understand the complex architecture do require some designing skills to actually understand them. If they can't, it is better to atleast see the computer word and understand in order to relate and detect design concepts. Evolutionary biologists are not designers or developers which is understandable, but that does not mean they should reject on design concepts and ways to detect design. In 16 trillion bits in my computer, there is not a single 0 or 1 was by chance.

Tuesday, January 27, 2015

Downloading large files from Google Drive using Download Manager

Downloading large files are always a pain. It takes long time and worse is it if disconnects or fails just before it completes all. I rely mostly on Google Drive and sometimes upload large files. Hence, I thought it would be appropriate to post a blog explaining how to download without any issues from Google Drive using a download manager. This post assumes you are using Google Chrome and screenshots are from it. As an example, I will show how to download SNP Prophet.zip which is 13 GB, a really massive file.

Download the file normally

Just click on the link (or follow the link) and download the file normally from the browser.




Cancel the download

Now, cancel the download from the browser. You can cancel the download by clicking on small context menu from the download display.Once cancelled, click on 'Show all downloads' to go to downloads page.



Copy the link

On the downloads page, right click on the link and select 'Copy link address'


Paste the link on Download Manager

You can paste this copied link URL on any download manager,. Here, I had used Free Download Manager. Make sure the 'Save As' is modified as the required filename.




Download using Download Manager

Now, the file is being downloaded by a download manager.



The important advantage of using a download manager is that, if any network interruption happens, the download resumes and will not download from the beginning. This helps in avoiding download failures and saving bandwidth when trying to download large files.

Friday, January 23, 2015

Domain Change Update!

Since I changed my name to Felix Immanuel, I thought of changing my personal domain as well to reflect my name change. Hence, I changed my domain from fc.id.au to fi.id.au, so does my email which I updated in contacts page. However, I will still be receiving emails from my old email and forward it to my new email as long as I can. The website at fc.id.au will be forwarded to fi.id.au. Unfortunately, I wasn't able to retain comments, likes and follows from my personal blog due to this change. In summary, the domain change is a smooth and slow transition and will not have any impact to others.

Wednesday, January 14, 2015

How to choose thresholds for segment matches?

Choosing thresholds for segment matches is one of the greatest challenges for a genetic genealogist. Thresholds used by several people vary from 200 SNPs / 2 cM to 700 SNPs / 7 cM. The question that drills down is, what is the logic behind it? Why is it set on a particular SNP/cM pair of values? What is the relation between SNPs and cM? These questions are so crucial in deciding why there is a threshold and what threshold is right for a genetic genealogist.

Before proceeding with this blog, you need to make yourself familiar with noise thresholds and SNP density for a kit.

Noise Thresholds

A segment that is just noise cannot occur above 150 SNPs / 1 Mb threshold. The mathematics behind how this was arrived can be read from The true IBS noise range and Noise threshold on atDNA Matches, which i also tested in GEDmatch and you can see the result from the post, IBS Noise Kit at GEDMatch. 1 Mb is roughly equal to 1 cM threshold. This is one of the reasons FamilyTreeDNA has 1 cM included in total shared DNA. 

SNP Density

It is the number of SNPs per cM. SNP density varies between kit versions from different companies. E.g., If we take FTDNA kit having 700k SNPs, then the number of SNPs per cM is, 700000 SNPs / 3600 cM ~ 194 SNPs per cM.

Below are the SNP density for different kit versions:

KIT VERSIONSNP DENSITY (per cM)
23andMe V2
157
23andMe V3
265
23andMe V4
168
Ancestry
194
FTDNA Affymetrix
160
FTDNA Illumina
194

Motive

Most DNA companies only advertise only up to 5 generations. In other words, the common ancestor for a 4th cousin is 5 generations back and they are not interested in going beyond that.

FTDNA: Upto 5 generations. Anything beyond is
mentioned as 5th to remote cousins, 

23andMe: Upto 7 generations, but restricts to 7 cM, which
means, very rarely a few lucky 6th cousins are identified.

Ancestry: 5th cousins is up to 6 generations

Hence, any thresholds provided by DNA testing companies will always target what they advertised. This is one of the reasons, DNA companies have different thresholds and their own internal logic for confirming a matching.

Autosomal DNA does undergo recombination. This recombination breaks the a single strand into 2 to 6 segments unequally each time. Hence, the segment length varies greatly as the cousin relationship is distant. I would like to point to  FamilyTreeDNA FAQ.

How accurate is Family Finder’s Relationship Range?
I would also like to point to two blogs:
Hence, as the common ancestor goes back in time, the IBD segment shared with cousins also varies significantly. There is no known algorithm to accurately predict the how far can the common ancestor is back in time.

The below table is from ISOGG - Identical by Descent, provided by Tim Janzen.
Relationship
Range
Expected
Range of number of shared segments
Parent/child3539-3748 cMs23-29
First cousins548-1139 cMs888 cMs17-32
First cousins once removed220-638 cMs444 cMs12-23
Second cousins86-426 cMs222 cMs10-18
Second cousins once removed19-197 cMs111 cMs4-12
Third cousins16-111 cMs55.4 cMs2-6?
Third cousins once removed0-99 cMs27.8 cMs1-4
Fourth cousins0-54 cMs13.8 cMs0-2
As you can see, the range of segment length for fourth cousins vary greatly. They not only have a very few segments 0-2, but also some of their segment lengths falls well below 7 cM, the thresholds used by several DNA companies.

There is also a level of false-positive tolerance: If a DNA company says a match is 3rd cousin, but in reality it is 10th or vice-versa, the customer may not be very happy in using their product. However, if the same company says the match is 2nd or 3rd cousin, but in reality it is 4th cousin, the customer will not be as disappointed as the previous scenario. The motive of any DNA testing company is make their customers happy and in the process, they are happy to even eliminate some valid close cousin matches due to their smaller segments in favor of completely eliminating very distant cousins who don't add any value to most pedigree. As you can see from the above table, a fourth cousin is expected to have 13.8 cM with 2 IBD segments which can range from 0 to 54 cM. To confirm a match with reasonable accuracy, the only way is to stop at a threshold which has only recent cousins. 7 cM is found to have mostly recent cousins by 23andMe and it stated using it. This however does not mean that any segment less than 7 cM is not of recent cousins, nor any segment that is less than 7 cM is invalid.

If you look at another FTDNA FAQ, How much genetic sharing is needed for two people to be considered a Family Finder match?
For the program to consider two people a potential match, the largest matching DNA segment between two people must be at least 5.5 centiMorgans (cM) long.* The program then uses additional matching segments to confirm the relationship and to calculate the degree of relatedness. 
Based on the extensive Family Finder database, it is rare for two genuine genealogical cousins to have a largest shared segment of less than 7 cM and one less than 6 cM is exceptional. 
*updated from 5 cM 15 Aug. 2012
As you can see, largest segment used by FTDNA is 5.5 cM (up from 5 cM used 2.5 years back). The two genuine genealogical cousins could mean very different for different people based on what 'genealogical time-frame' could mean different because a 4th cousin might have lived just 150 years or 5 generations back, while genealogical time-frame could mean 500 years back. With reference to FTDNA, I believe  here they are only referring to only 4th-5th cousin as they advertised.

Thanks to GEDmatch which allows down 1 cM.

How the threshold is arrived?

As we saw previously, most DNA companies narrowed and tuned their thresholds to get more accurate and better results only till 5 to 7 generations back. Based on their research, they found the segment length to be 5.5 cM to 7 cM. However, segment length (cM) is just half. The other half is SNPs. This is where SNP density comes. For 2 FTDNA Illumina kits to match each other at 7 cM, they both have 7 cM * 194 SNPs = 1358 SNPs. Hence, for matching 2 FTDNA Illumina kits thresholds can be comfortably kept at 7 cM / 1300 SNPs. This is why you will notice relatively high SNPs for segments if the kits are of same version.

But in reality, SNPs doesn't need to have such a high density to confirm the validity of a segment. As we saw earlier the noise threshold is 150 SNPs, which means, noise cannot occur below this threshold. Can we then have 150 SNPs / 7 cM ? Yes and No. If we reduce the SNP threshold, then adjacent IBD segments will start to merge and create a longer segment creating a 'segment merging effect'. This merge effect is not because of just reducing the SNP threshold, but because of SNPs not available to check in-between the two adjacent IBD segments - i.e., incompatible DNA kit versions having no-calls in between, which is assumed to be matching. The reduced SNP threshold simply aids in matching longer segment creating a merge effect with such incompatible DNA kit versions.


This does not mean we can't reduce the SNP threshold. Reducing SNP thresholds between two kits of same version (or kits that has same SNPs) while maintaining higher segment length threshold, will match similar to results from higher SNP thresholds. Hence, there is no problem is matching at lower SNP thresholds as long as the kit versions are same and the allowed errors are in check. E.g., 700 SNPs / 7 cM will give similar results to 150 SNPs / 7 cM because, segment cannot accidentally occur at 150 SNPs (150 SNPs is more than enough to confirm the validity of a segment). The prime reason for this is, same kit version don't have no-calls in between and have higher overlapping SNP density - hence, any mismatches in these SNPs will simply going to break the segment.

How much overlapping SNP density is required for two different kit versions? What is the thumb-rule for thresholds?

Any segment that is not noise will always have threshold greater than 150 SNPs / 1 cM. If we truly look into this segment, the common ancestor for this small segment could be thousands of years because the segment length with respect to generations for common ancestor is a exponential decay curve. It could also represent a 4th cousin as we saw earlier. Hence, 1 cM is as important as 7 cM and needs to be eliminated in favor of recent cousins only after careful investigation.

If we start off with 1 cM (having 150 SNPs) and increase the threshold to 2 cM, to have the same SNP density, we must have 300 SNPs. However, at 2 cM, we don't have many kit versions having overlapping SNP density as 300.

Below are the overlapping SNP density for different kit versions:

KIT VERSION 1KIT VERSION 2OVERLAP SNP DENSITY (per cM)
23andMe V223andMe V3
148
23andMe V223andMe V4
127
23andMe V2Ancestry
88
23andMe V2FTDNA Affymetrix
40
23andMe V2FTNDA Illumina
90
23andMe V323andMe V4
145
23andMe V3Ancestry
188
23andMe V3FTDNA Affymetrix
62
23andMe V3FTNDA Illumina
194
23andMe V4Ancestry
85
23andMe V4FTDNA Affymetrix
40
23andMe V4FTNDA Illumina
86
AncestryFTDNA Affymetrix
49
AncestryFTNDA Illumina
193
FTDNA AffymetrixFTNDA Illumina
50

The average overlap SNP density is ~100 SNPs per cM. Blue are highly compatible, red are marginally compatible and gray are not compatible. However, we need to make the incompatible kits to become compatible and the only way is to increase the threshold in such a way they they don't occur accidentally and yet validate the segment. How can we do this? The thresholds must follow two rules:

  • Number of SNPs in the segment must be greater than 150 cM.
  • The SNP density per cM must be 100 and above.

While 150 SNPs will make sure the segment doesn't occur accidentally, the SNP density per cM will make sure it doesn't merge with nearby IBD segments because of no-calls assumed as matching. Hence, not just the thresholds, but we must also verify the matching segment in the result if it has the right SNP density (100 SNPs per cM). In other words, if you want to match a 3 cM segment, you need 300 SNPs overlapping SNPs, or a 5 cM segment, you need 500 SNPs, or 25 cM, you need 2500 SNPs. I will demonstrate this with an example.

As we saw above, FTDNA Illumina and Ancestry are highly compatible with 193 SNP density per cM. This provides two segments.

  • 8.5 cM requires 850 SNPs and above to confirm the segment (but has significantly more - 1727 SNPs, hence valid). 
  • 16.3 cM requires 1630 SNPs and above to confirm the segment (but has significantly more - 3483 SNPs, hence valid)



However, if the same person did the test and has a kit version that is not compatible, let see how it shows. Lisa did her 23andMe v4 test. Comparing Lisa's 23andMe v4 and Ancestry we have the below match. As we saw in the above table, 23andMe v4 and ancestry are not compatible kit and has very less SNP density. This provides only 1 matching segment.

25.2 cM requires 2520 SNPs and above (but has only 2361 SNPs, hence invalid).



This is true with all incompatible kits and extremely helpful to weed out false-positive segments. This is also extremely helpful in eliminating false positives in smaller segments e.g., 2 cM. A 2 cM can occur in a 4th cousin or can even have common ancestor thousands of years back.


As you can see in the above, all matching segments that pass the SNP density on the right also matches on the left, even through the right is between two non-compatible kits.However, only 1 segment out of 2 that does not pass the SNP density is found on the left.

Conclusion

Choosing a threshold is an important task for genetic genealogy. To summarize, the threshold must have the following properties:
  • Number of SNPs in the segment must be greater than 150 cM.
  • The SNP density per cM must be 100 and above.
Hence, if you decide to see 7 cM, make sure you have 700 SNPs as threshold and also don't forget to validate the matching segment for SNP density.

A word of Caution: Just because a segment has lower SNP density does not always means it is merged. It simply means, the possibility of being merged from two segments is extremely high and validating those segments if they are merged is not easily possible.