Thursday, February 26, 2015

Second-generation PLINK

"... these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM"  :-)

Interview with author Chris Chang. User Google group.

If one estimates a user population of ~1000, each saving of order $1000 in CPU/work time per year, then in the next few years PLINK 1.9 and its successors will deliver millions of dollars in value to the scientific community.
Second-generation PLINK: rising to the challenge of larger and richer datasets

Background
PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format.

Findings
To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(n‾√)-time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0).

Conclusions
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

2 comments:

  1. Yan Shen9:00 PM

    "... these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM" :-)

    Can someone please set this man up on a date with Miranda Kerr? We need these genes to proliferate in the gene pool.

    https://petitions.whitehouse.gov/



    Can we petition Obama to make it happen?

    ReplyDelete
  2. Endre Bakken Stovner11:38 AM

    Not only is it faster, but it actually works, which is as common as not in bioinformatics.

    When I need to do something, I always try to find a way to make plink do it. This isn't just to keep the number of different programs used in pipelines low, but I can be pretty confident it works. The speed is a nice bonus, of course. And Chang has been incredibly quick to come up with patches for those edge cases that are off, less than 24 hours every single time afaicr.

    ReplyDelete