Showing posts with label software development. Show all posts
Showing posts with label software development. Show all posts

Monday, December 02, 2013

PLINK 1.90 alpha


WDIST is now PLINK 1.9 alpha. WDIST (= "weighted distance" calculator) was originally written to compute pairwise genomic distances. The mighty Chris Chang then amazingly re-implemented all of PLINK with significant improvements (see below).

PLINK 1.9 even has support for LASSO (i.e., L1 penalized optimization, a particular method for Compressed Sensing).
This is a comprehensive update to Shaun Purcell's popular PLINK command-line program, developed by Christopher Chang with support from the NIH-NIDDK's Laboratory of Biological Modeling and others. (What's new?) (Credits.)

It isn't finished yet (hence the 'alpha' designation), but it's getting there. We are working with Dr. Purcell to launch a large-scale beta test in the near future. ...

Unprecedented speed 
Thanks to heavy use of bitwise operators, sequential memory access patterns, multithreading, and higher-level algorithmic improvements, PLINK 1.9 is much, much faster than PLINK 1.07 and other popular software. Several of the most demanding jobs, including identity-by-state matrix computation, distance-based clustering, LD-based pruning, and association analysis max(T) permutation tests, now complete hundreds or even thousands of times as quickly, and even the most trivial operations tend to be 5-10x faster due to I/O improvements.

We hasten to add that the vast majority of ideas contributing to PLINK 1.9's performance were developed elsewhere; in several cases, we have simply ported little-known but outstanding implementations without significant further revision (even while possibly uglifying them beyond recognition; sorry about that, Roman...). See the credits page for a partial list of people to thank. On a related note, if you are aware of an implementation of a PLINK command which is substantially better what we currently do, let us know; we'll be happy to switch to their algorithm and give them credit in our documentation and papers.

Nearly unlimited scale 
The main genomic data matrix no longer has to fit in RAM, so bleeding-edge datasets containing tens of thousands of individuals with exome- or whole-genome sequence calls at millions of sites can be processed on ordinary desktops (and this processing will usually complete in a reasonable amount of time). In addition, several key individual x individual and variant x variant matrix computations (including the GRM mentioned below) can be cleanly split across computing clusters (or serially handled in manageable chunks by a single computer).

Command-line interface improvements
We've standardized how the command-line parser works, migrated from the original 'everything is a flag' design toward a more organized flags + modifiers approach (while retaining backwards compatibility), and added a thorough command-line help facility.

Additional functions
In 2009, GCTA didn't exist. Today, there is an important and growing ecosystem of tools supporting the use of genetic relationship matrices in mixed model association analysis and other calculations; our contributions are a fast, multithreaded, memory-efficient --make-grm-gz/--make-grm-bin implementation which runs on OS X and Windows as well as Linux, and a closer-to-optimal --rel-cutoff pruner.

There are other additions here and there, such as cluster-based filters which might make a few population geneticists' lives easier, and a coordinate-descent LASSO. New functions are not a top priority for now (reaching 95%+ backward compatibility, and supporting dosage/phased/triallelic data, are more important...), but we're willing to take time off from just working on the program core if you ask nicely.

Saturday, August 31, 2013

"Hadoop wildly overhyped" and other database arcana

Michael Stonebreaker talk on the current state of database architecture and technology, entitled The Traditional RDBMS Wisdom is (Almost Certainly) All Wrong. This field was pretty dead for a long time, but big changes are afoot now. Readers might be familiar with Streambase, one of his startups.

A friend summarizes the talk as follows
1. Relational databases are going through large transformations (not just as a threat from the non relational no-sql DBs, but the internal model that Oracle, MySQL itself use is breaking apart and need to be rewritten).

2. The current relational model has effectively already broken apart into two solutions (in large companies)- Huge datasets for after the fact statistical analysis (ie- Walmart trying to analyze a recent sale, etc) have already moved over to column grouped data (as opposed to row records), which lets you stream through a given column of data quickly. Real time data on the other hand, is kept at the terabyte range, and the whole database is loaded into memory. Although this speeds up things significantly, the next bottlenecks appear around thread locking, etc- Fixing this is an area of current research, but the speedup gains are ultimately orders of magnitude faster than the old relational database model, and so will probably all change soon.

3. The speaker described Hadoop as being terrible at everything except embarrassingly parallel multi machine computations. He mentioned that Google itself doesn't use mapreduce anymore and is probably amused at all the attention that it is getting in the world.

There were some other interesting points in the talk, but the main takeaway is that we probably are in for some huge changes in the next decade.

Saturday, June 22, 2013

WDIST and PLINK

News from BGI Cognitive Genomics.
31 May 2013: We have started the process of returning genetic data to our first round of volunteers. Everyone who was sequenced will be contacted within the next few weeks.

We are also starting public testing of our new bioinformatics tool: WDIST, an increasingly complete rewrite of PLINK designed for tomorrow's large datasets, developed by Christopher Chang with support from the NIH-NIDDK's Laboratory of Biological Modeling and others. It uses a streaming strategy to reduce memory requirements, and executes many of PLINK's slowest functions, including identity-by-state/identity-by-descent computation, LD-based pruning of marker sets, and association analysis max(T) permutation tests, over 100x (and sometimes even over 1000x) as quickly. Some newer calculations, such as the GCTA relationship matrix, are also supported. We have developed several novel algorithms, including a fast Fisher's exact test (2x2/2x3) which comfortably handles contingency tables with entries in the millions (try our browser demo!). Software engineers can see more details on our WDIST core algorithms page, and download the GPLv3 source code from our GitHub repository.

Thursday, January 17, 2013

US-China software arbitrage

Who says outsourcing doesn't work?  :-)

This is just a single anecdote, but it suggests that US software developers cost many times more than coders of similar quality in China ...
Verizon RISK team security blog: ... As it turns out, Bob had simply outsourced his own job to a Chinese consulting firm. Bob spent less that one fifth of his six-figure salary for a Chinese firm to do his job for him. Authentication was no problem, he physically FedExed his RSA token to China so that the third-party contractor could log-in under his credentials during the workday. It would appear that he was working an average 9 to 5 work day. Investigators checked his web browsing history, and that told the whole story.

A typical ‘work day’ for Bob looked like this:

9:00 a.m. – Arrive and surf Reddit for a couple of hours. Watch cat videos
11:30 a.m. – Take lunch
1:00 p.m. – Ebay time.
2:00 – ish p.m Facebook updates – LinkedIn
4:30 p.m. – End of day update e-mail to management.
5:00 p.m. – Go home

Evidence even suggested he had the same scam going across multiple companies in the area. All told, it looked like he earned several hundred thousand dollars a year, and only had to pay the Chinese consulting firm about fifty grand annually. The best part? Investigators had the opportunity to read through his performance reviews while working alongside HR. For the last several years in a row he received excellent remarks. His code was clean, well written, and submitted in a timely fashion. Quarter after quarter, his performance review noted him as the best developer in the building.

Saturday, March 31, 2007

Letter to a former student

Here's some advice I wrote to a former student who is headed into software development. Can anyone add to or improve my comments?


1) software development in general

The Mythical Man Month (overrated, in my view, but everyone in the industry has read it)

Joel Spolsky on Software (a successful entrepreneur who has a big following in the developer community; he has an extensive web site)

Paul Graham (a CS PhD who writes about software and startups; very opinionated; check out his web site)


2) algorithms

The standard text is Rivest et al., but for an informal introduction to some of the best stuff in CS try The Turing Omnibus. For security and crypto, try Applied Cryptography by Schneier. Knuth's books on the art of computer programming are probably mainly of academic interest, but you might enjoy a look.


3) general

Anyone in the working world should read How To Win Friends and Influence People by Dale Carnegie -- I swear by it. There is an outline of the whole book on my blog somewhere :-)


4) miscellaneous comments

Unix shell tools are amazingly powerful and will serve you time and again -- grep, piping, awk, sed, emacs, etc. Many Windows programmers are unfamiliar with these and are blown away by their power to do quick and dirty stuff.

The more you know the more effective you become -- experienced programmers think a lot before they code, and they tend to reuse existing code or libraries.

Don't forget to think like a physicist -- get to the heart of the problem before starting in any direction; make quick and dirty models and test them out; test your assumptions throughout the process by checking against the accumulating evidence. After you have a feel for the problem spend some time generalizing about or abstracting from what you've done.

Blog Archive

Labels