Sunday, May 11, 2008

On data mining

Last week we had Jiawei Han of UIUC here to give a talk: Exploring the Power of Links in Information Network Mining. He's the author of a well-known book on data mining.

During our conversation we discussed a number of projects his group has worked on in the past, all of which involve teasing out the structure in large bodies of data. Being a lazy theorist, my attitude in the past about data mining has been as follows: sit and think about the problem, come up with list of potential signals, analyze data to see which signals actually work. The point being that the good signals would turn out to be a subset (or possibly combination) of the ones you could think of a priori -- i.e., for which there is a plausible, human-comprehensible, reason.

In many of the examples we discussed I was able to guess the main signals that turned out to be useful. However, Han impressed on me that, these days, with gigantic corpora of data available, one often encounters very subtle signals that are identified only by algorithm -- that human intuition completely fails to identify. (Gee, why that weird linear combination of those inputs, with alternating signs, even?! :-)

Our conversation made me want to get my hands dirty on some big data mining project. Of course, it's much easier for him -- his group has something like ten graduate students at a time! Interestingly, he identified this ability to tap into large chunks of manpower as an advantage of being in academia as opposed to, e.g., at Microsoft Research. Of course, if you are doing very commercially applicable research you can access even greater resources at a company lab/startup, but for blue sky academic work it wouldn't be the case.

No comments:

Blog Archive