WIRED: ... Google’s engineers have discovered that some of the most important signals can come from Google itself. PageRank has been celebrated as instituting a measure of populism into search engines: the democracy of millions of people deciding what to link to on the Web. But Singhal notes that the engineers in Building 43 are exploiting another democracy — the hundreds of millions who search on Google. The data people generate when they search — what results they click on, what words they replace in the query when they’re unsatisfied, how their queries match with their physical locations — turns out to be an invaluable resource in discovering new signals and improving the relevance of results. The most direct example of this process is what Google calls personalized search — a feature that uses someone’s search history and location as signals to determine what kind of results they’ll find useful.1 But more generally, Google has used its huge mass of collected data to bolster its algorithm with an amazingly deep knowledge base that helps interpret the complex intent of cryptic queries.
Take, for instance, the way Google’s engine learns which words are synonyms. “We discovered a nifty thing very early on,” Singhal says. “People change words in their queries. So someone would say, ‘pictures of dogs,’ and then they’d say, ‘pictures of puppies.’ So that told us that maybe ‘dogs’ and ‘puppies’ were interchangeable. We also learned that when you boil water, it’s hot water. We were relearning semantics from humans, and that was a great advance.”
But there were obstacles. Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theories about how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches. That helped the algorithm understand what “hot dog” — and millions of other terms — meant. “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.” ...
[mike siwek lawyer mi]
The Mike Siwek query illustrates how Google accomplishes this. When Singhal types in a command to expose a layer of code underneath each search result, it’s clear which signals determine the selection of the top links: a bi-gram connection to figure it’s a name; a synonym; a geographic location. “Deconstruct this query from an engineer’s point of view,” Singhal explains. “We say, ‘Aha! We can break this here!’ We figure that lawyer is not a last name and Siwek is not a middle name. And by the way, lawyer is not a town in Michigan. A lawyer is an attorney.”
This is the hard-won realization from inside the Google search engine, culled from the data generated by billions of searches: a rock is a rock. It’s also a stone, and it could be a boulder. Spell it “rokc” and it’s still a rock. But put “little” in front of it and it’s the capital of Arkansas. Which is not an ark. Unless Noah is around. “The holy grail of search is to understand what the user wants,” Singhal says. “Then you are not matching words; you are actually trying to match meaning.”
- Steve Hsu
- Pessimism of the Intellect, Optimism of the Will. Archive Favorite posts Twitter: @hsu_steve
Sunday, February 28, 2010
Seeds of AI at Google
Semantic meaning from statistical learning and mechanical turk workers like you and me :-)
Labels:
ai,
google,
machine learning
blog comments powered by Disqus
Subscribe to:
Post Comments (Atom)
Blog Archive
-
▼
2010
(234)
-
▼
02
(14)
- Seeds of AI at Google
- Gender differences in "extreme" mathematical abili...
- Charlie Munger at Caltech
- Financier flows
- Quants!
- The education of Nathan Myhrvold
- Ethnic affinity
- The value of hard work
- From physics to neuroscience
- Learn Chinese!
- Asian-American admissions in the Boston Globe
- The new dating game
- Hank on Charlie
- Office space
-
▼
02
(14)
Labels
- physics (259)
- finance (233)
- globalization (209)
- brainpower (162)
- genetics (151)
- photos (119)
- credit crisis (114)
- economics (101)
- China (100)
- technology (98)
- science (96)
- travel (90)
- psychometrics (88)
- american society (87)
- credit crunch (77)
- human capital (72)
- iq (70)
- universities (67)
- startups (65)
- innovation (62)
- psychology (62)
- gilded age (61)
- higher education (58)
- income inequality (53)
- careers (52)
- elitism (52)
- biology (49)
- books (49)
- cdo (45)
- ai (44)
- autobiographical (44)
- cognitive science (42)
- derivatives (41)
- evolution (41)
- genius (40)
- quantum mechanics (40)
- caltech (37)
- mortgages (36)
- bgi (35)
- kids (35)
- bubbles (34)
- talks (34)
- behavioral economics (32)
- social science (32)
- hedge funds (30)
- mma (29)
- education (27)
- efficient markets (27)
- genetic engineering (26)
- history of science (26)
- many worlds (26)
- subprime (25)
- quants (24)
- sports (24)
- housing (23)
- intellectual history (23)
- statistics (23)
- taiwan (23)
- bounded rationality (22)
- expert prediction (22)
- foo camp (22)
- mathematics (22)
- entrepreneurs (21)
- literature (21)
- political correctness (21)
- academia (20)
- black holes (20)
- podcasts (20)
- ultimate fighting (20)
- cds (19)
- MSU (18)
- athletics (18)
- obama (18)
- silicon valley (18)
- feynman (17)
- geopolitics (17)
- sci fi (17)
- treasury bailout (17)
- wall street (17)
- affirmative action (16)
- berkeley (16)
- google (16)
- race relations (16)
- ufc (16)
- economic history (15)
- film (15)
- goldman sachs (15)
- harvard (15)
- physical training (15)
- security (15)
- von Neumann (15)
- history (14)
- internet (14)
- machine learning (14)
- university of oregon (14)
- bjj (13)
- freeman dyson (13)
- japan (13)
- jiujitsu (13)
- movies (13)
- happiness (12)
- india (12)
- neuroscience (12)
- politics (12)
- nuclear weapons (11)
- personality (11)
- singularity (11)
- algorithms (10)
- biotech (10)
- blogging (10)
- computing (10)
- gender (10)
- hedonic treadmill (10)
- malcolm gladwell (10)
- mutants (10)
- olympics (10)
- probability (10)
- social networks (10)
- venture capital (10)
- aspergers (9)
- autism (9)
- entropy (9)
- fitness (9)
- geeks (9)
- meritocracy (9)
- music (9)
- net worth (9)
- string theory (9)
- cosmology (8)
- italy (8)
- keynes (8)
- realpolitik (8)
- robot genius (8)
- ability (7)
- aig (7)
- alpha (7)
- basketball (7)
- chess (7)
- complexity (7)
- eugene (7)
- football (7)
- hugh everett (7)
- manhattan (7)
- nerds (7)
- nobel prize (7)
- oppenheimer (7)
- quantum field theory (7)
- real estate (7)
- scifoo (7)
- television (7)
- usain bolt (7)
- wwii (7)
- Einstein (6)
- alan turing (6)
- anthropic principle (6)
- art (6)
- ashkenazim (6)
- christmas (6)
- conferences (6)
- crossfit (6)
- cryptography (6)
- data mining (6)
- dating (6)
- dna (6)
- fx (6)
- harvard society of fellows (6)
- les grandes ecoles (6)
- nassim taleb (6)
- new yorker (6)
- qcd (6)
- teaching (6)
- volatility (6)
- Fermi problems (5)
- academia sinica (5)
- climate change (5)
- determinism (5)
- energy (5)
- environmentalism (5)
- flynn effect (5)
- france (5)
- free will (5)
- games (5)
- james salter (5)
- philosophy of mind (5)
- poker (5)
- privacy (5)
- prostitution (5)
- tail risk (5)
- turing test (5)
- war (5)
- warren buffet (5)
- Iran (4)
- Poincare (4)
- bayes (4)
- blade runner (4)
- bobby fischer (4)
- borges (4)
- charles darwin (4)
- class (4)
- econtalk (4)
- fake alpha (4)
- game theory (4)
- global warming (4)
- government (4)
- hormones (4)
- humor (4)
- iraq war (4)
- kerviel (4)
- luck (4)
- markets (4)
- monsters (4)
- neanderthals (4)
- paris (4)
- perimeter institute (4)
- philip k. dick (4)
- pseudoscience (4)
- russia (4)
- soros (4)
- success (4)
- trento (4)
- Go (3)
- babies (3)
- bill gates (3)
- brain drain (3)
- cambridge uk (3)
- censorship (3)
- charlie munger (3)
- chet baker (3)
- creativity (3)
- ecosystems (3)
- equity risk premium (3)
- facebook (3)
- fannie (3)
- feminism (3)
- fst (3)
- height (3)
- inequality (3)
- information theory (3)
- intellectual property (3)
- intellectual ventures (3)
- judo (3)
- kasparov (3)
- lewontin fallacy (3)
- lhc (3)
- microsoft (3)
- mixed martial arts (3)
- moore's law (3)
- nathan myhrvold (3)
- neal stephenson (3)
- new york times (3)
- noam chomsky (3)
- nonlinearity (3)
- patents (3)
- path integrals (3)
- renaissance technologies (3)
- risk preference (3)
- sad but true (3)
- search (3)
- sec (3)
- sivs (3)
- society generale (3)
- solar energy (3)
- vietnam war (3)
- 100m (2)
- 200m (2)
- alibaba (2)
- assortative mating (2)
- bear stearns (2)
- bruce springsteen (2)
- charles babbage (2)
- cloning (2)
- cold war (2)
- david mamet (2)
- democracy (2)
- demographics (2)
- digital books (2)
- donald mackenzie (2)
- eliot spitzer (2)
- empire (2)
- encryption (2)
- exchange rates (2)
- freddie (2)
- gaussian copula (2)
- genomics (2)
- godel (2)
- industrial revolution (2)
- james watson (2)
- language (2)
- ltcm (2)
- magic (2)
- mccain (2)
- michael lewis (2)
- national character (2)
- nicholas metropolis (2)
- no holds barred (2)
- nsa (2)
- offices (2)
- oligarchs (2)
- olympiads (2)
- palin (2)
- pca (2)
- pop culture (2)
- population structure (2)
- prisoner's dilemma (2)
- rationality (2)
- research (2)
- skidelsky (2)
- socgen (2)
- software development (2)
- sprints (2)
- abx (1)
- anathem (1)
- andrew lo (1)
- antikythera mechanism (1)
- athens (1)
- atlas shrugged (1)
- ayn rand (1)
- bay area (1)
- beats (1)
- book search (1)
- bunnie huang (1)
- car dealers (1)
- carlos slim (1)
- catastrophe bonds (1)
- cdos (1)
- ces 2008 (1)
- chance (1)
- cheng ting hsu (1)
- children (1)
- cochran-harpending (1)
- correlation (1)
- cpi (1)
- david x. li (1)
- dick cavett (1)
- dolomites (1)
- drugs (1)
- eharmony (1)
- epidemics (1)
- escorts (1)
- faces (1)
- fads (1)
- favorite posts (1)
- fiber optic cable (1)
- francis crick (1)
- gary brecher (1)
- gizmos (1)
- greece (1)
- greenspan (1)
- heinlein (1)
- hypocrisy (1)
- igon value (1)
- iit (1)
- inflation (1)
- information asymmetry (1)
- iphone (1)
- jack kerouac (1)
- jaynes (1)
- jfk (1)
- jim simons (1)
- john dolan (1)
- john kerry (1)
- john paulson (1)
- john searle (1)
- john tierney (1)
- jonathan littell (1)
- las vegas (1)
- lawyers (1)
- lee kwan yew (1)
- lehman auction (1)
- les bienveillantes (1)
- lowell wood (1)
- lse (1)
- mating (1)
- mba (1)
- mcgeorge bundy (1)
- mexico (1)
- michael jackson (1)
- mickey rourke (1)
- migration (1)
- mit (1)
- money:tech (1)
- monkeys (1)
- myron scholes (1)
- netwon institute (1)
- networks (1)
- newton institute (1)
- nfl (1)
- oliver stone (1)
- phil gramm (1)
- philanthropy (1)
- philip greenspun (1)
- portfolio theory (1)
- power laws (1)
- quantum computers (1)
- randomness (1)
- recession (1)
- sales (1)
- simulation (1)
- singapore (1)
- skype (1)
- standard deviation (1)
- star wars (1)
- starship troopers (1)
- students today (1)
- supercomputers (1)
- systemic risk (1)
- teleportation (1)
- thailand (1)
- tierney lab blog (1)
- tomonaga (1)
- twitter (1)
- tyler cowen (1)
- ussr (1)
- variance (1)
- venice (1)
- violence (1)
- virtual meetings (1)
- virtual reality (1)
- war nerd (1)
- wealth effect (1)