WIRED: ... Google’s engineers have discovered that some of the most important signals can come from Google itself. PageRank has been celebrated as instituting a measure of populism into search engines: the democracy of millions of people deciding what to link to on the Web. But Singhal notes that the engineers in Building 43 are exploiting another democracy — the hundreds of millions who search on Google. The data people generate when they search — what results they click on, what words they replace in the query when they’re unsatisfied, how their queries match with their physical locations — turns out to be an invaluable resource in discovering new signals and improving the relevance of results. The most direct example of this process is what Google calls personalized search — a feature that uses someone’s search history and location as signals to determine what kind of results they’ll find useful.1 But more generally, Google has used its huge mass of collected data to bolster its algorithm with an amazingly deep knowledge base that helps interpret the complex intent of cryptic queries.
Take, for instance, the way Google’s engine learns which words are synonyms. “We discovered a nifty thing very early on,” Singhal says. “People change words in their queries. So someone would say, ‘pictures of dogs,’ and then they’d say, ‘pictures of puppies.’ So that told us that maybe ‘dogs’ and ‘puppies’ were interchangeable. We also learned that when you boil water, it’s hot water. We were relearning semantics from humans, and that was a great advance.”
But there were obstacles. Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theories about how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches. That helped the algorithm understand what “hot dog” — and millions of other terms — meant. “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.” ...
[mike siwek lawyer mi]
The Mike Siwek query illustrates how Google accomplishes this. When Singhal types in a command to expose a layer of code underneath each search result, it’s clear which signals determine the selection of the top links: a bi-gram connection to figure it’s a name; a synonym; a geographic location. “Deconstruct this query from an engineer’s point of view,” Singhal explains. “We say, ‘Aha! We can break this here!’ We figure that lawyer is not a last name and Siwek is not a middle name. And by the way, lawyer is not a town in Michigan. A lawyer is an attorney.”
This is the hard-won realization from inside the Google search engine, culled from the data generated by billions of searches: a rock is a rock. It’s also a stone, and it could be a boulder. Spell it “rokc” and it’s still a rock. But put “little” in front of it and it’s the capital of Arkansas. Which is not an ark. Unless Noah is around. “The holy grail of search is to understand what the user wants,” Singhal says. “Then you are not matching words; you are actually trying to match meaning.”
- Steve Hsu
- Professor of physics at the University of Oregon. Homepage. Archive. Favorite posts. Twitter: @hsu_steve
Sunday, February 28, 2010
Seeds of AI at Google
Semantic meaning from statistical learning and mechanical turk workers like you and me :-)
Labels:
ai,
google,
machine learning
blog comments powered by Disqus
Subscribe to:
Post Comments (Atom)
Blog Archive
-
►
2012
(44)
-
►
02
(20)
- Class, brains and income
- Test preparation and SAT scores
- Turing and wavefunction collapse
- Daily Show catches linsanity
- Beyond Race in Affirmative Action
- Linsanity on SNL
- Luke and Maddie revisited
- Intergenerational mobility: Bowles and Gintis and ...
- Solvay 1961
- Greg Clark: Are there ruling classes?
- History is impossible
- Jeremy Lin in historical perspective
- Class and Race
- UC Davis colloquium
- Wharton MBA compensation by industry
- Trouble ahead
- Paper Promises: Money, Debt and the new World Orde...
- Jeremy Lin represents
- Personnel Selection: horsepower matters
- Transparency in college admissions
-
►
01
(14)
- Some recommended reading
- Looking back
- Dismantling Detroit
- The Moral Foundation of Economic Behavior
- US manufacturing jobs
- Gracie Breakdown: heel hook edition
- How did East Asians become "yellow"?
- Lana Del Rey
- Inside Duke: hurting the ones we love?
- James Crow colloquium
- "Phantom" heritability
- Eric Lander profile
- Mathematical minds
- Genomic prediction
-
►
02
(20)
-
▼
2010
(234)
-
▼
02
(14)
- Seeds of AI at Google
- Gender differences in "extreme" mathematical abili...
- Charlie Munger at Caltech
- Financier flows
- Quants!
- The education of Nathan Myhrvold
- Ethnic affinity
- The value of hard work
- From physics to neuroscience
- Learn Chinese!
- Asian-American admissions in the Boston Globe
- The new dating game
- Hank on Charlie
- Office space
-
▼
02
(14)
Labels
- physics (215)
- finance (214)
- globalization (197)
- brainpower (126)
- credit crisis (110)
- genetics (108)
- photos (94)
- China (92)
- economics (88)
- travel (79)
- credit crunch (77)
- psychometrics (76)
- technology (74)
- science (72)
- american society (64)
- iq (63)
- gilded age (57)
- human capital (53)
- startups (53)
- books (47)
- careers (46)
- elitism (46)
- psychology (46)
- cdo (45)
- income inequality (45)
- universities (43)
- higher education (42)
- derivatives (39)
- innovation (36)
- mortgages (36)
- autobiographical (35)
- ai (34)
- evolution (33)
- bubbles (32)
- biology (31)
- genius (30)
- kids (30)
- behavioral economics (29)
- social science (29)
- quantum mechanics (28)
- talks (28)
- caltech (27)
- mma (27)
- hedge funds (26)
- education (25)
- subprime (25)
- bgi (23)
- housing (23)
- taiwan (23)
- foo camp (21)
- expert prediction (20)
- many worlds (20)
- sports (20)
- ultimate fighting (20)
- cds (19)
- efficient markets (19)
- entrepreneurs (18)
- genetic engineering (18)
- podcasts (18)
- political correctness (18)
- quants (18)
- treasury bailout (17)
- black holes (16)
- cognitive science (16)
- geopolitics (16)
- history of science (16)
- intellectual history (16)
- mathematics (16)
- wall street (16)
- goldman sachs (15)
- google (15)
- literature (15)
- academia (14)
- bounded rationality (14)
- economic history (14)
- obama (14)
- silicon valley (14)
- statistics (14)
- ufc (14)
- athletics (13)
- berkeley (13)
- sci fi (13)
- security (13)
- freeman dyson (12)
- happiness (12)
- history (12)
- internet (12)
- japan (12)
- university of oregon (12)
- affirmative action (11)
- bjj (11)
- jiujitsu (11)
- race relations (11)
- feynman (10)
- harvard (10)
- hedonic treadmill (10)
- india (10)
- malcolm gladwell (10)
- movies (10)
- singularity (10)
- social networks (10)
- geeks (9)
- gender (9)
- politics (9)
- string theory (9)
- von Neumann (9)
- algorithms (8)
- entropy (8)
- film (8)
- italy (8)
- keynes (8)
- machine learning (8)
- mutants (8)
- nuclear weapons (8)
- physical training (8)
- robot genius (8)
- basketball (7)
- blogging (7)
- chess (7)
- complexity (7)
- computing (7)
- fitness (7)
- probability (7)
- quantum field theory (7)
- scifoo (7)
- venture capital (7)
- ability (6)
- aig (6)
- alan turing (6)
- anthropic principle (6)
- aspergers (6)
- autism (6)
- christmas (6)
- eugene (6)
- les grandes ecoles (6)
- music (6)
- nassim taleb (6)
- nerds (6)
- net worth (6)
- new yorker (6)
- nobel prize (6)
- olympics (6)
- qcd (6)
- real estate (6)
- realpolitik (6)
- television (6)
- volatility (6)
- wwii (6)
- Einstein (5)
- Fermi problems (5)
- academia sinica (5)
- ashkenazim (5)
- biotech (5)
- climate change (5)
- cosmology (5)
- cryptography (5)
- environmentalism (5)
- football (5)
- games (5)
- hugh everett (5)
- manhattan (5)
- neuroscience (5)
- personality (5)
- poker (5)
- prostitution (5)
- tail risk (5)
- teaching (5)
- turing test (5)
- usain bolt (5)
- Iran (4)
- Poincare (4)
- alpha (4)
- bayes (4)
- bobby fischer (4)
- borges (4)
- charles darwin (4)
- class (4)
- conferences (4)
- data mining (4)
- dating (4)
- determinism (4)
- dna (4)
- econtalk (4)
- energy (4)
- flynn effect (4)
- france (4)
- free will (4)
- fx (4)
- global warming (4)
- government (4)
- iraq war (4)
- kerviel (4)
- markets (4)
- neanderthals (4)
- oppenheimer (4)
- paris (4)
- perimeter institute (4)
- philip k. dick (4)
- privacy (4)
- pseudoscience (4)
- soros (4)
- trento (4)
- war (4)
- Go (3)
- art (3)
- babies (3)
- blade runner (3)
- brain drain (3)
- cambridge uk (3)
- censorship (3)
- charlie munger (3)
- crossfit (3)
- ecosystems (3)
- equity risk premium (3)
- facebook (3)
- fannie (3)
- fst (3)
- game theory (3)
- harvard society of fellows (3)
- hormones (3)
- humor (3)
- inequality (3)
- information theory (3)
- intellectual property (3)
- intellectual ventures (3)
- james salter (3)
- judo (3)
- kasparov (3)
- meritocracy (3)
- mixed martial arts (3)
- monsters (3)
- moore's law (3)
- nathan myhrvold (3)
- new york times (3)
- nonlinearity (3)
- patents (3)
- path integrals (3)
- philosophy of mind (3)
- renaissance technologies (3)
- risk preference (3)
- search (3)
- sivs (3)
- society generale (3)
- success (3)
- vietnam war (3)
- warren buffet (3)
- 100m (2)
- 200m (2)
- alibaba (2)
- assortative mating (2)
- bear stearns (2)
- bill gates (2)
- bruce springsteen (2)
- charles babbage (2)
- cloning (2)
- cold war (2)
- david mamet (2)
- democracy (2)
- digital books (2)
- donald mackenzie (2)
- eliot spitzer (2)
- empire (2)
- exchange rates (2)
- fake alpha (2)
- freddie (2)
- gaussian copula (2)
- godel (2)
- height (2)
- james watson (2)
- language (2)
- lewontin fallacy (2)
- lhc (2)
- ltcm (2)
- luck (2)
- magic (2)
- mccain (2)
- michael lewis (2)
- microsoft (2)
- national character (2)
- neal stephenson (2)
- nicholas metropolis (2)
- no holds barred (2)
- nsa (2)
- offices (2)
- oligarchs (2)
- olympiads (2)
- palin (2)
- pca (2)
- pop culture (2)
- population structure (2)
- rationality (2)
- russia (2)
- sec (2)
- skidelsky (2)
- socgen (2)
- solar energy (2)
- sprints (2)
- abx (1)
- anathem (1)
- andrew lo (1)
- antikythera mechanism (1)
- athens (1)
- atlas shrugged (1)
- ayn rand (1)
- bay area (1)
- beats (1)
- book search (1)
- bunnie huang (1)
- car dealers (1)
- carlos slim (1)
- catastrophe bonds (1)
- cdos (1)
- ces 2008 (1)
- chance (1)
- cheng ting hsu (1)
- chet baker (1)
- children (1)
- cochran-harpending (1)
- correlation (1)
- cpi (1)
- david x. li (1)
- demographics (1)
- dick cavett (1)
- dolomites (1)
- drugs (1)
- eharmony (1)
- encryption (1)
- epidemics (1)
- escorts (1)
- faces (1)
- fads (1)
- favorite posts (1)
- feminism (1)
- fiber optic cable (1)
- francis crick (1)
- gary brecher (1)
- geoffrey miller (1)
- gizmos (1)
- greece (1)
- greenspan (1)
- heinlein (1)
- hypocrisy (1)
- igon value (1)
- iit (1)
- industrial revolution (1)
- inflation (1)
- information asymmetry (1)
- iphone (1)
- jack kerouac (1)
- jaynes (1)
- jfk (1)
- jim simons (1)
- john dolan (1)
- john kerry (1)
- john paulson (1)
- john searle (1)
- john tierney (1)
- jonathan littell (1)
- las vegas (1)
- lawyers (1)
- lee kwan yew (1)
- lehman auction (1)
- les bienveillantes (1)
- lowell wood (1)
- lse (1)
- mating (1)
- mba (1)
- mcgeorge bundy (1)
- mexico (1)
- michael jackson (1)
- mickey rourke (1)
- migration (1)
- mit (1)
- money:tech (1)
- monkeys (1)
- myron scholes (1)
- netwon institute (1)
- networks (1)
- newton institute (1)
- nfl (1)
- noam chomsky (1)
- oliver stone (1)
- phil gramm (1)
- philanthropy (1)
- philip greenspun (1)
- portfolio theory (1)
- power laws (1)
- prisoner's dilemma (1)
- quantum computers (1)
- randomness (1)
- recession (1)
- sad but true (1)
- sales (1)
- satoshi kanazawa (1)
- simulation (1)
- singapore (1)
- skype (1)
- software development (1)
- standard deviation (1)
- star wars (1)
- starship troopers (1)
- students today (1)
- supercomputers (1)
- systemic risk (1)
- teleportation (1)
- thailand (1)
- tierney lab blog (1)
- tomonaga (1)
- twitter (1)
- tyler cowen (1)
- ussr (1)
- variance (1)
- venice (1)
- violence (1)
- virtual meetings (1)
- virtual reality (1)
- war nerd (1)
- wealth effect (1)