WIRED: ... Google’s engineers have discovered that some of the most important signals can come from Google itself. PageRank has been celebrated as instituting a measure of populism into search engines: the democracy of millions of people deciding what to link to on the Web. But Singhal notes that the engineers in Building 43 are exploiting another democracy — the hundreds of millions who search on Google. The data people generate when they search — what results they click on, what words they replace in the query when they’re unsatisfied, how their queries match with their physical locations — turns out to be an invaluable resource in discovering new signals and improving the relevance of results. The most direct example of this process is what Google calls personalized search — a feature that uses someone’s search history and location as signals to determine what kind of results they’ll find useful.1 But more generally, Google has used its huge mass of collected data to bolster its algorithm with an amazingly deep knowledge base that helps interpret the complex intent of cryptic queries.
Take, for instance, the way Google’s engine learns which words are synonyms. “We discovered a nifty thing very early on,” Singhal says. “People change words in their queries. So someone would say, ‘pictures of dogs,’ and then they’d say, ‘pictures of puppies.’ So that told us that maybe ‘dogs’ and ‘puppies’ were interchangeable. We also learned that when you boil water, it’s hot water. We were relearning semantics from humans, and that was a great advance.”
But there were obstacles. Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theories about how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches. That helped the algorithm understand what “hot dog” — and millions of other terms — meant. “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.” ...
[mike siwek lawyer mi]
The Mike Siwek query illustrates how Google accomplishes this. When Singhal types in a command to expose a layer of code underneath each search result, it’s clear which signals determine the selection of the top links: a bi-gram connection to figure it’s a name; a synonym; a geographic location. “Deconstruct this query from an engineer’s point of view,” Singhal explains. “We say, ‘Aha! We can break this here!’ We figure that lawyer is not a last name and Siwek is not a middle name. And by the way, lawyer is not a town in Michigan. A lawyer is an attorney.”
This is the hard-won realization from inside the Google search engine, culled from the data generated by billions of searches: a rock is a rock. It’s also a stone, and it could be a boulder. Spell it “rokc” and it’s still a rock. But put “little” in front of it and it’s the capital of Arkansas. Which is not an ark. Unless Noah is around. “The holy grail of search is to understand what the user wants,” Singhal says. “Then you are not matching words; you are actually trying to match meaning.”
Pessimism of the Intellect, Optimism of the Will Favorite posts | Manifold podcast | Twitter: @hsu_steve
Sunday, February 28, 2010
Seeds of AI at Google
Semantic meaning from statistical learning and mechanical turk workers like you and me :-)
Labels:
ai,
google,
machine learning
Subscribe to:
Post Comments (Atom)
Blog Archive
-
▼
2010
(234)
-
▼
02
(14)
- Seeds of AI at Google
- Gender differences in "extreme" mathematical abili...
- Charlie Munger at Caltech
- Financier flows
- Quants!
- The education of Nathan Myhrvold
- Ethnic affinity
- The value of hard work
- From physics to neuroscience
- Learn Chinese!
- Asian-American admissions in the Boston Globe
- The new dating game
- Hank on Charlie
- Office space
-
▼
02
(14)
Labels
- physics (420)
- genetics (325)
- globalization (301)
- genomics (295)
- technology (282)
- brainpower (280)
- finance (275)
- american society (261)
- China (249)
- innovation (231)
- ai (206)
- economics (202)
- psychometrics (190)
- science (172)
- psychology (169)
- machine learning (166)
- biology (163)
- photos (162)
- genetic engineering (150)
- universities (150)
- travel (144)
- podcasts (143)
- higher education (141)
- startups (139)
- human capital (127)
- geopolitics (124)
- credit crisis (115)
- political correctness (108)
- iq (107)
- quantum mechanics (107)
- cognitive science (103)
- autobiographical (97)
- politics (93)
- careers (90)
- bounded rationality (88)
- social science (86)
- history of science (85)
- realpolitik (85)
- statistics (83)
- elitism (81)
- talks (80)
- evolution (79)
- credit crunch (78)
- biotech (76)
- genius (76)
- gilded age (73)
- income inequality (73)
- caltech (68)
- books (64)
- academia (62)
- history (61)
- intellectual history (61)
- MSU (60)
- sci fi (60)
- harvard (58)
- silicon valley (58)
- mma (57)
- mathematics (55)
- education (53)
- video (52)
- kids (51)
- bgi (48)
- black holes (48)
- cdo (45)
- derivatives (43)
- neuroscience (43)
- affirmative action (42)
- behavioral economics (42)
- economic history (42)
- literature (42)
- nuclear weapons (42)
- computing (41)
- jiujitsu (41)
- physical training (40)
- film (39)
- many worlds (39)
- quantum field theory (39)
- expert prediction (37)
- ufc (37)
- bjj (36)
- bubbles (36)
- mortgages (36)
- google (35)
- race relations (35)
- hedge funds (34)
- security (34)
- von Neumann (34)
- meritocracy (31)
- feynman (30)
- quants (30)
- taiwan (30)
- efficient markets (29)
- foo camp (29)
- movies (29)
- sports (29)
- music (28)
- singularity (27)
- entrepreneurs (26)
- conferences (25)
- housing (25)
- obama (25)
- subprime (25)
- venture capital (25)
- berkeley (24)
- epidemics (24)
- war (24)
- wall street (23)
- athletics (22)
- russia (22)
- ultimate fighting (22)
- cds (20)
- internet (20)
- new yorker (20)
- blogging (19)
- japan (19)
- scifoo (19)
- christmas (18)
- dna (18)
- gender (18)
- goldman sachs (18)
- university of oregon (18)
- cold war (17)
- cryptography (17)
- freeman dyson (17)
- smpy (17)
- treasury bailout (17)
- algorithms (16)
- autism (16)
- personality (16)
- privacy (16)
- Fermi problems (15)
- cosmology (15)
- happiness (15)
- height (15)
- india (15)
- oppenheimer (15)
- probability (15)
- social networks (15)
- wwii (15)
- fitness (14)
- government (14)
- les grandes ecoles (14)
- neanderthals (14)
- quantum computers (14)
- blade runner (13)
- chess (13)
- hedonic treadmill (13)
- nsa (13)
- philosophy of mind (13)
- research (13)
- aspergers (12)
- climate change (12)
- harvard society of fellows (12)
- malcolm gladwell (12)
- net worth (12)
- nobel prize (12)
- pseudoscience (12)
- Einstein (11)
- art (11)
- democracy (11)
- entropy (11)
- geeks (11)
- string theory (11)
- television (11)
- Go (10)
- ability (10)
- complexity (10)
- dating (10)
- energy (10)
- football (10)
- france (10)
- italy (10)
- mutants (10)
- nerds (10)
- olympics (10)
- pop culture (10)
- crossfit (9)
- encryption (9)
- eugene (9)
- flynn effect (9)
- james salter (9)
- simulation (9)
- tail risk (9)
- turing test (9)
- alan turing (8)
- alpha (8)
- ashkenazim (8)
- data mining (8)
- determinism (8)
- environmentalism (8)
- games (8)
- keynes (8)
- manhattan (8)
- new york times (8)
- pca (8)
- philip k. dick (8)
- qcd (8)
- real estate (8)
- robot genius (8)
- success (8)
- usain bolt (8)
- Iran (7)
- aig (7)
- basketball (7)
- free will (7)
- fx (7)
- game theory (7)
- hugh everett (7)
- inequality (7)
- information theory (7)
- iraq war (7)
- markets (7)
- paris (7)
- patents (7)
- poker (7)
- teaching (7)
- vietnam war (7)
- volatility (7)
- anthropic principle (6)
- bayes (6)
- class (6)
- drones (6)
- econtalk (6)
- empire (6)
- global warming (6)
- godel (6)
- intellectual property (6)
- nassim taleb (6)
- noam chomsky (6)
- prostitution (6)
- rationality (6)
- academia sinica (5)
- bobby fischer (5)
- demographics (5)
- fake alpha (5)
- kasparov (5)
- luck (5)
- nonlinearity (5)
- perimeter institute (5)
- renaissance technologies (5)
- sad but true (5)
- software development (5)
- solar energy (5)
- warren buffet (5)
- 100m (4)
- Poincare (4)
- assortative mating (4)
- bill gates (4)
- borges (4)
- cambridge uk (4)
- censorship (4)
- charles darwin (4)
- computers (4)
- creativity (4)
- hormones (4)
- humor (4)
- judo (4)
- kerviel (4)
- microsoft (4)
- mixed martial arts (4)
- monsters (4)
- moore's law (4)
- soros (4)
- supercomputers (4)
- trento (4)
- 200m (3)
- babies (3)
- brain drain (3)
- charlie munger (3)
- cheng ting hsu (3)
- chet baker (3)
- correlation (3)
- ecosystems (3)
- equity risk premium (3)
- facebook (3)
- fannie (3)
- feminism (3)
- fst (3)
- intellectual ventures (3)
- jim simons (3)
- language (3)
- lee kwan yew (3)
- lewontin fallacy (3)
- lhc (3)
- magic (3)
- michael lewis (3)
- mit (3)
- nathan myhrvold (3)
- neal stephenson (3)
- olympiads (3)
- path integrals (3)
- risk preference (3)
- search (3)
- sec (3)
- sivs (3)
- society generale (3)
- systemic risk (3)
- thailand (3)
- twitter (3)
- alibaba (2)
- bear stearns (2)
- bruce springsteen (2)
- charles babbage (2)
- cloning (2)
- david mamet (2)
- digital books (2)
- donald mackenzie (2)
- drugs (2)
- dune (2)
- exchange rates (2)
- frauds (2)
- freddie (2)
- gaussian copula (2)
- heinlein (2)
- industrial revolution (2)
- james watson (2)
- ltcm (2)
- mating (2)
- mba (2)
- mccain (2)
- monkeys (2)
- national character (2)
- nicholas metropolis (2)
- no holds barred (2)
- offices (2)
- oligarchs (2)
- palin (2)
- population structure (2)
- prisoner's dilemma (2)
- singapore (2)
- skidelsky (2)
- socgen (2)
- sprints (2)
- star wars (2)
- ussr (2)
- variance (2)
- virtual reality (2)
- war nerd (2)
- abx (1)
- anathem (1)
- andrew lo (1)
- antikythera mechanism (1)
- athens (1)
- atlas shrugged (1)
- ayn rand (1)
- bay area (1)
- beats (1)
- book search (1)
- bunnie huang (1)
- car dealers (1)
- carlos slim (1)
- catastrophe bonds (1)
- cdos (1)
- ces 2008 (1)
- chance (1)
- children (1)
- cochran-harpending (1)
- cpi (1)
- david x. li (1)
- dick cavett (1)
- dolomites (1)
- eharmony (1)
- eliot spitzer (1)
- escorts (1)
- faces (1)
- fads (1)
- favorite posts (1)
- fiber optic cable (1)
- francis crick (1)
- gary brecher (1)
- gizmos (1)
- greece (1)
- greenspan (1)
- hypocrisy (1)
- igon value (1)
- iit (1)
- inflation (1)
- information asymmetry (1)
- iphone (1)
- jack kerouac (1)
- jaynes (1)
- jazz (1)
- jfk (1)
- john dolan (1)
- john kerry (1)
- john paulson (1)
- john searle (1)
- john tierney (1)
- jonathan littell (1)
- las vegas (1)
- lawyers (1)
- lehman auction (1)
- les bienveillantes (1)
- lowell wood (1)
- lse (1)
- machine (1)
- mcgeorge bundy (1)
- mexico (1)
- michael jackson (1)
- mickey rourke (1)
- migration (1)
- money:tech (1)
- myron scholes (1)
- netwon institute (1)
- networks (1)
- newton institute (1)
- nfl (1)
- oliver stone (1)
- phil gramm (1)
- philanthropy (1)
- philip greenspun (1)
- portfolio theory (1)
- power laws (1)
- pyschology (1)
- randomness (1)
- recession (1)
- sales (1)
- skype (1)
- standard deviation (1)
- starship troopers (1)
- students today (1)
- teleportation (1)
- tierney lab blog (1)
- tomonaga (1)
- tyler cowen (1)
- venice (1)
- violence (1)
- virtual meetings (1)
- wealth effect (1)
No comments:
Post a Comment