Showing posts with label data mining. Show all posts
Showing posts with label data mining. Show all posts

Monday, September 03, 2018

PanOpticon in my Pocket: 0.35GB/month of surveillance, no charge!

Your location is monitored roughly every 10 minutes, if not more often, thanks to your phone. There are multiple methods: GPS or wifi connections or cell-tower pings, or even Bluetooth. This data is stored forever and is available to certain people for analysis. Technically the data is anonymous, but it is easy to connect your geolocation data to your real world identity -- the data shows where you sleep at night (home address) and work during the day. It can be cross-referenced with cookies placed on your browser by ad networks, so your online activities (purchases, web browsing, social media) can be linked to your spatial-temporal movements.

Some quantities which can be easily calculated using this data: How many people visited a specific Toyota dealership last month? How many times did someone test drive a car? Who were those people who test drove a car? How many people stopped / started a typical 9-5 job commute pattern? (BLS only dreams of knowing this number.) What was the occupancy of a specific hotel or rental property last month? How many people were on the 1:30 PM flight from LAX to Laguardia last Friday? Who were they? ...

Of course, absolute numbers may be noisy, but diffs from month to month or year to year, with reasonable normalization / averaging, can yield insights at the micro, macro, and individual firm level.

If your quant team is not looking at this data, it should be ;-)

Google Data Collection
Professor Douglas C. Schmidt, Vanderbilt University
August 15, 2018

... Both Android and Chrome send data to Google even in the absence of any user interaction. Our experiments show that a dormant, stationary Android phone (with Chrome active in the background) communicated location information to Google 340 times during a 24-hour period, or at an average of 14 data communications per hour. In fact, location information constituted 35% of all the data samples sent to Google. In contrast, a similar experiment showed that on an iOS Apple device with Safari (where neither Android nor Chrome were used), Google could not collect any appreciable data (location or otherwise) in the absence of a user interaction with the device.

e. After a user starts interacting with an Android phone (e.g. moves around, visits webpages, uses apps), passive communications to Google server domains increase significantly, even in cases where the user did not use any prominent Google applications (i.e. no Google Search, no YouTube, no Gmail, and no Google Maps). This increase is driven largely by data activity from Google’s publisher and advertiser products (e.g. Google Analytics, DoubleClick, AdWords)11. Such data constituted 46% of all requests to Google servers from the Android phone. Google collected location at a 1.4x higher rate compared to the stationary phone experiment with no user interaction. Magnitude wise, Google’s servers communicated 11.6 MB of data per day (or 0.35 GB/month) with the Android device. This experiment suggests that even if a user does not interact with any key Google applications, Google is still able to collect considerable information through its advertiser and publisher products.

f. While using an iOS device, if a user decides to forgo the use of any Google product (i.e. no Android, no Chrome, no Google applications), and visits only non-Google webpages, the number of times data is communicated to Google servers still remains surprisingly high. This communication is driven purely by advertiser/publisher services. The number of times such Google services are called from an iOS device is similar to an Android device. In this experiment, the total magnitude of data communicated to Google servers from an iOS device is found to be approximately half of that from the Android device.

g. Advertising identifiers (which are purportedly “user anonymous” and collect activity data on apps and 3rd-party webpage visits) can get connected with a user’s Google identity. This happens via passing of device-level identification information to Google servers by an Android device. Likewise, the DoubleClick cookie ID (which tracks a user’s activity on the 3rd-party webpages) is another purportedly “user anonymous” identifier that Google can connect to a user’s Google Account if a user accesses a Google application in the same browser in which a 3rd-party webpage was previously accessed. Overall, our findings indicate that Google has the ability to connect the anonymous data collected through passive means with the personal information of the user.

Monday, October 19, 2015

Men Are Easy



@9 min: 26 million matches per day on Tinder. Male preferences easy to predict, females more complex! Linear vs Multivariate Nonlinear preferences? Calling Geoffrey Miller ...

Some data from OKcupid:



Thursday, November 01, 2012

Quants and campaigns

Big data, analytics, randomized experiments and modern political campaigns. "... the most advanced political marketers are ahead of commercial marketers." Turnout efforts targeted at voters whose preference is predictable are more efficient (in votes per dollar spent) than attempts at persuasion.
Sasha Issenberg shows how cutting-edge social science and analytics are reshaping the modern political campaign, upending the way political campaigns are run in the 21st century. In The Victory Lab: The Secret Science of Winning Campaigns Issenberg writes about the techniques—including persuasion experiments, innovative ways to mobilize voters, heavily researched electioneering methods—and shows how they’re being used.


Sunday, March 25, 2012

Time machines, robots and silicon gods



The NY Times has a great profile of Gil Elbaz, founder of the startup which eventually became Google AdSense.

Elbaz's early aspirations sound like those of other Caltechers, except perhaps the part about being rich :-)

NYTimes: AT 7 years old, Gilad Elbaz wrote, “I want to be a rich mathematician and very smart.” That, he figured, would help him “discover things like time machines, robots and machines that can answer any question.”

He is old enough to have experienced the horribly dull pre-internet technology world. Young people these days can't imagine how much more limited opportunities were then for people with math/sci/engineering ability. From the article, Elbaz definitely had an entrepreneurial bent already as a kid.

... At Caltech, Mr. Elbaz majored in applied science and economics. Interested in the subject of monopolies, he won an award for a paper that determined that companies would take financial losses to corner their markets.

He worked for I.B.M. for two years, looking at the use of computers in problems of manufacturing, then went to Sybase, a database company. This was in the early 1990s, when I.B.M. was stumbling in the transition from mainframe computers to servers and PCs.

His younger brother says he thinks that the experience changed him. Many employees were “just trying to hold on to their jobs, not working together for the company,” Eytan says. He recalls how Gil, concerned about how employees were hoarding their data, “started talking about how much better it would be if people shared data.”

Mr. Elbaz then joined a semiconductor start-up called Microunity and became a consultant, saving money and playing the stock market to help finance his own first business. His father gave him $10,000 to invest for him, which Mr. Elbaz tripled in 18 months. When Mr. Elbaz and a Caltech friend decided to form a company in 1998 — it became Applied Semantics — his father told him to put the stock winnings into it.

Applied Semantics software quickly scanned thousands of Web pages for their meaning. By parsing content, it could tell businesses what kind of ads would work well on a particular page. It had 45 employees and was profitable when Google acquired it in 2003 for $102 million in cash and pre-I.P.O. stock.

While Mr. Elbaz would not say how much he made from the deal, his father’s $30,000 from the stock investments was eventually worth $18 million. “He certainly changed my retirement,” Nissim Elbaz says.

When I met Elbaz a few years ago, I have to admit I thought the business model for his current startup Factual was kind of shaky. Although I'm a proponent of Big Data, doing stuff that is technically cool is not the same as creating economic value. My understanding from someone at WolframAlpha is that cleaning and "curating" data is a lot of work!

... His mental and financial assets, he says, are like gifts he needs to deploy so the world works better.

“If all data was clear, a lot fewer people would subtract value from the world,” he says. “A lot more people would add value.”

Creating clear, reliable data could also make Factual a very big company.

“Gil is pretty far ahead of the rest of us, the one entrepreneur where it takes a few meetings before I really understand everything he is talking about,” says Ben Horowitz, a venture capitalist who backed Factual through his firm, Andreessen Horowitz. “Three years ago, he thought Factual was his biggest chance to change the world. Over time, the world has moved his way.”

... Factual’s plan, outlined in a big orange room with a few tables and walled with whiteboards, is to build the world’s chief reference point for thousands of interconnected supercomputing clouds. The digital world is expected to hold a collective 2.7 zettabytes of data by year-end, an amount roughly equivalent to 700 billion DVDs. Factual, which now has 50 employees, could prove immensely valuable as this world grows and these databases begin to interact.

Thursday, August 11, 2011

A problem for data scientists

If flash mobs or riots (like the ones in London) are organized using Twitter, Facebook, BlackBerry and SMS, won't it be very easy to catch the people responsible? Not only are the organizers / initiators easy to track down, but with geolocation (GPS or cell tower) and a court order it would be easy to determine whether any particular individual had participated. Perhaps current privacy laws prevent that data from being stored, but we can easily modify the laws if necessary.

Where are those law enforcement data scientists when you need them? :-)

The rise of data science

See also this follow up article from O'Reilly Radar, and the earlier post Exuberant geeks.

What is data science: ... Data science requires skills ranging from traditional computer science to mathematics to art. Describing the data science group he put together at Facebook (possibly the first data science group at a consumer-oriented web property), Jeff Hammerbacher said:

"... on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization."

Where do you find the people this versatile? According to DJ Patil, chief scientist at LinkedIn (@dpatil), the best data scientists tend to be "hard scientists," particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem. When you've just spent a lot of grant money generating data, you can't just throw the data out if it isn't as clean as you'd like. You have to make it tell its story. You need some creativity for when the story the data is telling isn't what you think it's telling.

... Entrepreneurship is another piece of the puzzle. Patil's first flippant answer to "what kind of person are you looking for when you hire a data scientist?" was "someone you would start a company with." That's an important insight: we're entering the era of products that are built on data. We don't yet know what those products are, but we do know that the winners will be the people, and the companies, that find those products. Hilary Mason came to the same conclusion. Her job as scientist at bit.ly is really to investigate the data that bit.ly is generating, and find out how to build interesting products from it. No one in the nascent data industry is trying to build the 2012 Nissan Stanza or Office 2015; they're all trying to find new products. In addition to being physicists, mathematicians, programmers, and artists, they're entrepreneurs.

Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdiscplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: "here's a lot of data, what can you make from it?"

The future belongs to the companies who figure out how to collect and use data successfully. Google, Amazon, Facebook, and LinkedIn have all tapped into their datastreams and made that the core of their success. They were the vanguard, but newer companies like bit.ly are following their path. Whether it's mining your personal biology, building maps from the shared experience of millions of travellers, or studying the URLs that people pass to others, the next generation of successful businesses will be built around data.

Here is a nice talk on machine learning and data science by Hilary Mason of bit.ly. One of my students will be working with her starting in the fall.

Saturday, October 10, 2009

Spooks drowning in data

Almost every technical endeavor, from finance to high energy physics to biology to internet security to spycraft, is either already or soon to be drowning in Big Data. This is an inevitable consequence of exponential Moore's Laws in bandwidth, processing power, and storage, combined with improved "sensing" capability. The challenge is extracting meaning from all that data.

My impression is that the limiting factor at the moment is the human brainpower necessary to understand the idiosyncrasies of the particular problem, and, simultaneously, develop the appropriate algorithms. There are simply not enough people around who are good at this; it's not just a matter of algorithms, you need insight into the specific situation. Of equal importance is that the (usually non-technical) decision makers who have to act on the data need to have some rough grasp of the strengths and limitations of the methods, so as not to have to treat the results as coming from a black box.

To give you my little example of big data, on my desk (in Oakland, not in Eugene) I have stacks of terabyte drives with copies of essentially every Windows executable (program that runs on a flavor of Windows) that has appeared on the web in the past few years (about 5 percent of this is malware; also stored in our data is what each executable does once it's installed). Gathering this data was only modestly hard; analyzing it in a meaningful way is a lot harder!

NY Review of Books: On a remote edge of Utah's dry and arid high desert, where temperatures often zoom past 100 degrees, hard-hatted construction workers with top-secret clearances are preparing to build what may become America's equivalent of Jorge Luis Borges's "Library of Babel," a place where the collection of information is both infinite and at the same time monstrous, where the entire world's knowledge is stored, but not a single word is understood. At a million square feet, the mammoth $2 billion structure will be one-third larger than the US Capitol and will use the same amount of energy as every house in Salt Lake City combined.

Unlike Borges's "labyrinth of letters," this library expects few visitors. It's being built by the ultra-secret National Security Agency—which is primarily responsible for "signals intelligence," the collection and analysis of various forms of communication—to house trillions of phone calls, e-mail messages, and data trails: Web searches, parking receipts, bookstore visits, and other digital "pocket litter." Lacking adequate space and power at its city-sized Fort Meade, Maryland, headquarters, the NSA is also completing work on another data archive, this one in San Antonio, Texas, which will be nearly the size of the Alamodome.

Just how much information will be stored in these windowless cybertemples? A clue comes from a recent report prepared by the MITRE Corporation, a Pentagon think tank. "As the sensors associated with the various surveillance missions improve," says the report, referring to a variety of technical collection methods, "the data volumes are increasing with a projection that sensor data volume could potentially increase to the level of Yottabytes (1024 Bytes) by 2015."[1] Roughly equal to about a septillion (1,000,000,000,000,000,000,000,000) pages of text, numbers beyond Yottabytes haven't yet been named. Once vacuumed up and stored in these near-infinite "libraries," the data are then analyzed by powerful infoweapons, supercomputers running complex algorithmic programs, to determine who among us may be—or may one day become—a terrorist. In the NSA's world of automated surveillance on steroids, every bit has a history and every keystroke tells a story.

... Where does all this leave us? Aid concludes that the biggest problem facing the agency is not the fact that it's drowning in untranslated, indecipherable, and mostly unusable data, problems that the troubled new modernization plan, Turbulence, is supposed to eventually fix. "These problems may, in fact, be the tip of the iceberg," he writes. Instead, what the agency needs most, Aid says, is more power. But the type of power to which he is referring is the kind that comes from electrical substations, not statutes. "As strange as it may sound," he writes, "one of the most urgent problems facing NSA is a severe shortage of electrical power." With supercomputers measured by the acre and estimated $70 million annual electricity bills for its headquarters, the agency has begun browning out, which is the reason for locating its new data centers in Utah and Texas. And as it pleads for more money to construct newer and bigger power generators, Aid notes, Congress is balking.

The issue is critical because at the NSA, electrical power is political power. In its top-secret world, the coin of the realm is the kilowatt. More electrical power ensures bigger data centers. Bigger data centers, in turn, generate a need for more access to phone calls and e-mail and, conversely, less privacy. The more data that comes in, the more reports flow out. And the more reports that flow out, the more political power for the agency.

Rather than give the NSA more money for more power—electrical and political—some have instead suggested just pulling the plug. "NSA can point to things they have obtained that have been useful," Aid quotes former senior State Department official Herbert Levin, a longtime customer of the agency, "but whether they're worth the billions that are spent, is a genuine question in my mind."

Based on the NSA's history of often being on the wrong end of a surprise and a tendency to mistakenly get the country into, rather than out of, wars, it seems to have a rather disastrous cost-benefit ratio. Were it a corporation, it would likely have gone belly-up years ago. The September 11 attacks are a case in point. For more than a year and a half the NSA was eavesdropping on two of the lead hijackers, knowing they had been sent by bin Laden, while they were in the US preparing for the attacks. The terrorists even chose as their command center a motel in Laurel, Maryland, almost within eyesight of the director's office. Yet the agency never once sought an easy-to-obtain FISA warrant to pinpoint their locations, or even informed the CIA or FBI of their presence.

But pulling the plug, or even allowing the lights to dim, seems unlikely given President Obama's hawkish policies in Afghanistan. However, if the war there turns out to be the train wreck many predict, then Obama may decide to take a much closer look at the spy world's most lavish spender. It is a prospect that has some in the Library of Babel very nervous. "It was a great ride while it lasted," said one.

Sunday, May 11, 2008

On data mining

Last week we had Jiawei Han of UIUC here to give a talk: Exploring the Power of Links in Information Network Mining. He's the author of a well-known book on data mining.

During our conversation we discussed a number of projects his group has worked on in the past, all of which involve teasing out the structure in large bodies of data. Being a lazy theorist, my attitude in the past about data mining has been as follows: sit and think about the problem, come up with list of potential signals, analyze data to see which signals actually work. The point being that the good signals would turn out to be a subset (or possibly combination) of the ones you could think of a priori -- i.e., for which there is a plausible, human-comprehensible, reason.

In many of the examples we discussed I was able to guess the main signals that turned out to be useful. However, Han impressed on me that, these days, with gigantic corpora of data available, one often encounters very subtle signals that are identified only by algorithm -- that human intuition completely fails to identify. (Gee, why that weird linear combination of those inputs, with alternating signs, even?! :-)

Our conversation made me want to get my hands dirty on some big data mining project. Of course, it's much easier for him -- his group has something like ten graduate students at a time! Interestingly, he identified this ability to tap into large chunks of manpower as an advantage of being in academia as opposed to, e.g., at Microsoft Research. Of course, if you are doing very commercially applicable research you can access even greater resources at a company lab/startup, but for blue sky academic work it wouldn't be the case.

Blog Archive

Labels