Developers working on data mining projects should build Iron Man, not R2D2. That’s according to speakers at the second annual Predictive Analytics World Conference, which ended yesterday in San Francisco. The conference focused on building better tools for analysts rather than building autonomous robots that make decisions.
John F. Elder IV, founder and CEO of Elder Research, said the U.S. government has been an early adopter of text-mining technology. “It turns out that government agencies are taking the lead in text mining. Their data is huge, their need is great and they are willing to act on it,” he said.
Text mining is the act of extracting information from streams of text.
“Text mining is a lot like the Wild West right now, like data mining was a few years ago,” Elder said, explaining that the market for text mining was still in a fledgling state. Elder Research has contracted with the U.S. government to implement data mining solutions.
Elder said one of the key problems with text mining is that many projects attempt to automate the entire process. He said this is the wrong way to deal with data mining.
“Anyone who works with computers and humans knows their strengths are complimentary; they are not alike,” he said. That means developers shouldn’t build analytics robots, but rather exoskeletal systems, metaphorically similar to the superhero Iron Man, that can enhance the comprehension and usefulness of the statisticians who use those systems.
Using predictive analytics, Elder implemented a system for the Social Security Administration that sped up the process of approving disabled applicants for insurance. For 20% of those applying for disability insurance, the process is fairly straightforward, but due to the complications introduced by the other 80% of applicants, everyone had to wait.
Elder’s system allowed the 20% to be instantly approved if their applications were similar to others that were worthy of the express lane. The system, he said, was 90% to 95% accurate in predicting which applications could be quickly approved, and that this was accomplished by analyzing the words used in the application.
One of the true pathfinders of predictive analytics, as they apply to marketing, is Andreas Weigend, former chief scientist at Amazon.com. He spoke about the many analytic upselling tools Amazon uses. These include systems like “Customers who bought this also bought…” and the company’s “Share the love” program, which gives discounts to people who buy a product based on a friend’s purchase.
Weigend said that “data is only worth as much as the decisions made based on that data. A good approach to look at the economics of data is to look at what is genuinely scarce and what is abundant.” He also pointed out the fact that while customers help companies generate data, these same companies rarely share that data with their users.
Some companies do share that data, however. Facebook recently published a blog entry where it correlated relationship status with happiness, and dating site OkCupid frequently publishes data about its users, such as how often people take showers, by state.
But no matter what kind of data was being crunched at the conference, there was one general sentiment shared by all: There is just too much data out there. The big data problem gets worse every year, said attendees, and finding a place to put all that data is half of the challenge of predictive analytics.
Cloudera CEO, Mike Olson, believes that this is where the Apache Hadoop project will see great growth. Hadoop is an application framework for building scalable batch processing systems, and it’s often used by companies such as Hulu, as well as government agencies such as the National Security Agency, where it is used to turn unstructured data into information-rich data stores.
“It used to be that all of the data in your business was well-structured data,” Olson said. “Increasingly, people are having to deal with weblogs, documents and sensor data from assembly lines. Universal data is much more interesting than it used to be. You just need a different kind of platform for that.”