Organizing personal Web updates is becoming more and more difficult, based solely on the amount of times sites are updated on a daily basis. Michael Tung, cofounder and CEO of Diffbot, realized this while in grad school. Tung went to Stanford with the company’s other cofounder, Leith Abdulla, where he quickly learned that he would need some sort of system to organize his computer science course load.
“I had an idea to create a tool that would review the class websites and extract new information on the site, and then notify me on my phone when information changed,” Tung said. This robotics-based technology became known as Diffbot, which he said merely stands for different robot, a computer program that scans the Web for different types of information.
Diffbot learned the Web, with the help of Abdulla and Tung and months of research. Tung said that while doing projects for his Artificial Intelligence courses, he realized that he could use computer-vision techniques to understand, analyze and extract information from websites, similar to the way humans understand individual websites.
Tung said the Diffbot team spent several months researching the Web, asking friends and test users to submit URLs to the service. As they received more and more types of websites, Tung said it became apparent that the Web as we know it is divided into 30 different categories.
He created learning APIs that correlate to these categories, two of which have been released to the public: On-Demand and Follow. “There is a fixed amount of page types that humans can recognize, and they span cultures and languages,” he said, adding that no matter what language a website is, all ads, headers, footers and modules are rendered in a way that any human can comprehend what they are supposed to be looking at.
The learning APIs can be accessed for free for up to 50,000 calls, and then on an on-demand model after that. The On-Demand API, divided into Frontpage and Articles, analyzes home pages, index pages and article pages. It can “learn” specific information based on headlines, bylines, images, article text, pictures and tags from a variety of articles.
The Follow API can notify a user of changes or updates made to any webpage.
On-Demand is used with news sites and news applications, like the AOL Editions application, which is an iPad magazine that, according to its tagline, “reads you.” The developers who created the Editions application used the search capabilities to direct the Diffbot APIs to “learn” what readers of the iPad application like to see, and then deliver it on a daily basis.
Applications using the Diffbot technology and APIs can also analyze information displayed on an article page, understand keywords and allow developers to categorize the content, analyze homepages, generate an RSS feed based on specific keywords to allow the application to follow any topic on the Internet, and convert Web pages into mobile format.
Tung said that developers can take this technology and use it to create any type of Web or mobile app they can conceive of, and end users can test it out by using Diffbot’s beta front-end user interface, FeedBeater.com. The FeedBeater.com site gives end users and developers a feel for how the technology works, with the developer site giving developers more granular and direct access. Developer access to the APIs allows them to regulate these extraction terms through HTML instead of by clicking on different modules of the site, giving them a deeper, more direct way to extract information from certain websites.
The algorithms that Diffbot uses to extract information learn at an exceptional rate; Tung said they are updated three times per week and will continue learning perpetually. He added that the Diffbot team is working on releasing the 28 other categories it identified on the Web as APIs in the near future. He also said that the technology will continue to change and grow as the Internet grows; there may be no limit to this robot’s learning abilities.