From Chaucer’s “The Canterbury Tales” to newspapers documenting 19th-century wars, the recent credit crunch and the Olympics, the British Library has worked for more than 250 years to preserve the history and social culture those published works hold. Now, as more information goes digital, those working to collect as much information as possible are beginning to worry about the “digital black hole.” So much information is created every day that older information is ultimately erased or lost, leaving generations to come with a potential void in their history.
“About 15 petabytes of information are created every day in the world,” said David Boloker, IBM’s CTO for emerging Internet technologies. A petabyte can be thought of as about eight times the amount of information held in all United States libraries today, he said.
Recent research by the British Library estimates the average life expectancy of a website to be just about 44–75 days, and it suggests that 10% of all U.K. websites are either lost by assimilation with other information or replaced by new data every six months.
In an attempt to help the British Library harness an endless trove of information from the U.K. and Ireland, IBM worked with it this past December and installed code for IBM BigSheets. The program is a new technology prototype designed to build “a Web-insight engine that can deal with huge amounts of data and basically ingest it,” Boloker explained. “BigSheets can then in essence map and graph the data and reduce to what is understandable by a human being.”
Taking about two years to develop, and on its fifth generation of code, BigSheets is built on Apache’s Hadoop framework and has about two more years to go before becoming a robust technology, Boloker said. While in use, the program will crawl the roughly 8 million (expected to be up to 11 million by 2011) websites in the .uk domain and take “snapshots” of Web pages by going out, fetching and copying a page to be stored in the WARC (Web ARChive) file format. BigSheets then takes the content and stores it in the Hadoop File System (HDFS) for further processing.
Analytics are then run on the copy, and the information is split and stored in the HDFS on disks in the cloud. British Library patrons can then run the BigSheets insight engine against all the data that is stored. The previously unstructured Web data can now be visualized in Web 2.0 standard data feeds, pie charts and tag clouds.
Since entire pages or a series of pages and sites will be archived and analyzed, insights in the data are found. “So when you say politics have changed literature, you can go back and look at it,” Boloker said. This project will offer insight into the information, from which can come patterns and trends, he added.
In the U.K. Web Archive so far, special collections include data from the U.K.’s 2005 general election, the 2012 Olympics, Avian and Swine flu outbreaks and the 2008 credit crunch. Other archives include all books and some periodicals from the U.K. and Ireland. Governmental bodies, Parliament and the House of Scotland also allowed for 5,000 websites to be collected; the British Library is now working to get more in the .uk domain.
“We’re on the bridge period between yesteryear when we had documents and occasional periodicals,” Boloker said, “to where we are today when we have documents that the British Library gets everyday.”