Tomorrow, the final chapter of the Da Vinci Code saga, “Inferno,” hits theaters. While Tom Hanks scours the ancient past for mysteries and riddles, there’s a deeper evil brewing in the form of a deadly virus waiting to infect the world and bring the series to its denouement.
In the real world, however, viral propagation and research is also about decrypting puzzling messages, and a lot of that work is now being aided by software developers. From scientists writing their own tools to open-source projects shared within the community, computers are now inseparable from biology, even if humans aren’t quite cyborgs yet.
Eclipse, for example, has a sub-project called uDig: the User-friendly Desktop Internet GIS. This has been repurposed by viral researchers to chart the propagation of avian flu, for example. Python, as another example, has a tremendously large scientific user base, as do MATLAB and R.
(Related: More things are on the horizon for IBM’s Watson)
Dr. Alexei Aravin, professor of biology at Caltech, said that biology and software development have a lot in common. His work is primarily focused on figuring out what “normal” means when it comes to how organisms interpret and implement the instructions encoded in their DNA.
Aravin said that going from DNA to results and back again is a difficult proposition, one that science likely will not solve for many years to come. “You can compare a living organism easily with a computer. With DNA, it’s code for some instructions,” he said. “It’s actually a relatively straightforward comparison. There’s code and how this code works. We as a whole community of scientists are trying to understand what instructions you can find in DNA, and how exactly they are fulfilled on a cellular level.”
That work may be daunting, but undertaking it 10 to 15 years ago would have been impossible, said Aravin. Sequencing the DNA is just part of the larger picture for that instruction set, he said.
“Research is just one level of information. You have to store information,” said Aravin. “There are many other biological letters that compose the chemical building blocks, and there are other huge amounts of information to store. Without computers, we wouldn’t be able to process anything at that scale. Software now plays a very big part. Just to store it without processing 15 to 20 years ago wouldn’t be possible.”
It’s not just about the data processing and storage, however. Aravin and his team rely heavily on the UC Santa Cruz Genome Browser. This open-source tool and service allows users to browse through the full genome of various animals, from humans and dogs to fruit flies and guinea pigs. Not only is this tool available online, it can be downloaded and used offline. As a visual analytics tool, it’s a great example of making a very large dataset easily accessible in a visual and highly customizable fashion.
Aravin’s own team even builds custom software for its work, blurring the lines between scientist and developer. “We do not really develop software for someone else, it’s mostly just so we can use it,” he said. “It’s not a big package we’re working on. We mostly just need scripts to do certain things, like transfer files from one format to another and so on.”
As for the potential of a virus wreaking havoc on the world, like in “Inferno,” Aravin said that scientists like him are still trying to tweeze out the differences between a gene that’s been affected by a virus that manifests a problem for its host organism, and a gene that’s normal.
A problem not unlike that of the modern software developer: establishing a baseline of analytics so you can actually tell when something has gone wrong. It’s just a shame biologists don’t have access to debuggers like IDA Pro: They’re going to be reverse-engineering the genetic code for a long time, said Aravin.