No more code than you need, no more math than you have to.
If not quite the mantra of Alpine Data Labs cofounder Steven Hillion, it’s certainly the philosophy driving the company’s efforts in collaboration for data scientists, DBAs and business analysts.
Alpine Data Labs sells the Chorus collaboration tool that enables teams to share data sets, code and ideas. To take a step or two back, Alpine grew out of Greenplum and was spun off before Greenplum was purchased by EMC, which later acquired VMware, and those two companies formed what is now called Pivotal (a Chorus reseller).
Hillion said he began work on Chorus while at Pivotal but brought it with him to Alpine Data Labs, so the product works against many data platforms. But it solves one problem: the need for a central location to share common workflows, models and external data sets. Chorus, he said, began as a wiki-style Facebook-style hub for working on data projects.
(Related: Data science in Big Data)
“The mythical data scientist has math skills, computer skills and domain skills. Those people don’t actually exist,” Hillion said. On top of that, companies need to include DBAs and business analysts.
“What I found historically is that the role of a data scientist is a lonely one. For me, it was important to have them sit alongside DBAs and business users, to provide context and use cases. Business analysts are usually data-savvy. They understand the types of work data scientists do, so this can be a rich relationship. There are all these different roles along this conveyor belt of analytics. They all need to be involved in an iterative way. Some interactions are obvious. Some scientists will work with software engineers to get their data models into production. Chorus was made to support that collaboration.”
Hillion explained:
“We’re seeing time and again that the pure scientists shouldn’t have to be software engineers to get their work done. Working with Hadoop, or distributed SQL, shouldn’t be necessary. At the same time, the business analyst is asking, ‘Is this correlated with that?’ He shouldn’t have to understand the language of mathematics any more than he needs to, and shouldn’t have to understand the details of gradient descent to take advantage of mathematical models.
“In order for the Big Data revolution to become real, it’s important for people to focus on just the things they need to focus on and not be waylaid by obscure methods and techniques that they shouldn’t have to be aware of. In some ways, this is a controversial point. People say if you don’t have an understanding of mathematics and statistics, it’s dangerous to use models. If you don’t take into account holdout samples, or if you don’t look at the right statistics to determine the efficacy of the model, then yes, it’s dangerous. But software should help you do that. There are measures of model accuracy and variable significance. But how significant is the variable? Those things are relatively easy to understand but are often couched in technical terms that make it daunting.
“So what if a business user builds a simple model, or runs some simple correlations? They might not be valid, but [they] can spawn some ideas, and then he can work with a data scientist to provide a level of rigor, but it allows other users to try things out.”