For Revolution Analytics, many of its largest customers in finance and retail began asking for the ability to use analytics programming language R with Hadoop. Fortunately for the company, one developer had been building this exact piece of software for over a year. RHIPE is that project, and at Hadoop World in New York, Revolution Analytics was on hand to state that they hired the man behind RHIPE.
Saptarshi Guha is now a consultant at Revolution Analytics, and he said that the RHIPE project is already quite usable, though it is only at version 0.63.
“I actually started developing RHIPE in the last few years of my Ph.D.,” he said. “A lot of my colleagues had data sets that were getting increasingly bigger. They were hard-pressed to compute with this data across the cluster. I saw Hadoop and I saw R, which all data managers need to use, and the two were disjointed.”
Thus, he began working on RHIPE. Guha said that his first goal was to give R users access to the same functionality and capabilities that Java developers had in Hadoop. He said that this goal is now mostly met by the project, so he has made it available for free.
When asked why he felt that R was a good match with Hadoop, Guha said: “R has about 2,700 packages that bring every possible statistical algorithm and tool to the user. To apply algorithms to different subsets of data, there’s no other language choice but R. The idea is to push R to the computing back end, rather than send the data to R.”
To that end, RHIPE takes the form of a single R package that must then be installed on each Hadoop node. Once it’s up and running, Guha said it will provide two types of workflows for developers.
The first allows developers to interact with the Hadoop data using R and a command line-like interface. This allows developers to manipulate the data in near real time, and he said this enables R developers to crank out data visualizations in minutes. The second workflow is similar to the traditional Hadoop workflow: Developers write jobs, upload them to the cluster, then return when the jobs are done.
Mike Minelli, VP of sales at Revolution Analytics, said that Hadoop will now be a focus for R. “[Guha] is now working with us closely. We hired him because Revolution Analytics will be working to basically marry up Hadoop and Scale R, which is a component that makes R scream as far as speed and the ability to handle large sets of data.
“The next generation is to marry those two components, so those computations happen fast in R in Hadoop.”