Thankfully, Hortonworks and others are acting quickly to fill this gap in the stack. XA Secure, which Hortonworks purchased in May 2014, was the source for code subsequently contributed to the Apache Software Foundation and now incubating under the name Apache Argus. “It’s a UI to create a single holistic security policy across all engines,” explained Walker. “It needs to reach down into YARN and HDFS. So we’re trying to get ahead of that—we made the first security acquisition in this space.”
For its part, in June 2014, Cloudera acquired Gazzang for enterprise-grade data encryption and key management. Last year, the company delivered Apache Sentry, which is incubating. Another security project is Project Rhino—an open source effort founded by Intel in 2013 to provide perimeter security, entitlements and access control and data protection.
“We now have a 30-person security center of excellence team as part of our partnership with Intel. If we don’t solve the security problems, we don’t get the benefits of the data. One thing is new here, with the data hub methodology: You can’t onboard new users if you don’t solve the security problem,” said Collins.
Not surprisingly, relational database market leader Oracle has incorporated Sentry, LDAP-based authorization and support for Kerberos authentication in its Big Data Appliance. Oracle also alerts admins to suspicious activities happening in Hadoop via its Audit Vault and Database Firewall.
Ironically, some early adopters in retail and finance haven’t feared the lack of data-level security protocols—they’ve simply built it themselves. Square, the mobile credit card payment processing company, stores some customer data in Hadoop, but wiggles out the security and privacy quandaries by redacting the data. However, other data isn’t so harmless, and must be protected. After looking at multiple options, including a separate, secure Hadoop cluster, Square crafted an encryption model for its data stored in protobufs format using an online crypto service, maintaining keys separate from data, controlling cell-level access, using security to cover the incoming data as well as that being stored. Though the end product was successful, “The lack of data-level security is keeping relevant projects off of Hadoop,” blogged Ben Spivey, a solutions architect at Cloudera, and Joey Echeverria, a Cloudera software engineer.
Flint, meet steel
Heating up the Hadoop community are real-time data processing options such as the nascent Spark, which emerged from the famed UC Berkeley Amp Lab, home of the Berkeley Data Analytics Stack. Akka, Finagle and Storm, which run on the JVM and work with Java and Scala, are bringing the power of concurrent programming to the masses. Google is continually innovating new tools, such as Dataflow, intended to supplant the obtuse MapReduce model. Interestingly, the entire Hadoop ecosystem points to mainstream success of parallel programming in a world of exponentially increasing data and shrinking silicon real estate. With Hadoop as the new data OS, the abstraction layer continually rises. The wisdom of clusters wins.
Still, moderation is key. “Don’t let the digital obscure the analog,” warned Peter Coffee in his keynote at the 2014 Dreamforce conference in San Francisco. “All modes of information, those are part of the new business too: The expression on someone’s face, the feel of someone’s handshake.” Patterns in Big Data only emerge because there are these networks of interaction, and those networks can only arise because people voluntarily connect, he argued.
“If you parachuted a bunch of consultants with clipboards into the Philippines and said ‘We’re going to build you a disaster preparedness system for $5 million that no one trusts and no one uses’—that sounds like the old IT, doesn’t it? The new IT is to take the behavior people enthusiastically and voluntarily engage in and use it to find the emergent opportunities.”
Your big data project: What could go wrong?