HPCC Systems (High Performance Computing Cluster), a dba of LexisNexis Risk Solutions, is an open-source big-data computing platform. Flavio Villanustre, vice president technology and CISO at LexisNexis Risk Solutions, explained HPCC Systems’s evolution came as a necessity.
“In 2000 we were getting into data analytics, using the platforms, databases, and data integration tools that were available at the time. None of these tools would scale to handle the quantity of data and complexity of processes that we were doing.” He added, “That drove us to create our own platform, now known as HPCC Systems, a completely free, end-to-end big data platform.”
According to Villanustre, Accurint is the first product that utilized the platform. Accurint began as a data lookup service that took large amounts of data from numerous data sets and provided basic search capabilities to other companies and organizations. Today, Accurint has evolved and developed capabilities to help detect fraud and verify identities.
The open source HPCC Systems platform is a programmable Big Data store made up of two major cluster processing environment: Thor and ROXIE. Each can be used independently but the real power of the system comes when they are used together, seamlessly allowing data analysts and data scientists to fulfill the overall data lifecycle, from acquisition to delivery, in one homogeneous and consistent platform.
Thor, a homage to the ancient Norse god of thunder, is the data refinery on the left side of the data pipeline, able to ingest massive amounts of data. Its job is the general processing of large volumes of any type of raw data, to perform ETL (extract, transform, load), data cleansing, normalization and hygiene, and to perform the data integration process, either rules-based or probabilistic. Challenges that this part of the data pipeline can pose include timely processing of huge data volumes, non-stop operation, expressiveness for extremely complex transformations, and managing the complexities of parallel processing tasks.
Security, fast response time and scalability to huge numbers of clients are just a few data delivery challenges. ROXIE, the other parallel data processing component in the HPCC Systems platform, provides rapid data delivery capabilities for online applications through web services using a distributed indexed file system and functions similar to Hadoop with HBase and Hive added, but is significantly faster, according to Villanustre.
Villanustre said, “Thor is the data refinery engine of HPCC Systems that performs massive batch-oriented data processing. It can easily profile, clean, enhance, transform and analyze mixed-schema data. All of the data and models are then taken into the data delivery system called ROXIE, which can provide high performance, highly concurrent, highly reliable data querying and delivery strategies in massive data stores.”
Both platform components use the HPCC Systems Enterprise Control Language (ECL). “HPCC Systems leverages several strengths that are inherent to ECL,” Villanustre said. “ECL is a declarative dataflow programming language that disallows side effects in functions, ensuring referential transparency. This fact, combined with a number of other capabilities, allow your ECL work to compile into a highly efficient machine code version of that program and ensures that the parallel process across the platform will run as fast as possible. In lower level programming languages like Java for Hadoop, or programming languages that are more imperative in nature it can be quite challenging to implement a program efficiently in a distributed platform where processing close to where the data resides (data locality) is key to the overall performance of the system.”
There are several benefits for developers using HPCC Systems. Villanustre pointed out a key difference: “The HPCC Systems platform gives you a single homogeneous data pipeline. This significantly reduces the effort necessary to install and manage the platform. Above and foremost, this eliminates that dependency nightmare that people managing other open source big data platforms usually suffer when patching and upgrading their systems.”
He also said that ECL helps data analysts, data programmers and data scientists focus on the solution of the problem at hand rather than dealing with the underlying platform details.
HPCC Systems Community Edition, as a platform, is completely open source and licensed under the Apache 2.0 license. As such it’s free to download, use, change and gives the power to users to “look under the hood” to understand how things are done. Villanustre said this has played a major role in HPCC Systems growth, “The goal is not to sell the platform, but to create a community that can help bring new innovation to it. Our open-source users regularly contribute capabilities, documentation, code, new environments, integration, and projects that we are able to leverage in our systems operations.
Learn more at www.hpccsystems.com