Delivering the future of analytics, Pentaho Corporation, today announced the native integration of Pentaho Data Integration (PDI) with Apache Spark, enabling orchestration of Spark jobs. A development effort initiated by Pentaho Labs, this integration will enable customers to increase productivity, reduce maintenance costs, and dramatically lower the skill sets required as Spark is incorporated into big data projects.

Spark is a powerful open source processing engine built around speed, ease of use, and machine learning. Engineered from the bottom-up for performance, Spark is a next-generation big data technology to store, blend, and govern data at entirely new levels of speed, scale and simplicity. Building on complementary open source foundations, allowed Pentaho to innovate early with this emerging big data technology.

“For two years, we experimented with possible use cases based on our big data blueprints and sizing the enterprise market opportunity for Spark. Our customers now benefit from that work with simplified, real-time analytic capabilities,” said James Dixon, Chief Technology Officer at Pentaho. “Our open-source heritage and modern extensible platform, allows us to quickly evolve our capabilities keeping our customers big data technology options open, reducing risk and saving considerable development time while taking advantage of the latest innovations in popular big data stores.”

As big data technologies evolve at breakneck speed, the Pentaho Labs team continues to leverage and drive innovation in big data integration and analytics allowing customer’s to advance their big data deployments without risk. Today’s integration with Spark follows other labs efforts that have led to support for YARN and the Adaptive Big Data Layer. Following the native support of YARN alone, enterprise customers like RichRelevance, edo Interactive and MultiPlan have been able to innovate and drive greater value from Hadoop.

“Apache Spark couples high-performance, in-memory data processing and multiple computation models that make it well-suited to power next-generation data processing platforms,” said Matt Aslett, Research Director, Data Platforms and Analytics, 451 Research. “The integration with Spark illustrates how Pentaho’s open source approach enables it to respond as emerging technologies rise to prominence in the ever-evolving big data market. And integrate them with its data management and analytics platform.”


Pentaho Data Integration for Apache Spark is currently available in Pentaho Labs. It will be GA in June 2015. To learn more about the innovation in Pentaho Labs visit:

Attend the webinar, Emerging Big Data Technologies: Pentaho Labs Presents Apache Spark on Tuesday, June 2, 2015 at 10am/pt. Register at