SparkR

Databricks has announced the general availability of Apache Spark 1.4, including SparkR, a new R API for data scientists.

Version 1.4 of the open-source Big Data processing and streaming engine also enhances Spark’s DataFrame API features, Python 3 support, a component upgrade past alpha for Spark’s machine learning pipeline, and new visualization and monitoring capabilities for Spark Streaming and Core.

(Related: Spark ignites excitement at Big Data TechCon)

Below is a more detailed breakdown of the new features and improvements in Spark 1.4:

  • SparkR: The first new language API in Spark since PySpark in 2012, SparkR is based on the engine’s parallel DataFrame abstraction, and it allows developers to create SparkR DataFrames from local R data frames or other sources, including Apache Hive and Parquet, HDFS, and JSON. SparkR’s API features include aggregation, filtering, grouping, summary statistics, and analytical functions for data science tasks, as well as mixed SQL queries.
  • DataFrame improvements: New Window functions added to Spark SQL and other DataFrame features including better serializer memory use, statistical and mathematical function support, and support for Project Tungsten, the execution engine overhaul announced back in April for performance improvements coming in Spark 1.5.
  • Machine learning pipeline: The Spark ML pipeline, first introduced in Spark 1.2, now includes stable APIs for production-ready machine-learning workloads for tasks such as data pre-processing, feature extraction and transformation, model fitting, and validation stages.
  • DataVis and monitoring: New visual debugging and monitoring utilities are designed to help developers better understand Spark application runtime behavior. Additional data visualization tools include an application timeline viewer, a computation graph visualizer, visual monitoring over data streams to track latency and throughput, and a monitoring UI for the Spark SQL JDBC server.

More details about Apache Spark 1.4 are available here.