Until recently, data science was a mostly academic pursuit and the subject of papers rather than practice. Over time, data science became an applied science with data scientists being paired with data engineers to develop production systems. We are now entering a new phase where much of the work being performed by data scientists (hyperparameter … continue reading
Earlier this week, Microsoft announced the open-source release of its streaming data analysis and query tool Trill, a single-node engine library that can be incorporated into .NET applications to process complex queries in real-time or offline data sets. In a blog post about the release, principal software engineer James Terwilliger explained that Trill has been … continue reading
Open-source software foundation Apache released version 1.0.0 of its Kafka distributed data streaming platform today, with the first full version number indicating Apache’s confidence that Kafka is ready for major professional use. “Apache Kafka is playing a bigger role as companies are moving to real-time streaming and embracing stream processing,” Jun Rao, vice president of … continue reading
Hazelcast today released the first version of Hazelcast Jet, a distributed processing engine for Big Data streams. Jet integrates with the Hazelcast in-memory data grid to process data in parallel across nodes in near real time. Hazelcast Jet uses directed acyclic graphs to model relationships between tasks in the data-processing pipeline. The system is built … continue reading
The Apache Foundation this morning announced the promotion of Apache Beam to the top level. That’s good news for the many contributors and users of this unified programming model, which allows them to write batch and streaming jobs at the same time, and to run the resulting artifact on various execution engines. Within the Hadoop … continue reading
Databricks has announced the general availability of Apache Spark 2.0 on its platform, building on what the Databricks community has learned and focusing on three themes: ease of use, more speed, and more intelligence, according to the company. From the preview release of Apache Spark 2.0, Databricks said that 10% of its clusters were already … continue reading
Confluent this week introduced its first commercial product, Confluent Control Center, as part of the newly released Confluent Platform 3.0 and Apache Kafka 0.10.0. The combined package is aimed at operationalizing Kafka-based streaming applications and near real-time data processing efforts. Neha Narkhede, cofounder and CTO of Confluent (and one of the creators of Kafka), said, … continue reading
When you’re watching the Oakland Athletics beat up on the Anaheim Angels on the MLB.tv streaming service, you’re probably not thinking about the software involved in that equation. But it’s the software that has taken 15 years of continual evolution to get to the point where, nowadays, baseball’s premium subscription service isn’t just a nice … continue reading
While many in the Big Data space are talking about stream processing, MapR today announced the availability of Streams, a new product in its Hadoop stack that can be used to stream events across clusters distributed around the world. The new product offers a publish-and-subscribe model for event-driven data access and decision-making. While MapR Streams … continue reading
Google this morning removed the beta label from Google Cloud Dataflow. The company also introduced Google Cloud Pub/Sub, as well as integrations between these services and the Google BigQuery service. Together, developers can use these services to provide analytics, stream processing and batch processing applications within the same infrastructure. Cloud Dataflow is built around the … continue reading
Machine-learning algorithms are everywhere. They’ve permeated the way platforms work in the cloud, in analyzing Big Data, and in making AI more intelligent. Now, a software-defined application delivery startup is using machine learning to streamline code performance. SmartSequence, the new HTML and JavaScript streaming technology from Instart Logic, uses real-time machine-learning techniques to sift through … continue reading
The Apache Software Foundation has announced Apache Flink as a Top-Level Project (TLP). Flink is an open-source Big Data system that fuses processing and analysis of both batch and streaming data. The data-processing engine, which offers APIs in Java and Scala as well as specialized APIs for graph processing, is presented as an alternative to … continue reading