Although Apache Kafka is widely adopted, there are still operational challenges that teams run into when they try to run Kafka at scale. In order to restore balance to Kafka clusters, LinkedIn open sourced and developed Cruise Control, its general-purpose system that continuously monitors clusters and automatically adjusts the resources needed to meet pre-defined performance goals.
According to LinkedIn staff software engineer Jiangjie Qin in a LinkedIn engineering post, Cruise Control started off as an intern project by Efe Gencer, who is currently a research assistant at Cornell University. Several members of the Kafka development team helped to brainstorm and design Cruise Control, and the project received several other contributions from the Kafka SRE team at LinkedIn.
Cruise Control for Kafka is currently deployed at LinkedIn, where it monitors user-specified goals, makes sure there are no violations of these goals, analyzes the existing workload on the cluster, and then automatically executes administrative operations to satisfy those goals, according to Qin.
Cruise Control was also designed with a few requirements in mind, which meant it needed to be reliable, resource-efficient, extensible, and serve as a general framework “that could only understand the application and migrate only a partial state and be used in any stateful distributed system,” writes Qin.
Cruise Control follows a monitor-analysis-action working cycle, providing a REST API for users to interact with. This REST API supports “querying the workload and optimization proposals of the Kafka cluster, as well as triggering admin operations,” according to Qin.
Cruise Control is also made up of a Load Monitor, which collects standard Kafka metrics from the cluster and derives per partition resource metrics that are not available. For instance, it estimates CPU utilization on a per-partition basis, writes Qin.
The Analyzer is the actual “brain” of the open source project, using a heuristic method to generate optimization proposals based on the goals and the cluster workload model from the Load Monitor.
According to Qin:
“Cruise Control also allows for specifying hard goals and soft goals. A hard goal is one that must be satisfied (e.g., replica placement must be rack-aware). Soft goals, on the other hand, may be left unmet if doing so makes it possible to satisfy all the hard goals. The optimization would fail if the optimized results violate a hard goal. Usually, the hard goals will have a higher priority than the soft goals.”