The team at Google has recently announced that PerfKit Benchmarker (PKB), the open-source benchmarking tool used to measure and compare cloud offerings, now supports testing Dataflow jobs.
According to Google, Dataflow is a managed service for executing a wide variety of data processing patterns.
Released in 2015, PKB provisions and cleans up resources in the cloud, selecting and executing benchmark tests, as well as collecting and publishing results for actionable reporting.
Performance benchmarking can help ensure that a pipeline is sized correctly and configured, in order to meet expected data volumes without hitting capacity limits or breaking cost budgets.
In order to get started using PKB, see the public PKB docs. Users who prefer walkthrough tutorials, click here to see the beginner lab to review PKB setup, PKB command-line options, and how to visualize test results in Data Studio.
The repo includes example PKB config files, including dataflow_template.yaml which can be used to re-run the sequence of tests.
Additionally, users will need to replace all <MY_PROJECT> and <MY_BUCKET> instances with their own GCP project and bucket as well as create an input Pub/Sub subscription with their own test data preprovisioned and an output Big Query table with correct schema to receive the test data.
According to the company, the PKB benchmark handles saving and restoring a snapshot of that Pub/Sub subscription for every test run iteration.
To learn more, read Google’s blog.