AWS announced the availability of AWS Glue Data Quality, which delivers high-quality data across data lakes and pipelines.
A vast number of users establish data lakes, but without data quality, these can transform into data swamps, according to AWS. Establishing data quality is an intricate and lengthy procedure. It necessitates manual scrutiny and the formulation of data quality rules, as well as coding for quality degradation alerts. The time required for these manual tasks is significantly reduced, from several days to hours, by using AWS Glue Data Quality, according to AWS in a post.
This service calculates statistics automatically, suggests quality rules, monitors data, and sends alerts when a decline in quality is detected. As a result, the process of spotting missing, outdated, or incorrect data before it negatively affects the business becomes much more efficient.
AWS Glue Data Quality is a serverless feature of AWS Glue, eliminating the need for infrastructure management and maintenance. It automates the process of computing data statistics and recommending data quality rules, which enhances data freshness, accuracy, and integrity.
This reduces the manual work involved in data analysis and rule identification from days to just hours. It also allows the use of predefined data quality rules. For a current list of supported rules, one should refer to the Data Quality Definition Language (DQDL).
AWS Glue Data Quality can be accessed through various platforms including the AWS Glue Data Catalog, Glue Studio, and Glue Studio notebooks. This flexibility allows data stewards to establish rules in the Data Catalog, while coders can create data integration pipelines using notebook-based interfaces. Data engineers can also submit jobs from their preferred code editor via interactive sessions.