Data Profiler is an open-source Python library that originated at Capital One to analyze datasets and detect if any of the information contained within is sensitive data, such as bank account numbers, credit card information, or social security numbers.
According to the company, when data streams grow large enough, it can be quite difficult to monitor the data coming through, opening up the possibility for sensitive data to make its way past. The goal of the project is to be able to detect when that type of information is present in a dataset.
The company provided an example of how one might use Data Profiler by imagining a jeweler in the business of buying and selling diamonds. They have a large database with all of their customer and transaction details, in a structured format of rows and columns. Data Profiler can be used on the dataset to get statistics on each column.
“You’ll learn the exact distribution of the price of diamonds, that cut is a categorical column of several unique values, that the carat is organized in ascending order, and most importantly, you’ll learn the classification of each column for sensitive data. Our machine-learning model will then automatically classify columns as credit card information, email, etc. This will help you discover if sensitive data exists in columns they shouldn’t exist in,” Grant Eden, who was a principal software engineer at Capital One, explained in a blog post.
Data Profiler comes with a default set of 19 labels that are used to recognize data categories, such as ADDRESS, CREDIT_CARD, EMAIL_ADDRESS, PHONE_NUMBER, SSN, etc.
“Our library has a list of labels of which a subset is considered non-public personally identifiable pieces of information… the data labeler is able to use that deep learning model to identify where that exists in a dataset… and calls out where that exists to that user that’s doing the analysis,” Jeremy Goodsitt, a lead machine learning engineer at Capital One, told SD Times previously.
The labeler model can also be customized to meet specific use cases. In the example of the jeweler, they could customize the data labeler to help them be able to identify specific gem types.
At the time of this writing, the project has 1,600 stars on GitHub, has been forked 146 times, and has 48 people contributing to it.