Data Quality Part 3: Three Characteristics of Effective Data Quality

Ryan Gross

Data Quality Part 3: Three Characteristics of Effective Data Quality

In part one of this series, we presented the four main causes of bad data quality: issues keeping up with data scale, errors propagating in dynamic ways, abnormalities that aren't necessarily bad data, and differing uses of data across departments.

In part two we identified that the usual suspect (organizational scale and complexity) is actually a cover-up for the real root cause: bad data governance.

Now that we’ve identified that the after-the-fact, rules-based data quality approaches taken in the past are too cumbersome to keep up with the pace of change in a modern enterprise, our solutions must address three questions to develop a scalable data quality approach.

How do today’s modern solutions bring data quality into the development of all data products? Additionally, how can we effectively monitor for unforeseen (and therefore untested) scenarios? Finally, what should we do to maintain data consumer’s trust when, inevitably, undetected data issues do occur?

In this article, we are going to switch from problems to solutions and discuss how to answer those questions, painting a picture of effective modern data quality.

1. Integrated Into the Development of Data Products

The answer to our first question – how modern solutions bring data quality into data product development – lies in the adoption of total quality management. Just like in manufacturing, organizations with consistently high data quality practice total quality management (TQM) by integrating quality checks into every step of their data product development lifecycle.

Adopting TQM for Data Quality

A first step to adopting TQM is to recognize the dashboards, machine learning (ML) models, and self-service reporting your business relies on are products (hence the use of the term data product) customers rely on to do their jobs effectively.

No organization would build and ship a product to customers without properly prototyping, testing in a lab environment, and adding quality control checks throughout the manufacturing process to check for defects. These same approaches apply to the creation of data products. The emerging field of DataOps defines a set of approaches to integrate TQM into data pipeline development.

Identifying Potential Data Quality Issues at the Beginning

First, the data product planning team works backward from the data product to identify the potential data quality issues that would be most impactful to the product they are delivering. This prioritization effort focuses on issues that have an impact versus trying to prevent all potential quality issues, which will become increasingly difficult as the organization scales.

It also frames data quality in the eyes of the data consumer, which requires the discussion of enterprise versus departmental views of data and helps the team identify the metadata necessary to clarify data meaning within the experience of the data product (for instance, by showing an info icon with a tool tip displaying the detailed descriptions).

Defining Necessary Data Tests to Ensure Feasibility

Next, the design team defines required data tests that must pass at various stages of the data pipeline that transforms source data into the data product. For many advanced analytics products (i.e., predictive models), the feasibility of the predictions is not known at design time, so a prototype of the critical predictive model must be built from source data in an experimentation environment (for example using Jupyter Notebooks). This prototype should incorporate the tests necessary to ensure the experiment results are not based on bad data, but these tests do not need to be fully automated at this point.

Once feasibility has been established, the data scientists, data engineers, and data analysts that built the data product incorporate automated tests into the development of the full data product. A data quality specialist helps the team develop test data sets that demonstrate potential edge conditions that could occur in the data.

Tools like Great Expectations provide a means for expressing expected properties of the data at each step in a data pipeline. These data tests can be run in a non-production environment of the data platform as part of a continuous integration process.

When the data tests are passing in a lower environment, a test run of the data pipeline can be executed using live production data sets, while keeping the data pipeline code in a pre-production environment. This allows the team to determine if their test data sets accurately match production data and exposes the data pipeline code to large-scale data to detect performance issues. Only once all these checks are passing should the product be promoted to production. This step could also be automated as part of a continuous delivery process.

Figure 1: Data product development process using DataOps principles

2. Automated to Detect Exceptions

Once in production, the same set of tests can be used to ensure that production data matches expectations, catching issues caused by upstream changes in data sources. This can be done on all new data for absolutely critical data inputs like the credit score in an underwriting application, with bad data flagged or written to a quarantine location depending on the severity of impact. For non-critical data, live production data quality metrics are captured by sampling to determine the statistical likelihood of data issues. The live data quality checks produce metadata that describes the quality of the data used within the data product.

While these pre-defined tests measure data quality against expectations, data pipeline testing alone won’t address our second question, which asks how we can effectively monitor and respond to unforeseen scenarios.

Creating the Baseline for “Normal Data”

However good your tests are, it is impossible to test for every potential problem, so an effective data quality system relies on establishing a definition of 'normal data' and then flagging data that is 'abnormal' for further review. Modern machine learning techniques for this type of anomaly detection are quite mature at this point, so enabling the usage of this technology is a matter of cost/benefit analysis. By leveraging this approach in monitoring, organizations become proactive in identifying untested, unexpected quality issues.

Avoiding “Alert Fatigue”

Because these techniques must learn a definition of normal, they will work better over time. This means newly released or changed data products will require increased human attention until the system can be properly tuned. In addition to the hard costs of paying for the anomaly detection computation, there are 'alert fatigue' costs associated with putting anomaly detection onto every field. The same prioritization that went into determining the fields requiring data tests can be used to prioritize anomaly detection.

While many organizations only consider running anomaly detection on data values, there are additional metrics that can be monitored to provide indications of other issues to investigate:

Figure 2: Anomaly detection opportunities go far beyond just the value of a specific metric

3. Transparent to Manage Data Consumer Trust

Implementing total quality management practices and anomaly detection techniques are key steps in developing a scalable data quality approach, but undetected data issues will inevitably still arise. To answer our final question, how to maintain the trust of data consumers, we focus on providing transparency through the information captured and displayed to consumers.

Continuing with a data consumer experience mindset, you must design a data product that surfaces enough data quality metadata to the data consumer such that they understand why they should (and when they shouldn't) trust the data they are seeing in their data product. It is very easy to lose the trust of data consumers when they spot problems themselves and don't see those problems being resolved, so control measures must be put in place to stop the spread of mistrust when a data issue occurs.

Figure 3: Managing the spread of data mistrust when data issues occur

Finding the Right Metadata

Managing data consumer trust starts with capturing the proper metadata. Typically, the data catalog that is part of the organization’s data platform becomes the central home for this metadata. If your organization does not have a data catalog, then investing in one (open source, home-grown, or purchased are all viable options today) will be the first step.

Once a catalog is established, integrations can feed in:

The data test list run on a given data product in pre-production.
Data compliance statistics captured by running data tests in production.
Data anomalies detected and the reasoning for them.
Data defects reported and the status of the associated fixes.
Data definitions and descriptions, written to help the consumer understand the meaning of the metrics or predictions they are seeing.
Data lineage showing which system(s) the data underlying a metric or prediction came from.

Displaying Metadata in the Right Way

Next, you must find the right place to display the metadata to the data consumer. A successful data quality system will proactively alert affected data consumers in the context of their usage of the data when an issue occurs. To maintain the trust of these consumers, the alerts should provide hints on the actions the data consumers should take, taking into account that data consumers rarely have the ability to correct data problems themselves.

Often, an information, warning, or error icon on a dashboard or application screen that enables the data consumer to click through to the details will suffice when data consumers seek out the data quality.

An example message for a warning might read: "We have detected abnormal data in the weekly customer count field. Typically, this field had a range of 150,000-350,000, but now the value is 500,000. This does not mean the value is incorrect, but it is higher than expected. You can learn more about the customer count field here. Please add a note if you believe this new value is correct. If you believe this is an incorrect value, please click here to increase the priority of this investigation."

An example message for an error might read: "We have detected that the source data for the customer count metric did not load on time today. The following ticket was automatically created to track the resolution of this issue: Ticket 1234, and the status of the ticket is 'root cause identified, waiting for resolution.' Subscribe to ticket updates to get an alert when the issue has been resolved. For reference, these types of issues have typically been resolved in four to six hours in the past."

Putting Effective Data Quality All Together

In an organization with an effective data quality system, the culture of the organization moves toward acceptance that everyone is a data steward, each with their own role to play.

By managing data quality from planning through development and into production, leveraging ML techniques to monitor for the unexpected, and keeping stakeholders well-informed about their data and the actions they can take to improve it, these organizations are able to move data quality forward continuously.

The data platform plays a key role in enabling proactiveness and collaboration, and we will review effective data platform technology in the final post in this series. If you’d like to have a conversation about data quality with one of our data experts, reach out at marketing@credera.com.