Most big data and artificial intelligence (AI) projects fail or at least don’t live up to the hype and expectations. In the first part of this blog series, we identified four reasons and provided some guidance on dealing with the first one (a rush to do something).
A rush to “do something” (without understanding why).
Poor data quality/governance.
Immature data engineering practices.
People issues.
In the next two posts we want to cover data governance—specifically what is data governance, why (lack of) data governance is a problem, and what to do about it.
Before we start, we want to acknowledge that nothing makes people’s eyes roll back in their heads like the “g” word: governance. It conjures up visions of seemingly endless committee meetings, massive policy documents, and draconian review boards. So we’ll start by stating unequivocally this is not what we are suggesting—so please keep reading. This “big G” Governance is the type of corporate bureaucracy that grinds productivity and innovation to a halt. As you will see, we suggest a lightweight (“little g”) governance approach to data management that supports innovation and new capability, not hinders it.
We’ll cover those details in the “Not Your Father’s Governance” section below, but for now, we’ll just define data governance as the set of standards, tools, and processes that ensure business users and applications have the access to the right data at the right time.
Charles Babbage, the great English mathematician and inventor of the first mechanical “computer” is quoted as saying:
“On two occasions I have been asked, ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.”
Essentially, “garbage in—garbage out.” Unfortunately, most of our data suffers from this problem. We see it all around us. Have you ever:
Received two reports from different systems with conflicting numbers?
Built some model, algorithm, or campaign based on data only to see dismal results?
Heard someone say, “While I understand your analysis, I don’t trust that data”?
You are not alone:
98% of companies believe they have inaccurate data.
One in three business leaders don’t trust the information they use to make decisions.
Poor data quality costs the U.S. economy around $3.1 trillion annually.
66% of data scientists say that cleaning data is the most time-consuming task.
Bad data quality is costing businesses 30% or more of their revenue.
Knowledge workers waste 50% of their time hunting for data, finding and correcting errors, and searching for confirmatory sources for data they don’t trust.
With all the focus on data, how does this happen?
No single source of the truth.
In most companies there is, at some level, a decentralization in the organization. This decentralization is adopted with the promise of quicker innovation, fewer roadblocks, and frankly, less empire building. However, it also promotes isolated, uncoupled systems.
These systems are bound to get out of sync. Take a very simple example: Golden record, single view of the customer, the list to rule all lists; whatever you may call it, every organization wants to know the correct contact information for every one of their customers. It is the only way to tell them about changes, let them know their payment method is about to expire, or even just let them know they are important to you. Because every part of the company sees the value, every department makes some attempt to collect that information. Take a second and try to answer the following: How many contact lists you do manage? (Home email contact list; phone contact list; work email contact list; shipping addresses; billing address; etc.) Do all those live in a single location? Are they all up to date? Are they all in sync?
If you are like most modern organizations that have embraced the divide and conquer mentality of decentralization, each of those lists is maintained and updated by a different part of the organization. Perhaps sales are responsible for phone and email, marketing is responsible for mailing address and e-mail, and customer service is a catch all responsible for everything they may encounter during a typical call. What’s more, in the spirit of moving fast and removing bureaucracy, each of the teams has their own copy of the subset of data they need to run their part of the business. Keeping these in sync is a hopeless cause. Most medium to large companies have dozens of systems managing millions of contacts, orders, or customer records.
[Note: We aren’t suggesting centralizing data. Indeed, we think these massive centralization efforts are just as likely to make the problem worse. They create yet another data source to manage. More on that below.]
In addition to keeping things in sync (or at least knowing the schedule when synchronization will occur) these disparate systems and organizations will also use the same data with different definitions. For example, consider two airports that track the time through security lines. Both might report a statistic like “time waiting in line.” But one airport might measure this from the point you arrive until you get to the agent, while another might measure from the point you arrive until you exit the security screening. And what do they do with those required to complete random searches? To be clear, we are not advocating that there is a single correct answer—in fact both could be right (more on that in our blog post about choosing the right KPIs). But without consistent definitions and logic about the edge cases, the usage of this data (especially its comparison) is pointless.
Given these business realities, there really can be no argument against some form of data governance. The problem is that as soon as people start to consider how to manage their data, they myopically focus on the data itself. The key to actual success is to remember that the data is not the end but rather the means to the end. Thus, the successful approach starts with the business. What business decisions, actions, or interventions is the data driving? This is the same point we made in the previous blog post in this series. Those tasked with governance are in such a rush to do something, they end up solving the wrong problem.
Consider the airport wait time from above. If the point of the metric is to help travelers plan how early they should arrive to the airport, then that wait time should start the moment they get in line and include the extra time spent for random screenings. After all, if catching your flight depends on that wait time, we’d rather use the upper bound not the lower. However, if the point of sharing that wait time is to simply help travelers choose the shortest current line, then so long as they are consistent, the metric will serve its purpose. Lastly, if that wait time is designed to help the TSA ensure correct staffing and throughput of their x-ray technicians, then it should only include the time after the ID has been verified.
Once you understand the business problem you are trying to solve, answering the following questions become much easier, but still critical. Companies must make sure that each piece of critical data (e.g., customer, invoice, etc.) has:
A canonical definition of data attributes and the possible values they may take.
The upstream lineage of those data—where are they generated, how often are they updated, what other process manipulate them.
The downstream dependent systems and their allowed use, including frequency of updates—what systems are going to use this, if something changes who needs to know, and when something breaks who will be impacted.
Most importantly, an owner that is responsible for managing changes to the governance.
At this point, you might very logically think, “Well, we need to go catalog every piece of data, in every system, understand all the attributes, understand all the ins and outs of data movement, and establish a committee for each business object (e.g., customer, invoice, etc.). Please don’t. This type of enterprise-wide effort leads to the “Big G” Governance issues we all fear. Remember, as with all data problems, don’t boil the ocean; it won’t work out for you and nothing will ever improve. Set your scope to the data required to solve your current problem.
Instead, start with that small use case like we defined in the first part of this post. What data elements are used for that specific need? Assign an owner for those elements that puts just enough controls in place to achieve that use case.
A small effort like this can be easily documented centrally. Any changes can be requested and approved by the owner without the need for committees and review boards. We use the analogy of an open source project in GitHub. The “source” is open to the public and anyone can request changes. “Repository owners” can approve easily. In fact, GitHub or an equivalent makes and excellent source for this central repository. Changes can be reviewed and approved with the “pull request” mechanism and history is easily tracked. Just like open source grows, you can build up the governance this way over time for things that are important. Don’t aim for perfection. There are plenty of things that may not have a business case to address.
Governance is not easy and is not something that has to happen all at once. In fact, it’s probably better if it doesn’t. Taking a case-by-case approach can help your organization avoid being overwhelmed and bogged down by data governance (small “g”). This method will help you fight the data governance battle. But having a lightweight and business-driven approach doesn’t solve the entire problem. We’ll cover additional steps in the next blog post.
At Credera, we love helping organizations understand how to leverage their data. Please reach out to us to start a conversation.
Contact Us
We'd love to start a conversation. Fill out the form and we'll connect you with the right person.
Searching for a new career?
View job openings