Realizing Value From Your Data Lake: Lessons Learned and Four Critical Mistakes to Avoid

Maina Musa and Shawnasty Bankovich

Realizing Value From Your Data Lake: Lessons Learned and Four Critical Mistakes to Avoid

Data lakes are a great way to organize and store your company’s data, enabling better problem solving and a faster path to returning value. Google Trends shows an increase of over 300% in search interest in data lakes between 2014 and 2020, painting a clear picture of the popularity and prevalence of data lakes in today’s modern data landscape. Despite this trend, in 2014 Gartner predicted that 90% of data lakes were destined to be useless by 2018. With the growing popularity, and the expected lack of success in many of these ventures, a staggering number of data lakes will fail to reach their true potential while costing upwards of tens, even hundreds of thousand dollars each.

Credera has experience helping clients, of all industries and technical maturity levels, achieve their data-related goals. We have built custom data lake solutions, integrated commercial data lake products, and ultimately helped our clients realize impactful value from their data lake investment. Here are four key lessons we learned along the way.

Four Data Lake Lessons Learned

1. Define Success

Building a data lake should not itself be the goal, but rather a path to achieve the goal. Many data lakes are set up for failure because they are not linked to an overarching objective. When considering a data lake, it is imperative to define the business goals at hand and assess if a data lake is the optimal solution. Some example goals of data lakes are:

Increased operational efficiency – A data lake can provide reduction in transaction cost and offload strain of native data sources allowing for more efficient data consumption.
Data availability – Some data lake solutions can provide increased reliability for data access.
Ease of development – Data lakes allow for multiple data sources to be combined and coexist in one central repository.
Enhanced data use – Expanded machine learning opportunities, real-time dashboarding, and advanced analytical capabilities are just a few examples of what a data lake can accomplish.

Once you’ve aligned on the goal you’re working toward, define how you will measure success. Some examples of success metrics could be:

Response times – The average data request needs to complete in under a certain amount of time.
Down (data inaccessible) time – Data can become inaccessible during scheduled maintenance or deployments. Reducing down time or limiting to off-peak hours can have significant business impacts.
Cost of new development – Data lakes can reduce cost of new development, storing data cheaply and without needing to define a schema up front.
Machine learning performance – Data lakes have the ability to increase machine learning test metrics such as accuracy, F1 score, or receiver operating characteristic (ROC) by increasing data quality, volume, and accessibility.

These are the same steps we follow when working with our clients. For example, a consumer goods manufacturer was struggling to define what success meant for their data lake. We helped them reach a tangible metric that represents success. In this case the key performance indicator (KPI) was a frequency cap, the restriction of the number of times a person is shown advertising in a period, and we worked together tirelessly to achieve this goal.

We implemented identity matching that enabled more efficient marketing spend, allowing ads to be served by customer and not the traditional by device approach. The organization then achieved the building blocks of identity resolution necessary to meet their frequency capping metric. They anticipate being able to reduce digital ad spend for certain products by up to 60%! Data lakes can provide immense value, but not everyone in the organization has the same idea of what success looks like—you need to decide what that success looks like for your organization.

2. Provide Value as Soon as Possible

Since so many technology projects fail to meet expectations, large investments such as data lakes must begin showing value long before the project is complete. Delivering business value, even in small increments, builds stakeholders’ confidence in a project. Gradually delivering tangible value will reduce insecurities, leading to project longevity and success.

One way to incrementally deliver value is by beginning with a minimum viable product (MVP) and continuing into a phased roll out approach. Regularly releasing small chunks of work allows small amounts of value to be realized each step of the way. Provide value early and incrementally bring in stakeholder input while instilling confidence and making progress toward key success metrics.

A phased data lake roll out may look something like:

Loading data from a SQL data source into unstructured data lake (i.e., an MVP).
Introducing structure to the data lake where appropriate.
Creating a data warehouse from the data lake.
Querying the data warehouse from a reporting tool or other application.
Loading data from any additional integrations into the data lake.
Offloading strain from native data sources.
Performing advanced analytics on the data lake.

The same consumer goods manufacturer mentioned in the previous section quickly reached their first set of goals in the project. Data pipelines were designed to provide incremental value long before completion. When implementing pipelines to enable previously mentioned advertising goals, the pipeline was able to rapidly populate data into the new workstream even at early stages. Because of the design of this process, a team was not required to track down data or perform additional development—the process was ready for implementation as soon as the objective was defined.

3. Avoid Data Swamps

One of the toughest challenges when dealing with data lakes is maintaining data integrity across unstructured and semi-structured storage. Without proper safeguards, data lakes can easily become stagnant and transform into data swamps with very little value. The native lack of structure of a data lake creates a problematic environment for enforcing data quality. Alerting and monitoring tools can be used to provide cohesion at each step of the data pipeline, ensuring the source data arrives in the data lake and that the data warehouses (if they exist) are an accurate representation of the data lake.

Recently, we worked together to develop, design, and implement a fully custom data lake solution for a large financial services company. Although custom solutions can be easily susceptible to erroneous data, the clients’ data was safeguarded by automating testing and quality assurance measures. A data lake is only as useful as the data in it, therefore validating the accuracy of this data must be prioritized.

4. Choose the Right Solution for Your Goals

There is an overwhelming number of data lake solutions on the market spanning a wide variety of offerings. Databricks, Snowflake, and Cloudera all have entire companies focused around enterprise data platforms. Additionally, major cloud providers such as AWS, Azure, and GCP have coined multiple offerings around data lakes. Tools such as Spark and Dask allow you to build custom data lake solutions yourself. With all of the options for data lake solutions, it can be hard to know which solution best fits your use case.

The cost of data lake solutions can vary dramatically, and you may find that it is best for your organization to build your own custom solution rather than purchasing software. When choosing a solution, find one that aligns with your goals.

Considerations should include:

Security/compliance requirements – Depending on your industry, location, and user base there may be data compliance standards to follow such as GDPR, CCPA, HIPAA, SOX, etc.
Technical skills and experience of your IT team – Different data lake solutions vary in skillsets required for implementation and maintenance.
Storage requirements – Size, type, format, accessibility, and location of your data will drive which data lake solution is appropriate for your needs.
Ease of sustainability/maintenance – Data lake solutions come in a range of maintenance levels, from fully managed services to custom application development.
Cost of investment – Data lakes can be very expensive, but there are different models of payment options, such as upfront software cost, monthly licensing fees, or even paying for custom software development.
Integrations – Determine how your data lake will interact with the rest of your data architecture, including data sources, third-party software, data warehouses, etc.

Choosing the wrong solution can be a significant financial and timeline burden that can be difficult to bounce back from. Choosing the correct solution can unlock the extraordinary.

For example, when a financial services client of ours was struggling to utilize an existing data warehouse, we assessed their existing technology and found that a data lake would better support their goals. This client had multiple data sources they needed to map together to get a 360-degree view of their customer. Additionally, they handled sensitive financial information and therefore required very granular control of data access. Due to the strict constraints of this use case, we helped them move forward with a custom data lake build that allowed them to map together their data at different levels of access and achieve their goals.

Implementing a Data Lake for Your Business

Data lakes pose challenges that can differ from many other technology projects. Solutions are far from one size fits all. Credera has years of experience in helping our clients with a variety data lake challenges. If you think a data lake might be right for your organization or you’re interested in realizing more value from an existing data lake, we would love to help! Reach out to us at marketing@credera.com—we can talk through your data lake challenges and help you achieve your goals.