Skip to content

When Pristine Data Isn't Pristine

The data that you consider pristine and absolutely perfect for its intended use can turn into an absolute mess overnight if the data is used in a different way. While it isn’t common, there are cases where the current uses for data aren’t impacted by a major underlying quality issue that, if not identified, can totally corrupt a new use of the data. No matter how clean you believe your data to be, you must always revisit that assumption when the data is put to new uses. This blog will explain how this can be and provide a real and very intuitive example.

The Data Was Fully Tested and Pristine…

My first major run-in with the issue of data quality varying by usage took place in the early 2000s. There was a team within my company working with a major retailer to implement some new analytical processes. At the time, transaction level data had only just become available for analysis. As a result, the analytics being implemented were the first for that retailer to use line-item detail instead of rolling up to store, product, timeframe, or some other dimension.

The retailer had a robust reporting environment that was well-tested. Business users could dive deep into sales by store or product type or timeframe or any combination thereof. The output of these reports had been validated both prior to implementation and also through the experience of users validating that the numbers in the reports matched exactly to what was expected based on other sources. All was well with the reporting and so when it was determined that some initial market basket affinity reports would be implemented, it was expected to be a pain-free process.

Then, things went off the rails.

…Until Suddenly It Wasn’t!

The initial testing of the market basket data was going smoothly overall. However, there were some very odd results occurring in only some cases. For example, items from the deli seemed to have very unusual results that just didn’t make sense. As a result, the project team dug more deeply into the data to see what was going on.

What they found was that some stores had only a single transaction involving deli items each day. At the end of the day, there was a single transaction that would have 10 lbs of American cheese, 20 lbs of salami, etc. These transactions clearly had unrealistic amounts of deli products. At first glance, this made absolutely no sense and was assumed to be an error of some sort. Then the team dug some more.

It ended up that, for some reason, some of the store locations had not yet integrated the deli’s cash register with the core point of sale system. As a result, the deli manager would create a summary tally at the end of each day when the deli closed. The manager would then go to one of the front registers and enter a single transaction with the totals for each item from the day. The totals were actually valid and accurate!

The Implications of What Was Found

The team now knew that the odd deli data was correct. At the same time, the market basket analysis was not working properly. How could these both be true at once? The answer is that the scope of how the data was being analyzed had changed. For years, the company had only looked at aggregated sales across transactions. The manually entered deli end-of-day sales totals were 100% accurate if looking at sales by day or by store or by product. In the way the data had been used in the past, the data truly was pristine. The deli managers’ workaround was ingenious.

The problem was that the new affinity analysis was looking a level lower and diving into each transaction. The large deli transactions weren’t valid at the line-item level because they were, in fact, fake transactions even though the totals weren’t fake. Each nightly deli “transaction” was really an aggregate being forced into a transactional structure. As a result, while the data was pristine when looking at aggregates, it was completely inaccurate for market basket analysis.

There was an easy solution to the problem. The team simply filtered out the deli transactions from the stores with the separate deli system. Once the false transactions were removed, the analysis started to work well and the issue was resolved.

Data Governance and Quality Procedures Aren’t One and Done

The takeaway here is that you can never assume that data that has been checked and validated for one use will automatically be ok for others. It is necessary to validate that all is well anytime a new use is suggested. A more modern example might be a large set of images that have worked perfectly for building models that identify if one or more people are present in an image. The quality of the images might not be sufficient, however, if the goal is to identify specifically who is in each image instead of just identifying that some person is present.

It will certainly be rare that a new usage of data will uncover a previously unimportant data quality issue, but it will happen. Having appropriate data governance protocols in place that ensure that someone is validating assumptions before a new use of data can head off unpleasant surprises down the road. After all, the grocer’s data in the example truly was pristine for every use case it had ever been used for in the past. It was only when a new use from a different perspective was attempted that it was found to have a major flaw for the new purpose.

Originally published by the International Institute for Analytics