Skip to content

When Big Data Can't Predict

Most people think that in the age of big data, we always have more than enough information to build robust analytics. Unfortunately, this isn’t always the case. In fact, there are situations where even massive amounts of data still don’t enable even basic predictions to be made with confidence. In many cases, there isn’t much that can be done other than to recognize the facts and stick to the basics instead of getting fancy. This challenge of big data that can’t be used to predict seems like an impossible paradox at first, but let’s explore why it isn’t.


One example where issues arise is when we have a ton of data on a very small population. This makes it tough to find meaningful patterns. Let’s think about an airline manufacturer. Today’s airplanes generate terabytes of data every hour of operation. There are a lot of benefits that can come out of analyzing that data in terms of understanding things like how the engines are operating under differing conditions. However, at the same time, some exciting analytics like predictive maintenance can be difficult. Why is that?

Realize that even the biggest aircraft manufacturers only put out a few hundred airplanes per year. By the time the different models are taken into account, perhaps only a couple dozen of some models are produced in any given year. Even if the aircraft come fully loaded with sensors throughout, it will be hard to develop meaningful predictive part failure models. Why? Because with only a few dozen or hundred aircraft, the sample is too small.

This is exacerbated by the low failure rate of things like an engine (or engine component), especially on a new aircraft. So, while petabytes of data might be collected over a couple years of operation, there simply may not be enough aircraft to create a large enough pool of good and bad events from which to build predictive models that really work. Certainly, we can monitor the data to look for anomalous patterns that might support an investigation or intervention. But, that’s not a predictive model.


There are other situations where there is a large universe of people or things to analyze and lots of data about them all. However, when events are exceedingly rare, you can still end up with a situation where there just aren’t enough exceptions to build truly effective predictive models. Again, this isn’t to say that there isn’t a lot of value in analyzing the data and understanding various aspects of the behavior of the people or things. It is simply saying that it may not be possible to build effective predictive models.

Let’s consider computer chips. Many millions, if not billions, of chips are produced each year and the rate is ever increasing. Decades ago, defects on the order of one in 10,000 or one in 100,000 might have been acceptable. With today’s chip-infused products, defects need to be closer to the one in millions level. I’ve had clients mention that there is pressure from the auto industry to drive chip defect rates down to one in a billion or less. Why is that?

The answer is that if any given new car has 1,000 chips in it in a few years, even small error rates start to translate into a lot of defective vehicles. With defect rates of one in 1,000,000 then about one of every thousand cars produced would have at least one critical defect. That translates to a lot of cost. It can also lead to lost lives if a chip fails in an autonomous vehicle and therefore causes it to malfunction while in operation. Hence, the push for incredibly low defect rates.

The issue becomes that if such low error rates are achieved, and if we can assume that there are a wide range of issues that could lead to a defective chip, there will be so few instances of any given defect happening for any specific set of reasons that we may never have enough of a sample to enable a good model to be produced to predict when and where those failures might occur. Considering chips are outdated and replaced with newer models within just a few years, it is quite plausible that this can be on ongoing issue.


Keep in mind that the issues I’ve raised here are not the rule, but the exception. However, as data is collected from more and more sources and we analyze more and more aspects of our businesses, these exceptions are almost certain to pop up within your organization now and then. The important thing to do is simply to be on the lookout for cases where you have a very small universe to analyze, an incredibly rare event to analyze, or, worst of all, a rare event within a small universe. I am assuming, naturally, that you are only considering situations where the data is relevant to your business problem. Data that isn’t relevant will never add value no matter how big or small.

When occasions arise where you’re uncertain your data is going to be effective for prediction, make sure you assess what will plausibly be possible before investing too much energy into developing sophisticated analytics on the data. You may have to settle for basic analytics in some cases. It is important to keep in mind, however, that you should still be better off than if you had no data at all to analyze. That’s the upside to keep in mind instead of letting frustration get the best of you.