Skip to content

Avoiding Analytical Malpractice With COVID-19

The general population is not used to interacting with data and analytics and interpreting results. With the COVID-19 crisis, millions of people are now becoming armchair analysts and they have plenty of time on their hands to practice their newfound analytical “skills”. Add to this a media industry that loves to put forth shocking headlines to grab clicks and a social media complex that similarly rewards attention, however achieved. We are left with an environment ripe for widespread confusion about what’s really happening with COVID-19. This post will discuss a few ways that even professionals can fall into common traps that distort what the data is really telling us and can lead to analytical malpractice.

Definitions Really Matter

At the most fundamental level, how a piece of data is collected and defined is a critical thing to understand. There are many ways that a “COVID-19 Death” is being defined by different jurisdictions, for example. In some cases, only those dying of clear COVID-19 complications in context of a firm COVID-19 test and diagnosis are counted. In other cases, anyone exhibiting symptoms associated with COVID-19 is counted even if no test was given to confirm it wasn’t just a cold or allergies.

It isn’t that any of the many definitions are wrong in any absolute sense (they all have their merits and downsides). However, they are fundamentally different and will lead to differing estimates of deaths. This means that you can’t just take two pieces of data from two sources and directly compare them without first checking how the data has been defined in each case. If you compare death counts computed one way this week with death counts computed another way next week, both your approach and results are flawed as the trend you think you see may be a result of the differences in the data definitions and not differences in deaths.

Lesson: Always validate how the data from multiple sources is defined before you try to compare them and draw conclusions. If a definition isn’t offered with data, then don’t use or trust the data at all!

Assumptions Matter Even More

We all know that models are dependent on a range of assumptions. This is true of models predicting the spread of COVID-19 as well. Even two people using the exact same “standard model” will get vastly different results if each makes different assumptions. A few assumptions, among others, that feed COVID-19 forecasting models:

  • How easily will COVID-19 spread between people?

  • What is the mortality rate?

  • What mitigation efforts will be made?

If another user of the same model I am using assumes a higher rate of spread, a higher mortality rate, and fewer mitigation efforts that I do, then their estimates of the final impact will be higher than mine. If we release our results one week apart after creating them today, it could be reported that the outlook has worsened or improved. However, in that case, nothing has actually changed in the underlying data or the actual trends. All that changed were the assumptions that were made on top of those base facts before being fed into the model. The timing of our reports made a trend appear to be there that was not. Reverse the order of release of our reports and reverse the “trend”.

This whole situation gets even messier when different models are used. Not only would different models produce different results with the exact same data and assumptions, but different models may also be using different versions of the data with different assumptions. All of the models we have seen projections from may have merit and be fully defensible. But they are all different. Without taking those differences into account, incorrect conclusions can be drawn.

Lesson: Take the time to understand the assumptions being made to help you identify if the latest forecasts you’re seeing reflect a true upward or downward trend, or if they are just as likely to be an artifact of changing assumptions or differing models.

Clarity And Context May Matter The Most

Perhaps the aspect where current news stories and social media posts go most astray is when they focus on one specific metric and then utilize that metric to support the positive or negative point that the author wants to reinforce. The exact same set of data can be presented in ways to make it appear to support polar opposite opinions.

Consider a county that does 10x the number of COVID-19 tests this week as it did last week. Last week, there were 100 positives out of 1,000 tests and this week there are 500 positives out of 10,000 tests. Here are two competing headlines. One from an opponent of lockdowns and one from a supporter:

  • Negative headline: Confirmed county COVID-19 cases go up 5x week over week!

  • Positive headline: County rate of infection among those tested drops 50% week over week!

Which headline is factually true? Both are! They are both reporting a mathematically true fact. The problem is that neither is acknowledging the full context of those numbers. While the diagnosis count did go up 5x, the number of tests went up 10x. So, while 5x sounds scary, all things being equal we would have expected 10x. So, the first headline is misleading. At the same time, the second headline leaves out context as well. It is true that the detected infection rate is half of what it was last week, but there are still a significant number of positive results, which shows that the virus hasn’t yet gone away.

The proper way to present these weekly results would be to have a headline that simply states: “The latest COVID-19 testing results are in.” Then, within the article, a variety of metrics should be presented that provide the full context of the situation. In other words, report both the increase in case counts AND the decrease in infection rate. Then, explain how those fit together and what is both good and bad about where things stand as a result. Cherry picking a single statistic, taking it out of context, and using it to amplify the point you want to make isn’t being data driven, it is being sloppy at best and dishonest at worst.

Lesson: Be sure to offer clarity and sufficient context for any figures you communicate. Provide multiple metrics that capture different dynamics. Don’t fall for headlines or social media posts that sensationalize a single metric. Always dig into the context of the data provided to make sure the reality is what the headline implies it is.

Don’t Forget About Experimental Design!

Any study will have collected data from a specific subset of the population. It is always important to understand the sampling methodology as it informs how broadly a result can be generalized. For example, consider two completely valid studies on two very different populations:

  • Study A looked at COVID-19 infection rates by going door to door and testing people who had been sheltering in place since the start of the crisis

  • Study B looked at infection rates among shoppers at a big box retail store

Which study is valid? Both! However, based on the sample methodology, each is looking at a different subset of the population, neither of which is representative of the general population. Clearly, the risk profile of those willing to go to a big box store is different from those who haven’t left home in weeks. The results of each study, though probably different, can be useful but only when keeping in mind the factors we’ve discussed in this blog. Namely, look at how the sample / data was defined, review the assumptions made in any forecasts made from the data, and keep the context of how the underlying populations are different in mind as you do so.

Lesson: As is always true, different COVID-19 studies will have different goals and designs. You can’t just compare results across multiple studies without taking the time to understand the study’s design and sampling methodology.

How Can We All Help?

As analytics and data science professionals, we can do our part by first and foremost not falling into the basic traps outlined here ourselves. But, more importantly, we can try to educate those we know about the same pitfalls so that our friends and family don’t inadvertently make completely erroneous conclusions based on data that they don’t fully understand.

I have personally been very disappointed to see so many news headlines and social media posts from all sides of the arguments being made that so blatantly violate the principles outlined in this blog. A silver lining may be that we can use examples from this period in the future to help better educate kids in school how to properly and critically assess and interpret the data they are exposed to.

In the meantime, those of us who know better should force ourselves to have the uncomfortable discussions required to point out to someone we know when they are making one of the mistakes outlined here. Do it in person or via a phone call, not in a public online forum, or else the chance of a negative confrontation instead of a positive discussion will go way up. With your knowledge of analytics comes a responsibility to share it. Not everyone will listen, but you’ll know you did the right thing.

Originally published by the International Institute for Analytics

Bill Franks, Chief Analytics Officer, helps drive IIA's strategy and thought leadership, as well as heading up IIA's advisory services. IIA's advisory services help clients navigate common challenges that analytics organizations face throughout each annual cycle. Bill is also the author of Taming The Big Data Tidal Wave and The Analytics Revolution. His work has spanned clients in a variety of industries for companies ranging in size from Fortune 100 companies to small non-profit organizations. You can learn more at http://www.bill-franks.com.

You can view more posts by Bill here.

Follow IIA on LinkedIn, Twitter and Facebook for more updates.