Skip to content

A Practitioner’s Case for Structured Data Discovery

Back in the slammin' 70s, John Tukey published Exploratory Data Analysis, in which he championed the idea of playing around with datasets before jumping into hypothesis testing. Tukey argued that by doing so, one can uncover new information about the data and the topic in question, and develop new hypotheses that might lead to interesting results. He wrote: "It is important to measure what you CAN DO before you learn to measure how WELL you seem to have DONE IT."

Since then, exploratory data analysis (EDA) has grown in popularity, and today it would be quite difficult to find a Kaggle challenge submission notebook that does not start with an EDA.

If you've ever been curious enough to read Exploratory Data Analysis (and you have done so recently), you would probably find it filled with lots of outdated techniques—like how to easily find logs of numbers and how to plot data points by hand. But if you have braved the book and either scuffed or pondered about how far we've come from such prehistoric tutorials, you would find lots of useful examples and ideas. Primarily, you’d see that EDA is not a specific set of instructions one can execute—it is a way of thinking about the data and practicing curiosity.

However, while Tukey brilliantly describes EDA and its techniques, his book omits the first and frequently overlooked step in data analysis: understanding the topic. While this idea may be intuitive to some, not everyone may actually practice it. And although it may be tempting to jump into programming or creating beautiful visualizations, EDA may be misleading if we do not comprehend what our dataset represents. So, before embarking on our data exploration, we should also cultivate curiosity about the topic or process that our data attempts to describe.

Specifically, we should ask ourselves two questions:

  1. What do we know?
  2. What do we not know?

By attempting to answer these questions, we can establish a frame of reference that will guide our analysis.

Knowledge Is Power

When attempting to solve a math problem, a good strategy is to first write down everything that is known about the problem. Similarly, in data analysis, if we already have a dataset that we plan to analyze, it is natural to want to know what the data represents. If we don't yet have the dataset, it is equally natural to ask questions about our topic to gather appropriate requirements for the dataset and to grasp the end goal. In this section, I propose a structured approach to gather the facts about our analysis. In fact, the question "what do we know?" can be divided into three separate "what" questions.

What is the subject matter?

While subject matter expertise can be left to the experts, a proficient data analyst should investigate the subject and acquire comprehensive knowledge about the topic. The rationale behind this extends beyond mere curiosity. Understanding the subject matter aids in identifying the necessary information for analysis and helps gather specific requirements. When working with an existing dataset, it can also enhance the effectiveness of EDA. Additionally, it enables analysts to avoid redundant work.

For instance, if we know that a company publicly announces quarterly earnings, it can provide an explanation for why the stock price undergoes sudden changes on a quarterly basis. An analyst can include this information in a list of known facts, saving time when conducting EDA for an analysis of the company's stock price fluctuations. Furthermore, the analyst may request quarterly financial statements as an additional data requirement.

What are the definitions?

Before proceeding with the analysis, it is crucial to compile a dictionary of definitions and known terms. Having a readily available dictionary can aid in uncovering nuances in the analysis, understanding the logic involved in various calculations, and facilitating communication with stakeholders. The process of compiling a dictionary can also generate additional questions and hypotheses to enrich the analysis.

For instance, if you are given a wine quality dataset (like this one), and you are tasked with predicting the quality of wine, you could simply import the dataset, import scikit-learn, and run a model. But if you took the time to build a dictionary of terms, you’d come to understand that "volatile acidity," for example, is defined as the measure of fatty acids with low molecular weight in wine and is associated with wine spoilage. So, if your model predicts that volatile acidity positively contributes to the predicted quality of wine, it may be necessary to revisit your model or be prepared to justify this outcome to your stakeholders.

What are the underlying processes?

The final step in gathering the knowns is to understand the underlying processes that govern the subject of your analysis. Typically, this should be accomplished through systems analysis, which identifies processes and aids in raising and answering various questions about the data. It can also serve as a tool to guide the analysis.

Let’s say you are tasked with identifying factors that contribute to your company's growth. Developing a systems diagram of growth paths can be a valuable starting point for understanding the data that should be collected and testing various hypotheses. A tree diagram, for instance, could highlight three ways in which your company can increase perceived income: sales to new clients, revenue growth from existing clients, or a decrease in the number of churned clients. From these factors, you can begin to construct a detailed picture. You could explore various methods by which your company acquires clients and identify the necessary data to verify the importance of those factors.

Ignorance Is Bliss

Once we have a better understanding of everything we do know about our subject, it is time to ask ourselves, "What do we not know?" While additional data collection may be required in certain cases, the true purpose of this question is to comprehend our limitations and the requirements that we cannot fulfill. It is also necessary to understand our biases and assumptions under which we will make recommendations.

Is the information complete?

To address this question, let's examine the data we have or are able to obtain and assess the following criteria:

  1. Does the data present the entire picture or only a part of it? For example, if we are tasked with analyzing a stroke dataset from Electronic Patient Records (EPR), we are only examining individuals who have a record, not everyone who may have experienced a stroke. What implications does this have for our analysis and any recommendations we plan to make?
  2. Are there missing data points? It is common for datasets to have missing records or columns of data. Often, an analyst must devise a strategy for handling these missing data points. However, the strategy may vary depending on the reasons for the missing data.
  3. Is the data of good quality? In many instances, data transformation, errors in data collection, or manual inputs can result in poor data quality. Before utilizing the provided information, an analyst should first confirm its correctness and cross-reference it with existing sources if possible. Secondly, they should develop a strategy to address poor quality and uncertainty.

Let's imagine that you are assigned the task of analyzing product reviews on a website (like these) to uncover trends and insights and influence future product stocking. Such a dataset relies on individuals who voluntarily submit their product reviews. However, not all individuals engage in this activity. Therefore, we must question whether this dataset is representative of the entire population. If we apply natural language processing to the reviews, should we expect the quality of the reviews to be good, or do we need to perform some preprocessing to enhance the readability of certain entries? If we do not preprocess the data, are we potentially missing out on important information? Is there any additional data or supplementary analysis we can obtain to enhance our analysis?

What assumptions are we making?

In situations where data is incomplete or of poor quality, and it is not possible to obtain additional information, an analyst may have to accept their fate and design strategies to proceed with the analysis. However, it is essential to explicitly state the assumptions made, as they will define the boundaries and scope for interpreting the results of the analysis. Making assumptions carries inherent risks, and decision-makers must be informed of the risks they are willing to accept in utilizing the analysis results.

For example, when analyzing a dataset of happiness scores across the globe (like this one), and no additional information is available, several assumptions need to be made. Primarily, if we want to draw inferences about the general population based on the represented countries, we must assume that the results of individual countries accurately represent their respective populations. This assumes that the survey participants represent the wider population of their countries (which is a hefty assumption). We must also assume that the methodology employed by the Gallup World Poll, which conducted the survey, is unbiased and not skewed toward specific populations or methods (which is another hefty assumption). Additionally, we must assume that the translation of the survey into different languages did not affect the results.

Are we biased?

The final question that determines the strategy for analysis and interpretation of results deals with bias—an inclination or prejudice towards a particular viewpoint. Biases stem from individuals' perceptions of the world and their experiences and interactions. Without understanding different types of biases and critically evaluating information through an unbiased lens, the insights derived may not accurately represent reality. Instead, they may be skewed toward a specific point of view. Biased analysis not only fails to fully represent reality but can also be unfair and unethical.

In the case of analyzing a dataset of US presidential debate transcripts (like this one), the analysis may be influenced by recency bias, wherein greater importance is placed on more recent debates while disregarding the fact that culture changes and evolves over time. Additionally, an analyst with a strong political standpoint may exhibit confirmation bias, selectively ignoring evidence that does not support their viewpoint.

This post advocates for a structured approach to data discovery—a process that must occur before engaging in exploratory data analysis. The purpose of this process is twofold: to understand the known information and to evaluate the unknown. By answering questions that aid in the discovery process, we can develop a perspective and lens through which the results will be interpreted. It also helps us determine whether our insights can be utilized for decision-making and the limitations on the types of decisions we can make.

Before jumping into exploration of the unknown lands, we should discover those lands, understand them, survey them, and prepare ourselves for the journey. Only then will we be able to explore safely and with confidence.

Originally published in Towards Data Science. The version posted here has been slightly edited for clarity.