I have been thinking about some of the changes over the last decade in analytics, coinciding with the revised and updated release of my book with Jeanne Harris, Competing on Analytics.The book is ten years old, and much has changed in the world of analytics in the meantime. In updating the book (and in a previous blog post about the updates), we focused on such changes as big data, machine learning, streaming analytics, embedded analytics, and so forth. But some commenters have pointed out that one change that’s just as important is the move to self-service analytics. We described this trend in our book (due in early September), but we may not have given it the focus it deserves.
There should be no doubt that analytics of virtually every type are becoming more of a self-service activity. There is also little doubt that at one point they were an activity requiring analytical professionals. Today and for the last several decades, however, analytics are becoming easier to use. It’s easier to perform most key tasks in the analytical process, such as:
To acquire, integrate, review and clean data;
To run descriptive and predictive analytics on the data;
To find the model that best fits your data;
To display descriptive analytics in an appealing visual format;
To interpret results.
WHAT MAKES SELF-SERVICE POSSIBLE?
Why have things gotten easier for analytics users? There is no single breakthrough, but rather a series of incremental improvements. Analytical software has gotten better in terms of the basic user interface, which is almost always a point-and-click one these days. There are a variety of common data formats (e.g., comma-separated values, or CSVs) that make it relatively easy to acquire and integrate data. Almost all analytical systems allow the user to view data as a series of points on a grid, which facilitates identification of outliers and data entry errors.
Running descriptive analytics has gotten easier, both in terms of creating the analyses and displaying them visually. So-called OLAP systems, which involved manipulating pre-structured data cubes, were relatively easy to use once the cube had been constructed, but that typically needed to be done by IT professionals. And users often realized that the data they wanted to analyze wasn’t in their cube, so they needed a new one to be constructed.
Newer tools not only have a better interface, but eschew the cube idea to enable work on an entire dataset. This eliminates or at least reduces the need for IT professional help with analytics. In addition, most analyses with contemporary tools take place entirely in memory, which speeds analysis dramatically and makes it possible to iterate frequently until the best results are achieved.
Finding the model that best fits your data—a problem in predictive and prescriptive analytics—sometimes requires machine learning, but doesn’t always. Some more traditional statistical analysis systems can now make recommendations about what kinds of analyses to perform as well. These systems can examine the data and model roles (independent and dependent, for example) for the selected variables, and specify, for example, that a bivariate correlation is the best analysis for the data .
For the most automated (or at least semi-automated) approach, machine learning systems can try out more than a hundred different algorithms on thousands or millions of possible variable combinations and transformations. Some machine learning systems simply ask for a dataset and the variable to be predicted, and the system does the rest. They will even point out likely outliers and errors in data, and exclude them from the analysis automatically if you want. Of course, there is a downside to this ease of analysis; it may be difficult to understand and interpret the results. Hypothesis-driven analyses tend to be much more interpretable.
Finally, analytical tasks related to displaying and interpreting results have gotten substantially easier for amateurs to perform. Visual analytics displays can be created easily and quickly. Some vendors even recommend particular visual display types for particular types of data, e.g., a line chart for time-series data.
Interpretation of analytical results is eased not only by visuals, but also by automatically-generated textual narratives. More than one vendor of “natural language generation” software can create a paragraph or so of interpretive text about a particular bit of descriptive analytics. It is early days for this technology, but some viewers and decision-makers may find text easier to interpret than bar and line charts.
LIMITATIONS OF SELF-SERVICE
All of these technological advancements have made it much easier for analytical amateurs to create professional-level results. This is mostly a good thing. However, there are some limits to the self-service movement, at least at the present time. As with spreadsheets (perhaps the first self-service analytics technology), amateur analysts can still get in trouble in several different ways.
The ease of collecting, integrating, and transforming data, for example, means that it is very easy for organizations to encounter “multiple versions of the truth.” One person’s analysis being presented at a meeting will conflict with a colleague’s. The ease of manipulating data may also lead to errors, as was the case with spreadsheets. Depending upon which research study you believe (ironically, there are multiple versions of the truth) about spreadsheet errors, between 20 and 80 percent of spreadsheets have an error in them. With other types of analytics software the chance for errors is less (in part because the user isn’t generally creating an algorithm or data transformation logic), but it’s still present.
Beyond simply avoiding errors, there are several aspects of analytics that still require some expertise. This is despite increasing levels of analytical ease and automation. Decisions on such questions as how to frame the overall analysis, what dataset to use, how best to handle missing data, and if more or better data are needed still require human judgment. In addition, many statistical modeling approaches make certain assumptions about the data, and it’s important to ensure that they are not being violated. As a result of these remaining needs for expertise, amateurs may still need at least to consult with analytical professionals as they go about their work.
THE GREAT MIDDLE GROUND
The press and some vendors tend to discuss analytical expertise as binary—rank amateurs vs. experienced professional analysts, or Ph.D. data scientists vs. “citizen data scientists.” You probably already realize that the world is a little more complex than that. Neither all amateurs nor all professionals are created equal. There is a continuum of expertise about almost every phase of analytics. Some “amateurs” may not know when to employ logistic regression, but may be quite wise about how to frame an decision and how to communicate the results of analyses in a way that inspires trust and action. And the most sophisticated statistician or data scientist may be lacking in some of those same attributes.
If your organization is trying to increase the number of people who work with data and make decisions on the basis of analytics, don’t succumb to simplistic binary distinctions. Instead, figure out the skills that are needed to succeed with analytics. Create a set of roles—“business analyst,” “quantitative analyst,” “data scientist,” and the like—and specify what level of the needed skills each role should have. You won’t be able to capture all of the complexity of skill/role combinations, but at least model some of it. Then start thinking about certifying people in the various roles.
And by all means, take advantage of the easier-to-use technologies that are making it easier for people to perform their own analyses. You may want to classify different technologies as well in terms of how well suited they are for each role. And make sure that you revise your classification often, because this technology changes really quickly.