If I gave you the choice of winning either $1,000,000 or one penny doubled every day for a month, which one would you pick? The million dollars sounds pretty good, doesn’t it? However, by the time day 30 comes along that penny doubling will be worth more than $5 million due to the power of exponential growth.
When considering which new analytics to tackle, organizations often focus on what it will cost in the short term to build, test, and implement the various options and the expected returns that can be generated. There is nothing wrong with this, but in the age of big data, I’d like to suggest that there is another criterion that should also be considered: how much growth is expected in the volume of the data utilized for the analysis and how will it impact the costs and analytics processes in the future? The issue is not as simple as just storing the data, but also actually impacting the business with powerful analytics that must scale with the data.
Will the data required for a new analysis follow a linear, easy to predict growth curve or an exponential, impossible to predict growth curve? It is important to consider this question because the ongoing maintenance and updating of your analytics processes will be impacted heavily by the answer. Let’s look at what I mean by linear and exponential growth in this context and then discuss the implications for planning and decision making.
LINEAR, PREDICTABLE GROWTH
Most historical data followed a linear growth pattern and many sources of big data do too. Let’s start with a historical example. When capturing customer transactions, regardless of industry, it is easy to project what the storage, processing, and analytic process needs will be in the future. There are only so many new customers to acquire and customers will only execute so many transactions each.
This means that even if you were to acquire everyone in the world as a customer and managed to get them all to execute three times the average transaction count of today, it is very easy math to identify how much data that will translate into. It is also easy to project the required processing and analytics approach to handle that volume. It may not be cheap, but you can know exactly what you’re getting into from the start and have a very good idea if you can keep the things cost effective in the long run before you start.
A modern example would be the idea of monitoring the temperature and humidity of a home to optimize climate settings for comfort and cost. Realistically, most homes will have only one or two sensor points collecting temperature and humidity, corresponding to each home’s one or two thermostats. Even if every room gets its own set of sensors, that still will average only perhaps 5 - 7 more readings per house on average. There are only so many houses, and even if being aggressive, you only need temperature and humidity readings perhaps in one minute increments. So, any analytics planned against this data will similarly be very easy to project and also very easy to plan for.
The catch is that in today’s world there are also data sources that aren’t quite so easy.
EXPONENTIAL, UNPREDICTABLE GROWTH
Data with exponential growth is difficult to plan for. This is because the volume of data, the number of points generating the data, and the complexity of the data are all unclear beyond a short time from now. Let’s go back to our homes. Solving the specific issue of monitoring temperature and humidity is itself linear in nature as outlined above. However, managing a fully connected home is exponential. Why?
At this time, we have no idea how many sensors might end up in each home. We also have no idea how many different environmental readings each sensor might produce, nor how frequently we would need the readings to be produced. We also have no idea how various sensor information will interact with other sensor information and require mixing and matching for analytic processes. In other words, the resources required to fully utilize the data generated by a connected home may well grow in an exponential and unpredictable way. Will we have 100 sensors measuring 10 metrics each minute or 10,000 sensors measuring 100 metrics every millisecond? And, how will all of those readings interact over time? We have no idea as of today.
Another example is the monitoring of our activities and vital signs via sensors such as fitness bands. The analysis of steps, for example, is a linear problem. People can only take so many steps, and we know how many people there are in the world. But, as the number of sensors proliferate and start to monitor everything from body temperature, to blood sugar levels, to blood pressure, to myriad other metrics, we have no idea what we’re getting into. Though we may know how many people we’ll have to capture data from, we have no idea how many sensor readings we may eventually collect and how often each will have to be collected. We also have no idea how complex our analytics will have to be to make full use of that information. For example, what other bodily readings must be taken into account along with blood pressure? The data growth is exponential as is the complexity and volume of the analytics processes required to analyze the data.
MAKING THE RIGHT DECISIONS
The takeaway from the prior examples is that there is additional risk when pursuing analytics that involve exponential growth. Therefore, a consideration when prioritizing which initiatives to pursue should be to what extent the problem can be classified as linear or exponential. When two projects appear to be fairly equal on other measures, take the one with linear growth. There is much less long term risk and a high level of certainty that whatever analytics you create will be able to stay relevant for a long time.
Perhaps one of the biggest challenges with many sources of big data will be finding ways to take a scenario with exponential growth properties and figuring out how to filter and limit what is captured to transform the scenario into one of far less exponential, if not linear, growth.