The concept of a blockchain is quite a phenomenon in recent times. It has quickly risen from a relatively obscure idea known mostly within some small circles to one that is being discussed as having potential to literally change some of the fundamentals of the world’s economic systems.
I don’t claim to be a blockchain expert, but given that it is a new paradigm for generating and storing data, my mind has naturally drifted toward thinking about how the mechanics and performance of analyzing data within a blockchain environment would be different from how we analyze data within other platforms. My initial thoughts point to some significant challenges.
A SYSTEM THAT ISN’T BUILT FOR ANALYTICS ISN’T OPTIMIZED FOR ANALYTICS
Let’s start with a historical perspective by examining the early days of data warehousing and 3rd normal form data. Storing data in 3rd normal form does have a range of benefits, particularly when it comes to storing massive amounts of data at an enterprise scale. For one example, it minimizes data duplication and, therefore, storage costs. However, for building models and executing deeper analytics, we need to denormalize such data. So, 3rd normal form adds overhead to our analytic processing. There are benefits to the overhead of starting with 3rd normal form data, such as increased data integrity. But, the mechanics of preparing the data for any given analysis are actually more involved.
I see the same theme playing out with blockchain data. Blockchain has a number of advantages due to the fact that the data is validated by a range of unrelated servers and that data can’t be changed once logged. But, as with 3rd normal form data, blockchain technology is not aimed first and foremost at enabling complex analytics. Rather, it is aimed at ensuring secure and accurate recording of transactions. It is natural, therefore, that the ability to perform deeper analytics is a secondary consideration today (at best!) to the ability to validate and log transactions.
SOME POTENTIAL BLOCKCHAIN “GOTCHAS” TO THINK ABOUT
First, within a blockchain, the history of each transaction is stored. The history of a given chain of events can therefore be recreated by traversing the entire chain. So, for example, I can go back and determine who held a specific Bitcoin at a specific point in time. The problem is that I must recursively scan the history of that coin to get the answer. This isn’t too bad when I need to validate a single historical record. But, what about if I want to do some analytics across millions of people? To do that, I must traverse the records for all of the holdings of each person to get back to the point in time of interest. In other words, just to get the balances for each person one year ago requires quite a bit of work beyond simply querying a table. This is a time and processing intensive task in the aggregate and will impact what type of analytics can be done within a given timeframe or cost.
Next, inherent within blockchain is the idea that the entire history of the chain is stored everywhere. This is terrific, but it also means that the size of the data stored will blow up very quickly. As opposed to 3rd normal form data, which minimizes data redundancy, blockchain actually maximizes redundancy since every single node in the system hosts a complete copy of the data. This keeps the chain secure. However, as the data grows, so must the capacity of the individual nodes. At some point, the data will outgrow the capacity of the servers.
As blockchain data grows, analytics will become an issue even before outgrowing the servers. The servers are configured to validate transactions, update the blockchain, and store the results. The servers are not configured for allowing complex analysis across a range of time and people. Performance for analytic logic will be highly spotty at best.
Last, the entire point of a blockchain is the historical record. In other environments, we have the ability to build additional views or summaries of data to support our analytics. I don’t believe there is any mechanism for this built into blockchain today. And, per the prior points, even if it existed it likely wouldn’t be very good. What this means is that to support analytics, we’ll need to build separate environments that have the rollups and analytic views we require. And, these views will need to be updated periodically to reflect the latest data.
One big plus of blockchain related to the building of analytic views is that data can’t be edited or deleted. So, a summary table built once on historical data will stay fresh. There is no need to process anything but the latest data when updating our analytics views unless we require metrics that force us to process through all the data again to get accurate information.
BRINGING ANALYTICS TO BLOCKCHAIN
Over time, more focus will certainly be given to how to effectively use blockchain data for analytics. In the meantime, expect the prior issues and more like them to make the analysis of blockchain data more cumbersome and less efficient than we’d like. As with any new tool or technology, leave plenty of extra time to deal with the “gotchas” that will certainly accompany early efforts at doing analytics against blockchain data.
As I said before, I’m not a blockchain expert and I’m simply providing some initial thoughts I have on its implications for analytics. I welcome comments to help refine the thinking for me and other readers.