Skip to content

A Major Barrier to Scalable Analytics

Every large company wants and needs scalable analytics. That’s always been the case. It’s also always been the case that it is a mighty struggle to achieve the scale required. As new technologies and performance enhancements become available, the volume of data being analyzed and the complexity of those analytics always seems to outrun the improvements made possible by the enhancements. There is one constant challenge, however, that spans both recent and historical architectures that causes major barriers to achieving scale – the movement of data. This blog will discuss that challenge.

The Location of Data Is Everything

When creating an analysis, it is typical to bring multiple data sources, tables, or files together. Those data sources can be stored on the same platform or different platforms. When the volume of data is small, it really doesn’t make much difference where any of it is stored. However, as the volume of data increases, the location of each source of data becomes an important consideration and a critical component of performance.

Before we move on, let’s establish a definition of the “same platform.” In this context, it means that multiple data sources are each part of a repository within the same shared hardware devices. For example, in a classic relational database system, it is possible to have many different containers of data. However, they are all stored within the same set of hardware by the database management software. In the case of multi-node, parallel systems, the data will be spread across different physical machines, but those machines will be directly connected with the highest speed protocols available. This is the best case.

Another acceptable case is when multiple repositories exist and share a high-speed local network, but they are distinct hardware platforms. This would be common when a company has multiple data mart platforms (possibly from different vendors) within a data center, but the machines comprising those marts are all interconnected via high-speed private LAN connections. To the extent that data must flow back and forth, the dedicated network makes it as fast as possible.

The worst case is when multiple data sources are located in multiple locations that are not directly connected and where internet routing is necessary for those repositories to communicate. Most cloud systems follow this model because data is stored on shared servers located anywhere within a vast expanse of the cloud provider’s equipment. The cloud platform may have high-capacity internet bandwidth available, but the servers won’t share a private network except in cases where a company pays for a configuration that enables it (in which case the configuration is more a dedicated hosting service than a public cloud service).

Move Data Source “A” to Data Source “B” or Data Source “B” to Data Source “A”? That Is the Question!

With that background, we’re ready to dive into the root problem that has always existed for creating scalable analytics and that isn’t going away in the future. Namely, whenever you must combine two sources, tables, or files of data, there is no way around moving one of them to the other. It is literally impossible to join two tables with a SQL query, for example, without bringing the tables together.

In a simple case of two tables where one is very large and one is small, the obvious choice is to move the small table to the large one. When two or more very large tables are being joined, however, the combination will be painful no matter which way you go. If you’re combining a many-terabyte transaction table or file with a many-terabyte web browsing table or file, for example, the obvious choice would be to move the relatively smaller transaction table to the web browsing table. However, that’s still a lot of data to move! As the number of tables increases, the complexity increases; however, luckily there are usually only one or two very large tables that pose an issue, along with many smaller lookup tables that don’t.

The point is that it takes a lot of time, network bandwidth, and processing power to bring two huge data sources together. That has clear implications for both process performance and for process cost. In an on-premises environment where the equipment is paid for, the processing cost may not matter nearly as much as the performance cost will. In a cloud environment, both will matter a lot since you’ll pay for all the CPU and disk utilized as you move and join your large data sources. That can add up fast!

Why This Challenge Isn’t Going Away

This challenge exists with traditional relational data warehouses and data marts. It also exists in the more recent NoSQL and file-based data repositories. Most critically, the challenge doesn’t go away with cloud architectures in general, or with newer paradigms like a data mesh or data fabric. While some of the newer approaches make the management of the disparate data sources more seamless to a user and easier for administrators, the fundamental problem of having to move data is still there. It’s just easier to ask for it to happen. Hybrid cloud environments can be particularly problematic in this respect.

The takeaway here is that no matter how great the latest platform or architecture you’re considering sounds, there is no way to escape the performance issues caused by moving data from place to place to join and analyze it. Be sure to understand and account for that while maintaining reasonable expectations as to the cost and performance you can expect in whatever environment you choose. Also put substantive thought into where you physically store your critical data sources with respect to one another.

Like gravity, the scale challenge when combining disparate data sources is simply a fact of life. Don’t let your guard down simply because a new platform or architecture comes along. It may solve a lot of your problems, but you still won’t be able to bypass the need for data to be brought together to be joined. Success will come with a sober acceptance of that fact and a careful and realistic process setting scale, cost, and performance expectations.

Originally published by the International Institute for Analytics