It is good to remember in today’s hype-filled big data world that there is no “easy” button for big data. In fact, in many ways, big data is quite difficult to deal with. Many organizations seem to be falling for the fallacy that simply implementing new tools or platforms will “automagically” solve their big data problems. Unfortunately this isn’t the case.
For example, there is a common belief that MapReduce platforms such as Teradata Aster or Hadoop can tame big data in and of themselves. In reality they don’t inherently enable new functionality or analytic logic to be executed. Rather, they allow you to scale certain kinds of functionality and analytic logic in a way that makes the functionality and logic much more powerful and widely applicable.
This is an important distinction – and one I want to explore in detail.
Many organizations seem to be thinking of MapReduce as a magic bullet or “easy” button for handling big data. Just set up a system, and your big data problems are solved, right? Wrong. Once the system is in place, it is still necessary to develop the analytic processes that run against it. There really is no shortcut here. If you want great analytics, you’re going to have to build your processes just like you always have. Organizations that don’t understand this fact will be disappointed when they realize they aren’t instantly getting the value they expected from their investment.
As I said earlier, MapReduce doesn’t inherently enable new functionality. When you hear about MapReduce environments, you will quickly come to a discussion of leveraging languages such as Java or Python. It just so happens that these languages have been around for quite a while. They had strong followings before the concept of MapReduce came into existence. Most users of these languages have never used, and may never use, a MapReduce architecture as part of their work. However, they code away day to day developing processes just like their big data focused counterparts.
What many people don’t take the time to think about is that whatever logic you develop today in Java to run in a MapReduce environment is something you could have written in Java years ago. The exact same code, the exact same output for a given piece of data. This is why I said that MapReduce doesn’t directly cause any new analytic logic to come into existence. Rather, MapReduce provides a highly scalable platform so that logic can be executed at a scale far surpassing what was possible in the past.
This last point is the value that MapReduce brings. Having a terrific facial recognition or text parsing algorithm doesn’t do much good if there is no way to scale the process to a big data environment. MapReduce provides that ability. It lets organizations apply algorithms to a much wider base of problems and a much larger amount of data. It allows logic that wasn’t practical to build into your analytic processes to become practical.
This no different than how parallel database platforms provide value. A Massively Parallel (MPP) database system runs on SQL just like a non-MPP system. An MPP system doesn’t enable new functionality in the absolute sense, but it does provide the ability to scale an SQL process. As a result it enables far more value to be derived and a much wider set of problems to be practically addressed than when using a non-MPP architecture.
In summary, we can expect MapReduce to continue to be a force behind the taming of big data. But, the onus will still be on the organizations that use it to develop and implement the required analytic processes just as they always have had to do in the past. Many analytics that were theoretically possible, but impractical, will no longer be a problem. That will lead to a lot of value. The key is to understand what the architecture will do for you, and to not underestimate the effort required to use it correctly. It will take work to get the benefits. There is no “easy” button for big data.