Unstructured data has been a very popular topic lately since so many big data sources are unstructured. However, an important nuance is often missed - the fact is that virtually no analytics directly analyze unstructured data. Unstructured data may be an input to an analytic process, but when it comes time to do any actual analysis, the unstructured data itself isn’t utilized. “How can that be?” you ask. Let me explain…
Let’s start with the example of fingerprint matching. If you watch shows like CSI, you see them match up fingerprints all the time. A fingerprint image is totally unstructured and also can be fairly large in size if the image is of high quality. So, when police on TV or in real life go to match fingerprints, do they match up actual images to find a match? No. What they do is first identify a set of important points on each print. Then, a map or polygon is created from those points. It is the map or polygon created from the prints that is actually matched. More important is the fact that the map or polygon is fully structured and small in size, even though the original prints were not. While unstructured prints are an input to the process, the actual analysis to match them up doesn’t use the unstructured images, but rather structured information extracted from them.
An example everyone will appreciate is the analysis of text. Let’s consider the now popular approach of social media sentiment analysis. Are tweets, Facebook postings, and other social comments directly analyzed to determine their sentiment? Not really. The text is parsed into words or phrases. Then, those words and phrases are flagged as good or bad. In a simple example, perhaps a “good” word gets a “1”, a “bad” word gets a “-1”, and a “neutral” word gets a “0”. The sentiment of the posting is determined by the sum of the individual word or phrase scores. Therefore, the sentiment score itself is created from fully structured numeric data that was derived from the initially unstructured source text. Any further analysis on trends or patterns in sentiment is based fully on the structured, numeric summaries of the text, not the text itself.
This same logic applies across the board. If you’re going to build a propensity model to predict customer behavior, you’re going to have to transform your unstructured data into structured, numeric extracts. That’s what the vast majority of analytic algorithms require. An argument can be made that extracting structured information from an unstructured source is a form of analysis itself. However, my point is simply that the final analysis, which is what started the process of acquiring the unstructured data to begin with, does not use the unstructured data. It uses the structured information that has been extracted from it. This is an important nuance.
One reason it is important is that it gets to the heart of how to handle unstructured big data sources in the long run. Clearly, some new tools can be useful to aid in the initial processing of unstructured data. However, once the information extraction step is complete, you’re left with a set of data that is fully structured and, typically, much smaller than what you had when you started. This makes the information much easier to incorporate into analytic processes and standard tools than most people think. Through an appropriate information extraction process, a big data source can shrink to a much more manageable size and format. At that point, you can proceed with your analytics as usual. For this reason, the thought of using unstructured data really shouldn’t intimidate people as much as it often does.