Why Nobody Is Actually Analyzing Unstructured Data

Unstructured data has been a very popular topic lately since so many big data sources are unstructured. However, an important nuance is often missed – the fact is that virtually no analytics directly analyze unstructured data. Unstructured data may be an input to an analytic process, but when it comes time to do any actual analysis, the unstructured data itself isn’t utilized. “How can that be?” you ask. Let me explain…

Let’s start with the example of fingerprint matching. If you watch shows like CSI, you see them match up fingerprints all the time. A fingerprint image is totally unstructured and also can be fairly large in size if the image is of high quality. So, when police on TV or in real life go to match fingerprints, do they match up actual images to find a match? No. What they do is first identify a set of important points on each print. Then, a map or polygon is created from those points. It is the map or polygon created from the prints that is actually matched. More important is the fact that the map or polygon is fully structured and small in size, even though the original prints were not. While unstructured prints are an input to the process, the actual analysis to match them up doesn’t use the unstructured images, but rather structured information extracted from them.

An example everyone will appreciate is the analysis of text. Let’s consider the now popular approach of social media sentiment analysis. Are tweets, Facebook postings, and other social comments directly analyzed to determine their sentiment? Not really. The text is parsed into words or phrases. Then, those words and phrases are flagged as good or bad. In a simple example, perhaps a “good” word gets a “1”, a “bad” word gets a “-1”, and a “neutral” word gets a “0”. The sentiment of the posting is determined by the sum of the individual word or phrase scores. Therefore, the sentiment score itself is created from fully structured numeric data that was derived from the initially unstructured source text. Any further analysis on trends or patterns in sentiment is based fully on the structured, numeric summaries of the text, not the text itself.

This same logic applies across the board. If you’re going to build a propensity model to predict customer behavior, you’re going to have to transform your unstructured data into structured, numeric extracts. That’s what the vast majority of analytic algorithms require. An argument can be made that extracting structured information from an unstructured source is a form of analysis itself. However, my point is simply that the final analysis, which is what started the process of acquiring the unstructured data to begin with, does not use the unstructured data. It uses the structured information that has been extracted from it. This is an important nuance.

One reason it is important is that it gets to the heart of how to handle unstructured big data sources in the long run. Clearly, some new tools can be useful to aid in the initial processing of unstructured data. However, once the information extraction step is complete, you’re left with a set of data that is fully structured and, typically, much smaller than what you had when you started. This makes the information much easier to incorporate into analytic processes and standard tools than most people think. Through an appropriate information extraction process, a big data source can shrink to a much more manageable size and format. At that point, you can proceed with your analytics as usual. For this reason, the thought of using unstructured data really shouldn’t intimidate people as much as it often does.

Originally published by the International Institute for Analytics

Related Posts Plugin for WordPress, Blogger...
  • http://sethgrimes.com Seth Grimes

    As I wrote in 2005, “Most unstructured data is merely unmodeled.” http://www.informationweek.com/news/60406733

    Seth, http://twitter.com/sethgrimes

  • Scott Radcliffe

    Great insight, mostly not understood by the masses. Meaningful analysis of unstructured data always requires the development of a semantic framework as you describe just to begin making sense of the content. Second, to link those sentiments, topics, etc. to some actionable insight is a second step. For example, I recently worked on a project aimed at building mobile audience content preference communities. The two step process similar to what you describe was: 1.  the creation of a relevant taxonomy of keywords and concepts that allow the classification mobile usage/browsing behavior into interest categories; and 2. correlation of interest categories with subsequent events of interest such as mobile purchasing.

    Scott , http://www.scottradcliffe.com

  • Tj Vogel

    I agree with you in spades, Bill.

    Even my esteemed colleague, Dr. Tom Adi, uses the sounds in the constituent phonemes of words to build symbolic links, a map as it were, of structured relationships between the sounds of words and their meaning.
    http://commonsensical.wordpress.com/adis-semantic-theory/His “Adi Theory of Cognition and Emotions” still model the cognitive process as if it were structured, and despite my own experience with his outputs as orders of magnitude more predictive of  human behavior and intent, the real problem with all unstructured data is that the differing units render any numeric understanding system I know of moot in the context of building unstructured relationships between these unstructured upstream processes.

    Perhaps the word “structure’ is the real problem, and the subtle nuanced difference between process and product is what defies our understanding right now. It’s rather quantal, eh? Thoughts are merely a complex netwrok of non-random synaptic outputs firing in specific patterns until action is taken on them the physical world int he form of “an actual human behavior”.

    They then instantly become past behaviors that, once actualized as such, we can then catalog, study, and even come to predict with varying levels of uncertainty!

    Whenever someone mentions fingerprint analysis, I still remind them that no one has ever undertaken a study to prive that no two people have the same fingerprints. It is assumed that “enough” similarity is “enough”.

    Alfred Korzybski sought just such a system, an despite his fabulous work on {Â} vs {not-Â} reasoning still never realized a system of symbolism that transcende what even today we wrestle with as structured vc. non-structured!”Science and Sanity: An Introduction to Non-Aristotelian Systems and General Semantics”


  • Bob Boeri

    Really don’t like the near-universal phrase”unstructured data.” If it were truly unstructured (truly random, like smoke particles) then there would be no information content.  A better phrase would be “subtly structured,”  or even “less structured,” since the information continuum has “highly structured” at one end, XML in the middle perhaps, and office documents etc. at the less-structured side.

  • Emmett

    Bill, Great insight. But one that begs the question: Will existing Lexicons (word dictionaries) fit in the more robust Unstructured data sources (streaming conversations)? When we include more and more “Phrases” and less reliance on single words, we will need more intelligent parsing (Sentient Logic aka: AI).
    How do you see this progressing?

  • Alistair Sykes

    maybe we can help: http://www.semantic-evolution.com  – come and talk to us.

  • http://twitter.com/prakash_bhanu Bhanu Prakash

    Bill, this is a great article. Thank you for writing. What about Graph data, document data? Do we also convert them to structured format and analyse. I am guessing there must be some tools to analyse these formats. But are they generic enough or are they tailored to specific needs. 

    Bhanu Prakash

  • http://twitter.com/s_Daniel s.Daniel

    “Why Nobody Is Actually Analyzing Unstructured Data”
    I think that the statement is only correct when thinking of unstructured data as a bunch of random bits. Since the word ‘data’ is included I would argue that people do analyze such data, just maybe little in the BI area. The reason is that this kind of analysis is very different to the usual statistic approach and requires different tools like neuronal networks which may not be available as a standard toolset yet.

    Simple Example:
    Structured data: “1234567890”
    Less structured data: “6172839405”

    It’s basically the same data and your brain is able to find the pattern within the less structured data without first setting up a rule how to extract parts.

    Another example: In science terabytes of RNA sequences are being analyzed to find patterns without knowing what exactly to look for.

    Here is another example how people work with what I would understand as more or less unstructured data: http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html?_r=2&hp&


    So yes reducing the problem to smaller junks and better organized sets of data is often possible and then probably a good idea. But it’s not necessary always the only way to look at data analysis.