The Lanza Approach to Letter Analytics to Quickly Identify Text Analytic Deviations
By Rich Lanza, Sep 24, 2015
Updated February 2016
Back in 2014, the audit and fraud detection community set out to create a comprehensive list of key words or rather “bad phrases” that highlight potential issues in accounting data.
This initiative, led by AuditNet®, identified phrases such as “facilitation payments” or “plug revenue” that generally correlated with fraud or noncompliance with key regulations around corruption and anti-money laundering. Many organizations who maintained such phrase lists realized they had proprietary value and therefore kept them private. But to assist the community at large, we embarked on a survey that led to a database of over 4,000 unique phrases and updated by over 500 survey responses.
Once we had 4,000 phrases, we were presented with a data problem: How can we search for thousands of “key words” in data sets and make sense of the results in a swift fashion? Many companies were limiting their searches to 50 to 100 words as looking for any more became a time-consuming exercise. They essentially needed to complete word counts for different time frames of the key word lists, compare year over year trending, and then select transactions for additional review for concerning patterns.
It is at that time where I discovered the trick to reducing analysis time in key word counting, and furthermore, for any word counting. You simply needed summarize the results by the first and last letters to quickly gain analytic coverage of the population. Instead of looking at a population of words and asking “What’s the Word”, instead we can now find word deviations faster by asking “What’s the Letter?”
This is not the first time such a technique has been used as the frequency of letters in text has been useful to decipher codes and was popularized by the Caesar cipher (invented by Julius Caesar). Linguists use letter frequency analysis as an elementary technique for identifying a given language based on its characteristic letter distribution. The focus of this new approach is to use letter frequency analysis to identify deviations in letters quickly which act as pointers to the varying words over time.
Therefore, the Lanza Approach to Letter Analytics (“LALA”)TM focuses on identifying word deviations swiftly by applying letter frequency rates of the English language and prior period letter occurrences as expected benchmarks when analyzing the data set at hand. It primarily focuses on the following four measures given their consistent nature over time:
- First letter (26 letters)
- Last letter (26 letters)
- First two letters (702 letters)
- Last two letters (702 letters)
Using LALA, any population of word data (i.e., hundreds to billions of words) is summarized down to simple patterns of 26 and 702 letters, allowing for relative analysis on one and two letter deviations, respectively. Once I realized the usefulness of the technique, I embarked on using LALA in various analytic fronts including:
- Reviewing my own emails for new or increasing word usage over time to understand what is trending upward and downward
- Looking for personal information that may be falsified in company masterfile data (i.e., XX as the first two letters in first name or address data fields)
- Assessing changes in financial description fields with primary focus on journal entry descriptions and travel & entertainment line description fields.
In each case, I realized that the word populations displayed a fingerprint or rather a “letter-print” over time. While deviations happened at times, they were few in number, especially when looking at previous period data for the population at hand. This led me to the next realization that letter pattern usage over time should not vary greatly for the entire English language so I set out to identify a data set of all such words and their change over time.
Fortunately, I arrived at a benchmark of words in the Corpus of Contemporary American English (COCA). The COCA is the largest available corpus of English, and the only large and balanced corpus of American English. Because of its design, it is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the English language. The corpus was created by Mark Davies, Professor of Linguistics at Brigham Young University, and contains more than 450 million words of text, equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.
After analyzing the COCA data for the years 1990 through 2011 (Chart 1), I realized the letter rankings rarely changed over time and if so, only changed by one or maybe two positions in rank. Please note that rank was based on the summary of first letter and last letter occurrences of the words in the COCA with top ranked letters having the most occurrences. With the COCA data analyzed by letter frequency, not only was I able to prove that changes over time in language usage are not profound, but that the COCA can be used as a benchmark with any data set to highlight variations to what is “expected”. Please see the below chart for an analysis of the first letter rankings for a 21 year period and the average of all rankings in the left hand side of the chart (Chart 1) for the English language.
Chart 1: Ranking of First Letters of COCA Words (1990 to 2011)
The Lanza Approach to Letter Analytics (“LALA”)TM as explained in this article will be explored further in future articles in an effort to highlight the value of using the relative change in letter, or rather, word patterns over time in a variety of business applications.
To view the full research brief just released by the International Institute for Analytics, IIA Clients can click here. Non-clients can request access here. This research brief provides additional supporting analysis using LALA and letter frequency examples to apply to your business data.
About the author
Rich Lanza CPA, CFE, CGMA (www.richlanza.com) has 25 years of audit and fraud detection experience with specialization in data analytics and cost recovery efforts. He currently is a Director of Data Analytics with Grant Thornton, LLP, where he is weaving analytics into their audit and advisory practices. Rich wrote the first book on practical applications of using data analytics in an audit environment titled, 101 ACL Applications: A Toolkit for Today’s Auditor, in addition to writing over 19 publications, and numerous articles. Rich is proficient in the practical use of analytic software including ACL, ActiveData for Excel, Arbutus Analyzer, IDEA, TeamMate Analytics and auditing with Microsoft Excel techniques. Rich has been awarded by the Association of Certified Fraud Examiners for his research on proactive fraud reporting. He is also a regular presenter for CFO.com, the Institute of Internal Auditors, Association of Certified Fraud Examiners, Auditnet ®, Lorman, and Fraud Resource Net LLC. Rich consults with companies ranging in size of $30 million to $100 billion and in all, has helped them find value through the use of technology and recovery auditing.