Natural language interfaces to databases have long been a holy grail of both industry and academia. Recently, advances in large language models (LLMs) like GPT-3 have brought this dream closer to reality. LLMs now have an impressive capacity to translate natural language questions into formal database queries.

However, fundamental questions remain about whether LLMs can live up to their promise for real-world enterprise applications:

How accurately can LLMs interpret complex business questions and translate them into correct SQL queries?
Do enterprise databases with hundreds of tables stump LLMs without additional context?
Can representing the database as a knowledge graph provide the missing semantics to boost accuracy?

A new research paper titled “A Benchmark to Understand the Role of Knowledge Graphs on Large Language Model’s Accuracy for Question Answering on Enterprise SQL Databases” (Sequeda et al. 2023) seeks to answer these questions in a rigorous, scientific manner.

The authors introduce a high-quality benchmark specifically designed to evaluate enterprise question answering accuracy over SQL databases. The benchmark comprises:

A schema excerpt from a real-world insurance industry model containing 13 complex tables covering policies, claims, coverages and their intricate relationships.
43 natural language questions representative of diverse business queries ranging from simple lookups to advanced analytics.
A knowledge graph representation of the database schema encoded as an OWL ontology and R2RML mappings.

Using this robust benchmark, the researchers perform experiments to quantify LLM accuracy in translating the business questions into SQL queries over the database schema, both with and without the knowledge graph enhancement.

The results clearly demonstrate that incorporating knowledge graphs leads to massive accuracy improvements of over 3X, establishing their value for enterprise applications of LLMs.

This paper introduces a first-of-its-kind enterprise benchmark that paves the way for continued research into combining the statistical power of large language models with structured knowledge graphs. The benchmark and methodology provide a rock-solid framework for the community to build on.

The implications are far-reaching, both for advancing scientific progress and for enterprises looking to tap into their data using natural language and AI. Overall, this research represents an important milestone at the intersection of semantics, knowledge representation, and deep learning.

The Problem: LLMs Struggle with Complex Database Schemas

Existing benchmarks that evaluate LLMs for question answering over databases have some key limitations:

The databases used are generally simplistic, with very few tables. For example, Spider uses schemas with just 1–3 tables.
The table and column names are clean and intuitive, like “student_details” or “student_grade.”
The data itself is relatively simple, covering domains like academic databases.
In contrast, real-world enterprise databases are far more complex:
Schemas often comprise hundreds of tables spanning multiple interconnected domains.
Tables and columns frequently have cryptic abbreviated names like “POL_COVG_DET_EXP_DT” rather than descriptive names.
The data models intricate industry concepts and processes like insurance claims and policies.

When faced with such dense, complex schemas, LLMs struggle for multiple reasons:

The huge number of tables makes it difficult for LLMs to determine which ones are relevant for answering a specific question.
Opaque table/column names provide few clues about the underlying entities and relationships.
Without deeper industry knowledge, the schema alone gives little context into what real-world concepts the data represents.
Relationships between tables, like foreign key constraints, are not self-evident.

As a result, when presented with a business question over a messy industry schema, LLMs often fail to select the correct set of relevant tables and join them properly.

They may hallucinate incorrect column names or table relationships during query generation, producing completely inaccurate SQL queries. Without additional context, massive enterprise schemas easily confuse LLMs and lead to poor text-to-SQL performance.

This research hypothesized that equipping LLMs with structured knowledge of the business domain could help fill these context gaps and boost enterprise text-to-SQL accuracy significantly. The solution: Augment LLMs with knowledge graphs.

To provide the missing business context, the researchers proposed enhancing LLM question answering with knowledge graphs. Knowledge graphs explicitly represent the concepts, properties, relationships and constraints in a domain. They can embed the semantics LLMs need to reason about complex data.

The researchers developed a benchmark to test whether knowledge graphs improve LLM accuracy on enterprise questions and schemas.

The Method: Leveraging a KG to Generate SQL queries

The system takes a natural language question and an ontology as inputs. The ontology represents domain knowledge as classes, properties, and relationships. The question and ontology are processed by the large language model GPT-4 using zero-shot learning. This means GPT-4 generates a query based solely on the prompt without any task-specific training. GPT-4 analyzes the prompt and outputs a SPARQL query. SPARQL is a query language used to retrieve data from RDF graphs and knowledge bases.

The generated SPARQL query contains semantic constructs like paths, classes and constraints that reflect the intent of the original natural language question. This query is executed against a knowledge graph hosted in an RDF database. The knowledge graph provides a virtual view of the data enriched with domain concepts defined in the ontology. The results from the SPARQL query over the knowledge graph are then mapped to an equivalent SQL query over the relational database. The SQL query contains the necessary joins, constraints, and functions to retrieve the answer to the original question from the relational tables.

Finally, the SQL query is executed on the database and the results are returned in a tabular format as the answer to the natural language question. The system logs metadata about each question-answering attempt including the question text, generated query, timestamp, etc. in an RDF format.

The Benchmark: Mirroring Real-World Challenges

Select enterprise SQL schema — The researchers chose a real-world industry data model called the Property and Casualty schema from the Object Management Group. This schema covers complex insurance domain concepts using 13 interconnected tables.
Extract schema subset — To keep the scope focused, they extracted a relevant subset of tables from the full insurance schema containing key entities like claim, policy, coverage.
Generate sample data — Sample data rows were generated for each table, creating a populated database instantiation.
Develop natural language questions — 43 natural language questions were authored to cover typical business queries of varying complexity, from simple retrievals to analytical KPIs.
Classify questions — The questions were categorized based on conceptual complexity and number of tables required, to span easy and challenging queries.
Create ground truth SQL queries — For each question, the corresponding correct SQL query was written by a human expert against the database schema.
Develop ontology — An OWL ontology was built modeling key insurance entities and relationships to represent the domain semantics.
Define R2RML mappings — Mappings were created between the SQL schema constructs and ontology concepts using the R2RML language.
Generate knowledge graph — The ontology and mappings were used to automatically generate a knowledge graph representation of the sample database.
Load knowledge graph — The knowledge graph was hosted in an RDF database that could be queried using SPARQL.

This rigorous process resulted in a robust benchmark containing realistic enterprise data, queries, schema, and knowledge model. It effectively mirrors real-world challenges faced by businesses aiming to leverage natural language and AI for data access.

Experimental Results: Knowledge Graphs Provide Significant Accuracy Gains

Configure GPT-4 model — The researchers leveraged the GPT-4 model for query generation. They fine-tuned the hyperparameters like max tokens, temperature, and so on, for optimal results.
Design prompts — Simple zero-shot prompts were created providing either the SQL schema or ontology alongside the natural language question as context to GPT-4.
Generate SQL queries — GPT-4 was prompted to produce an SQL query representing each question over the database schema.
Execute SQL queries — The generated SQL queries were programmatically executed over the sample insurance database.
Evaluate SQL accuracy — The rows returned by the GPT-4 generated SQL query were compared to the ground truth rows from the human-authored SQL query to quantify accuracy.
Generate SPARQL queries — Similarly, GPT-4 was prompted to generate SPARQL queries against the knowledge graph for each question.
Execute SPARQL queries — These were executed over the knowledge graph representation of the database.
Evaluate SPARQL accuracy — Accuracy was computed by comparing result rows to the ground truth answers.
Compare approaches — SQL and SPARQL accuracies were aggregated across questions and compared.

The final results showed:

For simple queries:
- SQL accuracy was 25.5% for low complexity questions over low complexity schemas.
- SPARQL accuracy was 71.1% for the same simple queries, 2.8X higher than SQL.

The knowledge graph provided significant accuracy gains even for basic retrieval-style questions.

For complex queries:
- SQL accuracy dropped to 37.4% for high complexity questions over simple schemas.
- SPARQL accuracy was 66.9% for the same queries, still 1.8X higher than SQL.
- For low complexity questions on complex schemas, SQL accuracy plummeted to 0%.
- SPARQL retained 35.7% accuracy thanks to the knowledge graph context.
- For high complexity questions over complex schemas, neither approach had any accurate answers.

However, SPARQL still had 38% partial accuracies due to the semantic assistance.

Key takeaways:
- Even for simple queries, the knowledge graph boosted accuracy substantially.
- The benefits were even more prominent for complex, multi-table analytical queries.
- For schema complexity, SQL accuracy degraded rapidly without knowledge graph context.
- SPARQL maintained higher accuracies by leveraging the ontology relationships.

The knowledge graph’s semantic enrichment consistently improved accuracy across both query complexity and schema complexity dimensions. The gap between SQL and SPARQL accuracy grew wider as the queries and schema got more sophisticated. This highlights the value of semantic knowledge in enhancing LLM reasoning, especially for challenging enterprise workloads.

Discussion: Exploring Alternative Techniques

The researchers used a knowledge graph approach to improve question-answering accuracy over an enterprise database. But there are other techniques they could have tried to see if they work better or complement the knowledge graph method:

Rather than utilizing a knowledge graph, an alternative approach could involve employing a vector database. This method represents database content as numerical vectors, also known as embeddings. By doing so, queries can be executed directly over the vector database, offering a distinct and potentially more efficient method of data handling compared to the knowledge graph. This approach provides an opportunity to compare and contrast two of the most prominent methodologies in the application of retrieval-augmented generation (RAG), highlighting their respective strengths and limitations.
The knowledge graph itself could be enhanced with embeddings. The concepts and relationships would get vector representations in addition to their symbolic semantics. This hybrid approach of symbols and vectors could improve accuracy.
Rather than simplifying the ontology in the prompt to GPT-4, they could have queried the full ontology in a graph database. This might improve accuracy but slow down the system. A tradeoff could be analyzed.
Since full ontologies can get very large, they could modularize it into smaller subdomains. Then use a multi-stage process querying the most relevant subdomain ontology first before generating the full query.

Additional experiments could analyze different representation techniques like vectors, incorporate both symbolic and sub-symbolic knowledge, query larger ontologies, and use multi-stage querying. This could provide further insights into optimizing natural language interfaces to enterprise data.

Key Takeaways

This rigorous research clearly proves that equipping large language models with knowledge graphs substantially boosts their ability to answer natural language questions on enterprise databases.

It provides a blueprint for how enterprises can build production-ready semantic search and query systems using an LLM + knowledge graph architecture.

Specifically:

Investing in high-quality knowledge graphs is vital for achieving usable accuracy.
Knowledge graphs provide essential business context missing in raw database schemas.
Accuracy gains are especially significant on complex real-world schemas.

To leverage large language models and transform data access, every enterprise must make knowledge graphs a first-class citizen within their data management platforms.

This paper introduces an invaluable benchmark for the research community to advance the state of semantic data access. It sets a strong foundation for integrating deep learning with structured knowledge.

The commercial implications are also profound. Any enterprise looking to stay competitive must consider leveraging knowledge graphs and LLMs to unlock the value hidden within their data.

Originally published in Artificial Intelligence in Plain English.

Knowledge Graphs Drastically Improve Accuracy of Large Language Models on Complex Data: New Research Proves