Natural language query (NLQ) empowers business users to analyze data by asking questions in everyday language. Modern business intelligence (BI) tools offer NLQ functionality through proprietary natural language processing (NLP) models.
Data analysis with natural language expressions has become vital for accessible and user-friendly data analytics. Modern business intelligence (BI) tools, such as Power BI Q&A, Tableau Ask Data, and AWS QuickSight Q, now include NLQ features for self-service BI. Each tool may vary slightly in approach, but their features are comparable and have improved with NLP model advancements.
Typically, NLQ features these tools offer include enable users retrieve specific subsets of data, perform complex aggregations, compare time periods, filter and sort results and view trends.
Some BI tools also include features like autocompletion of user questions, synonym curation, and visualization type recommendations to further enhance user experience. As NLQ capabilities advance, users find it easier to extract insights and make data-driven decisions.
LLM use for data and analytics
Since ChatGPT's release by OpenAI in late 2022, generative AI and LLMs have garnered significant interest and investment for diverse use cases. This paper covers a limited array of applications and operationalization techniques, but you can find comprehensive materials on these topics from Infosys and other sources.
As organizations increasingly leverage LLMs to gain insights from their data assets, BI vendors also intensify their efforts to incorporate LLMs into their data and analytics tools. One of the key focus areas for these efforts is enabling NLQ by integrating LLMs, instead of proprietary NLP models. However, the potential use cases extend to various areas, including data sourcing, synthetic data generation, metadata enrichment, and guided analytics.
Another related use case involves generating or augmenting SQL code using LLMs for software or data engineering purposes, directly with tools like ChatGPT or with software engineering tools like GitHub Copilot. Some data platforms have already integrated Copilot for this purpose.
Text-to-SQL with LLMs
Text-to-SQL with LLMs for NLQ is an evolving research field, with academic papers and implementation-oriented materials generated continuously. A key focus is designing comprehensive NLQ solutions using various techniques and tools, with the current state-of-the-art encompassing the following approaches:
Another crucial area is enhanced natural language to SQL conversion through context learning, fine-tuning, and in-context techniques to enhance NLQ system performance and accuracy, enabling more precise and reliable data analysis.
Our experimentation overview
Our experiment focused on integrating LLMs to enable NLQ capabilities for data consumers in a modern data environment, independent of any specific BI tool. We tested this in a data landscape inspired by a real client’s needs. We introduced a conversational UI component and an application to convert user queries into SQL using LLM integration. This implementation follows the “Embed LLM As-is into an Application Frame” pattern from Infosys’ LLM pattern repository. Figure 1 presents a high-level architecture and describes the key components used in our experiment.
Figure 1. Solution design

Source: Infosys
| Component | Specifics | 
|---|---|
| Cloud Data Warehouse - Snowflake | Snowflake is a cloud-native unified data platform that offers data warehousing and data lake capabilities, along with various tools for different data processing and data science scenarios. For this paper, we used a trial version of Snowflake and TPCH dataset, publicly available from TPC and provided within Snowflake as a sample dataset. | 
| ETL/ELT – dbt | dbt (data build tool) is an open-source command-line tool that enables data engineers and analysts to transform, evaluate, and manage data within their data warehouses. It provides a development environment and a set of best practices to build data transformation pipelines. The dbt schema includes relevant information, including table names, columns, and relationships, to guide query generation within the prompts. | 
| Web UI – Streamlit | Streamlit is an open-source Python library to create custom web apps for machine learning and data science applications. | 
| LLM – OPENAI GTP 3.5 text-davinci-003, GPT3.5-turbo | GPT 3.5 models are descendants to OPENAI Codex, which was the original model for code completion tasks. GPT3.5-turbo is a powerful model for code completion now. | 
| Data application | This application serves as the principal component and performs the following steps: 
 | 
| LLM library – LangChain | LangChain is a library offering different tools for LLM application development. SQL database agent is one of the tools used in text-to-SQL scenarios. In this paper, we used prompt templates. | 
Step 1: Import everything

Step 2: Import Open API using LangChain LLMs:

Step 3: Read DBT schema and create a prompt by passing user question:

Step 4: Get a response from OpenAI model by executing through LangChain:

For real-time use cases, LangChain Vectorstore gathers the DBT schema and enables look-up activities to retrieve schema details based on user questions, facilitating prompt creation.
DBT schema.yml appears as follows:

To display data graphically, the SQL query output from the enterprise warehouse should include at least two columns. Utilize Python's Matplotlib for visual representation.
| User question | Model | LLM response | Comments | 
|---|---|---|---|
| The maximum ship mode for product type PROMO ANODIZED NICKEL | text-davinci-003 Gpt-3.5-turbo | SELECT P_TYPE, L_SHIPMODE, MAX(L_SHIPMODE) FROM lineitem INNER JOIN part ON lineitem.L_PARTKEY = part.P_PARTKEY WHERE P_TYPE = 'PROMO ANODIZED NICKEL' GROUP BY P_TYPE, L_SHIPMODE; SELECT MAX(L_SHIPMODE) AS Maximum_Ship_Mode FROM lineitem JOIN part ON lineitem.L_PARTKEY = part.P_PARTKEY WHERE part.P_TYPE = 'PROMO ANODIZED NICKEL'; | Davinci model’s response is not correct. However, when we changed the question (next question) it returns the correct SQL code. | 
| What is the maximum ship mode for product type PROMO ANODIZED NICKEL | Both models | SELECT MAX(L_SHIPMODE) AS Maximum_Ship_Mode FROM lineitem JOIN part ON lineitem.L_PARTKEY = part.P_PARTKEY WHERE part.P_TYPE = 'PROMO ANODIZED NICKEL'; | |
| List the number of open orders by market segment' | Text-davinci-003 Gpt-3.5-turbo | SELECT C_MKTSEGMENT, COUNT(*) AS NUM_OPEN_ORDERS FROM customer c INNER JOIN orders o ON c.C_CUSTKEY = o.O_CUSTKEY WHERE o.O_ORDERSTATUS = 'OPEN' GROUP BY C_MKTSEGMENT; SELECT C_MKTSEGMENT, COUNT(*) AS num_open_orders FROM orders o JOIN customer c ON o.O_CUSTKEY = c.C_CUSTKEY WHERE O_ORDERSTATUS = 'O' GROUP BY C_MKTSEGMENT; | Davinci model used the status code as ‘OPEN’ whereas gpt-3.5 used the correct code ‘O’. This may be due to the fact that gpt-3.5 has knowledge of data set as TPC website documents benchmark datasets in it is web site. in the next question we provided the correct code as a part of the context and Davinci returned the correct response. | 
| 'List the number of open orders by market segment. The code for open order status is "O"' | text-davinci-003 | SELECT C_MKTSEGMENT, COUNT(*) AS NUM_OPEN_ORDERS FROM customer c INNER JOIN orders o ON c.C_CUSTKEY = o.O_CUSTKEY WHERE o.O_ORDERSTATUS = 'O' GROUP BY C_MKTSEGMENT; | |
| How many vendors are there from India?' | text-davinci-003 Gpt-3.5-turbo | SELECT COUNT (DISTINCT S_SUPPKEY) FROM TPCH_SF1.SUPPLIER S INNER JOIN TPCH_SF1.NATION N ON S.S_NATIONKEY = N.N_NATIONKEY WHERE N.N_NAME = 'INDIA'; SELECT COUNT (*) AS total_vendors FROM supplier WHERE S_NATIONKEY IN ( SELECT N_NATIONKEY FROM nation WHERE N_NAME = 'INDIA'); | The schema we provided used the term “supplier” rather than “vendor” for table and column names. Both models used the correct table and column names. We also observed style differences between queries provided. | 
| What is country of the supplier for the SMALL ANODIZED TIN product type? | text-davinci-003 Gpt-3.5-turbo | SELECT N.N_NAME FROM nation N INNER JOIN supplier S ON N.N_NATIONKEY = S.S_NATIONKEY INNER JOIN partsupp PS ON S.S_SUPPKEY = PS.PS_SUPPKEY INNER JOIN part P ON PS.PS_PARTKEY = P.P_PARTKEY WHERE P.P_TYPE = 'SMALL ANODIZED TIN' SELECT n.N_NAME AS Country FROM supplier s JOIN nation n ON s.S_NATIONKEY = n.N_NATIONKEY JOIN partsupp ps ON s.S_SUPPKEY = ps.PS_SUPPKEY JOIN part p ON ps.PS_PARTKEY = p.P_PARTKEY WHERE p.P_TYPE = 'SMALL ANODIZED TIN' | Twenty-five countries should display for this question. But these queries return more than five thousand rows repeating countries multiple times. We had to add “Ensure each country is listed only once” to the question to get the correct list. | 
| List the supplier region, supplier country, name, ship mode, and the product cost for each supplier of the product 180002. | Gpt-3.5-turbo | SELECT s.S_NAME AS supplier_name, n.R_NAME AS supplier_region, n.N_NAME AS supplier_country, l.L_SHIPMODE AS ship_mode, ps.PS_SUPPLYCOST AS product_cost FROM supplier s JOIN nation n ON s.S_NATIONKEY = n.N_NATIONKEY JOIN partsupp ps ON s.S_SUPPKEY = ps.PS_SUPPKEY JOIN part p ON ps.PS_PARTKEY = p.P_PARTKEY JOIN lineitem l ON p.P_PARTKEY = l.L_PARTKEY WHERE p.P_PARTKEY = 180002 | This is an example of a hallucination. There is no region column with Name R_NAME in supplier table. The model made up a column name in the supplier table due to the way we asked the question. When we changed the question like in next row, we got the correct result. | 
| List the region, country, name, ship mode, and product cost for each supplier of the product 180002 | Gpt-3.5-turbo | SELECT r.R_NAME, n.N_NAME, s.S_NAME, l.L_SHIPMODE, ps.PS_SUPPLYCOST FROM region r JOIN nation n ON r.R_REGIONKEY = n.N_REGIONKEY JOIN supplier s ON n.N_NATIONKEY = s.S_NATIONKEY JOIN partsupp ps ON s.S_SUPPKEY = ps.PS_SUPPKEY JOIN part p ON ps.PS_PARTKEY = p.P_PARTKEY JOIN lineitem l ON p.P_PARTKEY = l.L_PARTKEY WHERE p.P_PARTKEY = 180002 | 
Our experimental application, which utilizes LLMs to dynamically generate SQL code from user questions, performs satisfactorily as a conversational analytical agent across various evaluated scenarios. These findings affirm the potential of integrating generative AI models with enterprise data platforms, enabling business users to conduct data analysis independently, without SQL expertise or dependence on technology teams. Nonetheless, we've also observed instances where a solution relying solely on prompting general-purpose LLMs may underperform, especially within complex enterprise data platforms.
Large enterprises typically have diverse data environments, comprising various analytical data stores like data lakes and data warehouses. These stores have cloud or on-premises infrastructure, with modern or legacy technologies. These data stores might integrate with modern data catalogs or rely on technical data dictionaries specific to each data store. Given this complexity, users from different departments or line of business (LOB) often use specific business vocabulary and expect data analysis to adhere to specific business rules.
In this diverse environment, an NLQ solution must offer beyond SQL statement generation based on a schema. It should identify the right data store, interpret questions using specific data catalogs and terminologies, and construct queries that implement relevant business rules with the right SQL dialect for each data store. Relying solely on prompts for these tasks is neither scalable nor efficient. Instead, a more effective approach fine tunes a private model with knowledge of the specific data environment.
The effectiveness of fine-tuning versus in-context learning for adapting LLMs to specialized tasks is an active area of research. In a recent study, Mosbach et al. compared both approaches, outlining their advantages and disadvantages. Additionally, in the context of text-to-SQL tasks, Sun et al. discussed a specialized text-to-SQL model and compared fine-tuning with prompting for performance enhancement.
Future work for real-world NLQ scenarios should involve creating a specialized text-to-SQL model by fine-tuning general-purpose LLMs for a specific enterprise data environment. Fine-tuned models and specialized techniques for complex data environments develop more efficient NLQ systems that empower seamless and effective data analysis.
 
    To keep yourself updated on the latest technology and industry trends subscribe to the Infosys Knowledge Institute's publications
Count me in!