{"id":1067841,"date":"2024-03-02T02:39:05","date_gmt":"2024-03-02T07:39:05","guid":{"rendered":"https:\/\/www.immortalitymedicine.tv\/build-a-robust-text-to-sql-solution-generating-complex-queries-self-correcting-and-querying-diverse-data-sources-aws-blog\/"},"modified":"2024-08-18T11:39:53","modified_gmt":"2024-08-18T15:39:53","slug":"build-a-robust-text-to-sql-solution-generating-complex-queries-self-correcting-and-querying-diverse-data-sources-aws-blog","status":"publish","type":"post","link":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/machine-learning\/build-a-robust-text-to-sql-solution-generating-complex-queries-self-correcting-and-querying-diverse-data-sources-aws-blog.php","title":{"rendered":"Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources &#8230; &#8211; AWS Blog"},"content":{"rendered":"<p><p>    Structured Query Language (SQL) is a complex language that    requires an understanding of databases and metadata. Today,    generative AI can enable people without SQL    knowledge. This generative AI task is called text-to-SQL, which    generates SQL queries from natural language processing (NLP)    and converts text into semantically correct SQL. The solution    in this post aims to bring enterprise analytics operations to    the next level by shortening the path to your data using    natural language.  <\/p>\n<p>    With the emergence of large language models (LLMs), NLP-based    SQL generation has undergone a significant transformation.    Demonstrating exceptional performance, LLMs are now capable of    generating accurate SQL queries from natural language    descriptions. However, challenges still remain. First, human    language is inherently ambiguous and context-dependent, whereas    SQL is precise, mathematical, and structured. This gap may    result in inaccurate conversion of the users needs into the    SQL thats generated. Second, you might need to build    text-to-SQL features for every database because data is often    not stored in a single target. You may have to recreate the    capability for every database to enable users with NLP-based    SQL generation. Third, despite the larger adoption of    centralized analytics solutions like data lakes and warehouses,    complexity rises with different table names and other metadata    that is required to create the SQL for the desired sources.    Therefore, collecting comprehensive and high-quality metadata    also remains a challenge. To learn more about text-to-SQL best    practices and design patterns, see Generating value from enterprise    data: Best practices for Text2SQL and generative AI.  <\/p>\n<p>    Our solution aims to address those challenges using Amazon Bedrock and AWS Analytics Services. We    use Anthropic Claude v2.1 on    Amazon Bedrock as our LLM. To address the challenges, our    solution first incorporates the metadata of the data sources    within the AWS Glue Data Catalog to    increase the accuracy of the generated SQL query. The workflow    also includes a final evaluation and correction loop, in case    any SQL issues are identified by Amazon Athena, which is used downstream as the    SQL engine. Athena also allows us to use a multitude of    supported endpoints and    connectors to cover a large set of data sources.  <\/p>\n<p>    After we walk through the steps to build the solution, we    present the results of some test scenarios with varying SQL    complexity levels. Finally, we discuss how it is    straightforward to incorporate different data sources to your    SQL queries.  <\/p>\n<p>    There are three critical components in our architecture:    Retrieval Augmented Generation (RAG) with database metadata, a    multi-step self-correction loop, and Athena as our SQL engine.  <\/p>\n<p>    We use the RAG method to retrieve the table descriptions and    schema descriptions (columns) from the AWS Glue metastore to    ensure that the request is related to the right table and    datasets. In our solution, we built the individual steps to run    a RAG framework with the AWS Glue Data Catalog for    demonstration purposes. However, you can also use knowledge bases in Amazon    Bedrock to build RAG solutions quickly.  <\/p>\n<p>    The multi-step component allows the LLM to correct the    generated SQL query for accuracy. Here, the generated SQL is    sent for syntax errors. We use Athena error messages to enrich    our prompt for the LLM for more accurate and effective    corrections in the generated SQL.  <\/p>\n<p>    You can consider the error messages occasionally coming from    Athena like feedback. The cost implications of an error    correction step are negligible compared to the value delivered.    You can even include these corrective steps as supervised    reinforced learning examples to fine-tune your LLMs. However,    we did not cover this flow in our post for simplicity purposes.  <\/p>\n<p>    Note that there is always inherent risk of having inaccuracies,    which naturally comes with generative AI solutions. Even if    Athena error messages are highly effective to mitigate this    risk, you can add more controls and views, such as human    feedback or example queries for fine-tuning, to further    minimize such risks.  <\/p>\n<p>    Athena not only allows us to correct the SQL queries, but it    also simplifies the overall problem for us because it serves as    the hub, where the spokes are multiple data sources. Access    management, SQL syntax, and more are all handled via Athena.  <\/p>\n<p>    The following diagram illustrates the solution architecture.  <\/p>\n<p>      Figure 1. The solution architecture and process flow.    <\/p>\n<p>    The process flow includes the following steps:  <\/p>\n<p>    At this stage, the process is ready to receive the query in    natural language. Steps 79 represent a correction loop, if    applicable.  <\/p>\n<p>    For this post, you should complete the following prerequisites:  <\/p>\n<p>    You can use the following Jupyter notebook, which includes    all the code snippets provided in this section, to build the    solution. We recommend using Amazon SageMaker Studio to    open this notebook with an ml.t3.medium instance with the    Python 3 (Data Science) kernel. For instructions, refer to    Train a Machine Learning    Model. Complete the following steps to set up the solution:  <\/p>\n<p>    In this section, we run our solution with different example    scenarios to test different complexity levels of SQL queries.  <\/p>\n<p>    To test our text-to-SQL, we use two datasets available from    IMDB. Subsets of IMDb data are available for personal and    non-commercial use. You can download the datasets and store    them in Amazon Simple Storage Service (Amazon S3). You    can use the following Spark SQL snippet to create tables in AWS    Glue. For this example, we use title_ratings and    title:  <\/p>\n<p>    In this scenario, our dataset is stored in an S3 bucket.    Athena has an S3 connector that    allows you to use Amazon S3 as a data source that can be    queried.  <\/p>\n<p>    For our first query, we provide the input I am new to this.    Can you help me see all the tables and columns in imdb schema?  <\/p>\n<p>    The following is the generated query:  <\/p>\n<p>    The following screenshot and code show our output.  <\/p>\n<\/p>\n<p>    For our second query, we ask Show me all the title and details    in US region whose rating is more than 9.5.  <\/p>\n<p>    The following is our generated query:  <\/p>\n<p>    The response is as follows.  <\/p>\n<\/p>\n<p>    For our third query, we enter Great Response! Now show me all    the original type titles having ratings more than 7.5 and not    in the US region.  <\/p>\n<p>    The following query is generated:  <\/p>\n<p>    We get the following results.  <\/p>\n<\/p>\n<p>    This scenario simulates a SQL query that has syntax issues.    Here, the generated SQL will be self-corrected based on the    response from Athena. In the following response, Athena gave a    COLUMN_NOT_FOUND error and mentioned that    table_description cant be resolved:  <\/p>\n<p>    To use the solution with other data sources, Athena handles the    job for you. To do this, Athena uses data source connectors that    can be used with federated queries. You can    consider a connector as an extension of the Athena query    engine. Pre-built Athena data source connectors exist for data    sources like Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB (with MongoDB compatibility),    and Amazon Relational Database Service (Amazon RDS),    and JDBC-compliant relational data sources such MySQL, and    PostgreSQL under the Apache 2.0 license. After you set up a    connection to any data source, you can use the preceding code    base to extend the solution. For more information, refer to    Query any data source with    Amazon Athenas new federated query.  <\/p>\n<p>    To clean up the resources, you can start by cleaning up your S3 bucket    where the data resides. Unless your application invokes Amazon    Bedrock, it will not incur any cost. For the sake of    infrastructure management best practices, we recommend deleting    the resources created in this demonstration.  <\/p>\n<p>    In this post, we presented a solution that allows you to use    NLP to generate complex SQL queries with a variety of resources    enabled by Athena. We also increased the accuracy of the    generated SQL queries via a multi-step evaluation loop based on    error messages from downstream processes. Additionally, we used    the metadata in the AWS Glue Data Catalog to consider the table    names asked in the query through the RAG framework. We then    tested the solution in various realistic scenarios with    different query complexity levels. Finally, we discussed how to    apply this solution to different data sources supported by    Athena.  <\/p>\n<p>    Amazon Bedrock is at the center of this solution. Amazon    Bedrock can help you build many generative AI applications. To    get started with Amazon Bedrock, we recommend following the    quick start in the following GitHub repo and    familiarizing yourself with building generative AI    applications. You can also try knowledge bases in Amazon    Bedrock to build such RAG solutions quickly.  <\/p>\n<p>    Sanjeeb Panda    is a Data and ML engineer at Amazon. With the background in    AI\/ML, Data Science and Big Data, Sanjeeb design and develop    innovative data and ML solutions that solve complex technical    challenges and achieve strategic goals for global 3P sellers    managing their businesses on Amazon. Outside of his work as a    Data and ML engineer at Amazon, Sanjeeb Panda is an avid foodie    and music enthusiast.  <\/p>\n<p>    Burak Gozluklu    is a Principal AI\/ML Specialist Solutions Architect located in    Boston, MA. He helps strategic customers adopt AWS technologies    and specifically Generative AI solutions to achieve their    business objectives. Burak has a PhD in Aerospace Engineering    from METU, an MS in Systems Engineering, and a post-doc in    system dynamics from MIT in Cambridge, MA. Burak is still a    research affiliate in MIT. Burak is passionate about yoga and    meditation.  <\/p>\n<p><!-- Auto Generated --><\/p>\n<p>Excerpt from:<br \/>\n<a target=\"_blank\" href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/build-a-robust-text-to-sql-solution-generating-complex-queries-self-correcting-and-querying-diverse-data-sources\" title=\"Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources ... - AWS Blog\" rel=\"noopener\">Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources ... - AWS Blog<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p> Structured Query Language (SQL) is a complex language that requires an understanding of databases and metadata.  <a href=\"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/machine-learning\/build-a-robust-text-to-sql-solution-generating-complex-queries-self-correcting-and-querying-diverse-data-sources-aws-blog.php\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"limit_modified_date":"","last_modified_date":"","_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[1231415],"tags":[],"class_list":["post-1067841","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"modified_by":null,"_links":{"self":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/1067841"}],"collection":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/comments?post=1067841"}],"version-history":[{"count":0,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/1067841\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/media?parent=1067841"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/categories?post=1067841"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/tags?post=1067841"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}