{"id":1121122,"date":"2024-01-18T18:09:22","date_gmt":"2024-01-18T23:09:22","guid":{"rendered":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/uncategorized\/introducing-aspire-for-selective-prediction-in-llms-google-research-blog-google-research\/"},"modified":"2024-01-18T18:09:22","modified_gmt":"2024-01-18T23:09:22","slug":"introducing-aspire-for-selective-prediction-in-llms-google-research-blog-google-research","status":"publish","type":"post","link":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/google\/introducing-aspire-for-selective-prediction-in-llms-google-research-blog-google-research\/","title":{"rendered":"Introducing ASPIRE for selective prediction in LLMs  Google Research Blog &#8211; Google Research"},"content":{"rendered":"<p><p>Posted by Jiefeng Chen, Student  Researcher, and Jinsung Yoon, Research Scientist, Cloud AI  Team   <\/p>\n<p>    In the fast-evolving landscape of artificial intelligence,    large language models (LLMs) have revolutionized the way we    interact with machines, pushing the boundaries of natural    language understanding and generation to unprecedented heights.    Yet, the leap into high-stakes decision-making applications    remains a chasm too wide, primarily due to the inherent    uncertainty of model predictions. Traditional LLMs generate    responses recursively, yet they lack an intrinsic mechanism to    assign a confidence score to these responses. Although one can    derive a confidence score by summing up the probabilities of    individual tokens in the sequence, traditional approaches    typically fall short in reliably distinguishing between correct    and incorrect answers. But what if LLMs could gauge their own    confidence and only make predictions when they're sure?  <\/p>\n<p>        Selective prediction aims to do this by enabling LLMs to    output an answer along with a selection score, which indicates    the probability that the answer is correct. With selective    prediction, one can better understand the reliability of LLMs    deployed in a variety of applications. Prior research, such as    semantic    uncertainty and self-evaluation, has    attempted to enable selective prediction in LLMs. A typical    approach is to use heuristic prompts like Is the proposed    answer True or False? to trigger self-evaluation in LLMs.    However, this approach may not work well on challenging    question    answering (QA) tasks.  <\/p>\n<p>    In \"Adaptation    with Self-Evaluation to Improve Selective Prediction in    LLMs\", presented at Findings of    EMNLP 2023, we introduce ASPIRE  a novel framework    meticulously designed to enhance the selective prediction    capabilities of LLMs. ASPIRE fine-tunes LLMs on QA tasks via    parameter-efficient    fine-tuning, and trains them to evaluate whether their    generated answers are correct. ASPIRE allows LLMs to output an    answer along with a confidence score for that answer. Our    experimental results demonstrate that ASPIRE significantly    outperforms state-of-the-art selective prediction methods on a    variety of QA datasets, such as the CoQA benchmark.  <\/p>\n<p>    Imagine teaching an LLM to not only answer questions but also    evaluate those answers  akin to a student verifying their    answers in the back of the textbook. That's the essence of    ASPIRE, which involves three stages: (1) task-specific tuning,    (2) answer sampling, and (3) self-evaluation learning.  <\/p>\n<p>    Task-specific tuning: ASPIRE performs    task-specific tuning to train adaptable parameters    (p) while freezing the LLM. Given a training    dataset for a generative task, it fine-tunes the pre-trained    LLM to improve its prediction performance. Towards this end,    parameter-efficient tuning techniques (e.g., soft prompt    tuning and LoRA) might    be employed to adapt the pre-trained LLM on the task, given    their effectiveness in obtaining strong generalization with    small amounts of target task data. Specifically, the LLM    parameters () are frozen and adaptable parameters    (p) are added for fine-tuning. Only p    are updated to minimize the standard LLM training loss (e.g.,        cross-entropy). Such fine-tuning can improve selective    prediction performance because it not only improves the    prediction accuracy, but also enhances the likelihood of    correct output sequences.  <\/p>\n<p>    Answer sampling: After task-specific tuning,    ASPIRE uses the LLM with the learned p to generate    different answers for each training question and create a    dataset for self-evaluation learning. We aim to generate output    sequences that have a high likelihood. We use beam search as    the decoding algorithm to generate high-likelihood output    sequences and the Rouge-L metric to    determine if the generated output sequence is correct.  <\/p>\n<p>    Self-evaluation learning: After sampling    high-likelihood outputs for each query, ASPIRE adds adaptable    parameters (s) and only fine-tunes s    for learning self-evaluation. Since the output sequence    generation only depends on  and p, freezing  and    the learned p can avoid changing the prediction    behaviors of the LLM when learning self-evaluation. We optimize    s such that the adapted LLM can distinguish between    correct and incorrect answers on their own.  <\/p>\n<p>    In the proposed framework, p and s can    be trained using any parameter-efficient tuning approach. In    this work, we use soft prompt    tuning, a simple yet effective mechanism for learning    soft    prompts to condition frozen language models to perform    specific downstream tasks more effectively than traditional    discrete text prompts. The driving force behind this approach    lies in the recognition that if we can develop prompts that    effectively stimulate self-evaluation, it should be possible to    discover these prompts through soft prompt tuning in    conjunction with targeted training objectives.  <\/p>\n<p>    After training p and s, we obtain the    prediction for the query via beam search decoding. We then    define a selection score that combines the likelihood of the    generated answer with the learned self-evaluation score (i.e.,    the likelihood of the prediction being correct for the query)    to make selective predictions.  <\/p>\n<p>    To demonstrate ASPIREs efficacy, we evaluate it across three    question-answering datasets  CoQA, TriviaQA, and SQuAD  using various    open pre-trained    transformer (OPT) models. By training p with    soft prompt tuning, we observed a substantial hike in the LLMs'    accuracy. For example, the OPT-2.7B model adapted    with ASPIRE demonstrated improved performance over the larger,    pre-trained OPT-30B model using the CoQA and SQuAD datasets.    These results suggest that with suitable adaptations, smaller    LLMs might have the capability to match or potentially surpass    the accuracy of larger models in some scenarios.  <\/p>\n<p>    When delving into the computation of selection scores with    fixed model predictions, ASPIRE received a higher AUROC score    (the probability that a randomly chosen correct output sequence    has a higher selection score than a randomly chosen incorrect    output sequence) than baseline methods across all datasets. For    example, on the CoQA benchmark, ASPIRE improves the AUROC from    51.3% to 80.3% compared to the baselines.  <\/p>\n<p>    An intriguing pattern emerged from the TriviaQA dataset    evaluations. While the pre-trained OPT-30B model demonstrated    higher baseline accuracy, its performance in selective    prediction did not improve significantly when traditional    self-evaluation methods  Self-eval and P(True)     were applied. In contrast, the smaller OPT-2.7B model, when    enhanced with ASPIRE, outperformed in this aspect. This    discrepancy underscores a vital insight: larger LLMs utilizing    conventional self-evaluation techniques may not be as effective    in selective prediction as smaller, ASPIRE-enhanced models.  <\/p>\n<p>    Our experimental journey with ASPIRE underscores a pivotal    shift in the landscape of LLMs: The capacity of a language    model is not the be-all and end-all of its performance.    Instead, the effectiveness of models can be drastically    improved through strategic adaptations, allowing for more    precise, confident predictions even in smaller models. As a    result, ASPIRE stands as a testament to the potential of LLMs    that can judiciously ascertain their own certainty and    decisively outperform larger counterparts in selective    prediction tasks.  <\/p>\n<p>    In conclusion, ASPIRE is not just another framework; it's a    vision of a future where LLMs can be trusted partners in    decision-making. By honing the selective prediction    performance, we're inching closer to realizing the full    potential of AI in critical applications.  <\/p>\n<p>    Our research has opened new doors, and we invite the community    to build upon this foundation. We're excited to see how ASPIRE    will inspire the next generation of LLMs and beyond. To learn    more about our findings, we encourage you to read our paper and    join us in this thrilling journey towards creating a more    reliable and self-aware AI.  <\/p>\n<p>    We gratefully acknowledge the contributions of Sayna    Ebrahimi, Sercan O Arik, Tomas Pfister, and Somesh Jha.  <\/p>\n<p><!-- Auto Generated --><\/p>\n<p>View original post here:<\/p>\n<p><a target=\"_blank\" rel=\"nofollow noopener\" href=\"https:\/\/blog.research.google\/2024\/01\/introducing-aspire-for-selective.html\" title=\"Introducing ASPIRE for selective prediction in LLMs  Google Research Blog - Google Research\">Introducing ASPIRE for selective prediction in LLMs  Google Research Blog - Google Research<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p> Posted by Jiefeng Chen, Student Researcher, and Jinsung Yoon, Research Scientist, Cloud AI Team In the fast-evolving landscape of artificial intelligence, large language models (LLMs) have revolutionized the way we interact with machines, pushing the boundaries of natural language understanding and generation to unprecedented heights. Yet, the leap into high-stakes decision-making applications remains a chasm too wide, primarily due to the inherent uncertainty of model predictions.  <a href=\"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/google\/introducing-aspire-for-selective-prediction-in-llms-google-research-blog-google-research\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[345634],"tags":[],"class_list":["post-1121122","post","type-post","status-publish","format-standard","hentry","category-google"],"_links":{"self":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts\/1121122"}],"collection":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/comments?post=1121122"}],"version-history":[{"count":0,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts\/1121122\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/media?parent=1121122"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/categories?post=1121122"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/tags?post=1121122"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}