{"id":1075650,"date":"2024-02-22T02:36:53","date_gmt":"2024-02-22T07:36:53","guid":{"rendered":"https:\/\/www.immortalitymedicine.tv\/scale-ai-to-set-the-pentagons-path-for-testing-and-evaluating-large-language-models-defensescoop\/"},"modified":"2024-08-18T12:53:14","modified_gmt":"2024-08-18T16:53:14","slug":"scale-ai-to-set-the-pentagons-path-for-testing-and-evaluating-large-language-models-defensescoop","status":"publish","type":"post","link":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/ai\/scale-ai-to-set-the-pentagons-path-for-testing-and-evaluating-large-language-models-defensescoop.php","title":{"rendered":"Scale AI to set the Pentagon&#8217;s path for testing and evaluating large language models &#8211; DefenseScoop"},"content":{"rendered":"<p><p>    The Pentagons Chief Digital and Artificial Intelligence Office    (CDAO) tapped    Scale AI to produce a trustworthy means for testing and    evaluating large language models that can support  and    potentially disrupt  military planning and decision-making.  <\/p>\n<p>    According to a statement the San Francisco-based company shared    exclusively with DefenseScoop, the outcomes of this new    one-year contract will supply the CDAO with a framework to    deploy AI safely by measuring model performance, offering    real-time feedback for warfighters, and creating specialized    public sector evaluation sets to test AI models for military    support applications, such as organizing the findings from    after action reports.  <\/p>\n<p>    Large language models and the overarching field of generative AI    include emerging technologies that can generate (convincing but    not always accurate) text, software code, images and other    media, based on prompts from humans.  <\/p>\n<p>    This rapidly evolving realm holds a lot of promise for the    Department of Defense, but also poses unknown and serious    potential challenges. Last year, Pentagon leadership launched    Task    Force Lima within the CDAOs Algorithmic Warfare    Directorate to accelerate its components grasp, assessment and    deployment of generative artificial intelligence.  <\/p>\n<p>    The department has long leaned on test-and-evaluation (T&E)    processes to assess and ensure its systems, platforms and    technologies perform in a safe and reliable manner before they    are fully fielded. But AI safety standards and policies have    not yet been universally set, and the complexities and    uncertainties associated with large language models make    T&E even more complicated when it comes to generative AI.  <\/p>\n<p>    Broadly, T&E enables experts to determine the baseline    performance of a specific model.  <\/p>\n<p>    For instance, to test and evaluate a computer vision algorithm    that differentiates between images of dogs and cats and things    that are not dogs or cats, an official might first train it    with millions of different pictures of those type of animals as    well as objects that arent dogs or cats. In doing so, the    expert will also hold back a diverse subset of data that can    then be presented to the algorithm down the line.  <\/p>\n<p>    They can then assess that evaluation dataset against the test    set, or ground truth, and ultimately determine failure rates    of where the model is unable to determine if something is or is    not one of the classifiers theyre trying to identify.  <\/p>\n<p>    Experts at Scale AI will adopt a similar approach for T&E    with large language models, but because they are generative in    nature and the English language can be hard to evaluate, there    isnt that same level of ground truth for these complex    systems. For example, if prompted to supply five different    responses, an LLM might be generally factually accurate in all    five, yet contrasting sentence structures could change the    meanings of each output.  <\/p>\n<p>    So, part of the companys effort to develop the framework,    methods and technology CDAO can use to test and evaluate large    language models will involve creating holdout datasets     where they include DOD insiders to prompt response pairs and    adjudicate them by layers of review, and ensure that each is as    good of a response as would be expected from a human in the    military.  <\/p>\n<p>    The entire process will be iterative in nature.  <\/p>\n<p>    Once datasets that are germane to the DOD for world knowledge,    truthfulness, and other topics are made and refined, the    experts can then evaluate existing large language models    against them.  <\/p>\n<p>    Eventually, as they have these holdout datasets, experts will    be able to run evaluations and establish model cards  or short    documents that supply details on the context for best for use    of various machine learning models and information for    measuring their performance.  <\/p>\n<p>    Officials plan to automate in this development as much as    possible, so that as new models come in, there can be some    baseline understanding of how they will perform, where they    will perform best, and where they will probably start to fail.  <\/p>\n<p>    Further in the process, the ultimate intent is for models to    essentially send signals to CDAO officials that engage with    them, if they start to waver from the domains they have been    tested against.  <\/p>\n<p>    This work will enable the DOD to mature its T&E policies    to address generative AI by measuring and assessing    quantitative data via benchmarking and assessing qualitative    feedback from users. The evaluation metrics will help identify    generative AI models that are ready to support military    applications with accurate and relevant results using DoD    terminology and knowledge bases. The rigorous T&E process    aims to enhance the robustness and resilience of AI systems in    classified environments, enabling the adoption of LLM    technology in secure environments, Scale AIs statement    reads.  <\/p>\n<p>    Beyond the CDAO, the company has also partnered with Meta,    Microsoft, the U.S. Army, the Defense Innovation Unit,    OpenAI, General Motors, Toyota Research Institute, Nvidia, and    others.  <\/p>\n<p>    Testing and evaluating generative AI will help the DoD    understand the strengths and limitations of the technology, so    it can be deployed responsibly. Scale is honored to partner    with the DoD on this framework, Alexandr Wang, Scale AIs    founder and CEO, said in the statement.  <\/p>\n<p><!-- Auto Generated --><\/p>\n<p>Continue reading here: <\/p>\n<p><a target=\"_blank\" rel=\"nofollow noopener\" href=\"https:\/\/defensescoop.com\/2024\/02\/20\/scale-ai-pentagon-testing-evaluating-large-language-models\/\" title=\"Scale AI to set the Pentagon's path for testing and evaluating large language models - DefenseScoop\">Scale AI to set the Pentagon's path for testing and evaluating large language models - DefenseScoop<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p> The Pentagons Chief Digital and Artificial Intelligence Office (CDAO) tapped Scale AI to produce a trustworthy means for testing and evaluating large language models that can support and potentially disrupt military planning and decision-making.  <a href=\"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/ai\/scale-ai-to-set-the-pentagons-path-for-testing-and-evaluating-large-language-models-defensescoop.php\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"limit_modified_date":"","last_modified_date":"","_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[1234935],"tags":[],"class_list":["post-1075650","post","type-post","status-publish","format-standard","hentry","category-ai"],"modified_by":null,"_links":{"self":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/1075650"}],"collection":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/comments?post=1075650"}],"version-history":[{"count":0,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/1075650\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/media?parent=1075650"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/categories?post=1075650"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/tags?post=1075650"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}