{"id":1028496,"date":"2024-05-21T02:34:01","date_gmt":"2024-05-21T06:34:01","guid":{"rendered":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/uncategorized\/the-new-chatgpt-has-a-huge-problem-in-chinese-2.php"},"modified":"2024-05-21T02:34:01","modified_gmt":"2024-05-21T06:34:01","slug":"the-new-chatgpt-has-a-huge-problem-in-chinese-2","status":"publish","type":"post","link":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/futurist\/the-new-chatgpt-has-a-huge-problem-in-chinese-2.php","title":{"rendered":"The New ChatGPT Has a Huge Problem in Chinese"},"content":{"rendered":"<p><\/p><div><img width=\"300\" height=\"158\" src=\"https:\/\/wp-assets.futurism.com\/2024\/05\/chatgpt-problem-chinese-300x158.jpg\" class=\"attachment-medium size-medium wp-post-image\" alt=\"A data training failure has resulted in OpenAI's new GPT-4o model spitting out spam and porn-littered Chinese-language responses.\" decoding=\"async\" loading=\"lazy\" style=\"padding-left:10px; padding-right: 10px;\"><\/div><h2>Dirty Data<\/h2><p>A pollution problem with OpenAI training data has rendered its new chatbot's Chinese outputs chock-full of porn and spam, <a href=\"https:\/\/www.technologyreview.com\/2024\/05\/17\/1092649\/gpt-4o-chinese-token-polluted\/\">the <em>MIT Technology Review&nbsp;<\/em>reports<\/a>.<\/p><p>Last week, OpenAI released GPT-4o, a <a href=\"https:\/\/futurism.com\/the-byte\/openai-remove-voice-chatgpt-scarlett-johansson\">decidedly flirty<\/a> new large language model (LLM) equipped with new and advanced capabilities &mdash; for example, the ability to \"<a href=\"https:\/\/futurism.com\/new-chatgpt-ai-camera-video\">see<\/a>\" through users' device cameras, as well as the power to <a href=\"https:\/\/openai.com\/index\/hello-gpt-4o\/\">converse out loud<\/a> in real-time. But for all of GPT-4o's apparent advancements, it seems to have at least one massive blindspot: the Chinese language.<\/p><p>To train AI models, you need tokens, or units of data that <em>represent<\/em> information that an AI uses to \"read\" and learn. According to<em> MIT Tech<\/em>, AI researchers were quick to discover that nearly all of the 100 longest Chinese-language tokens used by the AI to decipher Chinese prompts were comprised of spammy porn and gambling content &mdash; resulting in bizarre, smut- and spam-ridden responses to completely run-of-the-mill queries.<\/p><p>\"This is sort of ridiculous,\" Tianle Cai, an AI researcher and PhD candidate at Princeton, wrote in a <a href=\"https:\/\/gist.github.com\/ctlllll\/4451e94f3b2ca415515f3ee369c8c374\">Github post<\/a> showcasing the polluted tokens.<\/p><h2>Unforced Error<\/h2><p>The worst part? According to experts, the problem of uncleaned data is a well-known AI training hurdle &mdash; and likely wouldn't have been too hard to fix.<\/p><p>\"Every spam problem has a solution,\" Deedy Das, an AI investor at Menlo Ventures who formerly worked on Google's Search team, told <em>MIT Tech<\/em>, adding that just auto-translating tokenized content to detect certain problematic keywords could feasibly \"get you 60 percent of the way\" to a clean dataset.<\/p><p>\"At the end of the day,\" he continued, \"I just don't think they did the work in this case.\"<\/p><p>\"The English tokens seem fine,\" Cai, the Princeton researcher, told <em>MIT Tech<\/em>, \"but the Chinese ones are not.\"<\/p><p>In other words, the likeliest reason for OpenAI's error is that ensuring its Chinese-language tokens were mostly free of porn and gambling spam just didn't make the to-do list.<\/p><p>It's a bad look for OpenAI. The Chinese language has the <a href=\"https:\/\/www.babbel.com\/en\/magazine\/the-10-most-spoken-languages-in-the-world\">most native speakers<\/a> on the planet. And numbers aside, if the future of our internet will indeed center on AI-generated material &mdash; as opposed to human-created and built websites, communities, and worlds &mdash; errors like not ensuring that a premier chatbot can parse the native language of over one billion humans means that people, not to mention entire cultures, inherently get left out.<\/p><p>That is to say,&nbsp;let's hope this is a learning moment.<\/p><p><strong>More on AI and non-English languages: <\/strong><a href=\"https:\/\/futurism.com\/the-byte\/internet-ai-generated-slime\"><em>Huge Proportion of Internet Is AI-Generated Slime, Researchers Find<\/em><\/a><\/p><p>The post <a href=\"https:\/\/futurism.com\/chatgpt-problem-chinese\">The New ChatGPT Has a Huge Problem in Chinese<\/a> appeared first on <a href=\"https:\/\/futurism.com\">Futurism<\/a>.<\/p><p>Read this article: <\/p><p><a target=\"_blank\" href=\"https:\/\/futurism.com\/the-byte\/chatgpt-problem-chinese\" title=\"The New ChatGPT Has a Huge Problem in Chinese\" rel=\"noopener\">The New ChatGPT Has a Huge Problem in Chinese<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p> Dirty Data A pollution problem with OpenAI training data has rendered its new chatbot's Chinese outputs chock-full of porn and spam, the MIT Technology Review\u00a0 reports .  <a href=\"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/futurist\/the-new-chatgpt-has-a-huge-problem-in-chinese-2.php\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"limit_modified_date":"","last_modified_date":"","_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[10],"tags":[],"class_list":["post-1028496","post","type-post","status-publish","format-standard","hentry","category-futurist"],"modified_by":null,"_links":{"self":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/1028496"}],"collection":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/comments?post=1028496"}],"version-history":[{"count":0,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/1028496\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/media?parent=1028496"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/categories?post=1028496"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/tags?post=1028496"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}