{"id":175787,"date":"2017-02-07T08:15:58","date_gmt":"2017-02-07T13:15:58","guid":{"rendered":"http:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/ai-for-matching-images-with-spoken-word-gets-a-boost-from-mit-fast-company\/"},"modified":"2017-02-07T08:15:58","modified_gmt":"2017-02-07T13:15:58","slug":"ai-for-matching-images-with-spoken-word-gets-a-boost-from-mit-fast-company","status":"publish","type":"post","link":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/ai\/ai-for-matching-images-with-spoken-word-gets-a-boost-from-mit-fast-company\/","title":{"rendered":"AI For Matching Images With Spoken Word Gets A Boost From MIT &#8211; Fast Company"},"content":{"rendered":"<p><p>    Children learn to speak, as well as recognize objects, people,    and places, long before they learn to read or write. They can    learn from hearing, seeing, and interacting without being given    any instructions. So why shouldnt artificial intelligence    systems be able to work the same way?  <\/p>\n<p>    That's the key insight driving a research project under way at MIT that takes a    novel approach to speech and image recognition: Teaching a    computer to successfully associate specific elements of images    with corresponding sound files in order to identify imagery    (say, a lighthouse in a photographic landscape) when someone in    an audio clip says the word \"lighthouse.\"  <\/p>\n<p>    Though in the very early stages of what could be a years-long    process of research and development, the implications of the    MIT project, led by PhD student David    Harwath and senior research scientist Jim    Glass, are substantial. Along with being able to    automatically surface images based on corresponding audio clips    and vice versa, the research opens a path to creating    language-to-language translation without needing to go through    the laborious steps of training AI systems on the correlation    between two languages words.  <\/p>\n<p>    That could be particularly important for deciphering languages    that are dying because there aren't enough native speakers to    warrant the expensive investment in manual annotation of    vocabulary by bilingual speakers, which has traditionally been    the cornerstone of AI-based translation. Of 7,000 spoken    languages, Harwath says, speech recognition systems have been    applied to less than 100.  <\/p>\n<p>    It could even eventually be possible, Harwath suggested, for    the system to translate languages with little to no written    record, a breakthrough that would be a huge boon to    anthropologists.  <\/p>\n<p>    \"Because our model is just working on the level of audio and    images,\" Harwath told Fast Company, \"we believe it to    be language-agnostic. It shouldnt care what language its    working on.\"  <\/p>\n<p>    t-SNE analysis of the 150 lowest-variance audio pattern cluster    centroids for k = 500. Displayed is the majority-vote    transcription of the each audio cluster. All clusters shown    contained a minimum of 583 members and an average of 2482, with    an average purity of .668.  <\/p>\n<p>    The MIT project isnt the first to consider the idea that    computers could automatically associate audio and imagery. But    the research being done at MIT may well be the first to pursue    it at scale, thanks to the \"renaissance\" in deep neural networks,    which involve multiple layers of neural units that mimic the    way the human brain solves problems. The networks require    churning through massive amounts of data, and so theyve only    taken off as a meaningful AI technique in recent years as    computers processing power has increased.  <\/p>\n<p>    Thats led just about every major technology company to    go on hiring sprees in a bid to automate    services like search, surfacing relevant photos and news,    restaurant recommendations, and so on. Many consider AI to be    perhaps the next major computing paradigm.  <\/p>\n<p>    \"It is the most important computing development in the last 20    years,\" Jen-Hsun Huang, the CEO of Nvidia, one of    the worlds largest makers of the kinds of graphics processors    powering many AI initiatives, told Fast Company last year, \"and    [big tech companies] are going to have to race to make sure    that AI is a core competency.\"  <\/p>\n<p>    Now that computers are powerful enough to begin utilizing deep    neural networks in speech recognition, the key is to develop    better algorithms, and in the case of the MIT project, Harwath    and Glass believe that by employing more organic speech    recognition algorithms, they can move faster down the path to    truly artificial intelligent systems along the line of what    characters like C-3PO have portrayed in Star Wars movies.  <\/p>\n<p>    To be sure, were many years away from such systems, but the    MIT project is aiming to excise one of the most time-consuming    and expensive pieces of the translation puzzle: requiring    people to train models by manually labeling countless    collections of images or vocabularies. That laborious process    involves people going through large collections of imagery and    annotating them, one by one, with descriptive keywords.  <\/p>\n<p>    Harwath acknowledges that his team spent quite a lot of time    starting in late 2014 doing that kind of manual, or supervised,    learning on sound files and imagery, and that afforded them a    \"big collection of audio.\"  <\/p>\n<p>    Now, theyre on to the second version of the project, which is    to build algorithms that can both learn language as well as the    real-world concepts the language is grounded in, and to do so    utilizing very unstructured data.  <\/p>\n<p>    Heres how it works: The MIT team sets out to train neural    networks on what amounts to a game of \"which one of these    things is not like the other,\" Harwath explains.  <\/p>\n<p>    They want to teach the system to understand the difference    between matching pairsan image of a dog with a fluffy hat and    an audio clip with the caption \"dog with a fluffy hat\"and    mismatched pairs like the same audio clip and a photo of a cat.  <\/p>\n<p>    Matches get a high score and mismatches get a low score, and    when the goal is for the system to learn individual objects    within an image and individual words in an audio stream, they    apply the neural network to small regions of an image, or small    intervals of the audio.  <\/p>\n<p>    Right now the system is trained on only about 500 words. Yet    its often able to recognize those words in new audio clips it    has never encountered. The system is nowhere near perfect, for    some word categories, Harwath says, the accuracy is in the    15%-20% range. But in others, its as high as 90%.  <\/p>\n<p>    \"The really exciting thing,\" he says, \"is its able to make the    association between the acoustic patterns and the visual    patterns. So when I say lighthouse, Im referring to a    particular [area] in an image that has a lighthouse, [and it    can] associate it with the start and stop time in the audio    where you says, lighthouse.\"  <\/p>\n<p>    A different task that they frequently run the system through is    essentially an image retrieval task, something like a Google    image search. They give it a spoken query, say, \"Show me an    image of a girl wearing a blue dress in front of a lighthouse,\"    and then wait for the neural network to search for an image    thats relevant to the query.  <\/p>\n<p>    Heres where its important not to get too excited about the    technology being ready for prime time. Harwath says the team    considers the results of the query accurate if the appropriate    image comes up in the top 10 results from a library of only    about 1,000 images. The system is currently able to do that    just under 50% of the time.  <\/p>\n<p>    The number is improving, though. When Harwath and Glass    wrote a paper on the project for an upcoming conference in France, it was 43%.    Still, he believes that although there are regular improvements    and increased accuracy every time they train a new model,    theyre held back by the available computational power. Even    with a set of eight powerful GPUs, it can still take two weeks    to train a single model.  <\/p>\n<p>    An example of our grounding method. The left image displays a    grid defining the allowed start and end coordinates for the    bounding box proposals. The bottom spectrogram displays several    audio region proposals drawn as the families of stacked red    line segments. The image on the right and spectrogram on the    top display the final output of the grounding algorithm. The    top spectrogram also displays the time-aligned text transcript    of the caption, so as to demonstrate which words were captured    by the groundings. In this example, the top three groundings    have been kept, with the colors indicating the audio segment    that is grounded to each bounding box.  <\/p>\n<p>    Perhaps the most exciting potential of the research is in    breakthroughs for language-to-language translation.  <\/p>\n<p>    \"The way to think about it is this,\" Harwath says. \"If you have    an image of a lighthouse, and if we speak different languages    but describe the same image, and if the system can figure out    the word Im using and the word youre using, then implicitly,    it has a model for translating my word to your word . . . It    would bypass the need for manual translations and a need for    someone whos bilingual. It would be amazing if we could just    completely bypass that.\"  <\/p>\n<p>    To be sure, that is entirely theoretical today. But the MIT    team is confident that at some point in the future, the system    could reach that goal. It could be 10 years, or it could be 20.    \"I really have no idea,\" he says. \"Were always wrong when we    make predictions.\"  <\/p>\n<p>    In the meantime, another challenge is coming up with enough    quality data to satisfy the system. Deep neural networks are    very hungry models.  <\/p>\n<p>    Traditional machine learning models were limited by diminishing    returns on additional data. \"If you think of a machine learning    algorithm as an engine, data is like the gasoline,\" he says.    \"Then, traditionally, the more gas you pour into the engine,    the faster it runs, but it only works up to a point, and then    levels off.  <\/p>\n<p>    \"With deep neural networks, you have a much higher capacity.    The more data you give it, the faster and faster it goes. It    just goes beyond what older algorithms were capable of.\"  <\/p>\n<p>    But he thinks no ones sure of the outer limits of deep neural    networks capacities. The big question, he says, is how far    will a deep neural network scale? Will they saturate at some    point and stop learning, or will it just keep going?  <\/p>\n<p>    \"We havent reached this point yet,\" Harwath says, \"because    people have been consistently showing that the more data you    give them, the better they work. We dont know how far we can    push it.\"  <\/p>\n<p><!-- Auto Generated --><\/p>\n<p>Read more:<\/p>\n<p><a target=\"_blank\" rel=\"nofollow\" href=\"https:\/\/www.fastcompany.com\/3067904\/ai-for-matching-images-with-spoken-word-gets-a-boost-from-mit\" title=\"AI For Matching Images With Spoken Word Gets A Boost From MIT - Fast Company\">AI For Matching Images With Spoken Word Gets A Boost From MIT - Fast Company<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p> Children learn to speak, as well as recognize objects, people, and places, long before they learn to read or write.  <a href=\"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/ai\/ai-for-matching-images-with-spoken-word-gets-a-boost-from-mit-fast-company\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187743],"tags":[],"class_list":["post-175787","post","type-post","status-publish","format-standard","hentry","category-ai"],"_links":{"self":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts\/175787"}],"collection":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/comments?post=175787"}],"version-history":[{"count":0,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts\/175787\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/media?parent=175787"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/categories?post=175787"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/tags?post=175787"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}