{"id":227295,"date":"2017-07-12T12:07:04","date_gmt":"2017-07-12T16:07:04","guid":{"rendered":"http:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/uncategorized\/china-tunes-neural-networks-for-custom-supercomputer-chip-the-next-platform.php"},"modified":"2017-07-12T12:07:04","modified_gmt":"2017-07-12T16:07:04","slug":"china-tunes-neural-networks-for-custom-supercomputer-chip-the-next-platform","status":"publish","type":"post","link":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/super-computer\/china-tunes-neural-networks-for-custom-supercomputer-chip-the-next-platform.php","title":{"rendered":"China Tunes Neural Networks for Custom Supercomputer Chip &#8211; The Next Platform"},"content":{"rendered":"<p><p>    July 11, 2017 Nicole    Hemsoth  <\/p>\n<p>    Supercomputing centers around the world are preparing their    next generation architectural approaches for the insertion of    AI into scientific workflows. For some, this means retooling    around an existing architecture to make capability of    double-duty for both HPC and AI.  <\/p>\n<p>    Teams in China working on the top performing supercomputer in    the world, the Sunway TaihuLight machine with its custom    processor, have shown that their optimizations for theSW26010    architecture on deep learning models have yielded a 1.91-9.75X    speedup over a GPU accelerated model using the Nvidia Tesla    K40m in a test convolutional neural network run with over 100    parameter configurations.  <\/p>\n<p>    Efforts on this system show that high performance deep learning    is possible at scale on a CPU-only architecture. The Sunway    TaihuLight machine is based on the 260-core Sunway SW26010,    which     we detailed here from both a chip and systems perspective.    The convolutional neural network work was bundled together as    swDNN, a library for accelerating deep learning on the    TaihuLight supercomputer  <\/p>\n<p>    According to Dr. Haohuan Fu, one of the leads behind the swDNN    framework for the Sunway architecture (and associate director    at the National Supercomputing Center in Wuxi, where TaihuLight    is located), the processor has a number of unique features that    couple potentially help the training process of deep neural    networks. These include the on-chip fusion of both    management cores and computing core clusters, the support of a    user-controlled fast buffer for the 256 computing cores,    hardware-supported scheme for register communication across    different cores, as well as the unified memory space shared by    the four core groups, each with 65 cores.  <\/p>\n<\/p>\n<p>    Despite some of the features that make the SW26010 a good fit    for neural networks, there were some limitations teams had to    work around, the most prominent of which was memory bandwidth    limitationssomething that is a problem on all processors and    accelerators tackling neural network training in particular.    The DDR3 memory interface provides a peak bandwidth of 36GB\/s    for each compute group (64 of the compute elements) for a total    bandwidth of 144 GB\/s per processor. The Nvidia K80 GPU, with a    similar double-precision performance of 2.91 teraflops,    provides aggregate memory bandwidth of 480 GB\/sTherefore,    while CNNs are considered a compute-intensive kernel care had    to be taken with the memory access scheme to alleviate the    memory bandwidth constraints. Further, since the processor    does not have a shared buffer for frequent data communications    as are needed in CNNs, the team had to rely on a fine-grained    data sharing scheme based on row and column communication buses    in the CPU mesh.  <\/p>\n<p>      The optimized swDNN framework, at current stage, can provide      a double-precision performance of over 1.6 teraflops for the      convolution kernels, achieving over 50% of the theoretical      peak. The significant performance improvements achieved from      a careful utilization of the SW26010s architectural features      and a systematic optimization process demonstrate that these      unique features and corresponding optimization schemes are      potential candidates to be included in future DNN      architectures as well as DNN-specific compilation tools.    <\/p>\n<p>    According to Fu, By performing a systematic optimization that    explores major factors of deep learning, including the    organization of convolution loops, blocking techniques,    register data communication schemes, as well as reordering    strategies for the two pipelines of instructions, the SW26010    processor on the Sunway TaihuLight supercomputer has managed to    achieve a double-precision performance of over 1.6 teraflops    for the convolution kernel, achieving 54% of the theoretical    peak.  <\/p>\n<p>      To further get around the memory bandwidth limitations, the      team created a three-pronged approach to memory for its      manycore architecture. Depending on what is required, the CPE      (compute elements) mesh can access the data items either      directly from global memory or from the three-level memory      hierarchy (register, local data memory and larger, slower      memory).    <\/p>\n<p>    Part of the long-term plan for the Sunway TaihuLight    supercomputer is to continue work on scaling traditional HPC    applications to exascale, but also to continue neural network    efforts in a companion direction. Fu says that TaihuLight teams    are continuing the development of swDNN and are also    collaborating with face++ for facial recognition applications    on the supercomputer in addition to work with Sogou for voice    and speech recognition. Most interesting (and vague) was the        passing mention of a potential custom chip for deep    learning, although he was non-committal.  <\/p>\n<p>    The team has created a customized register communication scheme    that targets maximizing data reuse in the convolution kernels,    which reduces the memory bandwidth requirements by almost an    order of magnitude, they report in     the full paper (IEEE subscription required). A careful    design of the most suitable pipelining of instructions was also    built that reduces the idling time of the computation units by    maximizing the overlap of memory operation and computation    instructions, thus maximizing the overall training performance    on the SW26010.  <\/p>\n<p>      Double precision performance results for different      convolution kernels compared with the Nvidia Tesla K40 using      the cuDNNv5 libraries.    <\/p>\n<p>    To be fair, the Tesla K40 is not much of a comparison point to    newer architectures, including Nvidias Pascal GPUs.    Nonetheless, the Sunway architecture could show comparable    performance with GPUs for convolutional neural networkspaving    the way for more discussion about the centrality of GPUs in    current deep learning systems if CPUs can be rerouted to do    similar work for a lower price point.  <\/p>\n<p>    The emphasis on double-precision floating point is also of    interest since the trend in training and certainly inference is    to push lower while balancing accuracy requirements. Also left    unanswered is how convolutional neural network training might    scale across the many nodes availablein short, is the test    size indicative of the scalability limits before the    communication bottleneck becomes too severe to make this    efficient. However, armed with these software libraries and the    need to keep pushing deep learning into the HPC stack, it is    not absurd to think Sunway might build their own custom deep    learning chip, especially if the need arises elsewhere in    Chinawhich we suspect it will.  <\/p>\n<p>    More on the deep learning library for the Sunway machine can be    found at GitHub.  <\/p>\n<p>    Categories: AI, HPC, ISC17  <\/p>\n<p>    Tags: ISC17, TaihuLight  <\/p>\n<p>    OpenPower, Efficiency Tweaks Define Europes DAVIDE    Supercomputer The X86 Battle Lines Drawn With Intels Skylake    Launch  <\/p>\n<p><!-- Auto Generated --><\/p>\n<p>Read the original: <\/p>\n<p><a target=\"_blank\" href=\"https:\/\/www.nextplatform.com\/2017\/07\/11\/china-tunes-neural-networks-custom-supercomputer-chip\/\" title=\"China Tunes Neural Networks for Custom Supercomputer Chip - The Next Platform\">China Tunes Neural Networks for Custom Supercomputer Chip - The Next Platform<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p> July 11, 2017 Nicole Hemsoth Supercomputing centers around the world are preparing their next generation architectural approaches for the insertion of AI into scientific workflows. For some, this means retooling around an existing architecture to make capability of double-duty for both HPC and AI. Teams in China working on the top performing supercomputer in the world, the Sunway TaihuLight machine with its custom processor, have shown that their optimizations for theSW26010 architecture on deep learning models have yielded a 1.91-9.75X speedup over a GPU accelerated model using the Nvidia Tesla K40m in a test convolutional neural network run with over 100 parameter configurations.  <a href=\"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/super-computer\/china-tunes-neural-networks-for-custom-supercomputer-chip-the-next-platform.php\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"limit_modified_date":"","last_modified_date":"","_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[41],"tags":[],"class_list":["post-227295","post","type-post","status-publish","format-standard","hentry","category-super-computer"],"modified_by":null,"_links":{"self":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/227295"}],"collection":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/comments?post=227295"}],"version-history":[{"count":0,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/227295\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/media?parent=227295"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/categories?post=227295"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/tags?post=227295"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}