{"id":1116928,"date":"2023-08-10T19:25:05","date_gmt":"2023-08-10T23:25:05","guid":{"rendered":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/uncategorized\/the-great-8-bit-debate-of-artificial-intelligence-hpcwire\/"},"modified":"2023-08-10T19:25:05","modified_gmt":"2023-08-10T23:25:05","slug":"the-great-8-bit-debate-of-artificial-intelligence-hpcwire","status":"publish","type":"post","link":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/artificial-intelligence\/the-great-8-bit-debate-of-artificial-intelligence-hpcwire\/","title":{"rendered":"The Great 8-bit Debate of Artificial Intelligence &#8211; HPCwire"},"content":{"rendered":"<p><p>    Editors Note: Users often ask What    separates HPC from AI, they both do a lot of number    crunching? While this statement is true, one big    difference is the precision required for a valid answer. HPC    often requires the highest possible precision (i.e. 64-bit    double precision floating point), while many AI applications    actually work with 8-bit integers or floating point numbers.    The use of less precision often allows faster CPU\/GPU    mathematics and a good enough result for many AI    applications. The following article explains the trend toward    lower precision computing in AI.  <\/p>\n<\/p>\n<p>    A grand competition of numerical representation is shaping up    as some companies promote floating point data types in deep    learning, while others champion integer data types.  <\/p>\n<p>    Artificial intelligence (AI) is proliferating into every corner    of our lives. The demand for products and services powered by    AI algorithms has skyrocketed alongside the popularity of large    language models (LLMs) like ChatGPT, and image generation    models like Stable Diffusion. With this increase in popularity,    however, comes an increase in scrutiny over the computational    and environmental costs of AI, and particularly the subfield of    deep learning.  <\/p>\n<p>    The primary factors influencing the costs of deep learning are    the size and structure of the deep learning model, the    processor it is running on, and the numerical representation of    the data. State-of-the-art models have been growing in size for    years now, with the compute requirements doubling every 6-10    months [1] for the last decade. Processor compute power has    increased as well, but not nearly fast enough to keep up with    the growing costs of the latest AI models. This has led    researchers to delve deeper into numerical representation in    attempts to reduce the cost of AI. Choosing the right numerical    representation, or data type, has incredible implications on    the power consumption, accuracy, and throughput of a given    model. There is, however, no singular answer to which data type    is best for AI. Data type requirements vary between the two    distinct phases of deep learning: the initial training phase    and the subsequent inference phase.  <\/p>\n<p>    When it comes to increasing AI efficiency, the method of first    resort is quantization of the data type. Quantization reduces    the number of bits required to represent the weights of a    network. Reducing the number of bits not only makes the model    smaller, but reduces the total computation time, and thus    reduces the power required to do the computations. This is an    essential technique for those pursuing efficient AI.  <\/p>\n<p>    AI models are typically trained using single precision 32-bit    floating point (FP32) data types. It was found, however, that    all 32 bits arent always needed to maintain accuracy. Attempts    at training models using half precision 16-bit floating point    (FP16) data types showed early success, and the race to find    the minimum number of bits that maintains accuracy was on.    Google came out with their 16-bit brain float (BF16), and    models being primed for inference were often quantized to 8-bit    floating point (FP8) and integer (INT8) data types. There are    two primary approaches to quantizing a neural network:    Post-Training Quantization (PTQ) and Quantization-Aware    Training (QAT). Both methods aim to reduce the numerical    precision of the model to improve computational efficiency,    memory footprint, and energy consumption, but they differ in    how and when the quantization is applied, and the resulting    accuracy.  <\/p>\n<p>    Post-Training Quantization (PTQ) occurs after training a model    with higher-precision representations (e.g., FP32 or FP16). It    converts the models weights and activations to lower-precision    formats (e.g., FP8 or INT8). Although simple to implement, PTQ    can result in significant accuracy loss, particularly in    low-precision formats, as the model isnt trained to handle    quantization errors. Quantization-Aware Training (QAT)    incorporates quantization during training, allowing the model    to adapt to reduced numerical precision. Forward and backward    passes simulate quantized operations, computing gradients    concerning quantized weights and activations. Although QAT    generally yields better model accuracy than PTQ, it requires    training process modifications and can be more complex to    implement.  <\/p>\n<p>    The AI industry has begun coalescing around two preferred    candidates for quantized data types: INT8 and FP8. Every    hardware vendor seems to have taken a side. In mid 2022, a    paper by Graphcore and AMD[2] floated the idea of an IEEE    standard FP8 datatype. A subsequent joint paper with a similar    proposal from Intel, Nvidia, and Arm[3] followed shortly. Other    AI hardware vendors like Qualcomm[4, 5] and Untether AI[6] also    wrote papers promoting FP8 and reviewing its merits versus    INT8. But the debate is far from settled. While there is no    singular answer for which data type is best for AI in general,    there are superior and inferior data types when it comes to    various AI processors and model architectures with specific    performance and accuracy requirements.  <\/p>\n<p>    Floating point and integer data types are two ways to represent    and store numerical values in computer memory. There are a few    key differences between the two formats that translate to    advantages and disadvantages for various neural networks in    training and inference.  <\/p>\n<p>    The differences all stem from their representation. Floating    point data types are used to represent real numbers, which    include both integers and fractions. These numbers can be    represented in scientific notation, with a base (mantissa) and    an exponent.  <\/p>\n<p>    On the other hand, integer data types are used to represent    whole numbers (without fractions). The representations result    in a very large difference in precision and dynamic range.    Floating point numbers have a wider dynamic range then their    integer counterparts. Integer numbers have a smaller range and    can only represent whole numbers with a fixed level of    precision.  <\/p>\n<p>    In deep learning, the numerical representation requirements    differ between the training and inference phases due to the    unique computational demands and priorities of each stage.    During the training phase, the primary focus is on updating the    models parameters through iterative optimization, which    typically necessitates higher dynamic range to ensure the    accurate propagation of gradients and the convergence of the    learning process. Consequently, floating-point representations,    such as FP32, FP16, and even FP8 lately, should be employed    during training to maintain sufficient dynamic range. On the    other hand, the inference phase is concerned with the efficient    evaluation of the trained model on new input data, where the    priority shifts towards minimizing computational complexity,    memory footprint, and energy consumption. In this context,    lower-precision numerical representations, such as 8-bit    integer (INT8) become an option in addition to FP8. The    ultimate decision depends on the specific model and underlying    hardware.  <\/p>\n<p>    The best data type for inference will vary depending on the    application and the target hardware. Real-time and mobile    inference services tend to use the smaller 8-bit data types to    reduce memory footprint, compute time, and energy consumption    while maintaining enough accuracy.  <\/p>\n<\/p>\n<p>    FP8 is growing increasingly popular, as every major hardware    vendor and cloud service provider has addressed its use in deep    learning. There are three primary flavors of FP8, defined by    the ratio of exponents to mantissa. Having more exponents    increases the dynamic range of a data type, so FP8 E3M4    consisting of 1 sign bit, 3 exponent bits, and 4 mantissa bits,    has the smallest dynamic range of the bunch. This FP8    representation sacrifices range for precision by having more    bits reserved for mantissa, which increases the accuracy. FP8    E4M3 has an extra exponent, and thus a greater range. FP8 E5M2    has the highest dynamic range of the trio, making it the    preferred target for training, which requires greater dynamic    range. Having a collection of FP8 representations allows for a    tradeoff between dynamic range and precision, as some inference    applications would benefit from the increased accuracy offered    by an extra mantissa bit.  <\/p>\n<p>    INT8, on the other hand, effectively has 1 sign bit, 1 exponent    bit, and 6 mantissa bits. This sacrifices much of its dynamic    range for precision. Whether or not this translates into better    accuracy compared to FP8 depends on the AI model in question.    And whether or not it translates into better power efficiency    will depend on the underlying hardware. Research from Untether    AI research[6] shows that FP8 outperforms INT8 in terms of    accuracy, and for their hardware, performance and efficiency as    well. Alternatively, Qualcomm research [5] had found that the    accuracy gains of FP8 are not worth the loss of efficiency    compared to INT8 in their hardware. Ultimately, the decision    for which data type to select when quantizing for inference    will often come down to what is best supported in hardware, as    well as depending on the model itself.  <\/p>\n<p>    References  <\/p>\n<p>    [1] Compute Trends Across Three Eras Of Machine Learning,    <a href=\"https:\/\/arxiv.org\/pdf\/2202.05924.pdf\" rel=\"nofollow\">https:\/\/arxiv.org\/pdf\/2202.05924.pdf<\/a>    [2] 8-bit Numerical Formats for Deep Neural Networks, <a href=\"https:\/\/arxiv.org\/abs\/2206.02915\" rel=\"nofollow\">https:\/\/arxiv.org\/abs\/2206.02915<\/a>    [3] FP8 Formats for Deep Learning, <a href=\"https:\/\/arxiv.org\/abs\/2209.05433\" rel=\"nofollow\">https:\/\/arxiv.org\/abs\/2209.05433<\/a>    [4] FP8 Quantization: The Power of the Exponent, <a href=\"https:\/\/arxiv.org\/pdf\/2208.09225.pdf\" rel=\"nofollow\">https:\/\/arxiv.org\/pdf\/2208.09225.pdf<\/a>    [5] FP8 verses INT8 for Efficient Deep Learning Inference,    <a href=\"https:\/\/arxiv.org\/abs\/2303.17951\" rel=\"nofollow\">https:\/\/arxiv.org\/abs\/2303.17951<\/a>    [6] FP8: Efficient AI Inference Using Custom 8-bit Floating    Point Data Types, <a href=\"https:\/\/www.untether.ai\/content-request-form-fp8-whitepaper\" rel=\"nofollow\">https:\/\/www.untether.ai\/content-request-form-fp8-whitepaper<\/a>  <\/p>\n<p>        About the Author  <\/p>\n<p>    Waleed Atallah is a Product Manager responsible for    silicon, boards, and systems at Untether AI. Currently, he is    rolling out Untether AIs second generation silicon product,    the speedAI family of devices. He was previously a Product    Manager at Intel, where he was responsible for high-end FPGAs    with high bandwidth memory. His interests span all things    compute efficiency, particularly the mapping of software to new    hardware architectures. He received a B.S. degree in Electrical    Engineering from UCLA.  <\/p>\n<p><!-- Auto Generated --><\/p>\n<p>Read more:<\/p>\n<p><a target=\"_blank\" rel=\"nofollow noopener\" href=\"https:\/\/www.hpcwire.com\/2023\/08\/07\/the-great-8-bit-debate-of-artificial-intelligence\" title=\"The Great 8-bit Debate of Artificial Intelligence - HPCwire\">The Great 8-bit Debate of Artificial Intelligence - HPCwire<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p> Editors Note: Users often ask What separates HPC from AI, they both do a lot of number crunching? While this statement is true, one big difference is the precision required for a valid answer <a href=\"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/artificial-intelligence\/the-great-8-bit-debate-of-artificial-intelligence-hpcwire\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187742],"tags":[],"class_list":["post-1116928","post","type-post","status-publish","format-standard","hentry","category-artificial-intelligence"],"_links":{"self":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts\/1116928"}],"collection":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/comments?post=1116928"}],"version-history":[{"count":0,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts\/1116928\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/media?parent=1116928"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/categories?post=1116928"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/tags?post=1116928"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}