{"id":195968,"date":"2017-06-01T22:33:54","date_gmt":"2017-06-02T02:33:54","guid":{"rendered":"http:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/spark-gets-automation-analyzing-code-and-tuning-clusters-in-production-zdnet\/"},"modified":"2017-06-01T22:33:54","modified_gmt":"2017-06-02T02:33:54","slug":"spark-gets-automation-analyzing-code-and-tuning-clusters-in-production-zdnet","status":"publish","type":"post","link":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/automation\/spark-gets-automation-analyzing-code-and-tuning-clusters-in-production-zdnet\/","title":{"rendered":"Spark gets automation: Analyzing code and tuning clusters in production &#8211; ZDNet"},"content":{"rendered":"<p><p>    Reasons people are migrating to Spark.    Image: Databricks  <\/p>\n<p>    Hadoop and MapReduce, the parallel programming paradigm and API    originally behind Hadoop, used to be synonymous. Nowadays when    we talk about Hadoop, we mostly talk about an ecosystem of    tools built around the common file system layer of HDFS, and    programmed via Spark.  <\/p>\n<p>    Spark is the new Hadoop. One of the defining trends of this time, confirmed by    both practitioners in the field and surveys, is the en masse    move to Spark for Hadoop users. Spark is itself an ecosystem of    sorts, offering options for SQL-based access to data,    streaming, and machine learning.  <\/p>\n<p>    People are migrating to Spark for a number of reasons,    including easier programming paradigm. Easier than MapReduce    does not necessarily mean easy though, and there are a number    of gotchas when programming and deploying Spark applications.  <\/p>\n<p>    So why are people migrating to Spark? The top reason seems to    be performance: 91 percent of 1615 people from over 900    organizations participating in the Databricks Apache Spark Survey 2016 cited    this as their reason for using Spark. But there's more.    Advanced analytics and ease of programming are almost equally    important, cited by 82 percent and 76 percent of respondents.  <\/p>\n<p>    All industry sources we have spoken to over the last months    point to the same direction: programming against Spark's API is    easier than using MapReduce, so MapReduce is seen as a legacy    API at this point. Vendors will continue to offer support for    it as long as there are clients using it, but practically all    new development is Spark-based.  <\/p>\n<p>    Not everyone using Spark has the same    responsibilities or skills. Image: Databricks  <\/p>\n<p>    As Ash Munshi, Pepperdata CEO puts it: \"Spark offers a unified    framework and SQL access, which means you can do advanced    analytics, and that's where the big bucks are. Plus it's easier    to program: gives you a nice abstraction layer, so you don't    need to worry about all the details you have to manage when    working with MapReduce. Programming at a higher level means    it's easier for people to understand the down and dirty details    and to deploy their apps.\"  <\/p>\n<p>    Great. What's the problem then? Munshi points out that the flip    side of Spark abstraction, especially when running in Hadoop's    YARN environment which does not make it too easy to extract    metadata, is that a lot of the execution details are hidden.    This means it's hard to pinpoint which lines of code cause    something to happen in this complex distributed system, and    it's also hard to tune performance.  <\/p>\n<p>    Having a complex distributed system in which programs are run    also means you have be aware of not just your own application's    execution and performance, but also of the broader execution    environment. Pepperdata calls this the cluster weather problem:    the need to know the context in which an application is    running. A common issue in cluster deployment for example is    inconsistency in run times because of transient workloads.  <\/p>\n<p>    Pepperdata is not the only one that has taken note. A few    months back Alpine Data also pinpointed the same issue, albeit    with a slightly different framing. Alpine Data pointed to the    fact that Spark is extremely sensitive to how jobs are    configured and resourced, requiring data scientists to have a    deep understanding of both Spark and the configuration and    utilization of the Hadoop cluster being used.  <\/p>\n<p>    Failure to correctly resource Spark jobs will frequently lead    to failures due to out of memory errors, leading to inefficient    and time-consuming, trial-and-error resourcing experiments.    This requirement significantly limits the utility of Spark, and    impacts its utilization beyond deeply skilled data scientists,    according to Alpine Data.  <\/p>\n<p>    This is based on hard-earned experience, as Alpine Data    co-founder & CPO Steven Hillion explained. At some point    one of Alpine Data's clients was using Alpine Data Science    platform (ADSP) to do some very large scale processing on    consumer data: billions of rows and thousands of variables.    ADSP uses Spark under the hood for data crunching jobs, but the    problem was that these jobs would either take forever or break.  <\/p>\n<p>    The reason was that the tuning of Spark parameters in the    cluster was not right. People using ADSP in that case were data    scientists, not data engineers. They were proficient in finding    the right models to process data and extracting insights out of    them, but not necessarily in deploying them at scale.  <\/p>\n<p>    The result was that data scientists would get on the phone with    ADSP engineers to help them diagnose the issues and propose    configurations. As this would obviously not scale, Alpine Data    came up with the idea of building the logic their engineers    applied in this process into ADSP. Alpine Data says it worked,    enabling clients to build workflows within days and deploy them    within hours without any manual intervention.  <\/p>\n<p>    So the next step was to bundle this as part of ADSP and start    shipping it, which Alpine Labs did in Fall 2016. This was    presented in Spark Summit East 2017, and    Hillion says the response has been \"almost overwhelming. In    Boston we had a long line of people coming to ask about this\".  <\/p>\n<p>    Hillion emphasized that their approach is procedural, not based    on ML. This may sound strange, considering their ML expertise.    Alpine Labs however says this is not a static configuration,    but works by determining the correct resourcing and    configuration for the Spark job at run-time is based on the    size and dimensionality of the input data, the complexity of    the Spark job, and the availability of resources on the Hadoop    cluster.  <\/p>\n<p>    \"You can think of it as a sort of equation if you will, in a    simplistic way, one that expresses how we tune parameters\" says    Hillion. \"Tuning these parameters comes through experience, so    in a way we are training the model using our own data. I would    not call it machine learning, but then again we are learning    something from machines.\"  <\/p>\n<p>    Pepperdata now also offers a solution for Spark automation with    last week's release of Pepperdata Code Analyzer for Apache Spark (PCAAS),    but addressing a different audience with a different strategy.    Data scientists make for 23 percent of all Spark users, but    data engineers and architects combined make for a total of 63    percent of all Spark users. This is the audience Pepperdata    aims at with PCAAS.  <\/p>\n<p>    Architects are the people who design (big data) systems, and    data engineers are the ones who work with data scientists to    take their analyses to production. Munshi says PCAAS aims to    give them the ability to take running Spark applications,    analyze them to see what is going on and then tie that back to    specific lines of code.  <\/p>\n<p>    The thinking there is that by being able to understand more    about CPU utilization, garbage collection or I\/O related to    their applications, engineers and architects should be able to    optimize applications. PCAAS boasts the ability to do part of    the debugging, by isolating suspicious blocks of code and    prompting engineers to look into them.  <\/p>\n<p>    PCAAS aims to help decipher cluster weather as well, making it    possible to understand whether run time inconsistencies should    be attributed to a specific application or to the workload at    the time of execution. Munshi also points out the fact that    YARN heavily uses static scheduling, while using more dynamic    approaches could result in better hardware utilization.  <\/p>\n<p>    Better hardware utilization is clearly a top concern in terms    of ROI, but in order to understand how this relates to PCAAS    and why Pepperdata claims to be able to overcome YARN's    limitations we need to see where PCAAS sits in Pepperdata's    product suite. PCAAS is Pepperdata's latest addition to a line    of products including the Application Profiler, the Cluster    Analyzer, the Capacity Optimizer, and the Policy Enforcer.  <\/p>\n<p>    The latter three are about collecting telemetry data, while the    former two are about intervening in real-time, says Munshi.    Pepperdata's overarching ambition is to bridge the gap between Dev and Ops, and    Munshi believes that PCAAS is a step in that direction: a tool    Ops can give to Devs to self-diagnose issues, resulting in    better interaction and more rapid iteration cycles.  <\/p>\n<p>    Interestingly, Hillion also agrees that there is a clear    division between proprietary algorithms for tuning ML jobs and    the information that a Spark cluster can provide to inform    these algorithms. There are differences as well as similarities    in Alpine Labs and Pepperdata offerings though.  <\/p>\n<p>    To begin with, both offerings are not stand-alone. Spark    auto-tuning is part of ADSP, while PCAAS relies on telemetry    data provided by other Pepperdata solutions. So if you are only    interested in automating parts of your Spark cluster tuning or    application profiling, tough luck.  <\/p>\n<p>    When discussing with Hillion, we pointed out the fact that not    everyone interested in Spark auto tuning will necessarily want    to subscribe to ADSP in its entirety, so perhaps making this    capability available as a stand-alone product would make sense.    Hillion alluded that the part of their solution that is about    getting Spark cluster metadata from YARN may be open sourced,    while the auto-tuning capabilities may be sold separately at    some point.  <\/p>\n<p>    Alpine Labs is worried about giving away too much of their IP,    however this concern may be holding them back from commercial    success. When facing a similar situation, not every    organization reacts in the same way. Case in point: Metamarkets built Druid and then open sourced    it. Why? \"We built it because we needed it, and we open sourced    it because if we had not, something else would have replaced    it.\"  <\/p>\n<p>    The AI lock-in loop: great investment    begets greater results begetting greater investment. Image:    Azeem Azhar \/ Schibsted  <\/p>\n<p>    In all fairness though, for Metamarkets Druid is just    infrastructure, not core business, while for Alpine Labs ADSP    is their bread and butter. As for Pepperdata, they are toying    with the idea of giving free access to PCAAS for non-production    clusters to get a foothold in organizations. The reasoning is    tested and true: get engineers to know and love a tool, and the    tool will eventually spread and find its way in IT budgets.  <\/p>\n<p>    Either way, if you are among those who would benefit from    having such automation capabilities for your Spark deployment,    for the time being you don't have much of a choice. You will    have to either pay a premium and commit to a platform, or wait    until such capabilities eventually trickle down.  <\/p>\n<p>    The bigger picture however is clear: automation is finding an    increasingly central role in big data. Big data platforms can be the substrate on    which automation applications are developed, but it can    also work the other way round: automation can help alleviate    big data pain points.  <\/p>\n<p>    Remember the AI lock in the loop? First mover advantage may    prove significant here, as sitting on top of million telemetry    data points can do wonders for your product. This is exactly    the position Pepperdata is in, and it intends to leverage it to    apply Deep Learning to add predictive maintenance capabilities    as well as monetize it in other ways.  <\/p>\n<p>    Whether Pepperdata manages to execute on that strategy and how    others will respond is another issue, but at this point it    looks like a strategy that has more chances of addressing the    needs for big data automation services.  <\/p>\n<p><!-- Auto Generated --><\/p>\n<p>Read more:<\/p>\n<p><a target=\"_blank\" rel=\"nofollow\" href=\"http:\/\/www.zdnet.com\/article\/spark-gets-automation-analyzing-code-and-tuning-clusters-in-production\/\" title=\"Spark gets automation: Analyzing code and tuning clusters in production - ZDNet\">Spark gets automation: Analyzing code and tuning clusters in production - ZDNet<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p> Reasons people are migrating to Spark. Image: Databricks Hadoop and MapReduce, the parallel programming paradigm and API originally behind Hadoop, used to be synonymous. Nowadays when we talk about Hadoop, we mostly talk about an ecosystem of tools built around the common file system layer of HDFS, and programmed via Spark.  <a href=\"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/automation\/spark-gets-automation-analyzing-code-and-tuning-clusters-in-production-zdnet\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187732],"tags":[],"class_list":["post-195968","post","type-post","status-publish","format-standard","hentry","category-automation"],"_links":{"self":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts\/195968"}],"collection":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/comments?post=195968"}],"version-history":[{"count":0,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/posts\/195968\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/media?parent=195968"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/categories?post=195968"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.euvolution.com\/prometheism-transhumanism-posthumanism\/wp-json\/wp\/v2\/tags?post=195968"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}