{"id":207917,"date":"2017-02-14T10:31:26","date_gmt":"2017-02-14T15:31:26","guid":{"rendered":"http:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/uncategorized\/automations-impace-on-data-center-monitoring-alerts-the-data-center-journal.php"},"modified":"2017-02-14T10:31:26","modified_gmt":"2017-02-14T15:31:26","slug":"automations-impace-on-data-center-monitoring-alerts-the-data-center-journal","status":"publish","type":"post","link":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/automation\/automations-impace-on-data-center-monitoring-alerts-the-data-center-journal.php","title":{"rendered":"Automation&#8217;s Impace on Data Center Monitoring Alerts &#8211; The Data Center Journal"},"content":{"rendered":"<p><p>    In my     last installment, I discussed a few different areas where    data center monitoring automation can not only make life in the    data center more convenient but also become a force multiplier.    I ran out of space, however, before I ran out of ideas (the    story of my life). The one thing I didnt cover was the    automation you can implement in response to an alert.  <\/p>\n<p>    As a data center professional, you probably have a solid    understanding of monitoring and alerting already, but to truly    appreciate how automation can relieve an enormous burden, it    may be helpful to review a few examples.  <\/p>\n<p>    What follows are some clippings from my garden of    automationalert responses that have had a huge impact on the    environments where they were implemented.  <\/p>\n<p>    Example 1: Disk Full  <\/p>\n<p>    Disk-full alerting is a simple concept with a deceptively large    number of moving parts. So, I want to break it down into    specifics. First, get the alert right. As my fellow SolarWinds    Head Geek Thomas LaRock and I discussed in a recent episode of SolarWinds Lab, simplistic disk    alerts help nobody. If you have a 2TB disk, alerting when its    90 percent used translates to having204.8GBs of disk    space remaining.  <\/p>\n<p>    A good solution to this problem is to check for both percent    used and also remaining space. A better solution is to include    logic in the alert that tests for the total space of the drive,    so that drives with less than 1TB of space have one set of    criteria and drives with greater than 1Tb have another. These    tests should all be in the same alert, if possible, because who    wants to manage hundreds of alert rules? Nevertheless, you want    to ensure you are monitoring disk space in a way that is    reasonable for the volumes in question, and only create    necessary alerts.  <\/p>\n<p>    Next, clear unnecessary disk files out of various directories.    For the purpose of this article, Ill just say that all systems    have a temporary directory and that you can delete all files    out of that folder with impunity. The challenge in doing so    easily comes down to a problem of impersonation. Many    monitoring solutions run on the server as the system account.    As a result, performing certain actions requires the script to    impersonate a privileged user account. There are a variety of    ways to do so, which is why Ill leave the problem here for you    to solve in a way that best fits your individual environment.  <\/p>\n<p>    Once the impersonation issue is resolved, theres another    challenge specific to the disk-full alert: knowing that the    correct directories for the specific server are being targeted.    The best approach is to use a common shared folder that maps to    all servers and place a script file there. That script can be    set up to first detect the proper directories and then clear    them out with all the necessary safeguards and checks in place    to avoid accidental damage.  <\/p>\n<p>    Example 2: Restart an IIS Application Pool  <\/p>\n<p>    Sadly, restarting application pools is often the easiest and    best fix for website-related issues. Im not saying that    running appcmd stop... and then appcmd    start... from the server command line is a quick kludge    that ignores the bigger issues. Im saying that often,    resetting the application pool is the fix.  <\/p>\n<p>    If your web team finds itself in this situation, waking a human    being to do the honors is absolutely your most expensive    option. But automatically restarting the application pool    becomes slightly more challenging because one server could be    running multiple websites, which in turn have multiple    application pools. Or you could have one big application pool    controlling multiple websites. It all depends on how the server    and websites were configured and you have no way of knowing.  <\/p>\n<p>    If your monitoring solution can monitor the application pool,    it will provide the name for you. Most mature monitoring    solutions do so already. Once you have the name, you can do the    following:  <\/p>\n<p>    Example 3: Restart IIS  <\/p>\n<p>    Running a close second behind restarting application pools is    resetting IIS. Doing so is, of course, the nuclear option of    website fixes since you are bouncing all websites and all    connections. Even though its drastic, its a necessary step in    some cases.  <\/p>\n<p>    As with restarting application pools, getting a human involved    in this incredibly simple action is a waste of everyones time    and the companys money. Its far better to automatically    restart and then recheck the website a minute or two later. If    all is well, the server logs can be investigated in the morning    as part of a postmortem. If the website is still down, its    time to send in the troops.  <\/p>\n<p>    You can restart the IIS web server in a number of ways:  <\/p>\n<p>    Example 4: Restart a Server  <\/p>\n<p>    If restarting the IIS service is the nuclear option, restarting    the entire server is akin to nuclear Armageddon. Yet we all    know there are times when restarting the server is the best    option, given a certain set of conditions that you can    monitor.Assuming your monitoring solution doesn't support    a built-in capability for this function, some options include    the following:  <\/p>\n<p>    Example 5: Restart a Service  <\/p>\n<p>    Occasionally, services stop. They are sometimes even services    that you, as a data center professional who needs to monitor    your infrastructure, care about, such as SNMP.So, you are    cutting dozens of service-down alerts. Have you thought about    restarting them? In some cases, a restart doesnt really help    much. But in far more situations it does. Computers are funny    things. After all, Screws fall out all the time. The world is    an imperfect place. (From The Breakfast Club.)  <\/p>\n<p>    Sometimes, they just need a gentle nudge. If this is the case,    you can do the following:  <\/p>\n<p>    Example 6: Backup a Network-Device    Configuration  <\/p>\n<p>    Everything Ive gone over so far covers direct remediation-type    actions. But in some cases, automation can be defensive and    informational. Network-device configurations are a good    example, in that they dont fix anything, but instead gather    additional information to help you fix the issue faster.  <\/p>\n<p>    Its important to note that between 40 and 80 percent of all    corporate-network downtime is the result of unauthorized or    uncontrolled changes to network devices. These changes arent    always malicious. Often, the change simply went unreviewed by    another set of eyes or an otherwise simple error slipped past    the team.  <\/p>\n<p>    So, having the ability to spontaneously pull a device    configuration based on an event trigger is super helpful. To do    so, you can use the following approach:  <\/p>\n<p>    There are two general cases when you may want to execute this    automatic action. The first is when your monitoring solution    receives a config change trap. Although the details of SNMP    traps are beyond the scope of this article, you can configure    your network devices to send spontaneous alerts on the basis of    certain events. One of these events is a configuration change.    The second is when the behavior of a device changes    drastically, such as when ping success drops below 75 percent    or ping latency increases. In either case, often the device is    in the process of becoming unavailable. But in some situations,    its wobbly, and theres a chance to grab the configuration    before it drops completely.  <\/p>\n<p>    In both of those situations, having the latest configuration    provides valuable forensic information that can help    troubleshoot the issue. It also gives you a chance to restore    the absolutely last-known-good configuration, if necessary. And    if it leads you to think, Well, if I have the last known good    configuration, why cant I just push that one back? Then you,    my friend, have caught the automation bug! Run with it.  <\/p>\n<p>    Example 7: Reset a User Session  <\/p>\n<p>    Somewhere in the murky past, the first computer went online and    became Node 1 in the vast network we now call the Internet. The    next thing that probably happened, mere seconds later, was that    the first user forgot to log off their session and left it    hanging.  <\/p>\n<p>    For any system that supports remote connectionswhether its in    the form of telnet\/ssh, drive mappings or RDP sessionshaving    the ability to monitor and manage remote-connection user    sessions can make running weekly, if not daily, restarts    unnecessary. Or at least much smoother.  <\/p>\n<p>    For Linux, use the who command to discover current sessions,    or with greater granularity by remotely running netstat    -tnpa | grep 'ESTABLISHED.*sshd. Once you have the    process ID, you can kill it. For Windows, you get the active    sessions on a system using the query session    <servername> command and disconnect the session    using the reset session <Session name or ID>    <servername> command. Or you can use the PowerShell    cmdlet Invoke-RDUserLogoff.  <\/p>\n<p>    Example 8: Clear DNS Cache  <\/p>\n<p>    At times, a server and\/or application will misbehave because it    cant contact an external system. This misbehavior is either    because the DNS cache (the list of known systems and their IP    addresses) is corrupt, or because the remote system has moved.    In either case, a really easy fix is to clear the DNS cache and    let the server attempt to contact the system at its new    location.  <\/p>\n<p>    In Windows, use the command ipconfig \/flushdns. In    Linux, the command varies from one distribution to another, so    its possible that sudo \/etc\/init.d\/nscd restart will    do the trick, or \/etc\/init.d\/dns-clean, or perhaps    another command. Research may be necessary for this one.  <\/p>\n<p>    Hopefully at least a few of things Ive shared here and in this    series on automation as a whole have inspired you to give    automation a try in your data center. If so, or if youre    already well on your way to automating all the things. Id    love to hear about your experiences and perspective in the    comments section.  <\/p>\n<p>    Leading article image courtesy ofLeonardo    Rizzi under a Creative Commons license<\/p>\n<p>    Leon    Adato,SolarWindsHead Geek and long-time IT systems    management and monitoring expert, discusses all things data    center in this ongoing series.  <\/p>\n<p>    Automations Impace on Data Center    Monitoring Alerts was last modified: February 13th, 2017 by Leon Adato  <\/p>\n<p><!-- Auto Generated --><\/p>\n<p>Read this article:<\/p>\n<p><a target=\"_blank\" rel=\"nofollow\" href=\"http:\/\/www.datacenterjournal.com\/automations-impace-data-center-monitoring-alerts\/\" title=\"Automation's Impace on Data Center Monitoring Alerts - The Data Center Journal\">Automation's Impace on Data Center Monitoring Alerts - The Data Center Journal<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p> In my last installment, I discussed a few different areas where data center monitoring automation can not only make life in the data center more convenient but also become a force multiplier. I ran out of space, however, before I ran out of ideas (the story of my life). The one thing I didnt cover was the automation you can implement in response to an alert <a href=\"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/automation\/automations-impace-on-data-center-monitoring-alerts-the-data-center-journal.php\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"limit_modified_date":"","last_modified_date":"","_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[431581],"tags":[],"class_list":["post-207917","post","type-post","status-publish","format-standard","hentry","category-automation"],"modified_by":null,"_links":{"self":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/207917"}],"collection":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/comments?post=207917"}],"version-history":[{"count":0,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/207917\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/media?parent=207917"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/categories?post=207917"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/tags?post=207917"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}