saylornotes

The Blog of Chris Saylor

Search Results

    Time Warner Defeated by Munin

    November 18, 2010 engineering Chris Saylor

    Being a customer of Time Warner (Road Runner) for nearly three years now, I have had my share of technical issues that required a technician to come out and rummage around in the magic cable box outside my house. Some of the worst issues to correct are intermittent. I share the pain of the Time Warner techs when dealing with seemingly intangible errors, but that doesn’t mean you devote no effort to diagnosing what the issue could have been (still existing just not presenting).

    Diligence to identifying a problem that isn’t currently happening usually wanes fast. I just happen to run a web server in Atlanta, and that server happens to have a monitoring tool installed called Munin. Munin is a tool that graphs many aspects of a node, logs that data, and transmits it back to the Munin server. In this case, the Munin server is at my house.

    So how did Munin help me convince a technician that something systematic was happening? It turns out that intermittent issues become very obvious when they’re made visual by being graphed over the time period in which it occurred. I was able to demonstrate to the technician exactly when and for how long I was without internet by showing him the interruption in reporting from my Munin node in Atlanta.

    eth0 traffic graphed by week
    eth0 traffic graphed by week

    The gaps on the left side of the above graph makes it pretty plain that something happened where there are noticeable gaps in the traffic graph. One could argue, well maybe there just was no traffic going to your server during those times (doesn’t really explain the sudden drop instead of a drop off). Observe exhibit B:

    Note: This article was restored from archive and this image was lost. It depicted MySQL activity graphed by week which showed a similar gap as other examples.

    Still not convinced?

    Disk Usage graphed by week
    Disk Usage graphed by week

    Disk utilization does not change that quickly on a web server, and certainly is not going to zero without something horribly wrong happening.

    Thanks to Munin, the tech acknowledged that there was a problem, quickly determined it was something on their end (hard to BS me), and scheduled a work order for a line technician to take care of the issue. Munin for the win.

    Related Posts

    Ruminate More June 30, 2020

    Do you remember back to your school days of writing a paper, giving it a once over, and turning it in only to be surprised on return of bad editing …

    Deploying CSRF Protection to an Active Site December 18, 2019

    At Zumba, I implemented CSRF protection to all our state-changing user inputs. With a large and complicated site, implementing CSRF is a very tricky …

    Meta: How this blog is built and deployed April 11, 2019

    It is an unspoken rule that if you utilize something other than Wordpress for a blog that you must include an article on how it is built. This is that …

    Building a Chess bot for Slack August 23, 2018

    With Atlassian’s announcement suspending development of Stride and dropping support for Hipchat in favor of Slack, I decided that the time was right …