The Blog of Chris Saylor

Search Results

    Time Warner Defeated by Munin

    November 18, 2010 engineering Chris Saylor

    Being a customer of Time Warner (Road Runner) for nearly three years now, I have had my share of technical issues that required a technician to come out and rummage around in the magic cable box outside my house. Some of the worst issues to correct are intermittent. I share the pain of the Time Warner techs when dealing with seemingly intangible errors, but that doesn’t mean you devote no effort to diagnosing what the issue could have been (still existing just not presenting).

    Diligence to identifying a problem that isn’t currently happening usually wanes fast. I just happen to run a web server in Atlanta, and that server happens to have a monitoring tool installed called Munin. Munin is a tool that graphs many aspects of a node, logs that data, and transmits it back to the Munin server. In this case, the Munin server is at my house.

    So how did Munin help me convince a technician that something systematic was happening? It turns out that intermittent issues become very obvious when they’re made visual by being graphed over the time period in which it occurred. I was able to demonstrate to the technician exactly when and for how long I was without internet by showing him the interruption in reporting from my Munin node in Atlanta.

    eth0 traffic graphed by week
    eth0 traffic graphed by week

    The gaps on the left side of the above graph makes it pretty plain that something happened where there are noticeable gaps in the traffic graph. One could argue, well maybe there just was no traffic going to your server during those times (doesn’t really explain the sudden drop instead of a drop off). Observe exhibit B:

    Note: This article was restored from archive and this image was lost. It depicted MySQL activity graphed by week which showed a similar gap as other examples.

    Still not convinced?

    Disk Usage graphed by week
    Disk Usage graphed by week

    Disk utilization does not change that quickly on a web server, and certainly is not going to zero without something horribly wrong happening.

    Thanks to Munin, the tech acknowledged that there was a problem, quickly determined it was something on their end (hard to BS me), and scheduled a work order for a line technician to take care of the issue. Munin for the win.

    Related Posts

    Interop in PHP Should Not Be Exceptional December 22, 2020

    In many ways, PHP has come a long way to becoming a competent, typed language. With the newly minted PHP 8, strong types have eliminated a whole host …

    Managing Polylingual Side Projects July 19, 2020

    Like many engineers, I have a life-long passion for learning. I satiate this need by creating side projects that explore new concepts, languages, and …

    Ruminate More June 30, 2020

    Do you remember back to your school days of writing a paper, giving it a once over, and turning it in only to be surprised on return of bad editing …

    Deploying CSRF Protection to an Active Site December 18, 2019

    At Zumba, I implemented CSRF protection to all our state-changing user inputs. With a large and complicated site, implementing CSRF is a very tricky …