Thursday, September 15, 2011
This is my first cut and will likely see changes over time, like any script should. HTML output and emailed results are the most likely candidates.
The script should be fairly self-explanatory. For each cluster, traverse all VMs and get their OverallCpuUsage (the number that you see in the vSphere Client when selecting a cluster and then the Virtual Machines tab). Take the top X consumers based on that number and get their average CPU usage performance statistic for N days back in time and compare it to today’s.
The output looks something like this:
So here you go:
Wednesday, September 7, 2011
We run a lot of tiny VMs on vSphere 4 in a rather unique environment. The densities are high and the kernel OS is officially unsupported Fedora Core 8 (2.6.26 kernel). This causes us to be more tolerant of aberrations.
The biggest aberration of note has been CPU creep. The tiny guests will run along just fine using 30 - 40 MHz of CPU and then start a slow upward trend. It will creep slowly over the course of a week. No useful perspective can be gained from within the guest using traditional means. More interesting, performing a guest-initiated reboot will reveal a slow crawl all the way through the BIOS at boot and no CPU dip beyond the new baseline. They are stuck, and a reset from the vSphere client resolves the issue.
This has been acceptable so far. The guests are stateless, only a few are impacted at any one time, and no one guest is critical by itself. We automated the remediation, became accustomed, and moved on. The issue has stuck to one functional cluster and persisted across minor vSphere 4 upgrades.
Becoming accustomed caused us to miss another occurrence.
The software architects have been busy troubleshooting the core application running in a separate vSphere cluster on Ubuntu Server 8.04 LTS (2.6.24 kernel). CPU has been creeping slowly up for the past couple of months with a marked recent acceleration. We’ve been attributing it to increased load as we grow. The software was optimized and the CPU remained steady and on its upward path.
Stop all running processes, verify a higher than expected CPU load, and reset the VM. We’re down substantially.
In a small shop with few resources and too many projects, it’s time to implement trending alerts.
Have you experienced this behavior before?