We run a lot of tiny VMs on vSphere 4 in a rather unique environment. The densities are high and the kernel OS is officially unsupported Fedora Core 8 (2.6.26 kernel). This causes us to be more tolerant of aberrations.
The biggest aberration of note has been CPU creep. The tiny guests will run along just fine using 30 - 40 MHz of CPU and then start a slow upward trend. It will creep slowly over the course of a week. No useful perspective can be gained from within the guest using traditional means. More interesting, performing a guest-initiated reboot will reveal a slow crawl all the way through the BIOS at boot and no CPU dip beyond the new baseline. They are stuck, and a reset from the vSphere client resolves the issue.
This has been acceptable so far. The guests are stateless, only a few are impacted at any one time, and no one guest is critical by itself. We automated the remediation, became accustomed, and moved on. The issue has stuck to one functional cluster and persisted across minor vSphere 4 upgrades.
Becoming accustomed caused us to miss another occurrence.
The software architects have been busy troubleshooting the core application running in a separate vSphere cluster on Ubuntu Server 8.04 LTS (2.6.24 kernel). CPU has been creeping slowly up for the past couple of months with a marked recent acceleration. We’ve been attributing it to increased load as we grow. The software was optimized and the CPU remained steady and on its upward path.
Stop all running processes, verify a higher than expected CPU load, and reset the VM. We’re down substantially.
In a small shop with few resources and too many projects, it’s time to implement trending alerts.
Have you experienced this behavior before?