Thursday, September 15, 2011

Capturing CPU Trends with PowerCLI

Inspired by the creeping CPU that we see in Linux guests and helped greatly by @BoerLowie at his blog, I’ve come up with a little PowerCLI to capture CPU trends of the top consumers per cluster.
This is my first cut and will likely see changes over time, like any script should. HTML output and emailed results are the most likely candidates.
The script should be fairly self-explanatory. For each cluster, traverse all VMs and get their OverallCpuUsage (the number that you see in the vSphere Client when selecting a cluster and then the Virtual Machines tab).  Take the top X consumers based on that number and get their average CPU usage performance statistic for N days back in time and compare it to today’s.
The output looks something like this:
CPU-Trend


So here you go:
#
#  Produce guest CPU trending from a time period back versus a shorter 
#  more immediate time frame.  e.g. 30 days ago versus past 2 days.
#
param(
    [string] $vCenter
)
 
$DaysOld = -30        # compare to full day stats this many days back
$DaysRecent = -1    # get stats for this many recent days.
$GetTop = 10        # look at top x CPU consumers
 
Add-PSSnapin VMware.VimAutomation.Core -ErrorAction SilentlyContinue
 
#if ($vCenter -eq "") {
#    $vCenter = Read-Host "VI Server: "
#}
 
#if ($DefaultVIServers.Count) {
#    Disconnect-VIServer -Server * -Force -Confirm:$false
#}
#Connect-VIServer $vCenter
 
$AllClusters = Get-Cluster
 
Foreach ($Cluster in $AllClusters) {
    Write-Host "`n$($Cluster.Name)"
    
    $VMs = Get-Cluster $Cluster | Get-VM | `
        Where-Object { $_.PowerState -eq "PoweredOn" }
    $NumVMs = $VMs.Count
    
    # Get the Overall CPU Usage for each VM in the cluster.  Then cap that 
    # list at the top $GetTop highest for Overall CPU Usage
    $vm_list = @()
    $Count = 0
    Foreach ($vm in $VMs)
    {
        $Count += 1
        Write-Progress -Activity "Getting VM views" -Status "Progress:" `
            -PercentComplete ($Count / $NumVMs * 100)
            
        # the vSphere .Net view object has the OverallCpuUsage 
        # (VirtualMachineQuickStats)
        # http://www.vmware.com/support/developer/vc-sdk/visdk400pubs/ReferenceGuide/vim.vm.Summary.QuickStats.html
        $view = Get-View $vm
        
        $objOutput = "" | Select-Object VMName, CpuMhz
        $objOutput.VMName = $view.Name
        $objOutput.CpuMhz = $view.Summary.QuickStats.OverallCpuUsage
        $vm_list += $objOutput
    }
    # Reduce to our Top X
    $vm_list = $vm_list | sort-object CpuMhz -Descending | select -First $GetTop 
        
    #
    # For each of those VMs, get the statistics for past and current CPU usage
    $NumVMs = $vm_list.Count
    $Out_List = @()
    $Count = 0
    Foreach ($vm in $vm_list)
    {
        $Count += 1
        Write-Progress -Activity "Compiling CPU stats" -Status "Progress:" `
            -PercentComplete ($Count / $NumVMs * 100)
            
           [Double] $ldblPerfAged = (Get-Stat -Entity $vm.VMName -Stat cpu.usage.average `
            -Start $((Get-Date).AddDays($DaysOld)) `
            -Finish $((Get-Date).AddDays($DaysOld + 1)) -ErrorAction Continue | `
            Measure-Object -Average Value).Average
        
        If ($ldblPerfAged -gt 0) {
               [Double] $lblPerfNow = (Get-Stat -Entity $vm.VMName -Stat cpu.usage.average `
                -Start $((Get-Date).AddDays($DaysRecent)) `
                -ErrorAction Continue | Measure-Object -Average Value).Average
            [Int] $lintTrend = (($lblPerfNow - $ldblPerfAged) / $ldblPerfAged) * 100
        
            $objOutput = "" | Select-Object VMName, CpuMhz, PerfAged, PerfNow, Trend
            $objOutput.VMName = $vm.VMName
            $objOutput.CpuMhz = $vm.CpuMhz
            $objOutput.PerfAged = "{0:f2}%" -f $ldblPerfAged
            $objOutput.PerfNow = "{0:f2}%" -f $lblPerfNow
            $objOutput.Trend = "{0}%" -f $lintTrend
        
            $out_list += $objOutput
        }
    }
 
    # Spit 'er out
    Write-Host "Top CPU Consumers Trending, $($DaysOld) days vs today`n"
    $out_list | Format-Table -Property VMName, `
        @{Expression={$_.CpuMhz};Name='CPU Mhz';align='right'}, `
        @{Expression={$_.PerfAged};Name='CPU Aged';align='right'}, `
        @{Expression={$_.PerfNow};Name='CPU Now';align='right'}, `
        @{Expression={$_.Trend};Name='Trend';align='right'}
}

Wednesday, September 7, 2011

Linux Guest CPU Creep

We run a lot of tiny VMs on vSphere 4 in a rather unique environment.  The densities are high and the kernel OS is officially unsupported Fedora Core 8 (2.6.26 kernel). This causes us to be more tolerant of aberrations.

The biggest aberration of note has been CPU creep.  The tiny guests will run along just fine using 30 - 40 MHz of CPU and then start a slow upward trend.  It will creep slowly over the course of a week.  No useful perspective can be gained from within the guest using traditional means.  More interesting, performing a guest-initiated reboot will reveal a slow crawl all the way through the BIOS at boot and no CPU dip beyond the new baseline.  They are stuck, and a reset from the vSphere client resolves the issue.

This has been acceptable so far.  The guests are stateless, only a few are impacted at any one time, and no one guest is critical by itself.  We automated the remediation, became accustomed, and moved on.  The issue has stuck to one functional cluster and persisted across minor vSphere 4 upgrades.

Becoming accustomed caused us to miss another occurrence.

The software architects have been busy troubleshooting the core application running in a separate vSphere cluster on Ubuntu Server 8.04 LTS (2.6.24 kernel).  CPU has been creeping slowly up for the past couple of months with a marked recent acceleration. We’ve been attributing it to increased load as we grow.  The software was optimized and the CPU remained steady and on its upward path.

MQ-Creeping

The solution:

Stop all running processes, verify a higher than expected CPU load, and reset the VM.  We’re down substantially.

MQ-Creeping2

In a small shop with few resources and too many projects, it’s time to implement trending alerts.

Have you experienced this behavior before?