Tuning vCOps for your environment – Part 2 – Badge Tuning

Welcome to Part 2 on Tuning vCOps for your Environment – Badge Tuning. For those who missed my first post on Alerts it can be found Here.

For Part 2 we are going to focus on adjusting badge thresholds in vCOps and why it is necessary for noise free vCOps environment.
In case you are a little rusty, vCOps is essentially broken down into 3 Major badges and 8 Minor badges. The Minor badges are used to make up the score of the Major badges.

They are broken down as follows:

  • Health
    • Workload
    • Anomalies
    • Faults
  • Risk
    • Time Remaining
    • Capacity Remaining
    • Stress
  • Efficiency
    • Reclaimable Waste
    • Density

MajorMinorBadges

By default the various Major and Minor badges change state at certain levels. Eg the Workload badge for VMs will be Green 0 – 80, Yellow at 80-89, Amber at 90-94 and Red at 95-100. It is important to note that as the badge progresses through the varies alert levels a badge alert is raised. Eg a VM with 100% Workload due to high CPU will generate 3 badge alerts. Now if this is abnormal behavior it is probably a good thing that alerts are generated because it probably warrants your attention. However what is this is normal behavior and you don’t care? Well then it is just noise that distracts you from other potential issues that may be occurring.

So Chris how do I deal with this tromboner?
Well let me give you a classic scenario and then the steps needed to resolve the situation.

Scenario:
Every Datastore Object is generating a Workload Alert and yellow or amber Workload badge. As per the example screenshots below.

DiskSpaceAlert
DatastoreOps

Why is this occurring?:
This is a common situation for Datastores, as the Workload Badge for Datastores is composed of the derived attributes Disk Space and Disk I/O. The reason this is so common is most Designers fill their VMFS Datastores to 80% or 90% utilization before marking the Datastore as full.

As the Disk Space Metric in my example is at 89% it has tripped the Yellow (Warning) badge for Workload (as Workload will be set by the most constraining resource). This in turn has generated an Alert which I now have to pay attention to.

What should we not do?:
You may be thinking “I know, just disable the Alert like as you explained in Part 1“. However in this case that is not appropriate. The reason is as I explained in Part 1 disabling the Alert does not effect the Badge state. As such simply disabling the alert would keep Workload as a Critical Badge,  and therefore keep effecting my heath score of the Datastore and of its parent objects. As such my Health heatmap for my Cluster, vCenter or World will still contain a sea of red.

So why not simply apply the built in vCOps Policy “Ignore these Objects” to all Datastores?
Although this would work to a degree it would also remove all other badges on the Datastores that we still care about. Anomalies for example are incredibly useful as well as Faults. Applying this policy would disable those unnecessarily.

What should we actually do?:
The answer is – Create a Datastore Group -> Create an All Datastores Policy with the Workload Badge adjusted -> Apply the new Policy to the group.

I will cover creating an applying polices in detail in my next post however in essence a new Policy should be created and applied to reflect the environment.

The new Policy can have the Warning and Immediate thresholds disabled all together (simply left click on the slider box), and the Critical threshold still enabled at 95%. This will ensure if someone over-provisions a Datastore beyond normal policy it is alerted on.
NewDatastorePolicyBadges

After this has been applied you can see the result ->
FixedDatastoreWorkload

The Workload Badge is now Green and the Alert has disappeared automatically.

Final Word:
Although this example was a Datastore there may be dozens of others in your environment that need a similar adjustment. Eg. VMs that constantly run at 100% CPU, Mail Servers with very high disk I/O, etc… In these cases they probably warrant a specific policy which disables (or better yet adjusts) badges as your environment dictates.

In my next post we will discuss Policy and Intelligent Operations Group creation to simplify and automate applying polices for these sort of scenarios.

Leave a comment

6 Comments

  1. Ronny

     /  January 22, 2014

    great article, Chris!

    I wanted to add my two cents:
    I’d also recommend to create a custom group for datastore objects. I’d also adjust the thresholds similar to the datastore disk usage alerts in vCenter to avoid any confusion around diskspace usage.

    Great series of blog posts, keep them coming!

    Ronny

    Reply
    • Chris Slater

       /  January 22, 2014

      Hi Ronny,
      Great comment and that is something I should have mentioned. Yes for something like Disk space the badge should match your normal design policies as well as any vCenter alarms that might also report on a similar metric. This will ensure both systems are not alarming at different levels.
      Chris

      Reply
  2. Chris,

    Would you want to still look at vCenter alarms once you have vCOps in place. Won’t a this cause an overlap. Moreover, one of the reasons someone would have vCOps is to ensure that they do not have to worry about a number of alerts based on raw metrics. Instead get alerts based on derived metrics. What is your opinion on this?

    Reply
    • Chris Slater

       /  January 29, 2014

      Hi Sunny,
      Some great questions that I missed in my posts so far so thanks for bringing them up.
      Q: Would you want to still look at vCenter alarms once you have vCOps in place?
      A: Great question. As you know vCOps surfaces vCenter alerts as Faults which need to be manually cleared unlike Smart Alerts which Automatically clear. Although this does overlap with vCenter, I find I can now simply observe and act on these alerts in vCOps without the need to have a vSphere client window open all the time. Although this may seem a a bit of a duplication of alerts it is an important step in making vCOps your central operations platform. As in the the future you may have all sorts of environments in vCOps through Management Packs such as Storage and Network devices which will give you a central point to receive alerts to.

      Q: Moreover, one of the reasons someone would have vCOps is to ensure that they do not have to worry about a number of alerts based on raw metrics. Instead get alerts based on derived metrics. What is your opinion on this?
      A: This is in my opinion one of the reasons why vCOps is so great. Instead of opening the vSphere client and checking CPU Utilization, CPU Ready, CPU Co Stop, etc… and finding out what acceptable levels are, I can look at one derived metric “Demand” which is common across multiple objects.

      Chris

      Reply
  3. Jay Rogers

     /  October 8, 2014

    Thanks you SO MUCH for this series. I have been saying they need a book or whitepaper on this or even better a VMworld session on this. Everyone needs skill around this stuff to calm their environments like you mentioned to the important stuff.

    Reply
  4. Raj_mh

     /  January 24, 2015

    Hello All,
    I’m new to vCOPS & wanted your assistance in configuration of alerting. I can see vCOPS is reporting multiple alerts like “High number of metrics outside their normal bounds: 34 abnormal metrics out of 218 metrics monitored.” is there a way I can get configure alert based on actual issue like CPU/RAM etc…

    Rajesh

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *