Solved: VMware Performance Event Triggers use case questions.

Fred Lensink · ‎09-28-2012

1) Can the CPO VMware Adapter Performance Event Triggers and their

SAMPLE SIZE, INTERVAL, and CONDITIONS attributes be configured

for the following use case?

2) Can VMWare Performance Event Triggers be CORRELATED? and if so would it be needed to satisfy the use case?

3) Would an additional performance monitoring tool be required to satisy the use case?:

If a VM CPU or RAM has been 80% for 2 hours then trigger a workflow

and

If a VM CPU or RAM has been at 60% for X days then trigger a workflow

derevan · ‎09-29-2012

I will first describe how you could instrument a correlation method, but ultimately, I'm not sure it is really necessary for the uses cases described.

To correlate VMware performance events over a designated timeframe, you would create a correlation process that tracks the underlying performance event and decides whether or not to trigger the process you want triggered when the correlated event is detected.

Here's how it works.

First, create a global table with three columns:

1) Virtual Machine path

2) Consecutive Trigger Count

3) Last Trigger time

Create a process that is triggered by a VMware performance event (such as Memory Avg > 80%) where you can set a sample size and interval that makes sense given the timeframe of interest (2 hours or X days). For example, a sample size of 10 and interval of 30 seconds (5 minutes) is a reasonable time slice from vCenter for a 2 hour timeframe. This results in requiring 24 consecutive triggers to raise the actual event of interest. (2 hours divided by 5 minutes)

That is, the formula is:

Sample Size * Interval / Timeframe = Consecutive Trigger Count

The correlation process triggered by the raw VMware event does the following:

1) If there is no existing entry in the global table for the VM, add entry to table with count = 1 and current time

2) if entry exists, check the current time against the last trigger time

a) if it is the next interval, for example, current time is 12:15 and last triggered time was 12:10 (see note below),

Increment counter and set last trigger time

If count = 24, then run the process that handles the "VM Memory Avg > 80% for at least 2 hours " and delete entry from table.

If count < 24, the process exits having only incremented the counter and set the last trigger time

b) if the current time is > than last trigger time + time slice (+ a little padding), set the counter back to 1

Note: When comparing current time w/last time, you should pad to account for slight processing delays so compare current-time < last-time+(time slice*2). Anything less than 10 minutes in this case would be considered a consecutive trigger. You could also compare current-time < last-time+time-slice+1, which is probably also safe.

Now, having gone through all that, you may not need the correlation process after all. Simply adjust your sample size to be large enough to accommodate your ultimate timeframe and you can trigger your event handler directly without the need to correlate. So for the 2 hour window, just create a sample size of 240 (* 30 seconds = 2 hours). This may or may not work depending on how performance metrics have been configured on the server (sample size, intervals and how much is saved). You can only set your own sample and intervals to multiples of those configured values (so be careful and refer to the VMware documentation when relying on such metrics)

You may find that the correlation method I first described is more reliable, especially for longer timeframes such as X days (where you need to sample hourly rather than every 5 minutes).

In any case, I think you can do what is asked without external monitoring, but it will require some experimentation and a deeper knowledge of how performance metric sampling works for ESX.

View solution in original post

derevan · ‎09-29-2012

I will first describe how you could instrument a correlation method, but ultimately, I'm not sure it is really necessary for the uses cases described.

To correlate VMware performance events over a designated timeframe, you would create a correlation process that tracks the underlying performance event and decides whether or not to trigger the process you want triggered when the correlated event is detected.

Here's how it works.

First, create a global table with three columns:

1) Virtual Machine path

2) Consecutive Trigger Count

3) Last Trigger time

Create a process that is triggered by a VMware performance event (such as Memory Avg > 80%) where you can set a sample size and interval that makes sense given the timeframe of interest (2 hours or X days). For example, a sample size of 10 and interval of 30 seconds (5 minutes) is a reasonable time slice from vCenter for a 2 hour timeframe. This results in requiring 24 consecutive triggers to raise the actual event of interest. (2 hours divided by 5 minutes)

That is, the formula is:

Sample Size * Interval / Timeframe = Consecutive Trigger Count

The correlation process triggered by the raw VMware event does the following:

1) If there is no existing entry in the global table for the VM, add entry to table with count = 1 and current time

2) if entry exists, check the current time against the last trigger time

a) if it is the next interval, for example, current time is 12:15 and last triggered time was 12:10 (see note below),

Increment counter and set last trigger time

If count = 24, then run the process that handles the "VM Memory Avg > 80% for at least 2 hours " and delete entry from table.

If count < 24, the process exits having only incremented the counter and set the last trigger time

b) if the current time is > than last trigger time + time slice (+ a little padding), set the counter back to 1

Note: When comparing current time w/last time, you should pad to account for slight processing delays so compare current-time < last-time+(time slice*2). Anything less than 10 minutes in this case would be considered a consecutive trigger. You could also compare current-time < last-time+time-slice+1, which is probably also safe.

Now, having gone through all that, you may not need the correlation process after all. Simply adjust your sample size to be large enough to accommodate your ultimate timeframe and you can trigger your event handler directly without the need to correlate. So for the 2 hour window, just create a sample size of 240 (* 30 seconds = 2 hours). This may or may not work depending on how performance metrics have been configured on the server (sample size, intervals and how much is saved). You can only set your own sample and intervals to multiples of those configured values (so be careful and refer to the VMware documentation when relying on such metrics)

You may find that the correlation method I first described is more reliable, especially for longer timeframes such as X days (where you need to sample hourly rather than every 5 minutes).

In any case, I think you can do what is asked without external monitoring, but it will require some experimentation and a deeper knowledge of how performance metric sampling works for ESX.

Fred Lensink · ‎10-16-2012

Thankyou for your detailed explanation it was very valuable.