I have defined an incident with a count measure as a condition, the count measure has a low warning threshold, (I would like to see an alert if the count drops below the threshold). The evaluation timeframe on the incident is one minute.
If I chart the measure, showing thresholds with the chart resolution also set to one minute, I see no readings below the threshold, however the incident rule is incorrectly triggered - frequently. It appears that the incident evaluation timeframe of one minute is being ignored, and the incident is evaluating much more frequently (10 seconds maybe?), and thus acts on a reading below the low warning threshold.
I'm trying to determine if I'm misunderstanding something, or have encountered a bug of some kind. It seems to me that if I've configured the incident to evaluate over 1 minute that this should work as I expect.
Thanks for any insight!
Answer by Rob V. ·
Yeah, I'd agree with Rick that as you're trying to figure out what your "real" threshold is, looking at the highly granular data (as I did with the 10-second display resolution) is often very insightful. As Rick mentions, to get a more clear idea of where you are "exposed" for potential incidents, I'd change the resolution on the chart to perhaps 1 minute and see how that looks.
Good luck with it!!
Rob
Answer by Andrew W. ·
OK so it sounds like I'm asking too much to base an incident on a low count of this web request during a 60 second period. That's fine - I can work with that., if it's the best way to accomplish this. I'll go for a 5 minute evaluation window, and see if I can nail down the threshold appropriately.
This is what my chart looks like at 5 minute resolution,....
Keep in mind that if traffic is "spiky" rather than steady you'll want to infer your threshold from a higher resolution (1 minute or 10 second) chart rather than to base it strictly on a 5-minute chart. Based on your previous 1-minute chart, the low end of your "on the minute" data points was 10/minute whereas here because you're looking "on the 5 and 10" it looks like the minimum you're seeing steady-state is over 100 per 5 minute interval (or roughly 20-25 per minute).
Incidents are often like this with live data, they need some improvement after definition to be appropriately sensitive.
Hope that helps,
Rick B
Answer by Rob V. ·
I don't know that I'd characterize the 10-second granularity as a problem. There is an affect though: you get many more "minutes" than you expect. You get a new "minute" every 10 seconds. For example at 1:15:00 you evaluate from 1:14:00 to 1:15:00. At 1:15::10 you evaluate from 1:14:10 to 1:15:10, and so on. So instead of 60 minute-intervals per hour, you're getting 6 times that or 360 minute-intervals per hour. That makes it much more likely that you'll catch a slow collection of 10-second samples (e.g. a minute interval with 0,0,1,0,1,0 orders) that cause the threshold of 2 to kick off.
So recognizing that, my only substantive suggestion would be to consider either raising your threshold in recognition of how things work and what you're real expectation is, or perhaps raising your evaluation timeframe up to 5 minutes (perhaps with also an increased threshold - maybe 6?) to help blend things out a little. Another off-the-cuff thought is perhaps thinking about what you expect your AVERAGE order rate to be. In this case think about making the evaluation timeframe be 15 minutes and work to find out what's an acceptable average aggregation for you instead of using a sum. Just a thought.
Rob
Answer by Andrew W. ·
I think I basically follow what you're saying, it's working as designed - which is that the thresholds for this type of measure must be set appropriately for a 10 second evaluation granularity. The evaluation timeframe is simply telling the system how often to review all the 10 second samples since the previous evaluation.
So unfortunately, the problem remains - I'd like to report an incident if there are less than 2 orders (Counts of a particular web request) in a one minute interval. I can't really work in terms of a 10 second interval, because there are 10 second intervals where I would expect 0 orders.
Any thoughts on how to setup the incident I'm trying for?
Hi Andrew,
You're close, but what Rob is getting at is that the one minute interval is evaluated every 10 seconds (as you originally thought) but the 1 minute resolution chart is deceiving because it only shows, for instance, 9:00:00 to 9:01:00, whereas the incident evaluation could fire for 9:00:30 to 9:01:30. I think what you have is an overly sensitive alert, in which case I usually advise my customers to try a longer evaluation timeframe. Does 10 requests across 5 minutes work for you?
Rick B
Answer by Rob V. ·
Hi Andrew,
I did some testing of this exact scenario in my own system, and I think I have the explanation. For me, it works exactly as designed. The way this works is that the one-minute evaluation period is a rolling-minute. Meaning that every 10 seconds (our evaluation granularity), the previous minute is evaluated to see if the condition is met (in your case, the sum of your order counts over the past minute being <= 2).
It looks from your screenshot that you have the display granularity set to 1 minute. Things become a lot more clear if you set the display granularity to 10 seconds (which happens to correspond to our incident processing granularity). If you do that, I think you'll find that the incidents you're seeing make sense.
For me, at a minute granularity display, I had "weird" sections of alerts being shown (as indicated by the yellow "heat fields" at the top of the chart):
There were incidents where it seemed they shouldn't be, and none where it appeared there should be.
If I take that chart and change the display resolution to 10 seconds, it becomes more clear what's going on. I can tell from missing data points (long lines connecting two dots more than 10 seconds apart) that there were times when I had no orders to process (so that "slice" was zero) which caused that particular rolling minute to have a total sum of less than 2:
Here you can see the incidents line up nicely with times when the sum for that rolling minute would have been less than 2. The long connection lines between two points is an indication that during some 10-second slices there was a count of 0. The best example of that is just after 15:56 (where the sum was exactly 2) to the next plotted data point at 15:58 (where the sum was 1). The points in between there were zeros. The incident shows up after about a minute of that as expected.
I hope this makes sense, and I hope it will be as obvious when you try it on your system with this 10-second display granularity.
Rob
Answer by Andrew W. ·
I put a 2 in the low warning threshold on the measure.
Well, the incident is reported frequently, of varying duration, so I do have many of them listed until I confirm them all. In order words, the incident occurs, then the condition clears, and then is encountered again, and the incident is thus reported again.
Answer by Rob V. ·
Silly question, but since the behavior is confusing I have to ask: You said you have a "low warning" threshold of 2. Do you mean you have a low value (2) in the "Upper Warning" threshold entry, or have you put the 2 into the actual "Lower Warning" threshold box? What does the measure definition dialog look like regarding thresholds?
And what do you mean that the alert "fires" frequently? In dT, the alert will "fire", and then stay active until it goes out of its requested state, at which point it's not active. What is the incident count for this incident, if you look in the incident dashlet? Do you have many of them, or just one extended one?
Answer by Andrew W. ·
yes threshold is 2. If I split by agent, the chart looks the same as does to split by application or agent group.
Here's a look at how the measure details are currently setup, I've tried some variations without success, the alert still fires on what seems like a sample from an interval o significantly f less than one minute.
JANUARY 15, 3:00 PM GMT / 10:00 AM ET