• Forums
    • Public Forums
      • Community Connect
      • Dynatrace
        • Dynatrace Open Q&A
      • Application Monitoring & UEM
        • AppMon & UEM Open Q&A
      • Network Application Monitoring
        • NAM Open Q&A
  • Home /
  • Public Forums /
  • Application Monitoring & UEM /
  • AppMon & UEM Open Q&A /
avatar image
Question by Andrew W. · Aug 07, 2014 at 12:33 AM ·

Incident on Count measure

I have defined an incident with a count measure as a condition, the count measure has a low warning threshold, (I would like to see an alert if the count drops below the threshold). The evaluation timeframe on the incident is one minute. 

If I chart the measure, showing thresholds with the chart resolution also set to one minute, I see no readings below the threshold, however the incident rule is incorrectly triggered - frequently. It appears that the incident evaluation timeframe of one minute is being ignored, and the incident is evaluating much more frequently (10 seconds maybe?), and thus acts on a reading below the low warning threshold. 

I'm trying to determine if I'm misunderstanding something, or have encountered a bug of some kind. It seems to me that if I've configured the incident to evaluate over 1 minute that this should work as I expect.

 

Thanks for any insight!

 

 

Comment
Kay K.

People who like this

1 Show 0
10 |2000000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Toggle Comment visibility. Current Visibility: Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

10 Replies

  • Sort: 
  • Most voted
  • Newest
  • Oldest
avatar image

Answer by Rob V. · Aug 08, 2014 at 12:57 AM

Yeah, I'd agree with Rick that as you're trying to figure out what your "real" threshold is, looking at the highly granular data (as I did with the 10-second display resolution) is often very insightful. As Rick mentions, to get a more clear idea of where you are "exposed" for potential incidents, I'd change the resolution on the chart to perhaps 1 minute and see how that looks.

Good luck with it!!

Rob

 

 

Comment

People who like this

0 Show 0 · Share
10 |2000000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Toggle Comment visibility. Current Visibility: Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

avatar image

Answer by Andrew W. · Aug 08, 2014 at 12:39 AM

OK so it sounds like I'm asking too much to base an incident on a low count of this web request during a 60 second period. That's fine  - I can work with that., if it's the best way to accomplish this. I'll go for a 5 minute evaluation window, and see if I can nail down the threshold appropriately.

This is what my chart looks like at 5 minute resolution,....

 

Comment

People who like this

0 Show 1 · Share
10 |2000000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Toggle Comment visibility. Current Visibility: Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

avatar image Rick B. · Aug 08, 2014 at 12:44 AM 0
Share

Keep in mind that if traffic is "spiky" rather than steady you'll want to infer your threshold from a higher resolution (1 minute or 10 second) chart rather than to base it strictly on a 5-minute chart.  Based on your previous 1-minute chart, the low end of your "on the minute" data points was 10/minute whereas here because you're looking "on the 5 and 10" it looks like the minimum you're seeing steady-state is over 100 per 5 minute interval (or roughly 20-25 per minute).

Incidents are often like this with live data, they need some improvement after definition to be appropriately sensitive.

Hope that helps,

Rick B

avatar image

Answer by Rob V. · Aug 08, 2014 at 12:00 AM

I don't know that I'd characterize the 10-second granularity as a problem. There is an affect though: you get many more "minutes" than you expect. You get a new "minute" every 10 seconds. For example at 1:15:00 you evaluate from 1:14:00 to 1:15:00. At 1:15::10 you evaluate from 1:14:10 to 1:15:10, and so on. So instead of 60 minute-intervals per hour, you're getting 6 times that or 360 minute-intervals per hour. That makes it much more  likely that you'll catch a slow collection of 10-second samples (e.g. a minute interval with 0,0,1,0,1,0 orders) that cause the threshold of 2 to kick off.

So recognizing that, my only substantive suggestion would be to consider either raising your threshold in recognition of how things work and what you're real expectation is, or perhaps raising your evaluation timeframe up to 5 minutes (perhaps with also an increased threshold - maybe 6?) to help blend things out a little. Another off-the-cuff thought is perhaps thinking about what you expect your AVERAGE order rate to be. In this case think about making the evaluation timeframe be 15 minutes and work to find out what's an acceptable average aggregation for you instead of using a sum. Just a thought.

Rob

 

Comment

People who like this

0 Show 0 · Share
10 |2000000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Toggle Comment visibility. Current Visibility: Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

avatar image

Answer by Andrew W. · Aug 07, 2014 at 11:13 PM

I think I basically follow what you're saying, it's working as designed - which is that the thresholds for this type of measure must be set appropriately for a 10 second evaluation granularity. The evaluation timeframe is simply telling the system how often to review all the 10 second samples since the previous evaluation. 

So unfortunately, the problem remains - I'd like to report an incident if there are less than 2 orders (Counts of a particular web request) in a one minute interval. I can't really work in terms of a 10 second interval, because there are 10 second intervals where I would expect 0 orders.

Any thoughts on how to setup the incident I'm trying for?

 

 

 

 

 

 

Comment

People who like this

0 Show 1 · Share
10 |2000000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Toggle Comment visibility. Current Visibility: Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

avatar image Rick B. · Aug 07, 2014 at 11:51 PM 0
Share

Hi Andrew,

You're close, but what Rob is getting at is that the one minute interval is evaluated every 10 seconds (as you originally thought) but the 1 minute resolution chart is deceiving because it only shows, for instance, 9:00:00 to 9:01:00, whereas the incident evaluation could fire for 9:00:30 to 9:01:30.  I think what you have is an overly sensitive alert, in which case I usually advise my customers to try a longer evaluation timeframe.  Does 10 requests across 5 minutes work for you?

Rick B

avatar image

Answer by Rob V. · Aug 07, 2014 at 07:45 AM

Hi Andrew,

I did some testing of this exact scenario in my own system, and I think I have the explanation. For me, it works exactly as designed. The way this works is that the one-minute evaluation period is a rolling-minute. Meaning that every 10 seconds (our evaluation granularity), the previous minute is evaluated to see if the condition is met (in your case, the sum of your order counts over the past minute being <= 2).

It looks from your screenshot that you have the display granularity set to 1 minute. Things become a lot more clear if you set the display granularity to 10 seconds (which happens to correspond to our incident processing granularity). If you do that, I think you'll find that the incidents you're seeing make sense.

For me, at a minute granularity display, I had "weird" sections of alerts being shown (as indicated by the yellow "heat fields" at the top of the chart):

There were incidents where it seemed they shouldn't be, and none where it appeared there should be.

If I take that chart and change the display resolution to 10 seconds, it becomes more clear what's going on. I can tell from missing data points (long lines connecting two dots more than 10 seconds apart) that there were times when I had no orders to process (so that "slice" was zero) which caused that particular rolling minute to have a total sum of less than 2:

Here you can see the incidents line up nicely with times when the sum for that rolling minute would have been less than 2. The long connection lines between two points is an indication that during some 10-second slices there was a count of 0. The best example of that is just after 15:56 (where the sum was exactly 2) to  the next plotted data point at 15:58 (where the sum was 1). The points in between there were zeros. The incident shows up after about a minute of that as expected.

I hope this makes sense, and I hope it will be as obvious when  you try it on your system with this 10-second display granularity.

Rob

 

 

 

Comment

People who like this

0 Show 0 · Share
10 |2000000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Toggle Comment visibility. Current Visibility: Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

avatar image

Answer by Andrew W. · Aug 07, 2014 at 05:13 AM

I put a 2 in the low warning threshold on the measure.

Well, the incident is reported frequently, of varying duration, so I do have many of them listed until I confirm them all. In order words, the incident occurs, then the condition clears, and then is encountered again, and the incident is thus reported again. 

 

Comment

People who like this

0 Show 0 · Share
10 |2000000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Toggle Comment visibility. Current Visibility: Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

avatar image

Answer by Rob V. · Aug 07, 2014 at 04:31 AM

Silly question, but since the behavior is confusing I have to ask: You said you have a "low warning" threshold of 2. Do you mean you have a low value (2) in the "Upper Warning" threshold entry, or have you put the 2 into the actual "Lower Warning" threshold box? What does the measure definition dialog look like regarding thresholds?

And what do you mean that the alert "fires" frequently? In dT, the alert will "fire", and then stay active until it goes out of its requested state, at which point it's not active. What is the incident count for this incident, if you look in the incident dashlet? Do you have many of them, or just one extended one?

Comment

People who like this

0 Show 0 · Share
10 |2000000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Toggle Comment visibility. Current Visibility: Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

avatar image

Answer by Andrew W. · Aug 07, 2014 at 02:41 AM

yes threshold is 2. If I split by agent, the chart looks the same as does to split by application or agent group. 

Here's a look at how the measure details are currently setup, I've tried some variations without success, the alert still fires on what seems like a sample from an interval o significantly f less than one minute.

 

 

Comment

People who like this

0 Show 0 · Share
10 |2000000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Toggle Comment visibility. Current Visibility: Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

avatar image

Answer by Andrew W. · Aug 07, 2014 at 02:10 AM

Sure no problem.

Comment

People who like this

0 Show 1 · Share
10 |2000000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Toggle Comment visibility. Current Visibility: Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

avatar image Rick B. · Aug 07, 2014 at 02:15 AM 0
Share

So it looks like the configured threshold is 2 or 3 correct?  Can you split by agent?  What agent is referenced in the incident?

 

avatar image

Answer by Andreas G. · Aug 07, 2014 at 01:37 AM

Can you attach a screenshot of both your chart and your incident definition?

Comment

People who like this

0 Show 0 · Share
10 |2000000 characters needed characters left characters exceeded
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Toggle Comment visibility. Current Visibility: Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

How to get started

First steps in the forum
Read Community User Guide
Best practices of using forum

NAM 2019 SP5 is available


Check the RHEL support added in the latest NAM service pack.

Learn more

LIVE WEBINAR

"Performance Clinic - Monitoring as a Self Service with Dynatrace"


JANUARY 15, 3:00 PM GMT / 10:00 AM ET

Register here

Follow this Question

Answers Answers and Comments

4 People are following this question.

avatar image avatar image avatar image avatar image

Forum Tags

dotnet mobile monitoring load iis 6.5 kubernetes mainframe rest api dashboard framework 7.0 appmon 7 health monitoring adk log monitoring services auto-detection uem webserver test automation license web performance monitoring ios nam probe collector migration mq web services knowledge sharing reports window java hybris javascript appmon sensors good to know extensions search 6.3+ server documentation easytravel web dashboard kibana system profile purelytics docker splunk 6.1 process groups account 7.2 rest dynatrace saas spa guardian appmon administration production user actions postgresql upgrade oneagent measures security Dynatrace Managed transactionflow technologies diagnostics user session monitoring unique users continuous delivery sharing configuration alerting NGINX splitting business transaction client 6.3 installation database scheduler apache mobileapp RUM php dashlet azure purepath agent 7.1 appmonsaas messagebroker nodejs 6.2 android sensor performance warehouse
  • Forums
  • Public Forums
    • Community Connect
    • Dynatrace
      • Dynatrace Open Q&A
    • Application Monitoring & UEM
      • AppMon & UEM Open Q&A
    • Network Application Monitoring
      • NAM Open Q&A