question

Michael V. avatar image
Michael V. asked ·

Delay in alerting process for Custom Alerts

I have received a requirement to alert when we have 3 or more HTTP 500 errors within a minute on one of our services. I have created a Custom Alert as follows:

and done some testing, and it seems like there is some delay in the alerting process. For example if I have 4 HTTP 500 errors at 10:03, then the Custom Alert problem is opened at 10:10 and then closed at 10:12. Since I can't find any documentation about the alerting process and the 'for X minutes during any Y minutes period' setting on the custom alert, I'm not sure what to expect, so could someone please explain how often Dynatrace evaluates the treshold on Custom Alerts and how often it decides to raise/close an alert and how the X and Y settings affects this?

alerting
5aqgm.png (28.8 KiB)
10 |2000000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

Wolfgang B. avatar image
Wolfgang B. answered ·

Well the actual timeseries is evaluated every time a cluster consolidated metric payload is written to the storage. So there is a small delay of 1 to 2 minutes until all metric results of all cluster nodes reach the storage and are written. 7 minutes delay sounds a bit too long for my perspective but 1 to 2 minutes is the typical delay until the metric is checked and the alert is raised and notified on.

1 comment Share
10 |2000000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

Hi Wolfgang

Thank you for answering :-) In my first test I triggered 4 HTTP 500 errors and nothing more, so the delay could be explained with lack of subsequent traffic, but now I have done a new test with 5 HTTP 500 errors at 12:01 and then subsequent requests without failures, and the alert is raised at 12:08 and closed at 12:09

When the alert is raised at 12:08, it is reported to be open for 8 minutes:

and then after a minute (12:09) the problem is closed and changed to this:

There is something in the alerting process I dont quite understand :-) I would expect the problem to be raised no later than 12:03 and closed again 12:05
1 Like 1 · ·
nioaj.png (39.2 KiB)
5jlhl.png (41.0 KiB)
7ibrd.png (39.0 KiB)
Suneel K. avatar image
Suneel K. answered ·

But the app teams expect the alert notification at real time. Sending email after 7 to 10 min would cause delay in implementing the remediation activities. Is there any workaround/configuration set up to change this set up?

Share
10 |2000000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

Wolfgang B. avatar image
Wolfgang B. answered ·

As explained above we consolidate the data across all cluster nodes before taking the decision in order to not falsely alert on partial data. If we change the strategy to alert on partial data we would trade speed for a high number of false positive alerts.

1 comment Share
10 |2000000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

But the app teams expect the alert notification at real time. Sending email after 7 to 10 min would cause delay in implementing the remediation activities. Is there any workaround/configuration set up to change this set up.

0 Likes 0 · ·
Suneel K. avatar image
Suneel K. answered ·

May I know, is there any permanent solution/best practice to mitigate the situation. Even, we noticed the same scenario, where the custom alerts triggering the email notification which the previous data after 4 to 5 min.

Below is the example:

I enabled custom alert at 10:43 PM

Dynatrace triggered the email at 10:47 PM showing there was a violation at 10:38(alerting condition met ).

Is it expected behavior ? If yes, kindly explain the logic how does Dynatrace scans the logs for custom alerts.

Share
10 |2000000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

Wolfgang B. avatar image
Wolfgang B. answered ·

Sorry I have to correct my previous answer as the cluster nodes have to consolidate the incoming data. So by default only after 5 minutes when no new data is written do a timeslot the slot is closed and checked for the threshold.

Same for the de-alerting where we can only decide if the condition is no longer valid after the consolidation run. If we detect after 5 minute that the condition is no longer met we correct the problem duration to the correct timeframe and the heat field is also corrected back.

Share
10 |2000000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

Jeppe L. avatar image
Jeppe L. answered ·

Hey @Wolfgang B. any explanation to the delay?

Share
10 |2000000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 50.0 MiB each and 250.0 MiB total.

Space Topics

mobile monitoring dotnet synthetic monitoring reports iis chat kubernetes servicenow amazon web services mysql mainframe rest api errors cassandra dashboard oneagent sdk cmc application monitoring openkit smartscape request attributes monitoring developer community user tagging log monitoring services ufo syntheticadvisory activegate ip addresses auto-detection high five award oracle hyperion webserver uem usql iib test automation license web performance monitoring ios news migration management zones index ibm mq web services custom event alerts notifications sso host monitoring knowledge sharing reports browser monitors java hybris sap vmware maintenance window user action naming javascript appmon ai synthetic classic availability tipstricks automation extensions diagnostic tools session replay permissions davis assistant faq documentation problem detection http monitors server easytravel apdex aws-quickstart network docker tags and metadata cloud foundry google cloud platform synthetic monitoring process groups account usability dynatrace saas gui paas openshift key user actions administration user actions postgresql synthetic locations oneagent security Dynatrace Managed user management custom python technologies mongodb openstack user session monitoring continuous delivery citrix configuration alerting NGINX timestamp action naming linux nam installation masking error reporting database mission control jmeter recorder apache mobileapp RUM php threshold azure purepath davis scripting agent aix nodejs android