• Forums
    • Public Forums
      • Community Connect
      • Dynatrace
        • Dynatrace Open Q&A
      • Application Monitoring & UEM
        • AppMon & UEM Open Q&A
      • Network Application Monitoring
        • NAM Open Q&A
        • Enterprise Synthetic Monitoring
      • Synthetic Classic
        • Synthetic Classic Open Q&A
  • Home /
  • Public Forums /
  • Application Monitoring & UEM /
  • AppMon & UEM Open Q&A /
avatar image
Question by Colin F. · Jul 09, 2013 at 06:39 PM ·

Incident Duration question/Possible RFE

If I have an evaluation period of say 5 minutes.  Once the incident triggers, lets say the duration reported for the incident was 4 min 50 sec. The "total" duration of said incident was then technically 9m 50s, right?  The current "incident duration" value does not include time spent in evaluation leading up to the incident, correct? 

My hunch is this is the case.

My two cents:

In 10sec evaluations, it's probably not terribly relevant, but in longer duration evaluation periods, it would be nice to also track and be able to report the total time outside the bounds of the threshold not just time spent in the incident. This is important if one was to rely on incident data in dynaTrace for accurate and automated reporting to management.

Colin

 

Comment

People who like this

0 Show 0
10 |2000000 characters needed characters left characters exceeded
â–¼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 52.4 MB each and 262.1 MB total.

4 Replies

  • Sort: 
  • Most voted
  • Newest
  • Oldest
avatar image

Answer by Colin F. · Jul 09, 2013 at 11:37 PM

Like I said, it's not 100%.  Typically if something goes, it goes in a big way though so it's usually pretty close to accurate.  Service down, 500's, etc.. there are cases where performance degradation does just get to the point where it finally triggers the alarm.   In those cases I've been finding myself more and more in the new Server Timeline dashlet tooking at histograms.  What I would really like to do is model the histogram over time, so I can see the movement as a replay for example.  That is one of my side projects, if I have enough time.   So much of what we monitor to back end processes are async and pose challenges.  Some things can slow down, but not necessarily affect something as tangible as an Apdex.    

I'm glad you like the idea.   I do too.  Though, I must give credit to where it's due.  I got the idea based on a paper written quite some time back on the topic.  Here's the link: 

Ref: http://goo.gl/qqYJo

Would love to have your thoughts after reading this little gem....

Comment

People who like this

0 Show 0 · Share
10 |2000000 characters needed characters left characters exceeded
â–¼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 52.4 MB each and 262.1 MB total.

avatar image

Answer by Rob V. · Jul 09, 2013 at 10:20 PM

Hi Colin,

It's an interesting conversation, and definitely beyond what dT does now. Predictive quantile analysis is an RFE, for sure. You've interested me enough to read more on it. (smile)

For our existing system however: I'm curious how you decided that your "Impact Time" is simply the additional time of the evaluation timeframe. Clearly you could have been cruising along for a long time with individual, or groups of, measurements breaching the threshold, but with the overall average for the eval timeframe staying below the threshold. How do you know that your Impact Time is not just 5 minutes, but maybe an hour... or a day?

Comment

People who like this

0 Show 0 · Share
10 |2000000 characters needed characters left characters exceeded
â–¼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 52.4 MB each and 262.1 MB total.

avatar image

Answer by Colin F. · Jul 09, 2013 at 09:31 PM

Hi Rob,

I would agree the point is to find that balance where individual spikes and fluctuations are irrelevant until a meaningful amount of time has passed.

However, I still disagree with that line of thought.  Perhaps it's just in how we do things (never said it was the right way) but for us when an incident triggers, that's merely the point at which we say - we must act on something...  In order to accurately track the impact of such an event however,  we use something called "Impact Time" in the incident record which for us, affects the business reports for Uptime.  What drove impact time was the need to track the amount of time it took from the beginning of customer impact to the point at which teams began "working" the problem.     

Personally, I would like to see dT natively support dynamic/predictive quantile analysis.  TP99 is what the business reports on and there is a huge drive to alert on it.  SLA's are becoming widespread, accountability to customers mandated publicly as we venture to the cloud...   For us, that means we have to take TP99 SLA's and provide higher average thresholds padded with enough time for it to fluctuate and forcing us to take more of a wait and see approach, or wait an hour to see what percentiles look like which is not possible. 

Many of our incident evaluations are set at either 5 or 10 minutes, simply because there is such variation to the size of transactions that response time can spike, and often, which begs for bin analysis.   When there is an incident with say a 3rd party service/dependency, we let dynaTrace send the Trap for the incident, the incident system then can check if there is a CR in place for that system (such as a maintenance window) and either silently handle the incident, or put it on the screen of a support engineer. 

We have to go back in after the fact and correct the impact start-time to include the evaluation time of the incident in order to accurately document to those other parties the true time that something began to impact us (business driven requirement). It's not 100% accurate, but it's pretty close.  By setting our thresholds high, evaluation durations long, we have eliminated 99% of false positives and, as we began to automate the incident ticket creation, we found we had issues communicating with those teams on the actual start time that something began to really breech the norms.

Until we go past the evaluation time-frame, we don't necessarily know if it's a transient issue, or it needs to be a qualified alarm triggering an incident to the operations center.  The number of violations does play into perhaps determining whether severity is low or high, or to count exactly how many of xxx are affected but in and of itself doesn't make for easy reporting in the trap the actual time a named transaction becomes impacted by something.  We do use it, but more in context of assessing the level of impact.

 

Probably too much detail.. but I am keen on seeking opinions...

 

Comment

People who like this

0 Show 1 · Share
10 |2000000 characters needed characters left characters exceeded
â–¼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 52.4 MB each and 262.1 MB total.

avatar image Rick B. · Jul 09, 2013 at 09:57 PM 0
Share

While I agree with Rob on a practical level, I think if your organization needs this type of functionality it's more incentive to upvote percentile-based alerting and setting your thresholds on the 50th (median) as the variance here is very very low.

avatar image

Answer by Rob V. · Jul 09, 2013 at 07:15 PM

Hi Colin,

I don't think that the 9m 50s duration that you mention above makes sense. The point of the evaluation period (say, 5 min) and the aggregation (say, AVG) is that you don't want to worry about individual fluctuations above/below the threshold until some "meaningful" time. So at that particular time when AVG value over the most recent 5 min becomes greater than the threshold, that's the point that the incident duration starts. Sure, there will have been many points prior to that where the individual measurements were above (or below!) the threshold, but the incident isn't defined to start until the AVG of all such measurements is greater than the threshold. So that continues on for your 4m 50s until the AVG is no longer greater than the threshold at which point the incident is done.

If you're looking to track threshold breaches, how about using the "Violation of XXX" measure that's created for BTs? That can give you some indication of how often values slip up over the threshold, even if they are not in "alert status" yet.

Rob

 

Comment

People who like this

0 Show 0 · Share
10 |2000000 characters needed characters left characters exceeded
â–¼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Up to 10 attachments (including images) can be used with a maximum of 52.4 MB each and 262.1 MB total.

Join the conversation!

First steps in the forum
Community User Guide

LIVE WEBINAR

"Power Demo: Software Intelligence for Cloud Infrastructure"


DECEMBER 12, 10:00 AM GMT / 2:00 PM ET

Register here

Live webinar: Ensuring Digital Business Availability with Dynatrace

Learn how Dynatrace Real User Monitoring automatically detects errors that impact your end users caused by erroneous 3rd party or CDNs.
December 10, 4:00 pm CET / 10:00 am ET
Register here

Live webinar: Ensuring Digital Business Availability with Dynatrace

Learn how Dynatrace Real User Monitoring automatically detects errors that impact your end users caused by erroneous 3rd party or CDNs.
December 10, 4:00 pm CET / 10:00 am ET
Register here

Live webinar: Ensuring Digital Business Availability with Dynatrace

Learn how Dynatrace Real User Monitoring automatically detects errors that impact your end users caused by erroneous 3rd party or CDNs.
December 10, 4:00 pm CET / 10:00 am ET
Register here

Follow this Question

Answers Answers and Comments

3 People are following this question.

avatar image avatar image avatar image

Forum Tags

dotnet mobile monitoring load iis 6.5 kubernetes mainframe rest api errors dashboard framework 7.0 appmon 7 health monitoring adk log monitoring services auto-detection uem webserver test automation license web performance monitoring ios nam probe collector migration mq web services knowledge sharing reports window java browser agent community user guide hybris javascript appmon sensors good to know search 6.3+ server documentation easytravel web dashboard kibana system profile purelytics docker splunk 6.1 process groups account 7.2 rest dynatrace saas spa guardian appmon administration production user actions postgresql upgrade oneagent measures security Dynatrace Managed transactionflow diagnostics user session monitoring unique users continuous delivery configuration alerting NGINX splitting business transaction client 6.3 installation chart database scheduler apache mobileapp RUM php dashlet azure purepath plugins agent 7.1 appmonsaas messagebroker nodejs 6.2 incidents android sensor performance warehouse
  • Forums
  • Public Forums
    • Community Connect
    • Dynatrace
      • Dynatrace Open Q&A
    • Application Monitoring & UEM
      • AppMon & UEM Open Q&A
    • Network Application Monitoring
      • NAM Open Q&A
      • Enterprise Synthetic Monitoring
    • Synthetic Classic
      • Synthetic Classic Open Q&A