If I have an evaluation period of say 5 minutes. Once the incident triggers, lets say the duration reported for the incident was 4 min 50 sec. The "total" duration of said incident was then technically 9m 50s, right? The current "incident duration" value does not include time spent in evaluation leading up to the incident, correct?
My hunch is this is the case.
My two cents:
In 10sec evaluations, it's probably not terribly relevant, but in longer duration evaluation periods, it would be nice to also track and be able to report the total time outside the bounds of the threshold not just time spent in the incident. This is important if one was to rely on incident data in dynaTrace for accurate and automated reporting to management.
Colin
Answer by Colin F. ·
Like I said, it's not 100%. Typically if something goes, it goes in a big way though so it's usually pretty close to accurate. Service down, 500's, etc.. there are cases where performance degradation does just get to the point where it finally triggers the alarm. In those cases I've been finding myself more and more in the new Server Timeline dashlet tooking at histograms. What I would really like to do is model the histogram over time, so I can see the movement as a replay for example. That is one of my side projects, if I have enough time. So much of what we monitor to back end processes are async and pose challenges. Some things can slow down, but not necessarily affect something as tangible as an Apdex.
I'm glad you like the idea. I do too. Though, I must give credit to where it's due. I got the idea based on a paper written quite some time back on the topic. Here's the link:
Ref: http://goo.gl/qqYJo
Would love to have your thoughts after reading this little gem....
Answer by Rob V. ·
Hi Colin,
It's an interesting conversation, and definitely beyond what dT does now. Predictive quantile analysis is an RFE, for sure. You've interested me enough to read more on it.
For our existing system however: I'm curious how you decided that your "Impact Time" is simply the additional time of the evaluation timeframe. Clearly you could have been cruising along for a long time with individual, or groups of, measurements breaching the threshold, but with the overall average for the eval timeframe staying below the threshold. How do you know that your Impact Time is not just 5 minutes, but maybe an hour... or a day?
Answer by Colin F. ·
Hi Rob,
I would agree the point is to find that balance where individual spikes and fluctuations are irrelevant until a meaningful amount of time has passed.
However, I still disagree with that line of thought. Perhaps it's just in how we do things (never said it was the right way) but for us when an incident triggers, that's merely the point at which we say - we must act on something... In order to accurately track the impact of such an event however, we use something called "Impact Time" in the incident record which for us, affects the business reports for Uptime. What drove impact time was the need to track the amount of time it took from the beginning of customer impact to the point at which teams began "working" the problem.
Personally, I would like to see dT natively support dynamic/predictive quantile analysis. TP99 is what the business reports on and there is a huge drive to alert on it. SLA's are becoming widespread, accountability to customers mandated publicly as we venture to the cloud... For us, that means we have to take TP99 SLA's and provide higher average thresholds padded with enough time for it to fluctuate and forcing us to take more of a wait and see approach, or wait an hour to see what percentiles look like which is not possible.
Many of our incident evaluations are set at either 5 or 10 minutes, simply because there is such variation to the size of transactions that response time can spike, and often, which begs for bin analysis. When there is an incident with say a 3rd party service/dependency, we let dynaTrace send the Trap for the incident, the incident system then can check if there is a CR in place for that system (such as a maintenance window) and either silently handle the incident, or put it on the screen of a support engineer.
We have to go back in after the fact and correct the impact start-time to include the evaluation time of the incident in order to accurately document to those other parties the true time that something began to impact us (business driven requirement). It's not 100% accurate, but it's pretty close. By setting our thresholds high, evaluation durations long, we have eliminated 99% of false positives and, as we began to automate the incident ticket creation, we found we had issues communicating with those teams on the actual start time that something began to really breech the norms.
Until we go past the evaluation time-frame, we don't necessarily know if it's a transient issue, or it needs to be a qualified alarm triggering an incident to the operations center. The number of violations does play into perhaps determining whether severity is low or high, or to count exactly how many of xxx are affected but in and of itself doesn't make for easy reporting in the trap the actual time a named transaction becomes impacted by something. We do use it, but more in context of assessing the level of impact.
Probably too much detail.. but I am keen on seeking opinions...
While I agree with Rob on a practical level, I think if your organization needs this type of functionality it's more incentive to upvote percentile-based alerting and setting your thresholds on the 50th (median) as the variance here is very very low.
Answer by Rob V. ·
Hi Colin,
I don't think that the 9m 50s duration that you mention above makes sense. The point of the evaluation period (say, 5 min) and the aggregation (say, AVG) is that you don't want to worry about individual fluctuations above/below the threshold until some "meaningful" time. So at that particular time when AVG value over the most recent 5 min becomes greater than the threshold, that's the point that the incident duration starts. Sure, there will have been many points prior to that where the individual measurements were above (or below!) the threshold, but the incident isn't defined to start until the AVG of all such measurements is greater than the threshold. So that continues on for your 4m 50s until the AVG is no longer greater than the threshold at which point the incident is done.
If you're looking to track threshold breaches, how about using the "Violation of XXX" measure that's created for BTs? That can give you some indication of how often values slip up over the threshold, even if they are not in "alert status" yet.
Rob
DECEMBER 12, 10:00 AM GMT / 2:00 PM ET
Learn how Dynatrace Real User Monitoring automatically detects errors that impact your end users caused by erroneous 3rd party or CDNs.
December 10, 4:00 pm CET / 10:00 am ET
Register here
Learn how Dynatrace Real User Monitoring automatically detects errors that impact your end users caused by erroneous 3rd party or CDNs.
December 10, 4:00 pm CET / 10:00 am ET
Register here
Learn how Dynatrace Real User Monitoring automatically detects errors that impact your end users caused by erroneous 3rd party or CDNs.
December 10, 4:00 pm CET / 10:00 am ET
Register here