Hi All.
We would like to talk with users that have response time behavior caused by a-typical load patterns, e.g: applications that "only spike" on a certain day of the week or at a certain period during the years. We would like to see how our current baseline implementation works in these environments and see how we can improve this. Please either post your feedback on this forum or contact me through email so that we can start a discussion: andreas.grabner@compuware.com
Andi
Answer by James M. ·
We aren't retail, but would be happy to share with you our thoughts/experiences with the new auto-baselining functionality. We have only been spending a lot of time looking at this functionality over the last couple of months, so are still pretty new. Having looked at other vendors trying to do similar work, the one big advantage is that you have a reasonable model out of the box without having to spend a lot of time
Summary: its an interesting implementation, and for us has the most utility in the PureStack dashboards for visualization purposes.
Observations:
1) We see a very high degree of correlation the High Overall Failed Transaction Rate and Excessive Web Response Time incidents, so we merge these together when we move into Incident Notification, to avoid what are essentially duplicate tickets.
2) The inability to override/tune the thresholds such as the evaluation timeframe, the acceptable deviations, and to have different rules for different BTs somewhat limits this going forward.
3) We observe a large amount of incident chatter with both the Excessive Web Response Time (especially) and the Response Time Degraded incidents; we had to implement an incident suppression model that reduces both the chatter and treats both incidents as equivalent.
4) I deviate, but the rules around a host system being unhealthy should also be overridable on a host by host basis.
5) As we requested in RFE/other product management discussion, we would like to bring in host data as first class objects that can be populated via host agents or via the (deprecated, but shouldn't be) SSH host monitoring. Because:
6) The next cut is to highlight advanced correlation with the incident: is it because the host the process is on is running hot? Is it because the host that the database is running on is hot (i.e. auto-correlation of all host data)? And to correlate against the Host Site attribute: we run in 5 data centers and many of our apps run simultaneously in at least two of them, and if there's a correlation to a site when there is more than one involved, it would be interesting to know.
7) Where we stand right now: we are weakly recommending internally to not turn on incident alerting for the auto-baseline incidents for our apps, largely because they are too sensitive, and our world isn't predictable enough that the alerts are meaningful. That's purely an artifact of our business, and I suspect retail/high volume sites would show a different value proposition.
JANUARY 15, 3:00 PM GMT / 10:00 AM ET