Solved: Business impact analysis is incorrectly evaluating impacted users

kalle_lahtinen · ‎08 Dec 2020

Hi,

I ran into a case where the "impacted users" section in the problem summary was listing apps and services that weren't actually affected in any way. To avoid showing customer data here, I'll just describe this in a simplified way:

1. Application BananaApp uses service BananaSvc which then calls external service ExtSvc with request GetBananaInfo.

2. Service OrangeSvc calls external service ExtSvc with request GetOrangeInfo. OrangeSvc has no web UI usage and thus no RUM / "Application" data.

In my case, ExtSvc requests for "GetOrangeInfo" were failing, which generated a new problem. During this time, the failure rate for GetBananaInfo was a stable 0 %.

In the problem analysis, Dynatrace said that 100 users from BananaApp were impacted by this. In fact, zero users were, because the issue was limited to OrangeSvc.

So Davis deduced that because ExtSvc was having issues, and BananaApp generally uses that service, BananaApp is impacted. If Davis was instead correlating the data on a request level, it would've seen that zero GetOrangeInfo requests are coming from BananaApp, and thus it's not impacted.

I'm not sure whether this should be regarded as a bugfix or an RFE? And also I'm not certain what is the role of this external service here, i.e. would the same thing happen if it was a local service monitored by OneAgent? This ExtSvc was originally under "Requests to public networks", which I then isolated via "Monitor as separate service".

Anonymous · ‎08 Dec 2020

When you open the user sessions from the affected users and you check the session service flow from the app POV, how does it show?

kalle_lahtinen · ‎08 Dec 2020

Not sure how to exactly view the service flow from a single user session, at least in the same sense you can view it from "Transactions and services" point of view as a diagram ("View service flow" button).

But looking at the individual Load actions from impacted user sessions, I can see that this external service was indeed requested (by drilling down from Load Action to PurePath view). But the requests were successful; like I said the requests affecting these apps were not the same ones that were failing at the time. Davis just lumped all of this together because of the common nominator i.e. the external service.

Anonymous · ‎08 Dec 2020

I see.

The external service shows up like "ExtSvc", is possible to see the GetOrangeInfo and GetBananaInfo from DT as individual requests from the root ExtSvc? or is all the same requests inside ExtSvc and when checking the pure path you see the difference?

Might be needed

A) install OA to add that context and allow the causation engine to correctly see the difference or B) open up ExtSvc in individual services then so it treats each one as an ExtSvc (Via API Service Detection for external services)

kalle_lahtinen · ‎08 Dec 2020

Hi,

Thanks for the link! I was actually about to do that anyway, not just due to this issue. So you actually took me right to the config I was looking for 🙂

Regarding your first question, yes those requests are separately visible under ExtSvc, not just under the PurePaths. So e.g. by activating the failure rate tab, I can see right away which requests are failing and which are not. That's why I was surprised that the business impact wasn't correctly detected by Davis, since it should be possible to do based on the data at hand. But again, I'm not certain which part of the misconception is due to this being an external service.

There will never a possibility to install OneAgents on the remote hosts, so they will remain as external services. As for what's the impact of separating ExtSvc by subdomain, I guess I'll see in time if it helps, during a next similar problem event.

For now I'm not sure whether this question is answered 🙂 Maybe I'll keep it open for a while yet.

kalle_lahtinen · ‎08 Dec 2020

Just to reply to myself: after splitting the service based on the subdomain, of course I won't see this exact issue anymore, because now the application usage won't be linked anymore to this shared "ExtSvc" I referred to earlier. To use my earlier example names, now "GetBananaInfo" and "GetOrangeInfo" are under separate services and thus won't be mixed up anymore in the root cause analysis.

I'm still not sure whether Davis evaluates things on a request level or only the service level when it comes to analyzing the business impact. But I'll nonetheless mark this as answered, because my problem basically got solved.

Enrico_F · ‎09 Jun 2021

I'm investigating a very similar case in our environment where any anomalies detected on the built-in service "Requests to unmonitored hosts" seem to cause spurious problem associations and I wondered what you meant by "split the service based on the subdomain"? Did you create key requests for the detected requests to BananaSvc and OrangeSvc on your "ExtSvc" service that you separated from "Requests to public networks" earlier? Or did you create custom services on the PG's that provide the BananaSvc and OrangeSvc representing the client-side code of the requests to GetBananaInfo/GetOrangeInfo? Or something else entirely?

Your feedback would be highly appreciated.