Solved: How could I say if customer would ask me how reliable is "Root cause analysis"?

kohei-saito · ‎28 Aug 2018

Hi,

my customer (or trial user) have some ambiguous questions about Root cause analytics.
One of questions is: How reliable is "Root cause analysis"?

When I introduced dynatrace AI to them, they asked "What are the odds of correct root causes?".
I couldn't answer well, so I told them that I would explain the function of root cause again later after the investigation.

After that, I read this page (Artificial Intelligence & Dynatrace) and this leads the following.

calculates the probability of individual incidents causing other incidents, applying an eigenvector centrality algorithm (the same ranking approach used by Google Search) to build a weighted graph of all related incidents to determine what issue has the highest statistical probability of being the root cause.

According this description, dynatrace calculates the probability of individual incidents causing other incidents as a weighted graph of all related incidents and determine what incident has the highest statistical probability of being the root cause.

I consider this probability can also be said as "the probability that the root cause is correct", which my our customer wants to know.

Is this understanding correct?
If so, how high is the probability to determine root cause? And is it relative value or absolute value?

I mean by "relative" that root causes are determined as a result of comparison of each probability of individual incidents causing other incidents.

At this case, for example, when the probability of incident "A" is 60% and that of B is 80%, B is higher than A, so B is determined as root cause.

I mean by "absolute" that root causes are determined as a result of comparison of the probability of each incident causing other incidents and some thresholds or baseline.

At this case, for example, when the threshold of probability for root cause is determined as 70% and incident "A" is 60% and that of B is 80%, B is over the threshold 70% so B is determined as root cause.

If my understanding is far from the fact, please tell me the correct understanding...

Regards,

Kohei Saito

wolfgang_beer · ‎28 Aug 2018

Your description of how to calculate the probability of detecting the root cause of a complex incident is much too simplified and cannot be given in an absolute or relative manner. The Dynatrace root cause detection first of all tries to find out which component is the most probable candidate for containing the root cause. Each candidate, such as a host or a specific process gets a calculated rank. If the rank is high enough we show a root cause in terms of a host, processes or service. The rank, as in most other machine learning algorithms is a number between 0 and 1, where 1 would mean the highest rank possible. In case we detect many similar but very low ranks, such as 0.0001 for 10 different hosts we assume that we do not have enough evidence to blame an individual host and we would show no root cause.

A root cause would be shown if the first candidate shows a rank of 0.8 and the second candidate shows 0.01. That is a clear indicator that host number one stands out as root cause candidate.

In any case, a rank of 1 still does not automatically mean that we found the real root cause, which in many cases is somewhere hidden in the custom program logic given by the customers software services.

Therefore, the process of finding the root cause always starts with finding the most likely component, then follow the metric based suspicious evidences, such as garbage collector metrics anomalies and drill down into the code level details during the problematic timeframe in order to find the lines of code that are responsible for the incident.