Solved: Process unavailable alert after Pod was destroyed and restarted

GAnantula · ‎16 Nov 2020

We have Dynatrace OneAgent installed on the OpenShift platform. We have enabled the "Availability Monitoring" (PROCESS_UNAVAILABLE option) for a service on Process Group level and we noticed that a Problem card was created when one Pod was manually destroyed (to address an issue) and new pod was started for the same service. However, the problem card didn't close after the new process was spun as I believe the 5 characters at the end of the Kubernetes full pod name is different from the previous one, which is expected.

Usually the "Process unavailable" events will not be closed automatically until Dynatrace detects the same process. But in this case the Problem card is (still) open for more than 2 days and needs to be manually closed.

Process name identified by dynatrace is "archive-4-c6hfg" before the pod was destroyed.

Process name identified by dynatrace is "archive-4-abc7d" AFTER the new pod was started.

Questions:

Is there anyway for Dynatrace to detect that a new Pod was started after that Pod was destroyed even with "PROCESS_UNAVAILABLE" option selected for "Availability Monitoring"? This way we don't need to close the Problem card manually in Dynatrace.
Is "PROCESS_GROUP_LOW_INSTANCE_COUNT" option for "Availability Monitoring" the only way to have Dynatrace identify and close the problem card automatically in this case?

Please advise.

Thanks,

Ganesh

Anonymous · ‎16 Nov 2020

Well the situation AFIK is the expected behavior:

Dynatrace will open a new problem if a single process in this group shuts down or crashes. Details of the related impact on service requests will be included in the problem summary.
Note:If a process is intentionally shutdown or retired while this setting is active, you'll need to manually close the problem.

Maybe a better option is to set a minimum of process, as you said.

Or better shut the setting of? For Pods like, the process name will always be different since the last part of the name -xxxx is random id. Is expected behavior also. Unless you have a custom rule for process detection.

Also, DT should detect with out the need of such setting if the process has a problem / if crash or is not working and not responding to requests.