What is the timeout for an agent attempting to connect to a collector? There is conflicting information in at least the 5.5 and 6.1 documentation. The definition of the DT_WAIT environment variable indicates that it is 20 seconds, but in the .Net Agent Troubleshooting section it indicates that it is 60 seconds. The agent log file also indicates that it will try for 60 seconds. Considering that the default Windows service timeout is 30 seconds doesn't 60 seconds seem long? This appears to cause one of our applications that is running as a Windows service to fail to start when the collector was not available. When I set the DT_WAIT to 10 the service started fine.
Answer by Guenter H. ·
Thanks Gregg,
would be great if you tried and it worked with the new default!
I´ll talk with our good support guys and the devs in charge on Monday what the exact reason for the change was and if they could think of any new implications.
Have a nice WE!
G.
Answer by Christian S. ·
actually we changed the default from 60s to 20s exactly for this reason, because there were services (especially on windows) which did not come up in time when the collector was - for whatever reason - not reachable.
Answer by Gregg K. ·
We tested increasing the service timeout to 1m and then 2m, but it didn't help. Windows only seems to have that global setting for all services so we were not that crazy about that idea anyways. We only tried 10s for DT_WAIT, but I suppose we could have went with 20s since that is the default going forward.
Answer by Guenter H. ·
Good morning Gregg,
I reverted the copy-edit changes and referenced the DT_WAIT constant in the .Net troubleshooting section instead of duplicating the value so it won´t haunt us in the future.
While editing the troubleshooting I think I found (quite verbally) a good pointer to your problem:
"If a configuration is found, the Agent tries to connect to the given Collector ('server' setting) and creates a new dt_<agentName>
bootstrap
<pid>.log
file in the <DT_HOME>\log
folder. On connection problems, the Agent tries for DT_WAIT
seconds (default timeout setting) and blocks the process from executing. When the Agent times out, no instrumentation is done."
I suspect the Server > Collector > Agent update chain gets into the way time-wise. If you set the time-out to 10s the app is only blocked 10s and it will come up, but only uninstrumented.
Edit: How about increasing the Windows service start-up timeout instead of decreasing DT_WAIT?
If we can´t nail it with Christian´s next reply I suggest you contact support with a support archive, so they can look at it more deeply with better visibility.
Thanks
G.
Answer by Gregg K. ·
This could happen during maintenance if the application starts while the collector is rebooted for patching. In this case this is a critical application and the application cannot be prevented from starting for any reason.
We tested this scenario because the application would not start previously prior to the collector connecting to the dt server for the first time. It just so happened the proper firewall rules were not in place to allow that traffic, but the collector was reachable by the agent. After the initial collector connection to the dt server the agents started up fine.
Answer by Christian S. ·
hi Gregg,
apart from the changed defaults, this is usually a scenario that you should not experience on a regular basis, as it indicates that the Agent could not connect to the Collector. and in this case the Agent will not be instrumenting and only providing some metrics but no PurePaths and such.
so changing this timeout to a lower value also increases the possibility that the Agent will not work as expected.
so my question is: what is the reason for the Agent not connecting to the Collector? is this expected from your side?
best,
Christian
Answer by Guenter H. ·
Sorry for the confusion, Gregg!
It´s a combination of bad luck and ignorant, overzealous copy-editing you fell over.
Firstly, the bad luck:
The 5.5 you looked at was the last version where the timeout was 60 seconds.
I normally put change notes in the new documentation when such parameters change, but I don´t have / take the time to put forward references in the old docs.
Secondly, the copy-editing:
My original in 5.6 for wait=<seconds>
, because there were no copy editors yet:
Optional: Specifies the initial wait timeout – the maximum time to wait for a connection to a dynaTrace Collector in seconds. If the connection cannot be established within this timeframe, the application continues uninstrumented. Defaults to 20 seconds now; was 60s until 5.5.
After the first round of copy-editing:
Optional: Specifies the initial wait timeout – the maximum time to wait for a connection to a dynaTrace Collector in seconds. If the connection is not established within this timeframe, the application continues without instrumentation. It defaults to 20 seconds.
Any reference to the change is gone. I have no idea what´s the problem with this text, but I know they edited out other back references (what Java version was needed for installation on *NIX until when) and they can´t tell a *NIX chmod
mask (777) from a (European) area code (0777) and edited exactly the wrong version out... <grrrrrin>
Thanks for bringing this up! I will try to find all occurrences and revert them.
G.
JANUARY 15, 3:00 PM GMT / 10:00 AM ET