AppD Archive

Agent not reporting correctly since about 1 day

CommunityUser
Splunk Employee
Splunk Employee

Hi,

right now i have the problem that the agents are not reporting correct anylonger. 

In the log Files i see a lof of ERROR messages:

[AD Thread Pool-Global1] 10 Mär 2015 15:46:10,119 ERROR RequestSegmentDataQueue - Fatal transport error: Read timed out
[AD Thread Pool-Global1] 10 Mär 2015 15:46:10,119 ERROR RequestSegmentDataQueue - Could not send snapshots to controller Fatal transport error: Read timed out
[AD Thread Pool-Global3] 10 Mär 2015 15:46:20,791 ERROR RequestSegmentDataQueue - Fatal transport error: Connection reset
[AD Thread Pool-Global3] 10 Mär 2015 15:46:20,791 ERROR RequestSegmentDataQueue - Could not send snapshots to controller Fatal transport error: Connection reset

Any idea what is causing this?

0 Karma

CommunityUser
Splunk Employee
Splunk Employee

Prior to those error messages i see the following error in the log:

ERROR MetricHandler - Error registering metrics
com.singularity.ee.agent.commonservices.metricgeneration.metrics.e: Error registering metrics with controller Fatal transport error: Connection reset
at com.singularity.ee.agent.appagent.kernel.ub.a(ub.java:125)
at com.singularity.ee.agent.commonservices.metricgeneration.a.a(a.java:169)
at com.singularity.ee.agent.commonservices.metricgeneration.d.a(d.java:270)
at com.singularity.ee.agent.commonservices.metricgeneration.g.run(g.java:103)
at com.singularity.ee.util.javaspecific.scheduler.n.run(n.java:118)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at com.singularity.ee.util.javaspecific.scheduler.y.e(y.java:315)
at com.singularity.ee.util.javaspecific.scheduler.a.b(a.java:150)
at com.singularity.ee.util.javaspecific.scheduler.b.a(b.java:123)
at com.singularity.ee.util.javaspecific.scheduler.b.b(b.java:208)
at com.singularity.ee.util.javaspecific.scheduler.b.run(b.java:238)
at com.singularity.ee.util.javaspecific.scheduler.i.a(i.java:683)
at com.singularity.ee.util.javaspecific.scheduler.i.run(i.java:715)
at java.lang.Thread.run(Thread.java:745)

0 Karma

Arun_Dasetty
Super Champion

Hi Constantin,

We see such errors when there are netork connectivity errors between instance where you have installed AppServerAgent and the controller, we understood that your agent is trying to register at saas account UI https://medicalcolumbusag.saas.appdynamics.com and you have provided account-name and access-key details in <agent_dir>/conf/controller-info.xml in addition to controller host and port details and have restarted jvm and still see the issues.

If above said is not the case, please send the zipped version of <AppServerAgent_dir>/logs folder and <AppServerAgent_dir>/conf folder archive and also provide the output of the following command issued from agent instance:

shell> telnet medicalcolumbusag.saas.appdynamics.com 443

shell> telnet medicalcolumbusag.saas.appdynamics.com 80 (if you are using http port in agent config)

- Also confirm if there are any proxy involved between agent instance and controller instance by any chance here?

Let us know if that clarifies your query, keep us posted requested details to debug further.

Adding to that we see few apps in saas UI are reporting fine so this could be either network issues or agent registeration issue at affected agent instance, keep us posted the logs to assist you further.

Regards,

Arun

0 Karma

CommunityUser
Splunk Employee
Splunk Employee

Hi,

ok what have i done so far:

1. I have reset the agent so it reloads. No change happened

2. I have restarted the services. Still the same effect

3. Just tested the telnets and they are working fine. 

What is confusing me, is that the agents reporting to the controller are located in different networks and locations:

We have one server reporting from our on premise network (which is having this issue)

The rest of the servers is located in an Amazon VPC in Frankfurt (also showing the same issues).  

So it seems that it is not an network issue on the agents side but might be on the controller side. But i can not analyze that due to the saas controller. 

Regards

Constantin

0 Karma

Arun_Dasetty
Super Champion

Hi Constantin,

Can you provide the following to assist you further:

a) archive version <AppSerAgent_dir>/logs path

b) screenshot from controller UI for screen you were referring to for clarity?

Regards,

Arun

0 Karma

CommunityUser
Splunk Employee
Splunk Employee

Hi,

please find attached the requested information.

Regards


Constantin

0 Karma

Arun_Dasetty
Super Champion

Hi Constantin,

Though logs have fatal errors, but we see agent logs are for node "192.168.100.105" from application "transactor" and from your screenshot and from the attached screenshot it is clear that data is now reporting fine, 

image.png

We see the issue does not exists any more, and data for past 1 days shows fine in app dashboard as well, let us know if you need further assistance on this?

0 Karma

CommunityUser
Splunk Employee
Splunk Employee

Hi,

I'm not sure about that. I have those fatal errors in the log and i am afraid that some data might be missing. It is true that we have reports currently. 

But in the timeslot around 13:15 - 13:20 today there was no data and still isn't so monitoring data is definetly lost. 

Regards

Constantin

0 Karma

Arun_Dasetty
Super Champion

Hi Constantin,

Can you provide screenshot depicting the issue from UI and also create and share a custom time range with us so that we can drilldown from our end?

Data will not be persists for period during network connectivity issue exists at agent end for long period.

Regards,

Arun

0 Karma

CommunityUser
Splunk Employee
Splunk Employee
Hi,

I can not provide you the time frame because right now i don’t see the gap any longer i saw yesterday.

But what confuses me is that you are telling me that everything is fine but the log of my agent is full of those messages:

[AD Thread Pool-Global65] 12 M‰r 2015 14:47:08,853 ERROR RequestSegmentDataQueue - Fatal transport error: Read timed out
[AD Thread Pool-Global65] 12 M‰r 2015 14:47:08,853 ERROR RequestSegmentDataQueue - Could not send snapshots to controller Fatal transport error: Read timed out
[AD Thread Pool-Global65] 12 M‰r 2015 14:47:53,385 WARN EventHandler - The retention queue is at full capacity [5]. Dropping events for timeslice [Thu Mar 12 14:38:00 CET 2015] to accomodate events for timeslice [Thu Mar 12 14:47:00 CET 2015]

Especially the last row tells me the agent is dropping information because it can not deliver it. I have one single node for that application so it is essential for me that all the data the agent gathers is delivered to the controller.

So what is the solution to get rid of those errors. I checked our network connection there is nothing wrong there.

For a second application i have agents running in AWS those try to report also and they give me the same error messages regarding the transport. Since there is no load on those systems it does not drop anything though.


Regards

Constantin

0 Karma
Get Updates on the Splunk Community!

See just what you’ve been missing | Observability tracks at Splunk University

Looking to sharpen your observability skills so you can better understand how to collect and analyze data from ...

Weezer at .conf25? Say it ain’t so!

Hello Splunkers, The countdown to .conf25 is on-and we've just turned up the volume! We're thrilled to ...

How SC4S Makes Suricata Logs Ingestion Simple

Network security monitoring has become increasingly critical for organizations of all sizes. Splunk has ...