I'm tring to troubleshoot a problem with sending data from a light forwarder to a splunk server. In this particular case, the clients are 2 Windows 2008 R2 boxes running the "Client access server" role for exchange 2010. The splunk server is a redhat linux 5 64 bit box. Both are running splunk 4.1.4.
I'm trying to harvest logs from 2 cas servers. I suppose for those who are familiar with exchange 2010, I'm trying to harvest the following logs:
C:\Program Files\Microsoft\Exchange Server\V14\Logging\AddressBook Service\AddressBook*
C:\Program Files\Microsoft\Exchange Server\V14\Logging\Imap4\IMAP4*
C:\Program Files\Microsoft\Exchange Server\V14\Logging\Pop3\POP3*
C:\Program Files\Microsoft\Exchange Server\V14\Logging\RPC Client Access\RCA*
C:\inetpub\logs\LogFiles\W3SVC1\u_ex*
I started off with the first 4 logs (address book, imap, pop3 and rca). Things looked to be ok, and I was seeing active data from both machines. I was adding them one at a time to verify that each log was being collected properly. When I added the last log, I noticed that I stopped receiving events from one of the two cas servers. When I left it last night, the first one wasn't reporting. I tried running the splunk add oneshot command, and that got data to show up, but of course that was manually. I left it alone, and this morning, I noticed that the 2nd cas server was the one reporting back.
I'm not sure where the problem may be, but I see two things on logs (one on the client and one on the server) that point to a problem. On the client, I see this in the splunkd.log file over and over:
08-15-2010 15:45:53.760 INFO TailingProcessor - Could not send data to output queue (parsingQueue), retrying...
08-15-2010 15:45:53.760 INFO TailingProcessor - ...continuing.
I don't really understand what that really means or what its trying to do. I'm not sure if the timing matches, but I also see this on the splunkd.log of the splunk server:
08-15-2010 15:50:24.234 WARN DateParserVerbose - Failed to parse timestamp for event.
Context="source::C:\inetpub\logs\LogFiles\W3SVC1\u_ex100108.log|host::cas1|exchangewebaccesslog|remoteport::14759" Text="#Software: Microsoft Internet Information Services 7.5..."
The value after "Text=" isn't always that, but it is never actual data. I figure that particular warning on the splunk server is because the top lines of each log file contain header info that it can't parse. However, I'm not sure if that's enough to stop it from receiving data. I"m guessing its not because it seemed to be working fine before I added the web log.
I took a guess that maybe I was sending too many events after I added the web logs, so I tried adding this to my outputs.conf file:
maxQueueSize=200000
However, that didn't seem to make a difference. The other thing is that after a few restarts of the splunk light forwarder, the web logs seemed to be showing up for both clients, but then I stopped receiving data for all the other logs that were already working. When I attempted to put the TrailingProcessor in debug mode, I saw this for a file:
08-15-2010 15:40:52.195 DEBUG TailingProcessor - ****
08-15-2010 15:40:52.195 DEBUG TailingProcessor - File state notification for path='C:\Program Files\Microsoft\Exchange Server\V14\Logging\RPC Client Access\RCA_20100728-1.LOG' (first time).
08-15-2010 15:40:52.195 DEBUG TailingProcessor - Item 'C:\Program Files\Microsoft\Exchange Server\V14\Logging\RPC Client Access\RCA_20100728-1.LOG' matches stanza: C:\Program Files\Microsoft\Exchange Server\V14\Logging\RPC Client Access\RCA*.
08-15-2010 15:40:52.195 DEBUG TailingProcessor - Storing config 'C:\Program Files\Microsoft\Exchange Server\V14\Logging\RPC Client Access\RCA*' for app ''.
08-15-2010 15:40:52.195 DEBUG TailingProcessor - Entry is associated with 1 configuration(s).
08-15-2010 15:40:52.195 DEBUG TailingProcessor - Will attempt to read file: C:\Program Files\Microsoft\Exchange Server\V14\Logging\RPC Client Access\RCA_20100728-1.LOG.
08-15-2010 15:40:52.336 DEBUG TailingProcessor - Got classified_sourcetype='exchangerpcclientaccesslog' and classified_charset='UTF-8'.
08-15-2010 15:40:52.336 DEBUG TailingProcessor - About to read data (Opening file: C:\Program Files\Microsoft\Exchange Server\V14\Logging\RPC Client Access\RCA_20100728-1.LOG).
08-15-2010 15:40:52.367 DEBUG TailingProcessor - Will doublecheck EOF (in 3000ms)..
I don't see anything wrong that would tell me that its a problem reading the file itself. Here are the relevant files:
inputs.conf on the client side:
[monitor://C:\Program Files\Microsoft\Exchange Server\V14\Logging\AddressBook Service\AddressBook*]
index=idx-exchangemaillogs
sourcetype=exchangeaddressbooklog
[monitor://C:\Program Files\Microsoft\Exchange Server\V14\Logging\Imap4\IMAP4*]
index=idx-exchangemaillogs
sourcetype=exchangeimap4log
[monitor://C:\Program Files\Microsoft\Exchange Server\V14\Logging\RPC Client Access\RCA*]
index=idx-exchangemaillogs
sourcetype=exchangerpcclientaccesslog
[monitor://C:\inetpub\logs\LogFiles\W3SVC1]
index=idx-exchangemaillogs
sourcetype=exchangewebaccesslog
props.conf on the server side:
[exchangepop3log]
SHOULD_LINEMERGE=false
KV_MODE = none
REPORT-exchangepop3extract = exchangepop3log_extractions
[exchangeaddressbooklog]
SHOULD_LINEMERGE=false
KV_MODE = none
REPORT-exchangeaddressbookextract = exchangeaddressbooklog_extractions
[exchangeimap4log]
SHOULD_LINEMERGE=false
KV_MODE = none
REPORT-exchangeimap4extract = exchangeimap4log_extractions
[exchangerpcclientaccesslog]
SHOULD_LINEMERGE=false
KV_MODE = none
REPORT-exchangerpcclientaccessextract = exchangerpcclientaccesslog_extractions
[exchangewebaccesslog]
SHOULD_LINEMERGE=false
KV_MODE = none
REPORT-exchangewebaccessextract = exchangewebaccesslog_extractions
transforms.conf on the server side:
[exchangepop3log_extractions]
DELIMS = ","
FIELDS = "date-time","session-id","sequence-number","local-endpoint","remote-endpoint","event","data","context"
[exchangeaddressbooklog_extractions]
DELIMS = ","
FIELDS = "date-time","session-id","seq-number","client-name","organization-info","client-ip","server-ip","protocol","operation","rpc-status","processing-time","operation-specific","failures","authentication"
[exchangeimap4log_extractions]
DELIMS = ","
FIELDS = "date-time","session-id","sequence-number","local-endpoint","remote-endpoint","event","data","context"
[exchangerpcclientaccesslog_extractions]
DELIMS = ","
FIELDS = "date-time","session-id","seq-number","client-name","organization-info","client-software","client-software-version","client-mode","client-ip","server-ip","protocol","operation","rpc-status","processing-time","operation-specific","failures"
[exchangewebaccesslog_extractions]
DELIMS = " "
FIELDS = "date","time","s-ip","cs-method","cs-uri-stem","cs-uri-query","s-port","cs-username","c-ip","cs-version","cs(User-Agent)","cs(Cookie)","cs(Referer)","cs-host","sc-status","sc-substatus","sc-win32-status","time-taken"
and...just for sake of completion....the first bit of the files in question:
C:\Program Files\Microsoft\Exchange Server\V14\Logging\AddressBook Service\AddressBook*
#Software: Microsoft Exchange
#Version: 14.00.0694.000
#Log-type: AddressBook Protocol Logs
#Date: 2010-08-15T00:00:00.477Z
#Fields: date-time,session-id,seq-number,client-name,organization-info,client-ip,server-ip,protocol,operation,rpc-status,processing-time,operation-specific,failures,authentication
....data goes here....
C:\Program Files\Microsoft\Exchange Server\V14\Logging\Imap4\IMAP4*
#Software: Microsoft Exchange Server
#Version: 14.0.0.0
#Log-type: IMAP4 Protocol Log
#Date: 2010-08-15T19:44:10.128Z
#Fields: date-time,session-id,sequence-number,local-endpoint,remote-endpoint,event,data,context
C:\Program Files\Microsoft\Exchange Server\V14\Logging\Pop3\POP3*
#Software: Microsoft Exchange Server
#Version: 14.0.0.0
#Log-type: POP3 Protocol Log
#Date: 2010-08-15T16:26:19.597Z
#Fields: date-time,session-id,sequence-number,local-endpoint,remote-endpoint,event,data,context
C:\Program Files\Microsoft\Exchange Server\V14\Logging\RPC Client Access\RCA*
#Software: Microsoft Exchange
#Version: 14.00.0694.000
#Log-type: RCA Protocol Logs
#Date: 2010-08-15T00:00:30.087Z
#Fields: date-time,session-id,seq-number,client-name,organization-info,client-software,client-software-version,client-mode,client-ip,server-ip,protocol,operation,rpc-status,processing-time,operation-specific,failures
C:\inetpub\logs\LogFiles\W3SVC1\u_ex*
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2010-08-15 00:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs-version cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status time-taken
Unless I'm missing something obvious in my config files, can anyone point me in the right direction as to what would be causing the "C:\inetpub\logs\LogFiles\W3SVC1\u_ex*" to be showing up in my splunkd.log files on the light forwarder (as well as cause my data not to show up in splunk either!)
Wow, very thorough. 🙂
The TailingProcessor message means that it was unable to insert data into the parsingQueue, which, as you might guess, is where event parsing happens. This only happens when the queue is full - to confirm this, run 'grep blocked=true var/log/splunk/metrics.log*' and you should see many results.
Data travels through Splunk's queues pretty much linearly - meaning that if a queue further down the line (indexQueue) is blocked, it will eventually fill and block the queues that are feeding it (parsingQueue). The metrics.log output will indicate which queues are blocked.
The timeparsing warning doesn't have anything to do with the blocked queue message. Here are some things to check:
Does metrics.log on the indexer (receiver of the forwarded data) also contain blocked=true messages?
If not: does metrics.log on the forwarder indicate that only the parsingQueue is blocked, or indexQueue as well?
Seeing only the parsingQueue would indicate that something along the lines of too much time spent running complex regexes on the data.
A blocked indexQueue would mean the data isn't being sent to the indexer fast enough. What type of link is between the forwarders and indexer? How much bandwidth are you actually seeing being used by the forwarding port?
Following the above steps and poking around a bit should point out the bottleneck fairly quickly.
We recently ran into similar issue, 12 of our 40 forwarders were down for a period of 8 hours,when these 12 forwarders were brought up data ingestion into indexer was happenning at very very slow pace, we were almost lagging behind by 10 hours [difference between eventtimestamp versus indexedtime].In our topology data traversal happens from splunkforwarder-->HeavyForwarder-->indexer. On troubleshooting we did notice 'grep blocked=true var/log/splunk/metrics.log*' on all of our forwarders[for parsing queue] and the heavy forwarder[for parsing queue, ae queue]. We bumped up the queue size on both HeavyForwarder and Forwarders using below configurations and restarted both forwarders/heavyForwarders. After this activity in couple of hours we could notice reduction in the lag between the eventtimestamp versus indexedtime. In 3 to 6 hours after the change our data almost became current.
We shutdown few less critical forwarders for few hours in the hope that the heavy forwarder does not have to deal with too much data to parse/filtering(same machine acts as our search head and we were running mad searches in the hope to see the data come through), also we added a configuration in our forwarders input.conf to ignore any file whose modification time is older than 2d.
We are almost good in terms of data coming over and indexing fairly quickly and do not see any blocked=true messages in metrics.log, however we still notice below message in the forwarder splunkd.log, not sure what else we can do at this point to elimminate this other than horizontally scaling our HeavyForwarder and making sure forwarders are NOT down for such long time in future and adding splunkforwarder onto machine boot start scripts. We are below error will be gone in few days
Message in splunkd.log
TailReader - Could not send data to output queue (parsingQueue), retrying...
TailReader - Continuing...
[queue]
maxSize = 1GB
[queue=aq]
maxSize = 20MB
[queue=aeq]
maxSize = 20MB
inputs.conf
ignoreOlderThan=2d
Thanks @amrit and others for your sugesstions were very helpful. One other link which we found useful where the guy has collated all possible responses in one place is here
http://splunkgeek.blogspot.in/2015/05/could-not-send-data-to-output-queue_7.html.
If anyone has suggestion on how to get rid of "Could not send data to output queue (parsingQueue)" from showing in the splunkd.log it would help in future.
Can anyone please advice how to troubleshoot if the issue is on the indexer? When this happened I was able to see that ram had filled up but the CPU was not particularly overloaded.
If anyone has suggestion on how to get rid of "Could not send data to output queue (parsingQueue)"
It's important to remember that this message in the presence of no other problems does not really indicate a problem. In your case, all it probably means is that data is coming into your forwarder at a higher rate than the forwarder is able to send over the network.
For example: on a 1Gbit network, the HWF can accept network data at somewhere around 110MB/s (in theory). While the HWF is also capable of sending at 110MB/s, its destination is a single Splunk indexer, at any given time - whose indexing thruput is generally 25MB/s or lower. So, you potentially have data coming into the system at 100+MB/s, and exiting the system at <25MB/s. The math doesn't add up, so periodically you will see the forwarder back up, and emit the log message you're seeing.
I'm going to assume you have multiple indexers in your environment, and are using Auto Load Balancing (the default) as your forwarding strategy:
Assuming other things are healthy, these changes should help reduce or eliminate blockages at the forwarder level.
Also ran into this, and for us the solution was to add the following to limits.conf
on the forwarder:
[thruput]
maxKBps = 0
It seems the universal forwarder has a default value of 256kbps. This change removes the limit.
From various googling, it seems that this default was created to prevent network saturation. However the limit is on the forwarder's internal processing, which is uncompressed data, and what goes across the network is compressed (about 80% smaller). Meaning that this ends up rate limiting the network at around 50kbps, which is a rather absurd default by my book.
FWIW, this fixed our issue. We were also trying to index exchange SMTP protocol logs. Adding the file $SPLUNK_HOME\etc\system\local\limits.conf and putting in this stanza allowed the forwarder to begin sending events.
I recently ran into this issue and thought it might help someone. Another item to check would be /opt/splunk/etc/system/local/serverclass.conf on the indexer. We use apps to define what servers talk to what indexer. For us, each app has a whitelist.XX entry that specifies the subnets allowed. I had to add the subnet where the server lived and then /opt/splunk/bin/splunk reload deploy-server to reload the file. I then verified that the app was pushed to the client. I stopped and restarted the client and was able to see the data in Search.
BobM, perhaps this is a case just when you're running Splunk out of space. What we're discussing here is when you have more then one Universal Forwarder sending logs and just one of twenty, for example, is not working properly, not sending data to Splunk. But, this is a valuable information to consider when start analyzing the Splunk env.
Our fix was to stop the forwarder, remove the metric logs from var/log/splunk/metrics* and re-start, the files were 25MB
Splunk starts blocking when free space falls below 2GB and it sounds like this was the cause of your problem. Just deleting the files will work as a temporary fix but they will grow back.
You can change the default in "$SPLUNK_HOME/etc/system/local/server/conf"
You need a stanza
[diskUsage]
minFreeSpace = 1000 #or whatever value you decide on
Bob
Wow, very thorough. 🙂
The TailingProcessor message means that it was unable to insert data into the parsingQueue, which, as you might guess, is where event parsing happens. This only happens when the queue is full - to confirm this, run 'grep blocked=true var/log/splunk/metrics.log*' and you should see many results.
Data travels through Splunk's queues pretty much linearly - meaning that if a queue further down the line (indexQueue) is blocked, it will eventually fill and block the queues that are feeding it (parsingQueue). The metrics.log output will indicate which queues are blocked.
The timeparsing warning doesn't have anything to do with the blocked queue message. Here are some things to check:
Does metrics.log on the indexer (receiver of the forwarded data) also contain blocked=true messages?
If not: does metrics.log on the forwarder indicate that only the parsingQueue is blocked, or indexQueue as well?
Seeing only the parsingQueue would indicate that something along the lines of too much time spent running complex regexes on the data.
A blocked indexQueue would mean the data isn't being sent to the indexer fast enough. What type of link is between the forwarders and indexer? How much bandwidth are you actually seeing being used by the forwarding port?
Following the above steps and poking around a bit should point out the bottleneck fairly quickly.
I should note that as of recent Splunk versions (4.2 or 4.3), it may be harder to see the "blocked" messages. An additional thing to check would be current vs max queue size, with:
grep --color group=queue.*current_size_kb var/log/splunk/metrics.log
...as the queues could be at 95-99% capacity, but not emitting "blocked" messages because they're not at 100% (due to queuing strategy changes in recent versions).
I am having the same problem. Only parsingqueue is showing up blocked=true.
I determined that the network port was not open between this server and the indexer and that was the problem.
-Ron C.
Yeah. This basically indicates either network problems, or problems with the config of outputs.conf on the forwarder (to get the the indexer) or inputs.conf in the indexer (to receive from the forwarder).
right, the problem is not on the indexer then - it's certainly possible that the link between your forwarder and indexer was having issues.
I really have no idea what "unblocked" the parsing queue on the light forwarders though (both were unblocked at more or less the same time)...I wish I knew because I have a suspicion that I'll see that even again somewhere along the way. So if I didn't see this at all on our indexer, would that mean the problem isn't there at all? I'm not running any regexes on the data..just field extraction via the "delim=...fields=..." method.
Thanks amrit! I think I have a better understanding now of that thanks to your explanation. I did not find any references at all to "blocked=true" in the metrics.log file on the index server. However, I did find reference to "blocked=true" on our light forwarders: "name=parsingqueue, blocked=true, max_size=1000, filled_count=44, empty_count=0, current_size=1000, largest_size=1000, smallest_size=997". The unusual thing though is that they seemed to have stopped at 6:57am this morning. I also stopped receiving the "Could not send data to output queue" at 7:22 this morning.
oops...that last bit shouldn't have reference the web log files...I meant why would "TailingProcessor - Could not send data to output queue (parsingQueue)" be showing up in the splunkd.log of the light forwarder?