I'm running Splunk version 5.0.3, build 163460 on Suse Linux 3.0.13-0.27
I have a Splunk Dashboard Search and a Splunk Alert Search that look as follows...
8888888888888888888 Below is saved Search used in Dashboard Panel 888888888888888888
source=udp:515 f5type="RESPONSE" earliest=-60m@h latest=@h| bucket time span=1m | fields status, _time | eventstats count as TotalTransx by _time | eventstats count as TransactionByStatusCount by _time status | dedup status, TransactionByStatusCount, TotalTransx | where status!="200" | eventstats sum(TransactionByStatusCount) as TotalExceptions by _time| dedup _time, TotalExceptions, TotalTransx | eval SuccessfullTransx=( TotalTransx - TotalExceptions)| convert timeformat="%H:%M" ctime(time) AS Time| sort Time | table Time, TotalTransx, SuccessfullTransx, TotalExceptions | chart avg(TotalTransx) AS "Total Transactions" avg(SuccessfullTransx) AS "Transaction Success" avg(TotalExceptions) AS "Transaction Failures"
8888888888888888888 Below is Search used for an Alert 888888888888888888
source=udp:515 f5type="RESPONSE" earliest=-2m@m latest=-1m@m | stats count | where count=0
All of a sudden today (after it has been running fine for many months) these searches started returning zero results and thus the the Dashboard showed no values and the Alert triggered! The Application team(For the application we're using Splunk to monitor) told me their application was running just fine when I checked with them to see if the alert was correct in reporting the application down. On performing further checks, it turns out the Alert was actually false. I realized the problem only happens when I specify the time range using (earliest=-60m@h latest=@h) or (earliest=-2m@m and latest=-1m@m) on the search app. I ultimately decided to restart the Splunk server but this did not seem to resolve the issue as the problem persisted for some 20 minutes after the server restart. This false alerts lasted about 1 hour and a half hour then the problem auto-magically resolved itself!
Any one ever experience this odd phenomena? Any ideas on what may have suddenly caused this strange behavior and how I can prevent it from happening in the future?
Just to be clear, are you saying that the original source data kept arriving during this time? Is it not the case that you were missing events? Since you're relying on UDP its quite possible that the application could be available but no log data was received
Indeed the incoming data was still following when this odd phenomena took place. When I specified the time using the "Time Range Picker" and NOT the search field I could see the events for the desired time range. I'm not missing any events at all and UDP was not the culprit. Thank you very much for the swift response. Much appreciated.
Are you sure that the data arrived when it should have? My next thought would be to run something like
sourcetype=blah | eval IndexTime=_indextime | eval diff=_time-IndexTime | timechart avg(diff) or something similar for your data over the hour that you experienced problems to see if the events did arrive, but perhaps were indexed later due to latency or some other hiccup
Ok, bad news! The problem is recurring now as I type. And the data is following in through UDP port 515 as reflected in the tcpdump output. So the problem is NOT the network dropping data packets as we had concluded. 😞
mxxxmachine:~ # tcpdump dst port 515
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
14:41:42.691803 IP 188.8.131.52.39560 >
Any other suggestions that could help get to the root of this problem and prevent it from recurring?
Thanks in advance for your help.
This problem is becoming a daily pain. Done some log analysis and it turns out the problems happens when the following entry in the log happen. Unfortunately, it doesn't happen all the time. Help!
10-22-2013 07:37:00.603 +0200 WARN AggregatorMiningProcessor - Breaking event because limit of 256 has been exceeded - datasource="udp :515", datahost="172.17.100.75", datasourcetype="f5ltm"
10-22-2013 09:37:00.586 +0200 WARN AggregatorMiningProcessor - Breaking event because limit of 256 has been exceeded - datasource="udp
Performed the search you provided above for time range 12h00 to 14h00 today and got the below results. Seems you're correct that the indexing stopped for some reason!
sourcetype="f5ltm" | eval IndexTime=indextime | eval diff=time-IndexTime | timechart avg(diff)
1 10/1/13 12:00:00.000 PM -0.035207
2 10/1/13 12:05:00.000 PM -0.034583
3 10/1/13 12:10:00.000 PM -0.034141
4 10/1/13 12:15:00.000 PM -0.033685
5 10/1/13 12:20:00.000 PM -0.035301
Not sure what could have happened though... 😞
Those results look alright though? thats just the latency and its pretty small - unless theres other bits you haven't pasted 🙂 I have some UDP data and sometimes if theres a blip in the network then latency can be introduced, if Splunk searches before the data arrives then it will rightly think there is an issue.
There is in fact a "diff" value gap between 12h20 and 14h30 that I didn't show in the previous comment above. So there was an apparent loss of data in that time span as the reporting also reflect no results for that period. So perhaps as you suggested, UDP misbehaved or the network had some glitch during that moment. Thank you very much for helping get to the root cause of the problem. Tomorrow I'll try AGAIN to find out from networks if there is anything odd they picked up about the network during that time span that may have resulted in data not reaching the Splunk server.
No worries, I've converted the comment to an answer so feel free to click the tick and accept it if you're feeling generous 🙂 Hope you find the problem! (Also, switch to TCP!)