I have a clustered splunk environment and monitoring in place for quite a few application logs.
Lately , I have been encountering an issue with data collection in Splunk .
For some frame of time everyday(2 to 5 hours) , I do not see any data even though the application server has logs generated.
But for the rest of the day it works just fine .
Universal Forwarders and indexers are working just fine.
This is affecting the dashboards and alerts , as the data is been missed out .
Example log:
2020-02-13T05:01:45.249-0500 INFO 801 | UNIQ_ID=20200213050500000170151fda6c-171dcee | TRANS_ID=000001da6c-171dce8 | VERSION=1.09 | TYPE=AUDIT | INTERNAL_ERROR_MSG= | UTC_ENTRY=2020-02-13T10:05.178Z | UTC_EXIT=2020-02-13T10:01:45.230Z,"Timestamp":"2020-02-13T10:01:45.062Z","Organization":"abc","Region":"RStS","ApplicationName":"Anoid"},"Data":{"rsCommand":"Clization","rsStatus":"executed","statusTimeStamp":"2020-02-13T09:57:13.000Z","rsStatusReason":"executed","lastRemoTimeStamp":"2020-02-13T09:57:29.000Z","rsMinutesRemaining":"6","remoDuration":"10","interTemperature":"12","interTimeStamp":"2020-02-13T10:00:20.000Z","Successful Execution"}}
Can anyone give some insight ,If you have faced or come across this kind of issue.
I suspect Splunk is getting confused with the time format of the actual event and the time and year value format inside the event like status time stamp , last remo timestamp in the example log above.. But doesn't help me how to go about and solve this issue.
@jacobevans and @nickhillscpl
I was able to resolve this issue with the concept "Rebalance the indexer cluster"
https://docs.splunk.com/Documentation/Splunk/8.0.2/Indexer/Rebalancethecluster
When I had checked all the aspects of Splunk instance to cause this issue , I was able to see discrepancies in buckets replication and transition from hot to warm buckets , which explains data available when the data is ingested (HOT bucket) and unavailable later (transition failed to warm bucket and no replication happened) .
Thanks anyways to have helped me narrow down the issue 🙂
Cheers
Take care!!
@jacobevans and @nickhillscpl
I was able to resolve this issue with the concept "Rebalance the indexer cluster"
https://docs.splunk.com/Documentation/Splunk/8.0.2/Indexer/Rebalancethecluster
When I had checked all the aspects of Splunk instance to cause this issue , I was able to see discrepancies in buckets replication and transition from hot to warm buckets , which explains data available when the data is ingested (HOT bucket) and unavailable later (transition failed to warm bucket and no replication happened) .
Thanks anyways to have helped me narrow down the issue 🙂
Cheers
Take care!!
Hi,
Were you able to actually get the logs back or just fixed the issue wherein it did not happen again but the missing logs still remained missing?
Thanks.
Try this in your props.conf
[mySourcetype]
TIME_PREFIX=^
TIME_FORMAT=%Y-%m-%dT%H:%M:%S.%N3%z
MAX_TIMESTAMP_LOOKAHEAD = 28
@nickhills , I have this configuration in the props already .
[mysourcetype]
SHOULD_LINEMERGE=false
LINE_BREAKER = ([\r\n]+)\d{4}-\d\d
MAX_EVENTS=1000
TRUNCATE=10000
TIME_PREFIX = ^
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%3N%Z
What's surprising for me is that I can see logs getting ingested at 2 to 8 AM timeframe .
but later in the day I do not see them at all .
Is the Line break affecting here ?
You shouldn't need the line_breaker or linemerge if its finding the time properly, but make sure you set the lookahead to 28. This tells Splunk that the timestamp you care about is in the first 28 characters (and to ignore anything else that looks like a timestamp deeper in the event)
@nickhillscpl , I applied the configs you mentioned but the issue prevails .
The data was logged successfully until an hour ago but now when i check its gone from 4am to 5am est.Whereas it was there before when i checked . Rest of the logs are intact.
Odd, I posted a comment but its vanished.
If data is randomly appearing and disappearing I wonder if you have an issue with a search peer not returning results. Are you sure you have no errors reported in the job inspector?
you are right , i no more see the data that was indexed few hours back .
How do you suggest I check the search peers are fine and Job inspector has any reported jobs
What is the output of this query? (Last 24 hours or whenever you have the missing data)
index=<your index>
| eval latency = abs(_indextime - _time)
| stats max(latency) as max_latency by sourcetype
| sort 0 - max_latency
Does this line chart have the same gaps you see in your original search?
index=<your index>
| eval _time = _indextime
| timechart count
@jacobevans , thanks for the response .
I could not share the entire response , few i have pasted below
sourcetype max_latency
abc 4492800.0000
def 3486215.000
ghi 3196800.3570
jkl 3196798.3560
mno 1814345.000
pqr 777648.3730
Also , i see the same gaps in the line chart from the below search you gave .
index=
| eval _time = _indextime
| timechart count
You have incredible latency between the time events were actually indexed on your indexer(s) (_indextime) and the time that Splunk parsed out (the standard _time). Were you indexing historical logs, or are these logs supposed to be ingested in real-time?
The second answer does not make sense to me. If your _indextime and _time are that far off, I would not expect you to see the same gaps in both _time and _indextime. You may just have a few outliers that are throwing it off.
This will be difficult for us to help you with without more information. Could you share some sample events? This will show us your biggest outliers. Be sure to censor all sensitive information. The only information we really need are _indextime, _time, and the time in your _raw event from the search below:
index=<your index>
| eval latency_mins = round(abs(_indextime - _time)/60),
indextime = strftime(_indextime, "%Y-%m-%d %H:%M:%S")
| table _time indextime latency_mins index sourcetype _raw
| sort 0 - latency
If the time is at the beginning of the _raw event, you can add something like this to eliminate the sensitive information (change the 30 so that the full time is visible):
| eval _raw = substr(_raw, 0, 30)
Here's the documentation on what I'm trying to get at: https://docs.splunk.com/Documentation/Splunk/latest/Data/Configuretimestamprecognition
@jacobpevans , let me break it down some more as in for 16th Feb 2020
The sourcetype I am facing issue is say ABC , latency value for that is 59.846 .
The linechart has gaps in it for a particular time frame that is 12 am to 7 am
_time indextime latency_mins _raw
2020-02-16 10:56:46.856 2020-02-16 10:56:47 0 2020-02-16T10:56:46.856-0500 INFO
2020-02-16 10:56:46.792 2020-02-16 10:56:47 0 2020-02-16T10:56:46.792-0500 INFO
2020-02-16 10:56:46.617 2020-02-16 10:56:47 0 2020-02-16T10:56:46.617-0500 INFO
2020-02-16 10:56:46.562 2020-02-16 10:56:47 0 2020-02-16T10:56:46.562-0500 INFO
2020-02-16 10:56:46.395 2020-02-16 10:56:47 1 2020-02-16T10:56:46.395-0500 INFO
2020-02-16 10:56:46.381 2020-02-16 10:56:47 1 2020-02-16T10:56:46.381-0500 INFO
2020-02-16 10:56:46.129 2020-02-16 10:56:47 1 2020-02-16T10:56:46.129-0500 INFO
2020-02-16 10:56:46.126 2020-02-16 10:56:47 1 2020-02-16T10:56:46.126-0500 INFO
2020-02-16 10:56:46.085 2020-02-16 10:56:47 1 2020-02-16T10:56:46.085-0500 INFO
Logs :
2020-02-16T14:51:07.112-0500 INFO 209 ABC.splunk.log: -4: | UNIQ_ID=2d-228523f | TRANS_ID=0000285235 | CLIENT_IP=123.34.45.7.67.56 | API_KEY=5f1c42727678a| ROUTING_CODE=Routed_200 | RESPONSE_ERROR_CODE=0 | RESPONSE_HTTP_CODE=200 | STATUS_MESSAGE=Message processed successfully_ | HTTP_BACKEND_CODE=200 | HTTP_EPAT_CODE= | HTTP_GUA_CODE= | HTTP_TCU_CODE= | INTERNAL_ERROR_MSG= | UTC_ENTRY=2020-02-16T19:51:06.985Z | UTC_EXIT=2020-02-16T19:51:07.103Z | REQUEST_PAYLOAD={"Data":{"AccountID":"46035002"}} | RESPONSE_PAYLOAD={"Data","Response":{"ResponseCode":"2000","ResponseStatus":"Success","ResponseDescription":"Successful Execution"}}
All the logs have the first time stamp (_raw) which ends with INFO and UTC_ENTRY and exit .
Which has almost 5 hours of difference which is the same time the data is missing .
And these format of logs are generated at only this time . early morning to be specific