Hello all,
I have 4 SH, 2 indexer's, 1 Deployment Server in one of my environments (windows).
I'm now noticing that there's a long delay in some of my data showing up when searched on. This is a BIG issue for me as with operations you need to catch thing near real time.
Some items i'm not able to search on until the next day. for example my IIS logs, if i search on the last 15 minutes, maybe 4 out of the 8 Web Servers show as producing logs. If i perform the same search maybe an hour later i'll get 7/8 servers, and hour after that maybe 2/8 servers (so it's sporadic and various). if i search for IIS before 6 hours ago, all is well.
For my IIS indexer
12 CPU, 24GB memory
Indexing rate: around 250 KB/s (status = normal)
Indexing rate every 5 minutes is around 394 KB's
props.conf on indexer
[iis]
TZ = GMT
Index size= 700GB
Max size of Hot/Warm/Cold Bucket set to: auto
Homepath 263/ unlimited
cold 436/ unlimited
The highest host IIS Log Event Count: 343,166,069
by sourcetype (iis) 1,74,31,09,978
Maxdatasize auto
maxhotbuckets 3
maxwarmdbcount 300
Splunk Data Piple line is 0% across the board and show's no delays.
I noticed under the index Detail: instance my cold buckets size was much larger than my hot/warm buckets also
How have you solved it?
Have you verified all of the IIS servers have the correct time and time zone?
When you compare _time to _indextime, what do you see?
| tstats latest(_time) AS _time latest(_indextime) AS _indextime where index=iis by host
| eval delta=_indextime - _time
| where delta != 0
| eval indexTime=_indextime
| fields delta indexTime _time host
| sort - delta
| eval indexTime=strftime(indexTime, "%F %T")
| eval Time=strftime(_time, "%F %T")
| table delta indexTime Time host
Yes, the timestamp on all the IIS servers look fine. They are in UTC and as stated in the OP I've added a props.conf entry for that sourcetype that normalized the data. If I do a search on future logs nothing is returned so I'm not of the impression it's a timestamp issue.
One thing I meant to mention i discovered leaving out work, another log source is also delayed. Both of these logs are the biggest logs source I'm pulling
However smaller logs and sources still come through
I'm starting to think I'm hitting my limit in limits.conf.
Putting TZ = GMT
in props.conf does not normalize data. It's merely information to help indexers parse timestamps. If the timestamp is not in UTC, TZ = GMT
will result in events being out of sequence.
Are the logs being sent by a forwarder? If so, consider increasing the maxKBps
setting in the forwarder's limits.conf file.
Depending on what else the indexer is doing, 250GB/day is near the limit of what can be expected from a single indexer. If you can't increase the storage I/O rate then consider adding an indexer.
I meant normalize the data in respect to the timestamp, I should of been clearer. Generally I do my Field extractions at search time on the search heads only.
You may have missed that I have 2 indexers currently so one indxer is getting half this amount so I don't think it's the indexers.. no issues with the data pipeline.. i'm thinking limit.conf is probably where I need to concentrate.
Today I evaluated my actual logs and see someone doing something crazy with web Api calls that have more than quadrupled the log size. So I'll have them stop what they are doing first and look into the limit.conf at the same time
Thanks for your help
Maybe I'm ignorant to the idea of me hitting any limits as I'm only ingesting 250gb daily and I know of plenty who pull TB's of data a day. Perhaps they've adjusted their limits.conf to allow the data to flow or perhaps they are pulling from 1,000 devices to = that 1tb and no individual node is reaching the default limit in limits.conf where I'm only pulling from 84 devices = 250gb's?
I definitely need to fix this problem asap!