I am providing a directory for Splunk to index. In this directory, there are both text log files as well as gzipped log files(.gz). The gzipped log files are the older logs compressed to save space.
But while indexing, splunkd.log has many warnings like "Breaking event because limit of 256 has been exceeded - data_source=<.gz file name>". This leads to drop in overall indexing rate as parsing stage itself is taking longer than expected.
How to mitigate this issue? Any extra configuration needed in props.conf to support heterogeneous input types?
In an ideal world you don't want to index .gz rolled logs.
Your archived gz files include the historic logs which have rolled over, but in a working system you will already have those logs files indexed when they were 'new' and written to 'your_log.log'
Specifically indexing the gz files will likely give you duplicates.
Your monitor stanza should therefore be specific to only index the .log files and not the .gz versions (or blacklist them)
There is however a caveat - that being when you first install Splunk!
Its quite conceivable that you want to import your old archived logs when you first install Splunk - you could configure a ST for the GZ files, OR simply extract the original logs, and let it ingest those.
As long as the files in the gz are the same format (and your event breaking is perfect on the text .log files) you shouldn't have any issues indexing archived logs with the same sourcetype.
It will not do this multi-threaded, so it will take a significant period longer to index archived files than flat text (which would index in parallel).
If you are receiving breaking warnings on the gz files, you probably are getting them on the .log files too, you just may not have noticed.
It’s odd, that shouldn’t be necessary if they are detected as the same sourcetype. I presume they are indexed as the same sourcetype, but just a different source, and your props is applied to the source type, and not the source?
Ideally it would be good practice to store gz/achieved files in a different folder or disk mount. If that cannot be done and if you want to not index the gz files, add a blacklist in your input stanza for .gz files. This will help index only recent files and help in better indexing rate and avoid possible duplicate events.
Its a directory I added for indexing, which already contains both recent and rolled over logs. And since Splunk hadn't indexed any data yet, it won't cause duplicate issue.
But my query is, how do we index data which is a mixture of different types of files. Is segregation necessary? And do we need to add a stanza to handle directory containing gz files?