We are having an issue with our new 8.2.2 splunk instance any time there's a subsearch with a lot of data being searched (1,500,000+ events). It's any subsearch at all, whether it's a join, append, or regular subsearch. We get live results as the search is running, but when it finishes we get this error:
StatsFileWriterLz4 file open failed file=D:\Program Files\Splunk\var\run\splunk\srtemp\555759916_17936_at_1646839312.2\statstmp_merged_44.sb.lz4
The search job has failed due to an error. You may be able view the job in the Job Inspector
We observed this "srtemp" directory while running a search live and we see a directory for the job being created (like the 555759916_17936_at_1646839312.2 above) with a bunch of temp files being populated inside of that. When smaller searches are finished, the directory and all of it's contents are successfully deleted and we get results as expected. With larger searches we get the error above and the folder is left behind, but all of the temp files inside the directory are successfully deleted. We have 9TB of free space on the drive the directory is in, so we definitely aren't running out of space.
We have an old splunk instance (7.3.0) that does not have this issue at all. In fact, when observing the srtemp directory, nothing is created at all. Clearly there is some key difference we are missing but we are not sure what. We've tried increasing various limits in limits.conf and tried switching the journal compression from Lz4 to GZip and nothing has worked. We are stumped and not sure what to do next. None of the internal logs tell us anything more than what the error in the search says. Any sort of insight on what to do next would be greatly appreciated!
... View more