Getting Data In

Splunk ingesting duplicate data from Azure File Share monitored file - Checksum for seekptr didn't match, will re-read e

gjlewis
Explorer

I have an issue where I have set up a Universal Forwarder on a Windows Azure server to monitor data stored on an Azure file share server. 

This is my inputs.conf:

 

[monitor://\\********.file.core.windows.net\KanaResponse\RespShare\logs\log20*.xml]
disabled = 0
index = kana
sourcetype = kana_xml
crcSalt = <SOURCE>

 

The issue I have is Splunk thinks the CRC has changed each time the file is written to and re-ingests the whole file. The header of the file does not change, so I'm not sure why this happens. I read some other posts referring to how Azure file share caches data that changes metadata involved in the CRC calculation, but I'm not sure if that is definitely the case.

Each file generates approx 6,000 events, but due to the re-ingestion this can amount to over a million events per file. Our license would get eaten up pretty quickly if I left the feed enabled constantly.

Another knock on issue to this is when the log fills and a new file is created, Splunk doesn't see the new file and the data feed stops until the Splunk forwarder is restarted. It does however stop ingesting the previous file.

Splunk's internal log shows the following details confirming it thinks the file is new:

 

10-05-2022 11:37:12.556 +0100 DEBUG TailReader [8280 tailreader0] - Defering notification for file=\\********.file.core.windows.net\KanaResponse\RespShare\logs\log20221005_101607.xml by 3.000ms
10-05-2022 11:37:12.556 +0100 DEBUG TailReader [8280 tailreader0] - Finished reading file='\\********.file.core.windows.net\KanaResponse\RespShare\logs\log20221005_101607.xml' in tailreader0 thread, disposition=NO_DISPOSITION, deferredBy=3.000
10-05-2022 11:37:12.556 +0100 DEBUG WatchedFile [8280 tailreader0] - Reached EOF: fname=\\********.file.core.windows.net\KanaResponse\RespShare\logs\log20221005_101607.xml fishstate=key=0x8908643efe7e891f sptr=865145 scrc=0x77aadaaeb3af22ee fnamecrc=0xbd1b79bedeae4211 modtime=1664963939
10-05-2022 11:37:12.556 +0100 DEBUG WatchedFile [8280 tailreader0] - seeking \\********.file.core.windows.net\KanaResponse\RespShare\logs\log20221005_101607.xml to off=857837
10-05-2022 11:37:12.524 +0100 DEBUG TailReader [8280 tailreader0] - About to read data (Reusing existing fd for file='\\********.file.core.windows.net\KanaResponse\RespShare\logs\log20221005_101607.xml').
10-05-2022 11:37:12.524 +0100 INFO  WatchedFile [8280 tailreader0] - Will begin reading at offset=0 for file='\\********.file.core.windows.net\KanaResponse\RespShare\logs\log20221005_101607.xml'.
10-05-2022 11:37:12.524 +0100 INFO  WatchedFile [8280 tailreader0] - Checksum for seekptr didn't match, will re-read entire file='\\********.file.core.windows.net\KanaResponse\RespShare\logs\log20221005_101607.xml'.
10-05-2022 11:37:12.478 +0100 DEBUG TailReader [8280 tailreader0] -   Will attempt to read file: \\********.file.core.windows.net\KanaResponse\RespShare\logs\log20221005_101607.xml from existing fd.
10-05-2022 11:37:12.478 +0100 DEBUG TailReader [8280 tailreader0] - Start reading file="\\********.file.core.windows.net\KanaResponse\RespShare\logs\log20221005_101607.xml" in tailreader0 thread
10-05-2022 11:37:00.394 +0100 INFO  Metrics - group=per_source_thruput, series="\\********.file.core.windows.net\kanaresponse\respshare\logs\log20221005_101607.xml", kbps=0.622, eps=0.064, kb=19.295, ev=2, avg_age=1134.000, max_age=2268
10-05-2022 11:36:48.567 +0100 DEBUG TailReader [5484 MainTailingThread] - Enqueued file=\\********.file.core.windows.net\KanaResponse\RespShare\logs\log20221005_101607.xml in tailreader0

 

If anyone has any ideas how to circumvent this issue, I'd be hugely grateful.

I have tried using MonitorNoHandle, but that doesn't work as (a) Splunk wants the network drive location to be mapped to a drive, which we aren't able to do and (b) it requires individual files to be monitored, which we can't do easily as the new file uses the timestamp of when it is created in it's filename.

Thanks 

0 Karma

Zane
Explorer

is there any resolution? 

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...