Getting Data In

How to not reindexing data after overwriting a file?

Marco-IT
Path Finder

Hi everybody, 

let's say I'm monitoring the file test.log that has these informations:

2022-22-25 14:00 - row 1

2022-22-25 14:00 - row 2

2022-22-25 14:03 - row 3

2022-22-25 14:05 - row 4

 

At some point, I overwrite the original file with another test.log with these lines

2022-22-25 14:00 - row 1

2022-22-25 14:00 - row 2

2022-22-25 14:03 - row 3

2022-22-25 14:05 - row 4

2022-22-25 17:10 - row 5

2022-22-25 17:10 - row 6

 

Currently, all the lines of the new test.log are ingested so I have some duplicates.

Is there a way to only index the last to rows?

Labels (1)
Tags (1)
0 Karma

richgalloway
SplunkTrust
SplunkTrust

Splunk re-indexes the entire file because it noticed the whole thing changed.  Rows 1-4 are different because they have different timestamps.  There are settings in inputs.conf that help Splunk detect if it's already processed a file, but if the data changes then those settings don't help.

The best fix, IMO, is to change the application to not re-write the entire log file.  It's not a true log if the data changes (again, IMO).

---
If this reply helps you, Karma would be appreciated.
0 Karma

Marco-IT
Path Finder

Hi @richgalloway,

thank you for your answer.

I wrongly wrote my questions, the first 4 lines of the two files are identical (I already changed the question), also the timestamps.

 

0 Karma

richgalloway
SplunkTrust
SplunkTrust

If the first four lines rarely change then Splunk should just index the new rows.  Perhaps a setting (crcSalt or initCRCLength) can be adjusted to help.  Can you share the inputs.conf settings for the file?

---
If this reply helps you, Karma would be appreciated.
0 Karma

Marco-IT
Path Finder

Hi @richgalloway,

Right now I'm doing some tests in a test-machine and the content of inputs.conf is very basic:

[monitor:///tmp/test_folder/test1.log]
disabled = false
sourcetype = test01

Also the content of the file is an example: I'm gonna have a file with some lines (so not precisely four); during the day a new file with the same name, the same lines and few more will overwrite the previous one. This will happen several times during the day and I wanna avoid duplicates.

I hope this clarifies the situation.

Thank you for any suggestion you have

0 Karma

richgalloway
SplunkTrust
SplunkTrust

First, the inputs.conf stanza needs an index setting.  It's not good practice to send data to the default index.

I'm not sure there's a solution to this problem other than to change how the application writes logs.  I think Splunk is seeing a new file every time the application re-writes it and so indexes the whole thing.  Splunk cannot test if data coming in already exists in an index.

---
If this reply helps you, Karma would be appreciated.
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...