Getting Data In

How to not reindexing data after overwriting a file?

Marco-IT
Explorer

Hi everybody, 

let's say I'm monitoring the file test.log that has these informations:

2022-22-25 14:00 - row 1

2022-22-25 14:00 - row 2

2022-22-25 14:03 - row 3

2022-22-25 14:05 - row 4

 

At some point, I overwrite the original file with another test.log with these lines

2022-22-25 14:00 - row 1

2022-22-25 14:00 - row 2

2022-22-25 14:03 - row 3

2022-22-25 14:05 - row 4

2022-22-25 17:10 - row 5

2022-22-25 17:10 - row 6

 

Currently, all the lines of the new test.log are ingested so I have some duplicates.

Is there a way to only index the last to rows?

Labels (1)
Tags (1)
0 Karma

richgalloway
SplunkTrust
SplunkTrust

Splunk re-indexes the entire file because it noticed the whole thing changed.  Rows 1-4 are different because they have different timestamps.  There are settings in inputs.conf that help Splunk detect if it's already processed a file, but if the data changes then those settings don't help.

The best fix, IMO, is to change the application to not re-write the entire log file.  It's not a true log if the data changes (again, IMO).

---
If this reply helps you, Karma would be appreciated.
0 Karma

Marco-IT
Explorer

Hi @richgalloway,

thank you for your answer.

I wrongly wrote my questions, the first 4 lines of the two files are identical (I already changed the question), also the timestamps.

 

0 Karma

richgalloway
SplunkTrust
SplunkTrust

If the first four lines rarely change then Splunk should just index the new rows.  Perhaps a setting (crcSalt or initCRCLength) can be adjusted to help.  Can you share the inputs.conf settings for the file?

---
If this reply helps you, Karma would be appreciated.
0 Karma

Marco-IT
Explorer

Hi @richgalloway,

Right now I'm doing some tests in a test-machine and the content of inputs.conf is very basic:

[monitor:///tmp/test_folder/test1.log]
disabled = false
sourcetype = test01

Also the content of the file is an example: I'm gonna have a file with some lines (so not precisely four); during the day a new file with the same name, the same lines and few more will overwrite the previous one. This will happen several times during the day and I wanna avoid duplicates.

I hope this clarifies the situation.

Thank you for any suggestion you have

0 Karma

richgalloway
SplunkTrust
SplunkTrust

First, the inputs.conf stanza needs an index setting.  It's not good practice to send data to the default index.

I'm not sure there's a solution to this problem other than to change how the application writes logs.  I think Splunk is seeing a new file every time the application re-writes it and so indexes the whole thing.  Splunk cannot test if data coming in already exists in an index.

---
If this reply helps you, Karma would be appreciated.
Get Updates on the Splunk Community!

Splunk Lantern | Spotlight on Security: Adoption Motions, War Stories, and More

Splunk Lantern is a customer success center that provides advice from Splunk experts on valuable data ...

Splunk Cloud | Empowering Splunk Administrators with Admin Config Service (ACS)

Greetings, Splunk Cloud Admins and Splunk enthusiasts! The Admin Configuration Service (ACS) team is excited ...

Tech Talk | One Log to Rule Them All

One log to rule them all: how you can centralize your troubleshooting with Splunk logs We know how important ...