How to not reindexing data after overwriting a fil...

Marco-IT · ‎11-25-2022

Hi everybody,

let's say I'm monitoring the file test.log that has these informations:

2022-22-25 14:00 - row 1

2022-22-25 14:00 - row 2

2022-22-25 14:03 - row 3

2022-22-25 14:05 - row 4

At some point, I overwrite the original file with another test.log with these lines

2022-22-25 14:00 - row 1

2022-22-25 14:00 - row 2

2022-22-25 14:03 - row 3

2022-22-25 14:05 - row 4

2022-22-25 17:10 - row 5

2022-22-25 17:10 - row 6

Currently, all the lines of the new test.log are ingested so I have some duplicates.

Is there a way to only index the last to rows?

richgalloway · ‎11-25-2022

Splunk re-indexes the entire file because it noticed the whole thing changed. Rows 1-4 are different because they have different timestamps. There are settings in inputs.conf that help Splunk detect if it's already processed a file, but if the data changes then those settings don't help.

The best fix, IMO, is to change the application to not re-write the entire log file. It's not a true log if the data changes (again, IMO).

---
If this reply helps you, Karma would be appreciated.

Marco-IT · ‎11-28-2022

Hi @richgalloway,

thank you for your answer.

I wrongly wrote my questions, the first 4 lines of the two files are identical (I already changed the question), also the timestamps.

richgalloway · ‎11-28-2022

If the first four lines rarely change then Splunk should just index the new rows. Perhaps a setting (crcSalt or initCRCLength) can be adjusted to help. Can you share the inputs.conf settings for the file?

---
If this reply helps you, Karma would be appreciated.

Marco-IT · ‎11-29-2022

Hi @richgalloway,

Right now I'm doing some tests in a test-machine and the content of inputs.conf is very basic:

[monitor:///tmp/test_folder/test1.log]
disabled = false
sourcetype = test01

Also the content of the file is an example: I'm gonna have a file with some lines (so not precisely four); during the day a new file with the same name, the same lines and few more will overwrite the previous one. This will happen several times during the day and I wanna avoid duplicates.

I hope this clarifies the situation.

Thank you for any suggestion you have

richgalloway · ‎11-29-2022

First, the inputs.conf stanza needs an index setting. It's not good practice to send data to the default index.

I'm not sure there's a solution to this problem other than to change how the application writes logs. I think Splunk is seeing a new file every time the application re-writes it and so indexes the whole thing. Splunk cannot test if data coming in already exists in an index.

---
If this reply helps you, Karma would be appreciated.

How to not reindexing data after overwriting a file?

props.conf

Dashboards: Hiding charts while search is being executed and other uses for tokens

Splunk Observability Cloud's AI Assistant in Action Series: Explaining Metrics and ...

Brains, Bytes, and Boston: Learn from the Best at .conf25

Are you a member of the Splunk Community?

How to not reindexing data after overwriting a file?

props.conf

Dashboards: Hiding charts while search is being executed and other uses for tokens

Splunk Observability Cloud's AI Assistant in Action Series: Explaining Metrics and ...

Brains, Bytes, and Boston: Learn from the Best at .conf25