topic Re: How do we define a duplicate record? in Getting Data In

How do we define a duplicate record?

ws — Thu, 06 Feb 2025 08:44:23 GMT

Currently using an customize App to connect to a case / monitoring system and retrieve data.

I found out that, Splunk has the ability to detect if the data has already been indexed.

But if I have the following scenario? will it consider as a duplicate or new data? since it has a new close case timing for the update close case.

One of the previously closed cases has been reopened and closed again with a new case closed time. will Splunk enterprise consider as a new data to index?

Re: How do we define a duplicate record?

gcusello — Thu, 06 Feb 2025 09:21:21 GMT

Hi @ws ,

Splunk indexes all new data with the only exception when all the first 256 chars of the event are the same.

Then (after indexing) you can dedup results eventually excluding duplicated data from results based on your requirements.

Deduping is usually done related to one or more fields; it's also possible to search full duplicated deduping for _raw.

Ciao.

Giuseppe

Re: How do we define a duplicate record?

ITWhisperer — Thu, 06 Feb 2025 09:22:22 GMT

Splunk does not work like a database in this respect. So, it depends on how Splunk has been set up to detect "duplicates" of this nature. This is normally done with searches in reports or alerts or dashboards. These will normally depend on your data.

What searches do you already have set up?

What does your data look like?

How is it being ingested into Splunk?

What criteria do you want to use to determine that an event represents a duplicate?

Please provide as much detail as you can (without giving away sensitive information).

Re: How do we define a duplicate record?

PickleRick — Thu, 06 Feb 2025 11:58:10 GMT

Are you sure you're not talking about first 256 bytes of monitored file? (of course the header length is configurable). The only duplication detection I recall is connected with useACK and even then it indexes an event twice but emits a warning AFAIR.

Re: How do we define a duplicate record?

richgalloway — Thu, 06 Feb 2025 12:43:07 GMT

Splunk cannot and does not detect if data has already been indexed. As @gcusello said, it will attempt to avoid re-ingesting data, but that's not perfect.

It's up to the app doing the ingestion to prevent reading the same data twice. In DB Connect, for example, a "rising column" is defined to identify unique records. Your app could do something similar, using case ID and Closed Time, perhaps.

Re: How do we define a duplicate record?

ws — Mon, 10 Feb 2025 05:30:56 GMT

Currently, we are not focusing on searches but rather on the application created to pull data from the API provided by the destination party.

Based on my understanding of the current setup, the new data is being retrieved by the application through the destination API.

The data includes fields such as ID, case status, case close date, and others.

At this point, duplicates will be identified based on the ID field.

Please correct me if I'm wrong, but given the current setup, wouldn't this result in duplicate data? Since we are calling at the interval of 1 hours and 4 hours duration of logs.

For example:

10am, 6am-10am
11am, 11am-3pm

Re: How do we define a duplicate record?

ws — Mon, 10 Feb 2025 05:37:27 GMT

Understand Splunk will perform a check of the event at 256 chars if they are the same.

But at my current situation, would your recommendation be that we need to customize the application to implement a checkpoint mechanism for tracking previously indexed records?

Re: How do we define a duplicate record?

gcusello — Mon, 10 Feb 2025 07:18:20 GMT

Hi @ws ,

you have many ways to check repetitive logs, the easiest is to save logs in a file with different names (e.g. adding data and time) and use the crcSalt = <SOURCE> option in the inputs.conf related stanza.

Ciao.

Giuseppe