Splunk Search

Can the dedup command run within props.conf?

luck123813
Explorer

Hey everyone,

I have an issue where I am ingesting data via REST API, though I am getting a lot of duplicate data into the index. It seems the issue resides on the table where the API sources from, so in the meantime I have to dedup the results.

index=index1 sourcetype=dataset1 | dedup data_id | table column_1, column_2,  column_3

My question is, is there a way to run the dedup command within the props.conf file ?
I have read where I could do an eval =mvdedup(value) command, but I would need to dedup across the events and not just one field

Any thoughts?

0 Karma
1 Solution

nickhills
Ultra Champion

If I understand your problem and my assumptions are correct, dedup will likely not help you.

My first assumption is that you have a rest API method which polls a webservice on an interval and imports a number of events.
My second assumption is that on each subsequent poll, you are bringing in events which have already been collected.

Dedup is used to remove duplicates in a "stream" - it's concept could be useful if in one of your polls you have duplicated events, but it will not be able to evaluate a set of new events with those previously indexed.

Ideally your poll (and the API) would allow you to maintain a checkpoint of the last event you imported, and on subsequent polls, only collect events following that checkpoint.

That does rely on the API giving you a sequential record ID, (with which you would handle the checkpointing logic) or its own checkpointing function.

If my comment helps, please give it a thumbs up!

View solution in original post

luck123813
Explorer

Currently, I am using the REST API Modular Input from from splunkbase. Is there anyway I can manually put in a < checkpoint> as you mentioned within this REST API Modular Input (at the UI) ?

I also tried to run a script (external splunk), in which thje API writes to a file which and also produces the same data each time it is ran. @nickhillscpl

0 Karma

nickhills
Ultra Champion

If I understand your problem and my assumptions are correct, dedup will likely not help you.

My first assumption is that you have a rest API method which polls a webservice on an interval and imports a number of events.
My second assumption is that on each subsequent poll, you are bringing in events which have already been collected.

Dedup is used to remove duplicates in a "stream" - it's concept could be useful if in one of your polls you have duplicated events, but it will not be able to evaluate a set of new events with those previously indexed.

Ideally your poll (and the API) would allow you to maintain a checkpoint of the last event you imported, and on subsequent polls, only collect events following that checkpoint.

That does rely on the API giving you a sequential record ID, (with which you would handle the checkpointing logic) or its own checkpointing function.

If my comment helps, please give it a thumbs up!

View solution in original post

Take the 2021 Splunk Career Survey

Help us learn about how Splunk has
impacted your career by taking the 2021 Splunk Career Survey.

Earn $50 in Amazon cash!