Splunk Search

Can the dedup command run within props.conf?

luck123813
Explorer

Hey everyone,

I have an issue where I am ingesting data via REST API, though I am getting a lot of duplicate data into the index. It seems the issue resides on the table where the API sources from, so in the meantime I have to dedup the results.

index=index1 sourcetype=dataset1 | dedup data_id | table column_1, column_2,  column_3

My question is, is there a way to run the dedup command within the props.conf file ?
I have read where I could do an eval =mvdedup(value) command, but I would need to dedup across the events and not just one field

Any thoughts?

0 Karma
1 Solution

nickhills
Ultra Champion

If I understand your problem and my assumptions are correct, dedup will likely not help you.

My first assumption is that you have a rest API method which polls a webservice on an interval and imports a number of events.
My second assumption is that on each subsequent poll, you are bringing in events which have already been collected.

Dedup is used to remove duplicates in a "stream" - it's concept could be useful if in one of your polls you have duplicated events, but it will not be able to evaluate a set of new events with those previously indexed.

Ideally your poll (and the API) would allow you to maintain a checkpoint of the last event you imported, and on subsequent polls, only collect events following that checkpoint.

That does rely on the API giving you a sequential record ID, (with which you would handle the checkpointing logic) or its own checkpointing function.

If my comment helps, please give it a thumbs up!

View solution in original post

luck123813
Explorer

Currently, I am using the REST API Modular Input from from splunkbase. Is there anyway I can manually put in a < checkpoint> as you mentioned within this REST API Modular Input (at the UI) ?

I also tried to run a script (external splunk), in which thje API writes to a file which and also produces the same data each time it is ran. @nickhillscpl

0 Karma

nickhills
Ultra Champion

If I understand your problem and my assumptions are correct, dedup will likely not help you.

My first assumption is that you have a rest API method which polls a webservice on an interval and imports a number of events.
My second assumption is that on each subsequent poll, you are bringing in events which have already been collected.

Dedup is used to remove duplicates in a "stream" - it's concept could be useful if in one of your polls you have duplicated events, but it will not be able to evaluate a set of new events with those previously indexed.

Ideally your poll (and the API) would allow you to maintain a checkpoint of the last event you imported, and on subsequent polls, only collect events following that checkpoint.

That does rely on the API giving you a sequential record ID, (with which you would handle the checkpointing logic) or its own checkpointing function.

If my comment helps, please give it a thumbs up!
Get Updates on the Splunk Community!

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Wednesday, May 29, 2024  |  11AM PST / 2PM ESTRegister now and join us to learn more about how you can ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer Certification at ...

We’re excited to announce a new Splunk certification exam being released at .conf24! If you’re headed to Vegas ...

Share Your Ideas & Meet the Lantern team at .Conf! Plus All of This Month’s New ...

Splunk Lantern is Splunk’s customer success center that provides advice from Splunk experts on valuable data ...