Hey everyone,
I have an issue where I am ingesting data via REST API, though I am getting a lot of duplicate data into the index. It seems the issue resides on the table where the API sources from, so in the meantime I have to dedup the results.
index=index1 sourcetype=dataset1 | dedup data_id | table column_1, column_2, column_3
My question is, is there a way to run the dedup command within the props.conf file ?
I have read where I could do an eval =mvdedup(value) command, but I would need to dedup across the events and not just one field
Any thoughts?
If I understand your problem and my assumptions are correct, dedup will likely not help you.
My first assumption is that you have a rest API method which polls a webservice on an interval and imports a number of events.
My second assumption is that on each subsequent poll, you are bringing in events which have already been collected.
Dedup is used to remove duplicates in a "stream" - it's concept could be useful if in one of your polls you have duplicated events, but it will not be able to evaluate a set of new events with those previously indexed.
Ideally your poll (and the API) would allow you to maintain a checkpoint of the last event you imported, and on subsequent polls, only collect events following that checkpoint.
That does rely on the API giving you a sequential record ID, (with which you would handle the checkpointing logic) or its own checkpointing function.
Currently, I am using the REST API Modular Input from from splunkbase. Is there anyway I can manually put in a < checkpoint> as you mentioned within this REST API Modular Input (at the UI) ?
I also tried to run a script (external splunk), in which thje API writes to a file which and also produces the same data each time it is ran. @nickhillscpl
If I understand your problem and my assumptions are correct, dedup will likely not help you.
My first assumption is that you have a rest API method which polls a webservice on an interval and imports a number of events.
My second assumption is that on each subsequent poll, you are bringing in events which have already been collected.
Dedup is used to remove duplicates in a "stream" - it's concept could be useful if in one of your polls you have duplicated events, but it will not be able to evaluate a set of new events with those previously indexed.
Ideally your poll (and the API) would allow you to maintain a checkpoint of the last event you imported, and on subsequent polls, only collect events following that checkpoint.
That does rely on the API giving you a sequential record ID, (with which you would handle the checkpointing logic) or its own checkpointing function.