Getting Data In

How to clean Duplicates from Index?

delly_fofie
Engager

Hello,

We have a use case.

Using the Splunk DB Connect, we ingest data from the various systems especially from the ERP.

Every change on an article in the ERP is pushed into a temp DB which is monitored by the SPLUNK DB connect.

There a millions of data movements each day. 

But in the end of the day, we just need to work with the latest unique data that are in the system for each article. Each event has some 10-30 fields.

What is the best way to getting rid of all the duplicates that are comming into the system ?
Delete ? How ? 
skip ? Lookup ? Summary DB ? How ? 

What are the ideas that you might have or maybe some ideas i'm missing ?

Labels (1)
Tags (1)
0 Karma
1 Solution

gcusello
SplunkTrust
SplunkTrust

Hi @delly_fofie ,

the only way to avoid to index twice a log, is to run an SQl query that checks is data i duplicated before index them.

This action can be performed on SQL, not in Splunk.

In Splunk you could ingest also duplicated events and then, using Splunk Search Programming Language (SPL), remove duplicates in search results, but not in indexing.

In other words, it isn't possible to check if a data is already indexed before index it, the only way to do this is in the generating SQL query that you use to extract events suing DB-Connect.

Ciao.

Giuseppe

View solution in original post

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @delly_fofie,

Delete duplicates in Splunk is possible but in this way you only make a logical deletion, in other words, you don't save both disk space and license.

My hint is to optimize your extraction quesry to avoid to index twice.

Ciao.

Giuseppe

0 Karma

delly_fofie
Engager

Hello @gcusello Lets assume I would go with your idea.

But still if on the day 1 I manage to only get unique in the indexing, the next day I will have new entries and already existing entries in Splunk and will still create duplicate data in Splunk.

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @delly_fofie ,

the only way to avoid to index twice a log, is to run an SQl query that checks is data i duplicated before index them.

This action can be performed on SQL, not in Splunk.

In Splunk you could ingest also duplicated events and then, using Splunk Search Programming Language (SPL), remove duplicates in search results, but not in indexing.

In other words, it isn't possible to check if a data is already indexed before index it, the only way to do this is in the generating SQL query that you use to extract events suing DB-Connect.

Ciao.

Giuseppe

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @delly_fofie ,

good for you, see next time!

Ciao and happy splunking

Giuseppe

P.S.: Karma Points are appreciated by all the contributors 😉

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Deleting events from an index is tricky as it is easy to accidentally delete all the events from the index - this is why it is protected by its own level of security and is usually only granted to specific, isolated users, to minimise the likelihood of accidental deletions.

So, assuming you aren't going to be deleting events from the index, and that your dbconnect is potentially retrieving events which are already in the index, you should consider comparing the retrieved events with those already in the index and only add the new or updated events.

Another possibility is to have a "summary" index which you refresh (delete and insert) with the latest events for each article.

0 Karma
Get Updates on the Splunk Community!

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Wednesday, May 29, 2024  |  11AM PST / 2PM ESTRegister now and join us to learn more about how you can ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer Certification at ...

We’re excited to announce a new Splunk certification exam being released at .conf24! If you’re headed to Vegas ...

Share Your Ideas & Meet the Lantern team at .Conf! Plus All of This Month’s New ...

Splunk Lantern is Splunk’s customer success center that provides advice from Splunk experts on valuable data ...