Knowledge Management

Query to remove duplicates in the summary index

Prakash23
Observer

Hi Team,

The issue is like we have duplicate in the summary index, where from single host multiple same records are available.

Looking for a very quick help. I would require "query to remove duplicates from the summary index"

Thanks in advance.

Regards,

Prakash Mohan Doss

 

Labels (1)
0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Firstly, delete is a very powerful command and cannot be undone - use with extreme caution - best practice is to execute the search without the delete command and check that the events identified are indeed the events you want to delete. Then add the delete command. Execute. Then remove the delete command, so it is not left in your search history.

A search can be used to identify the events you want to delete from the index. However, the search cannot use non-distributable, non-streaming commands prior to the delete, this is partly due to the fact that the delete has to execute on the indexers. This means that in a lot of practical cases, the search used to identify the events to be removed, is likely to be unsuitable for direct use with the delete command. For example, suppose you wanted to remove duplicate summary index events. For this, you might have used eventstats to find the earliest times of the duplicated summary events for each of the dimensions in the summary, during the time period when the events were duplicated.

One way to overcome this issue is to use a subsearch to provide values with which to filter the events in the summary index.

However, there is a problem with this approach - what happens if the subsearch returns zero events?

In this situation, the subsearch will not filter out any events, which means your delete command could delete ALL the events in the index!

One way to avoid this problem is to ensure that the subsearch always returns at least one event - this one event should not match to any of the events in the index so the delete has nothing to do.

 

index=summary search_name=daily_report
    [search index=summary search_name=daily_report
    | eval host=orig_host
    | eval sourcetype=orig_sourcetype 
    | eval index=orig_index 
    | table _time, info_* volume, index, sourcetype, host
    | eventstats count as copies min(info_search_time) as redundant_search by _time, volume, index, sourcetype, host
    | where copies > 1 AND info_search_time=redundant_search
    | fields - copies redundant_search
    | appendpipe [stats count | where count = 0 | eval gobbledygook = random()]
    | return 4000 _time volume info_search_time info_min_time info_max_time]
| eval index="summary"
| eval search_name="daily_report"
| delete

 

Note the use of the return command to limit the maximum number of search terms and thereby restrict the events to be deleted at any one time - this is useful in reducing the impact of mistakes.

0 Karma

Vardhan
Contributor

Hi @Prakash23,

You can remove the data from summary index using delete command. Add the | delete command to the end of the search string and run it again - for example: index=summary "exception message logs" | delete

Note: To use a delete command you should have a additional capability to your role like "can_delete" . The delete command will not remove data from the index. But the data is not searchable.

Instead Identify the reports which are sending duplicate logs to summary index. And use the dedup command to control it.

If this answer helps you then upvote it.

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...