Getting Data In

Why is huge duplicate and unwanted data being indexed into Splunk?

santosh11
New Member

Dear All,

We are getting huge duplicate data and unwanted data into splunk and while we are querying the performance is getting effected. Below is the senario:

We are using HF to push the data into Splunk Cloud.

this is an example of duplicate data.

source type A: 1, AA
Source Type A: 1, AA

This is an example of unwanted data:

source type A: 1, AA
Source Type A: 1, AB

Here second one got updated by AB and we wont be needing first one(AA) any more any more.

Because of this splunk scans 20,00,00,000 events and out of that we get 1,50,00,000 which are useful.

Can someone suggest better way to maintain data in index.

Regards,
Santosh

0 Karma
1 Solution

gcusello
SplunkTrust
SplunkTrust

HI santosh11,
Splunk ingest alla data that are in the monitored files, if you have duplicated data in your files it cannot analyze data before ingestion.
If you can find a regex to filter the unwanted data, e.g. you know want to delete all the events with "S" and "T" in "Source Type" in uppercase, you can filter them before indexing, but you cannot check if the data was already indexed.
It's mainly a problem of license consuption.

If you have too many events in your searches and you want to speed them, you could think to schedule a search (e.g. every hour) extracting at search time only the records you want and saving them in a summary index, so then you can use it for your quick searches.
For duplicated data is easier, e.g. you could schedule a search like this

index=my_index
| eval first_field=lower(first_field)
| dedup first_field
| table _time first_field second_field third_field
| collect index=my_summary_index

For unwanted data, you have to find a rule (one or more regexes) to filter events and create a scheduled search like the above one.

Bye.
Giuseppe

View solution in original post

0 Karma

gcusello
SplunkTrust
SplunkTrust

HI santosh11,
Splunk ingest alla data that are in the monitored files, if you have duplicated data in your files it cannot analyze data before ingestion.
If you can find a regex to filter the unwanted data, e.g. you know want to delete all the events with "S" and "T" in "Source Type" in uppercase, you can filter them before indexing, but you cannot check if the data was already indexed.
It's mainly a problem of license consuption.

If you have too many events in your searches and you want to speed them, you could think to schedule a search (e.g. every hour) extracting at search time only the records you want and saving them in a summary index, so then you can use it for your quick searches.
For duplicated data is easier, e.g. you could schedule a search like this

index=my_index
| eval first_field=lower(first_field)
| dedup first_field
| table _time first_field second_field third_field
| collect index=my_summary_index

For unwanted data, you have to find a rule (one or more regexes) to filter events and create a scheduled search like the above one.

Bye.
Giuseppe

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

May 2026 Splunk Expert Sessions: Security & Observability

Level Up Your Operations: May 2026 Splunk Expert Sessions Whether you are refining your security posture or ...

Network to App: Observability Unlocked [May & June Series]

In today’s digital landscape, your environment is no longer confined to the data center. It spans complex ...

SPL2 Deep Dives, AppDynamics Integrations, SAML Made Simple and Much More on Splunk ...

Splunk Lantern is Splunk’s customer success center that provides practical guidance from Splunk experts on key ...