Getting Data In

Why is huge duplicate and unwanted data being indexed into Splunk?

santosh11
New Member

Dear All,

We are getting huge duplicate data and unwanted data into splunk and while we are querying the performance is getting effected. Below is the senario:

We are using HF to push the data into Splunk Cloud.

this is an example of duplicate data.

source type A: 1, AA
Source Type A: 1, AA

This is an example of unwanted data:

source type A: 1, AA
Source Type A: 1, AB

Here second one got updated by AB and we wont be needing first one(AA) any more any more.

Because of this splunk scans 20,00,00,000 events and out of that we get 1,50,00,000 which are useful.

Can someone suggest better way to maintain data in index.

Regards,
Santosh

0 Karma
1 Solution

gcusello
SplunkTrust
SplunkTrust

HI santosh11,
Splunk ingest alla data that are in the monitored files, if you have duplicated data in your files it cannot analyze data before ingestion.
If you can find a regex to filter the unwanted data, e.g. you know want to delete all the events with "S" and "T" in "Source Type" in uppercase, you can filter them before indexing, but you cannot check if the data was already indexed.
It's mainly a problem of license consuption.

If you have too many events in your searches and you want to speed them, you could think to schedule a search (e.g. every hour) extracting at search time only the records you want and saving them in a summary index, so then you can use it for your quick searches.
For duplicated data is easier, e.g. you could schedule a search like this

index=my_index
| eval first_field=lower(first_field)
| dedup first_field
| table _time first_field second_field third_field
| collect index=my_summary_index

For unwanted data, you have to find a rule (one or more regexes) to filter events and create a scheduled search like the above one.

Bye.
Giuseppe

View solution in original post

0 Karma

gcusello
SplunkTrust
SplunkTrust

HI santosh11,
Splunk ingest alla data that are in the monitored files, if you have duplicated data in your files it cannot analyze data before ingestion.
If you can find a regex to filter the unwanted data, e.g. you know want to delete all the events with "S" and "T" in "Source Type" in uppercase, you can filter them before indexing, but you cannot check if the data was already indexed.
It's mainly a problem of license consuption.

If you have too many events in your searches and you want to speed them, you could think to schedule a search (e.g. every hour) extracting at search time only the records you want and saving them in a summary index, so then you can use it for your quick searches.
For duplicated data is easier, e.g. you could schedule a search like this

index=my_index
| eval first_field=lower(first_field)
| dedup first_field
| table _time first_field second_field third_field
| collect index=my_summary_index

For unwanted data, you have to find a rule (one or more regexes) to filter events and create a scheduled search like the above one.

Bye.
Giuseppe

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...