Reporting

How to not include the duplicated events while accelerating the data model

luhadia_aditya
Path Finder

How should I add '| dedup' as one of the constraints of the dat model ?

We have a data model having a sourcetype as a base constraint and other fields using which we generate statistical reports by tstats searches. This sourcetype has got duplicated data.

I would want to filter out the duplicated events so for accurate statistics reporting in the data model, so that the generated summaries are accurate. (some thing like '| dedup _raw' right before the 'stats' command in the usual searches)

Please suggest some ideas. Thanks!

0 Karma

Richfez
SplunkTrust
SplunkTrust

I don't know about the precise question you asked - but I'd investigate why you have duplicate data in the first place. I know that won't help with historical information but it seems like the right answer here.

Is there information lacking in the logs making events appear duplicated? Are you grabbing a set of logs twice? Do two hosts both report the same information?

0 Karma

luhadia_aditya
Path Finder

Well, I have found the root cause of the duplication and have resolved it too.

To sum up the question - the issue persisted for a month and for this duration we have duplication. We have reports being generated on this data every now and then by the users and the stats reported are not accurate due to dupes. These reports come from the accelerated summaries created by a data model.
Now, how can I not include the duplicated events in the data model summaries to have the stats accurate ?

Thanks for your concern and response.

0 Karma

Richfez
SplunkTrust
SplunkTrust

I'm glad to hear you've got it straightened out now.

I think you have a couple of options. d and bwooden do a far better job of summarizing some of them in this answer, though I'd caution TEST TEST TEST before doing some of those! Remember, you shoudl be 100% the results of that search are really what you want to delete before you ever even enable the ability to USE delete. 😉

Anyway, If that's helpful please upvote that very thorough tag-teamed answer to give them some credit for it.

Your idea about including a dedup would probably work really well, except it'll be a huge performance impact all the time. Now, if perhaps you only need that for short while until that data expires out of the system, then maybe that's the easy way to go.

0 Karma

luhadia_aditya
Path Finder

Thanks a lot for your pointer Rich.

I have had already considered the scenarios they presented, and thats the reason I wanted the dedup to be incorporated in to the developed data-model itself rather than to have the dupes deleted.

I would highly appreciate if you may suggest any idea in terms to incorporate dedup in data-model itself, its alright to have the perf impact as its only gonna be applicable only for a month's data.

As far as I know, there is no way, thats why approaching all the Splunk Ninjas !! Thanks once again!

0 Karma
.conf21 CFS Extended through 5/20!

Don't miss your chance
to share your Splunk
wisdom in-person or
virtually at .conf21!

Call for Speakers has
been extended through
Thursday, 5/20!