topic Re: Dedup is extremely Slow in Splunk Search

Dedup is extremely Slow

tdavison76 — Tue, 20 May 2025 13:27:43 GMT

Hello,

I have a Search that is taking 5 min to complete when looking at only the last 24 hrs. If possible, could someone help me figure out how I can improve this Search? I am in need of deduping by SessionId and combing 3 fields into a single field.

source="mobilepro-test" | dedup Session.SessionId | strcat UserInfo.UserId " " Location.Site " " Session.StartTime label | table Session.SessionId, label

It looks like it's the dedup that is causing the slowness, but I have no idea how to improve that.

Thanks for any help on this one,

Tom

Re: Dedup is extremely Slow

ITWhisperer — Tue, 20 May 2025 13:43:31 GMT

You could try stats

source="mobilepro-test" | stats first(UserInfo.UserId) as UserInfo.UserId first(Location.Site) as Location.Site first(Session.StartTime) as Session.StartTime by Session.SessionId | strcat UserInfo.UserId " " Location.Site " " Session.StartTime label | table Session.SessionId, label

Re: Dedup is extremely Slow

livehybrid — Tue, 20 May 2025 14:35:44 GMT

Hi @tdavison76

I would recommend using stats for this instead, see below:

source="mobilepro-test" | strcat UserInfo.UserId " " Location.Site " " Session.StartTime label | stats latest(label) as label by Session.SessionId

You could switch the order of strcat to save on processing multiple strcat:

source="mobilepro-test" | stats latest(UserInfo.UserId) as UserInfo_UserId, latest(Location.Site) as Location_Site, latest(Session.StartTime) AS Session_StartTime by Session.SessionId | strcat UserInfo_UserId " " Location_Site " " Session_StartTime label | table Session.SessionId, label

Note: We are using "latest" here which keeps the most recent event.

🌟 Did this answer help you? If so, please consider:

Adding karma to show it was useful
Marking it as the solution if it resolved your issue
Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

Re: Dedup is extremely Slow

PickleRick — Tue, 20 May 2025 15:22:54 GMT

As a side note, completely irrelevant to the original problem - I'm wondering whether there will be any noticeable performance difference between first(something) and latest(something) in case of a default base search returning results in reverse chronological order.

Re: Dedup is extremely Slow

livehybrid — Tue, 20 May 2025 15:26:23 GMT

Thats a good point @PickleRick - for some reason I've always used latest, mainly incase there is any reason that events dont get returned with the most recent first (e.g. sorting of some sort, changes to _time, lookups, appends etc) but I suppose stats will stop looking after the first event if using first() but could read all events to check its still the "latest".
I might try this on a big dataset to see if it makes much difference!

Re: Dedup is extremely Slow

ITWhisperer — Tue, 20 May 2025 16:19:04 GMT

first is "closer" to dedup since it keeps the first event in the event pipeline for each unique value of the dedup'd field(s)