Dedup is extremely Slow

tdavison76 · ‎05-20-2025

Hello,

I have a Search that is taking 5 min to complete when looking at only the last 24 hrs. If possible, could someone help me figure out how I can improve this Search? I am in need of deduping by SessionId and combing 3 fields into a single field.

source="mobilepro-test"
| dedup Session.SessionId
| strcat UserInfo.UserId " " Location.Site " " Session.StartTime label
| table Session.SessionId, label

It looks like it's the dedup that is causing the slowness, but I have no idea how to improve that.

Thanks for any help on this one,

Tom

livehybrid · ‎05-20-2025

Hi @tdavison76

I would recommend using stats for this instead, see below:

source="mobilepro-test"
| strcat UserInfo.UserId " " Location.Site " " Session.StartTime label
| stats latest(label) as label by Session.SessionId

You could switch the order of strcat to save on processing multiple strcat:

source="mobilepro-test"
| stats latest(UserInfo.UserId) as UserInfo_UserId, latest(Location.Site) as Location_Site, latest(Session.StartTime) AS Session_StartTime by Session.SessionId
| strcat UserInfo_UserId " " Location_Site " " Session_StartTime label
| table Session.SessionId, label

Note: We are using "latest" here which keeps the most recent event.

🌟 Did this answer help you? If so, please consider:

Adding karma to show it was useful
Marking it as the solution if it resolved your issue
Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

PickleRick · ‎05-20-2025

As a side note, completely irrelevant to the original problem - I'm wondering whether there will be any noticeable performance difference between first(something) and latest(something) in case of a default base search returning results in reverse chronological order.

livehybrid · ‎05-20-2025

Thats a good point @PickleRick - for some reason I've always used latest, mainly incase there is any reason that events dont get returned with the most recent first (e.g. sorting of some sort, changes to _time, lookups, appends etc) but I suppose stats will stop looking after the first event if using first() but could read all events to check its still the "latest".
I might try this on a big dataset to see if it makes much difference!

ITWhisperer · ‎05-20-2025

first is "closer" to dedup since it keeps the first event in the event pipeline for each unique value of the dedup'd field(s)

ITWhisperer · ‎05-20-2025

You could try stats

source="mobilepro-test"
| stats first(UserInfo.UserId) as UserInfo.UserId first(Location.Site) as Location.Site first(Session.StartTime) as Session.StartTime by Session.SessionId
| strcat UserInfo.UserId " " Location.Site " " Session.StartTime label
| table Session.SessionId, label

Dedup is extremely Slow

fields

table

Index This | When is October more than just the tenth month?

Observe and Secure All Apps with Splunk

What’s New & Next in Splunk SOAR

Are you a member of the Splunk Community?

Dedup is extremely Slow

fields

table

Index This | When is October more than just the tenth month?

Observe and Secure All Apps with Splunk

What’s New & Next in Splunk SOAR