Splunk Search

Help with deduping similar values

tomapatan
Contributor

Hi Everyone,

I have a field called "User" that contains similar values and I was wondering how to remove or merge similar values?

For example: "Tony W" and "Anthony W" (both values of the same field) should be merged together.

I was looking at the fuzzy search and jellyfish apps on SplunkBase, but couldn't find a solution to the problem.

My search query:

(index="abc" Name=*) OR (index="xyz" department=* displayName=*)
| eval User=if(isnull(Name), upper(displayName), upper(Name))
| stats values(department) as department by User

Labels (1)
0 Karma
1 Solution

yuanliu
SplunkTrust
SplunkTrust

Unless you want to delve into the language processing wonderland, your best bet is to create a lookup and do it yourself.  Something like

shortenedspelled
tonyantony
nicknicolas
nicknikola
nicknocole
samsamuel
samsamantha
billwilliam
willwilliam

Let's call this table nicknames.  Then,

 

(index="abc" Name=*) OR (index="xyz" department=* displayName=*)
| eval User=if(isnull(Name), lower(displayName), lower(Name)) ``` lower or upper depends on lookup table design ```
| eval User = split(User, "\s+")
| eval firstName = mvindex(User, 0), lastName = mvindex(User, -1), middleInit = if(mvcount(User) > 2, mvindex(User, 1, -2), null())
| lookup nicknames shortened AS firstName output spelled
| lookup nicknames spelled AS firstName output shortened
| lookup nicknames spelled output shortened AS shortened2 ``` handle bill and will ```
| eval firstName = mvdedup(mvappend(firstName, spelled, shortened, shortened2))
| eval firstName = upper(mvjoin(mvsort(firstName), "/")) ``` upper wouldn't be needed if lookup table is in upper ```
| stats values(department) as department values(User) by firstName middleInit lastName ``` values(User) retains original data ```

 

Note that I chose to list a bunch of nicknames with ambiguity.  The only sensible way to handle them is to admit the uncertainty and retain data.

View solution in original post

Tags (1)
0 Karma

yuanliu
SplunkTrust
SplunkTrust

Unless you want to delve into the language processing wonderland, your best bet is to create a lookup and do it yourself.  Something like

shortenedspelled
tonyantony
nicknicolas
nicknikola
nicknocole
samsamuel
samsamantha
billwilliam
willwilliam

Let's call this table nicknames.  Then,

 

(index="abc" Name=*) OR (index="xyz" department=* displayName=*)
| eval User=if(isnull(Name), lower(displayName), lower(Name)) ``` lower or upper depends on lookup table design ```
| eval User = split(User, "\s+")
| eval firstName = mvindex(User, 0), lastName = mvindex(User, -1), middleInit = if(mvcount(User) > 2, mvindex(User, 1, -2), null())
| lookup nicknames shortened AS firstName output spelled
| lookup nicknames spelled AS firstName output shortened
| lookup nicknames spelled output shortened AS shortened2 ``` handle bill and will ```
| eval firstName = mvdedup(mvappend(firstName, spelled, shortened, shortened2))
| eval firstName = upper(mvjoin(mvsort(firstName), "/")) ``` upper wouldn't be needed if lookup table is in upper ```
| stats values(department) as department values(User) by firstName middleInit lastName ``` values(User) retains original data ```

 

Note that I chose to list a bunch of nicknames with ambiguity.  The only sensible way to handle them is to admit the uncertainty and retain data.

Tags (1)
0 Karma

tomapatan
Contributor

Hi yuanliu,

Appreciate the response, exactly what I needed.

Tags (1)
0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

This challenge was first posted on Slack #puzzles channelFor BORE at .conf23, we had a puzzle question which ...

Splunk Community Badges!

  Hey everyone! Ready to earn some serious bragging rights in the community? Along with our existing badges ...

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...