perform a cleanup activity

simon21 · ‎03-13-2018

I have to perform a cleanup activity. SO the scenario is that there is no primary key. The columns that are available are fullname and address. Now First things first, i split the full name into firstname and surname. Now, I have to make sure that there are no duplicates in the firstname or surname. For eg, if a users firstname is John and surname is Stinson and another users firstname is J and surname is Stinson, I have to build a logic that says John stinson and J stinson are the same person. another scenario is such that, if a users firstname is Stacy and surname is C and another users firstname is S and surname is Cyrus, then i have to state that Stacy C and S cyrus are the same person. similar case goes with names like, one users entry states the fullname as Sam Smith another states it like Smith Sam, but i have to determine that they are the same entries

richgalloway · ‎03-13-2018

How is this related to Splunk?

---
If this reply helps you, Karma would be appreciated.

simon21 · ‎03-13-2018

The data is being populated to splunk using db connect. So all of this has to happen via splunk. So the comparison of field values has be coded using SPL.

richgalloway · ‎03-13-2018

Data indexed in Splunk cannot be "cleaned up". Indexed data remains unchanged until it ages out.

---
If this reply helps you, Karma would be appreciated.

simon21 · ‎03-13-2018

I am not talking about cleaning up the indexed data like that. I just need a status field appended that would indicate whether a users entry has been done multiple times in various combinations as mentioned above. A custom logic that would help in determining the same.

Sukisen1981 · ‎03-13-2018

well you are probably trying something in splunk that is not related to splunk. For example, Bhakti Sakore , B Sakore and Bhakti S , you say (and by intuition) we know are the same people, but what if there is someone called Bhakti Singh AND whose alias is also present in the DB as Bhakti S. How do you know which person is this?
To me this question sounds unrelated to splunk, it is just that you are trying this out on splunk. You might build something after many hours , only to find some data combination you have not considered messing up your use case, since in this case there really is no logic to capture what you intend

simon21 · ‎03-13-2018

This is a probability usecase. Very correctly pointed that Bhakti S maybe Bhakti Sakor or Bhakti Singh. I just need the grouping to be honest. A status field will just state that Bhakti S maybe Bhakti Sakore or bhakti Singh.
but from the looks of it, it seems rather a complex task on splunk. Maybe I need an alternative to execute this scenario.

efavreau · ‎03-13-2018

There's no explanation of what has been attempted, and no question here. A LOT more detail is needed.

###

If this reply helps you, an upvote would be appreciated.

simon21 · ‎03-13-2018

Let's consider the db has only one entry i.e Name (for simplicity purpose)

Name
Afreen Hemani
Afreen Rafai
Arjun T
Arjun Thesia
Arjun Taher
Akshay Trimbake
Akshay Kadam
Ammy Virk
Bhakti S
Bhakti Sakore
B Sakore

My aim is to state that Bhakti Sakore, B Sakore and Bhakti S are the same user in the db with multiple mistaken inputs. Can this be done in splunk?

Okay, So the query I have used so far is this:
base query | sort +Name| eval Name1 = split(Name," ") | eval FirstName=mvindex(Name1,0) | eval SecondName=mvindex(Name1,1)| eval ThirdName=mvindex(Name1,2) | table Name FirstName SecondName ThirdName| stats list(SecondName) as SName by FirstName|nomv SName| eval surname = split(SName," ") | eval Surname1 =mvindex(surname,0) | eval Surname2=mvindex(surname,1) | eval Surname3=mvindex(surname,2) |table FirstName Surname1 Surname2 Surname3| eval resultS1andS2=if(like(Surname2,"%".Surname1."%"),"Surname1 and Surname2 could be same","Surname1 and Surname2 Different") | eval resultS2andS3=if(like(Surname3,"%".Surname2."%"),"Surname2 and Surname3 could be same","Surname2 and Surname3 Different") | eval resultS1andS3=if(like(Surname3,"%".Surname1."%"),"Surname1 and Surname3 could be same","Surname1 and Surname3 Different") | strcat resultS1andS3 " / " resultS2andS3 " / " resultS1andS2 Status | fields - resultS1andS3 ,resultS2andS3, resultS1andS2

Output looks like this:
FirstName Surname1 Surname2 Surname3 Status
Afreen Hemani Rafai Surname1 and Surname3 Different / Surname2 and Surname3 Different / Surname1 and Surname2 Different
Akshay Kadam Trimbake Surname1 and Surname3 Different / Surname2 and Surname3 Different / Surname1 and Surname2 Different
Ammy Virk Surname1 and Surname3 Different / Surname2 and Surname3 Different / Surname1 and Surname2 Different
Arjun T Taher Thesia Surname1 and Surname3 could be same / Surname2 and Surname3 Different / Surname1 and Surname2 could be same
Bhakti S Sakore Surname1 and Surname3 Different / Surname2 and Surname3 Different / Surname1 and Surname2 could be same

Here, the Name is a single field consisteing of the complete name of users.

This works fine when I want to identify the same users based on the firstname. I am having troubles doing it both ways, meaning I am able to determine that Bhakti Sakore and Bhakti S maybe same users. but I am having trouble determining B Sakore and Bhakti S/Bhakti Sakore to be the same.

perform a cleanup activity

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?

Are you a member of the Splunk Community?

perform a cleanup activity

Tech Talk Recap | Mastering Threat Hunting

Observability for AI Applications: Troubleshooting Latency

Splunk AI Assistant for SPL vs. ChatGPT: Which One is Better?