You do that by normalizing the data.
... | eval problem_abstract=case(problem_abstract="SAP Password Reset", problem_abstract, problem_abstract="Reset SAP Password", "SAP Password Reset", problem_abstract="SAP Reset Password", "SAP Password Reset", 1=1, problem_abstract) | ...
but that means having a case entry for each possible problem.
Letting Splunk do that for you may work better, depending on your data.
... | cluster showcount=true countfield=count field=problem_abstract match=termset | top limit=10 count | sort - count | table problem_abstract count
@richgalloway exactly where I was going with the cluster command. In fact if patterns are not know some other options like TFIDF, NLP etc.
| makeresults | fields - _time | eval problem_abstract="SAP reset Password=10,reset SAP Password=20,Password reset SAP=20,Other=100,Something Else=50" | makemv problem_abstract delim="," | mvexpand problem_abstract | makemv problem_abstract delim="=" | eval count=mvindex(problem_abstract,1),problem_abstract=mvindex(problem_abstract,0) | table problem_abstract count | cluster field=problem_abstract t=0.3 | fields - cluster_label
Play around with
t as per your need of creation of clusters. Refer to cluster command documentation.
@richgalloway and @niketnilay - clustering is definitely an interesting option. it has to be termset or ngramset though, termlist , which is the 'match' parameter by default will yield inferior results.
But there is a risk - I tested with reset sap password & sap password reset with text like 'i care' and 'i don;t care' as dummy. It works well with termset and ngramset. But then i added a fourth line/phrase - please reset my sap password. Now, the game changes and the clustering fails to yield proper results.
@chinkeeparco - please go ahead with the clustering as suggested by rich and niket, you have to play around with the t value and the match term , to see what suits you best
I did explicitly mention
match=termset in my answer as well as "depending on your data". You may have to combine the two approaches I offered - normalize some outliers then let
cluster do the rest. Then again, some experimenting with various
cluster options may yield acceptable results.
@Sukisen1981 yes indeed I have mentioned TFIDF, NLP to be tried as well. But like @richgalloway has mentioned solution should be adopted as per the use case.