Solved: Re: compare values inside mv field in a table

UnivLyon2 · ‎08-31-2020

Hello,

I've have an alert that returns by email suspicious login attempts in the form of a table with client_ip, number of different logins used, list of logins used, continent and country.

Basically, the table is created by this search (time window 60 minutes):

index=webauth sourcetype=cas login!="audit:unknown" | eval login=lower(login)
| stats dc(login) AS number, values(login) AS "logins list" by client_ip| iplocation allfields=true client_ip
| fields - City,MetroCode,Region,lat,lon,Timezone
| search number>1

Sample result:

client_ip	number	logins list	Continent	Country
192.168.0.6	3	foo bar baz	Somewhere	Here

We have many false positive alert because some users make typos when they try to log in. I would like to clean up this table from any login that is not very different from the first one.

If a result line lists those logins: myself, myselv, then the Levenshtein distance would be 1, then I would like to ditch the line (ie. number would fall from 2 to 1, and result would be excluded).

If a result line lists: myself, myselv, yourself, then the second login would be excluded, but the result should be kept in the final table because yourself is very different from myself.

I hope it makes sense.

I've studied the solution https://community.splunk.com/t5/Splunk-Search/Is-there-any-way-to-compare-multivalue-fields-to-singl... for hours, but my result is so ugly that I can't believe it's the only solution:

index=webauth sourcetype=cas login!="audit:unknown"
| eval login=lower(login)
| fields client_ip,login
| dedup client_ip,login
| mvcombine login
| eval n=mvcount(login), llogs=mvdedup(login)
| search n>1
| iplocation allfields=true client_ip
| fields - City,MetroCode,Region,lat,lon,Timezone,_raw,_time
| mvexpand login
| table *
| map maxsearches=100 search="
     | makeresults
     | eval login=\"$login$\", llogs=\"$llogs$\", number=\"$n$\" , Continent=\"$Continent$\" , Country=\"$Country$\" , client_ip=\"$client_ip$\" 
     |makemv delim=\"
\" llogs
     | mvexpand llogs
     |table *"
| `ut_levenshtein(login,llogs)`
| search ut_levenshtein>3
| fields - _time, llogs
| mvcombine login
| eval logins=mvdedup(login)
| eval number=mvcount(logins)
| fields - login
| dedup client_ip,logins
| table client_ip,number,logins,Continent,Country,ut_levenshtein

Any idea to design something better?

Thanks

niketn · ‎08-31-2020

@UnivLyon2 if you want to handle typos and find out similar names you may have to try some clustering algorithm. You can definitely try the built in cluster command. Following is one run anywhere example SPL based on the sample data and use case provided. However, this may take care of only miss spelled names. For complex scenarios like name with missing letters, special characters, spaces etc you may have to further change the logic of pre-processing the data before feeding to cluster command.

| makeresults
| fields - _time
| eval client_ip="10.10.10.10", number="3", login_list="Myself,Yourself,Myselv", continent="Somewhere", country="Here"
| eval login_list=split(login_list,",")
| fields - data 
|  mvexpand login_list
| eval splitLetters=mvjoin(split(login_list,"")," ")
| eval key=client_ip."-".continent."-".country."-".login_list."-".number."- ".splitLetters
| fields key 
| cluster field=key t=0.6 showcount=true 
| rex field=key "(?<client_ip>[^\-]+)\-(?<continent>[^\-]+)\-(?<country>[^\-]+)\-(?<login_list>[^\-]+)"
| stats list(login_list) as login_list count as number by client_ip continent country

Following is a run anywhere dashboard which explains every step of above search as to how it arrives from Myself, Yourself, Myselv as 3 login list to only 2 i.e. Myself and Yourself. You can play around with cluster command sensitivity threshold and other names in the text box as comma separated list to see what level actually fits the need.
PS: Since this is clustering algorithm based correlation, there may be situations where actually two or more different valid names which are similar as per configured threshold, may get clustered as 1.

Following is the Simple XML dashboard run anywhere example from the screenshot above.

<form theme="dark">
  <label>Cluster to find similar names</label>
  <!-- Independent search to dynamically set count of names in the login_list -->
  <search>
    <query>| makeresults
    | eval number=mvcount(split("$login_list$",","))
    </query>
    <earliest>-1s</earliest>
    <latest>now</latest>
    <done>
      <condition match="$job.resultCount$==0">
        <set token="number">0</set>
      </condition>
      <condition>
        <set token="number">$result.number$</set>
      </condition>
    </done>
  </search>
  <fieldset submitButton="false"></fieldset>
  <row>
    <panel id="panel_title">
      <title>Find Similar Names for Grouping using Cluster Command</title>
      <html>
        <style depends="$alwaysHideCSSPanel$">
          div#panel_title h2.panel-title{
            text-align:center;color:#7ED2FF;font-weight:bold;
          }
          div.html-container{
            display:flex;
          }
          div.html-container .html-header h3{
            text-align:center;color:#7ED2FF;padding-right:10px;
          }
          div.html-container  div.html-description{
            padding-top: 10px;
          }
          div#panel_step2_input div.dashboard-panel,
          div#panel_step5_output div.dashboard-panel{
            border-style: solid;
            border-color: green;
          }
        </style>
      </html>
    </panel>
  </row>
  <row>
    <panel>
      <html>
        <div class="html-container">
          <div class="html-header">
            <h3>Step 1 - Add data:</h3>
          </div>
          <div class="html-description">
            <code>
              *login_list has comma separated names. **Threshold value is for Cluster command where 0.1 is highest sensitivity and 0.9 is lowest.
            </code>
          </div>
        </div>
      </html>
    </panel>
  </row>
  <row>
    <panel>
      <input type="text" token="client_ip" searchWhenChanged="true">
        <label>Client IP</label>
        <default>10.10.10.10</default>
      </input>
      <input type="text" token="continent" searchWhenChanged="true">
        <label>Continent</label>
        <default>Somewhere</default>
      </input>
      <input type="text" token="country" searchWhenChanged="true">
        <label>country</label>
        <default>Here</default>
      </input>
      <input type="text" token="login_list" searchWhenChanged="true">
        <label>login_list</label>
        <default>foo,bar,baz</default>
      </input>
      <input type="dropdown" token="threshold" searchWhenChanged="true">
        <label>Cluster sensitivity (0.1 - 0.9)</label>
        <choice value="0.1">0.1</choice>
        <choice value="0.2">0.2</choice>
        <choice value="0.3">0.3</choice>
        <choice value="0.4">0.4</choice>
        <choice value="0.5">0.5</choice>
        <choice value="0.6">0.6</choice>
        <choice value="0.7">0.7</choice>
        <choice value="0.8">0.8</choice>
        <choice value="0.9">0.9</choice>
        <default>0.6</default>
      </input>
    </panel>
  </row>
  <row>
    <panel id="panel_step2_input">
      <html>
        <div class="html-container">
          <div class="html-header">
            <h3>Step 2 - Genrate data set:</h3>
          </div>
          <div class="html-description">
            <code>
              Generate data with login_list having comma separated names.
            </code>
          </div>
        </div>
      </html>
      <table>
        <search id="sSampleData">
          <query>| makeresults
| fields - _time
| eval client_ip="$client_ip$", number="$number$", login_list="$login_list$", continent="$continent$", country="$country$"
| eval login_list=split(login_list,",")
| fields - data</query>
          <earliest>-1s</earliest>
          <latest>now</latest>
          <sampleRatio>1</sampleRatio>
        </search>
        <option name="count">100</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">none</option>
        <option name="percentagesRow">false</option>
        <option name="refresh.display">progressbar</option>
        <option name="rowNumbers">false</option>
        <option name="totalsRow">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
  </row>
  <row>
    <panel>
      <html>
        <div class="html-container">
          <div class="html-header">
            <h3>Step 3 - Prepare Data:</h3>
          </div>
          <div class="html-description">
            <code>
              Split names in login_list as letters. Full names with spaces, special characters, too many mistakes etc. may need extra logic.
            </code>
          </div>
        </div>
      </html>
      <table>
        <search id="sGenKey" base="sSampleData">
          <query>| mvexpand login_list
| eval splitLetters=mvjoin(split(login_list,"")," ")
| eval key=client_ip."-".continent."-".country."-".login_list."-".number."- ".splitLetters
| fields key</query>
        </search>
        <option name="count">100</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">none</option>
        <option name="percentagesRow">false</option>
        <option name="refresh.display">progressbar</option>
        <option name="rowNumbers">false</option>
        <option name="totalsRow">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
  </row>
  <row>
    <panel>
      <html>
        <div class="html-container">
          <div class="html-header">
            <h3>Step 4 - Clustered Events based on similarity of login_list names:</h3>
          </div>
          <div class="html-description">
            <code>
              Threshold and login_list name play role here.
            </code>
          </div>
        </div>
      </html>
      <table>
        <search id="sCluster" base="sGenKey">
          <query>| cluster field=key t=$threshold$ showcount=true</query>
        </search>
        <option name="count">100</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">none</option>
        <option name="percentagesRow">false</option>
        <option name="refresh.display">progressbar</option>
        <option name="rowNumbers">false</option>
        <option name="totalsRow">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
  </row>
  <row>
    <panel id="panel_step5_output">
      <html>
        <div class="html-container">
          <div class="html-header">
            <h3>Step 5 - Final Result:</h3>
          </div>
          <div class="html-description">
            <code>
               Extract required fields client_ip, continent, country and show final count login_list
            </code>
          </div>
        </div>
      </html>
      <table>
        <search base="sCluster">
          <query>| cluster field=key t=0.6 showcount=true
| rex field=key "(?&lt;client_ip&gt;[^\-]+)\-(?&lt;continent&gt;[^\-]+)\-(?&lt;country&gt;[^\-]+)\-(?&lt;login_list&gt;[^\-]+)"
| stats list(login_list) as login_list count as number by client_ip continent country</query>
        </search>
        <option name="count">100</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">none</option>
        <option name="percentagesRow">false</option>
        <option name="refresh.display">progressbar</option>
        <option name="rowNumbers">false</option>
        <option name="totalsRow">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
  </row>
</form>

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

View solution in original post

UnivLyon2 · ‎08-31-2020

WHOUA.

Ok, thank you very much for this incredibly detailed and enlightening answer. Just a note: is that a typo on line 8 of your search example? The cluster command is duplicated.
Anyway, the result you got is almost perfect, I've just made some modifications on the cluster command arguments so that results are even better and more appropriate than those from my own search.

I've replaced

…|cluster field=key t=0.6 showcount=true | cluster field=key t=0.6 showcount=true

with

…| cluster field=key t=0.8 showcount=true match=ngramset

Now the result is perfect, it can even detect "firstname.lastname" and "flastname" as duplicates (happens a lot here), Levenshtein distance can't detect that.

niketn · ‎09-01-2020

Thanks for correction, I had duplicate cluster command due to copy/paste. I have corrected the same.

For the couple of examples you had shared 0.6 worked but that is sensitivity based on the data of your use case. I feel first name and last name, it would be better to keep same length and padded spaces for better results. But if you got it working that's a great news.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

niketn · ‎08-31-2020

@UnivLyon2 if you want to handle typos and find out similar names you may have to try some clustering algorithm. You can definitely try the built in cluster command. Following is one run anywhere example SPL based on the sample data and use case provided. However, this may take care of only miss spelled names. For complex scenarios like name with missing letters, special characters, spaces etc you may have to further change the logic of pre-processing the data before feeding to cluster command.

| makeresults
| fields - _time
| eval client_ip="10.10.10.10", number="3", login_list="Myself,Yourself,Myselv", continent="Somewhere", country="Here"
| eval login_list=split(login_list,",")
| fields - data 
|  mvexpand login_list
| eval splitLetters=mvjoin(split(login_list,"")," ")
| eval key=client_ip."-".continent."-".country."-".login_list."-".number."- ".splitLetters
| fields key 
| cluster field=key t=0.6 showcount=true 
| rex field=key "(?<client_ip>[^\-]+)\-(?<continent>[^\-]+)\-(?<country>[^\-]+)\-(?<login_list>[^\-]+)"
| stats list(login_list) as login_list count as number by client_ip continent country

Following is a run anywhere dashboard which explains every step of above search as to how it arrives from Myself, Yourself, Myselv as 3 login list to only 2 i.e. Myself and Yourself. You can play around with cluster command sensitivity threshold and other names in the text box as comma separated list to see what level actually fits the need.
PS: Since this is clustering algorithm based correlation, there may be situations where actually two or more different valid names which are similar as per configured threshold, may get clustered as 1.

Following is the Simple XML dashboard run anywhere example from the screenshot above.

<form theme="dark">
  <label>Cluster to find similar names</label>
  <!-- Independent search to dynamically set count of names in the login_list -->
  <search>
    <query>| makeresults
    | eval number=mvcount(split("$login_list$",","))
    </query>
    <earliest>-1s</earliest>
    <latest>now</latest>
    <done>
      <condition match="$job.resultCount$==0">
        <set token="number">0</set>
      </condition>
      <condition>
        <set token="number">$result.number$</set>
      </condition>
    </done>
  </search>
  <fieldset submitButton="false"></fieldset>
  <row>
    <panel id="panel_title">
      <title>Find Similar Names for Grouping using Cluster Command</title>
      <html>
        <style depends="$alwaysHideCSSPanel$">
          div#panel_title h2.panel-title{
            text-align:center;color:#7ED2FF;font-weight:bold;
          }
          div.html-container{
            display:flex;
          }
          div.html-container .html-header h3{
            text-align:center;color:#7ED2FF;padding-right:10px;
          }
          div.html-container  div.html-description{
            padding-top: 10px;
          }
          div#panel_step2_input div.dashboard-panel,
          div#panel_step5_output div.dashboard-panel{
            border-style: solid;
            border-color: green;
          }
        </style>
      </html>
    </panel>
  </row>
  <row>
    <panel>
      <html>
        <div class="html-container">
          <div class="html-header">
            <h3>Step 1 - Add data:</h3>
          </div>
          <div class="html-description">
            <code>
              *login_list has comma separated names. **Threshold value is for Cluster command where 0.1 is highest sensitivity and 0.9 is lowest.
            </code>
          </div>
        </div>
      </html>
    </panel>
  </row>
  <row>
    <panel>
      <input type="text" token="client_ip" searchWhenChanged="true">
        <label>Client IP</label>
        <default>10.10.10.10</default>
      </input>
      <input type="text" token="continent" searchWhenChanged="true">
        <label>Continent</label>
        <default>Somewhere</default>
      </input>
      <input type="text" token="country" searchWhenChanged="true">
        <label>country</label>
        <default>Here</default>
      </input>
      <input type="text" token="login_list" searchWhenChanged="true">
        <label>login_list</label>
        <default>foo,bar,baz</default>
      </input>
      <input type="dropdown" token="threshold" searchWhenChanged="true">
        <label>Cluster sensitivity (0.1 - 0.9)</label>
        <choice value="0.1">0.1</choice>
        <choice value="0.2">0.2</choice>
        <choice value="0.3">0.3</choice>
        <choice value="0.4">0.4</choice>
        <choice value="0.5">0.5</choice>
        <choice value="0.6">0.6</choice>
        <choice value="0.7">0.7</choice>
        <choice value="0.8">0.8</choice>
        <choice value="0.9">0.9</choice>
        <default>0.6</default>
      </input>
    </panel>
  </row>
  <row>
    <panel id="panel_step2_input">
      <html>
        <div class="html-container">
          <div class="html-header">
            <h3>Step 2 - Genrate data set:</h3>
          </div>
          <div class="html-description">
            <code>
              Generate data with login_list having comma separated names.
            </code>
          </div>
        </div>
      </html>
      <table>
        <search id="sSampleData">
          <query>| makeresults
| fields - _time
| eval client_ip="$client_ip$", number="$number$", login_list="$login_list$", continent="$continent$", country="$country$"
| eval login_list=split(login_list,",")
| fields - data</query>
          <earliest>-1s</earliest>
          <latest>now</latest>
          <sampleRatio>1</sampleRatio>
        </search>
        <option name="count">100</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">none</option>
        <option name="percentagesRow">false</option>
        <option name="refresh.display">progressbar</option>
        <option name="rowNumbers">false</option>
        <option name="totalsRow">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
  </row>
  <row>
    <panel>
      <html>
        <div class="html-container">
          <div class="html-header">
            <h3>Step 3 - Prepare Data:</h3>
          </div>
          <div class="html-description">
            <code>
              Split names in login_list as letters. Full names with spaces, special characters, too many mistakes etc. may need extra logic.
            </code>
          </div>
        </div>
      </html>
      <table>
        <search id="sGenKey" base="sSampleData">
          <query>| mvexpand login_list
| eval splitLetters=mvjoin(split(login_list,"")," ")
| eval key=client_ip."-".continent."-".country."-".login_list."-".number."- ".splitLetters
| fields key</query>
        </search>
        <option name="count">100</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">none</option>
        <option name="percentagesRow">false</option>
        <option name="refresh.display">progressbar</option>
        <option name="rowNumbers">false</option>
        <option name="totalsRow">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
  </row>
  <row>
    <panel>
      <html>
        <div class="html-container">
          <div class="html-header">
            <h3>Step 4 - Clustered Events based on similarity of login_list names:</h3>
          </div>
          <div class="html-description">
            <code>
              Threshold and login_list name play role here.
            </code>
          </div>
        </div>
      </html>
      <table>
        <search id="sCluster" base="sGenKey">
          <query>| cluster field=key t=$threshold$ showcount=true</query>
        </search>
        <option name="count">100</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">none</option>
        <option name="percentagesRow">false</option>
        <option name="refresh.display">progressbar</option>
        <option name="rowNumbers">false</option>
        <option name="totalsRow">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
  </row>
  <row>
    <panel id="panel_step5_output">
      <html>
        <div class="html-container">
          <div class="html-header">
            <h3>Step 5 - Final Result:</h3>
          </div>
          <div class="html-description">
            <code>
               Extract required fields client_ip, continent, country and show final count login_list
            </code>
          </div>
        </div>
      </html>
      <table>
        <search base="sCluster">
          <query>| cluster field=key t=0.6 showcount=true
| rex field=key "(?&lt;client_ip&gt;[^\-]+)\-(?&lt;continent&gt;[^\-]+)\-(?&lt;country&gt;[^\-]+)\-(?&lt;login_list&gt;[^\-]+)"
| stats list(login_list) as login_list count as number by client_ip continent country</query>
        </search>
        <option name="count">100</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">none</option>
        <option name="percentagesRow">false</option>
        <option name="refresh.display">progressbar</option>
        <option name="rowNumbers">false</option>
        <option name="totalsRow">false</option>
        <option name="wrap">true</option>
      </table>
    </panel>
  </row>
</form>

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

compare values inside mv field in a table

fields

Splunk Decoded: Service Maps vs Service Analyzer Tree View vs Flow Maps

What’s New in Splunk Observability – September 2025

Fun with Regular Expression - multiples of nine

Are you a member of the Splunk Community?

compare values inside mv field in a table

fields

Splunk Decoded: Service Maps vs Service Analyzer Tree View vs Flow Maps

What’s New in Splunk Observability – September 2025

Fun with Regular Expression - multiples of nine