Splunk Search

How to discover fields that have the same values?

automayt
Explorer

I have sourcetype=apple and sourcetype=orange. They are both network related sourcetypes. Is there an automated way of finding redundancies in the two (or more) sourcetypes? For instance apple has "sip" and orange has "sourceip". I want to automate the discovery of the redundant fields. While I don't know how to do this, I had considered flipping the values and fields such that a result might look like 

value field
12.23.34.45

apple-sip

orange-sourceip 

 

However, I am open to anything that accomplishes the goal of auto-discovering redundant fields across multiple sourcetypes. I thought foreach * might possible do the trick as well. 

Labels (1)
Tags (1)
0 Karma
1 Solution

yuanliu
SplunkTrust
SplunkTrust

I'm glad you found a method that fit your need.  Would you mind explain the actual logic?  Does this mean that any match would be considered duplicates?  Or are you just trying to fish out duplicate candidates?  Note this method depends a lot on how closely events from the two sourcetypes are mingled from index search returns, which is a factor that you have little control.

Here is a demo with my emulation, i.e.,

dipsipsourceipsourcetype
  12.23.34.45orange
  23.34.45.56orange
12.23.34.4512.23.34.45 apple
23.34.45.5645.56.67.89 apple

First, direct application

 

 

| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append 
    [makeresults
    | eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
    | mvexpand sip
    | fromjson sip]
``` the above emulates
index=fruit sourcetype IN ("apple","orange")
```
| transpose
| eval allvalues=mvappend('row 1', 'row 2') 
| stats dc(column) AS dc_column values(column) AS columns BY allvalues 
| where dc_column>1

 

 

This gives no output.

That's because the events that match are separated too "far" away by events that do not.  Extend to row 3 and row 4,

 

 

| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append 
    [makeresults
    | eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
    | mvexpand sip
    | fromjson sip]
``` the above emulates
index=fruit sourcetype IN ("apple","orange")
```
| transpose
| eval allvalues=mvappend('row 1', 'row 2', 'row 3', 'row 4') 
| stats dc(column) AS dc_column values(column) AS columns BY allvalues 
| where dc_column>1

 

 

allvaluesdc_column
columns
12.23.34.453
dip
sip
sourceip
23.34.45.562
dip
sourceip

A more robust way to do this  would be to put all values in one row with stats, i.e.,

 

| fields - sourcetype
| stats values(*) as *
| transpose
| rename "row 1" as ip
| stats values(column) as fieldname by ip
| where mvcount(fieldname) > 1

 

Using the same emulation, the output would still be

ip
fieldname
12.23.34.45
dip
sip
sourceip
23.34.45.56
dip
sourceip

but the order and interleave of events no longer matter.  Additionally, there will be fewer rows to transpose, reducing cost.

View solution in original post

automayt
Explorer

This ended up getting the job done, but it still doesn't reference the sourcetype automagically. I'm aware I could use header_name=sourcetype to change row 1 and row 2, but ended up going with this for more general ease for someone following.

index=fruit sourcetype IN ("apple","orange")
| transpose
| eval allvalues=mvappend('row 1', 'row 2')
| stats dc(column) AS dc_column values(column) AS columns BY allvalues
| where dc_column>1

This flips the column names into a value, and then if 2 or more columns match the same value, it will suggest that they might be duplicates and list the duplicate fields.

yuanliu
SplunkTrust
SplunkTrust

I'm glad you found a method that fit your need.  Would you mind explain the actual logic?  Does this mean that any match would be considered duplicates?  Or are you just trying to fish out duplicate candidates?  Note this method depends a lot on how closely events from the two sourcetypes are mingled from index search returns, which is a factor that you have little control.

Here is a demo with my emulation, i.e.,

dipsipsourceipsourcetype
  12.23.34.45orange
  23.34.45.56orange
12.23.34.4512.23.34.45 apple
23.34.45.5645.56.67.89 apple

First, direct application

 

 

| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append 
    [makeresults
    | eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
    | mvexpand sip
    | fromjson sip]
``` the above emulates
index=fruit sourcetype IN ("apple","orange")
```
| transpose
| eval allvalues=mvappend('row 1', 'row 2') 
| stats dc(column) AS dc_column values(column) AS columns BY allvalues 
| where dc_column>1

 

 

This gives no output.

That's because the events that match are separated too "far" away by events that do not.  Extend to row 3 and row 4,

 

 

| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append 
    [makeresults
    | eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
    | mvexpand sip
    | fromjson sip]
``` the above emulates
index=fruit sourcetype IN ("apple","orange")
```
| transpose
| eval allvalues=mvappend('row 1', 'row 2', 'row 3', 'row 4') 
| stats dc(column) AS dc_column values(column) AS columns BY allvalues 
| where dc_column>1

 

 

allvaluesdc_column
columns
12.23.34.453
dip
sip
sourceip
23.34.45.562
dip
sourceip

A more robust way to do this  would be to put all values in one row with stats, i.e.,

 

| fields - sourcetype
| stats values(*) as *
| transpose
| rename "row 1" as ip
| stats values(column) as fieldname by ip
| where mvcount(fieldname) > 1

 

Using the same emulation, the output would still be

ip
fieldname
12.23.34.45
dip
sip
sourceip
23.34.45.56
dip
sourceip

but the order and interleave of events no longer matter.  Additionally, there will be fewer rows to transpose, reducing cost.

automayt
Explorer

Your final solution is not only scales to more situations, but doesn't force it to be a 2 sourcetype situation. Thank you. This mostly satisfies everything (although i'd be great to know the sourcetype with the field in the lists, but that is probably impractical). TLDR, here was @yuanliu solution that is better than mine.

search_here_that_has_two_or_more_sourcetypes
| fields - sourcetype
| stats values(*) as *
| transpose
| rename "row 1" as ip
| stats values(column) as fieldname by ip
| where mvcount(fieldname) > 1

 

0 Karma

yuanliu
SplunkTrust
SplunkTrust

to know the sourcetype with the field in the lists,

I haven't found a scalable method to do this.  But if

  1. combinations of sourcetypes and fields of interest are few and known, and
  2. there is no identical field name in different sourcetypes,

you can rename each field to contain info about its sourcetype.

0 Karma

jotne
Builder

I do not think there is an easy solution for this.  What you can do is to examine your data and set field alias so that thy will respond to same field name

0 Karma

yuanliu
SplunkTrust
SplunkTrust

You may want to clarify the conditions and logic more, maybe illustrate your data and desired output, too.

For example, given that apple and orange are different sourcetypes, you could have data like this:

dipsipsourceipsourcetype
  12.23.34.45orange
  23.34.45.56orange
12.23.34.4512.23.34.45 apple
23.34.45.5645.56.67.89 apple

Here, dip contains all values in sourceip, while sip contains partial overlap.  What is your intention with this dataset?

Here is an emulation for this dataset.  You may want to play with it and compare with your real data, and give us a realistic illustration

 

| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append 
    [makeresults
    | eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
    | mvexpand sip
    | fromjson sip]
``` the above emulates data, only runs in Splunk 8 or later ```

 

0 Karma

automayt
Explorer

Thanks for the ideas so far. I will clarify more. Lets say I have the following data that I am certain represents the same communication (orange is a device that sees traffic and apple is another device, I have confirmed that it is the identical communication, but they log differently). 

destipvlandipsipsourceipsourcetype
5.6.7.8   12.23.34.45orange
5.22.66.11   45.56.67.89orange
 1115.6.7.812.23.34.45 apple
 1115.22.66.1145.56.67.89 apple

 

I can look at this and say "oh, they are the same traffic but it looks like orange and apple use a different taxonomy! I will combine sourceip and sip into src_ip. I will combine destip and dip into dest_ip!". I want to do this programatically without having to manually identify the common fields. Using JSON or something, I could transpose the values and the fields and say "show me the listing of the fields BY the values. Programmatically my result would look something akin to the following with a transpose;

45.56.67.8912.23.34.455.22.66.115.6.7.8111

sourceip

sip

sourceip

sip

destip

dip

destip

dip

vlan

orange

apple

orange

apple

 orange

apple

orange

apple

apple

 

Admittedly that wouldn't be the prettiest version of the result, but it would get the job done. Any version of that result (one that says "here are redundant field names from semi-redundant sources) that lists the fields and their sourcetypes) would work.

0 Karma

VatsalJagani
SplunkTrust
SplunkTrust

@automayt - You can try the fieldsummary command to get summary of fields you are getting including common values.

https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/fieldsummary

 

And you could run it over multiple sourcetypes with map command.

| makeresults | eval sourcetype="abc"
| append [| makeresults | eval sourcetype="xyz"]

| map search="index=* sourcetype=$sourcetype|s$ | fieldsummary | eval sourcetype=$sourcetype|s$"

 

This is just a starting point, and I hope you get to take it to the next level.

0 Karma
Get Updates on the Splunk Community!

How to Monitor Google Kubernetes Engine (GKE)

We’ve looked at how to integrate Kubernetes environments with Splunk Observability Cloud, but what about ...

Index This | How can you make 45 using only 4?

October 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with this ...

Splunk Education Goes to Washington | Splunk GovSummit 2024

If you’re in the Washington, D.C. area, this is your opportunity to take your career and Splunk skills to the ...