I have sourcetype=apple and sourcetype=orange. They are both network related sourcetypes. Is there an automated way of finding redundancies in the two (or more) sourcetypes? For instance apple has "sip" and orange has "sourceip". I want to automate the discovery of the redundant fields. While I don't know how to do this, I had considered flipping the values and fields such that a result might look like
value | field |
12.23.34.45 |
apple-sip orange-sourceip |
However, I am open to anything that accomplishes the goal of auto-discovering redundant fields across multiple sourcetypes. I thought foreach * might possible do the trick as well.
I'm glad you found a method that fit your need. Would you mind explain the actual logic? Does this mean that any match would be considered duplicates? Or are you just trying to fish out duplicate candidates? Note this method depends a lot on how closely events from the two sourcetypes are mingled from index search returns, which is a factor that you have little control.
Here is a demo with my emulation, i.e.,
dip | sip | sourceip | sourcetype |
12.23.34.45 | orange | ||
23.34.45.56 | orange | ||
12.23.34.45 | 12.23.34.45 | apple | |
23.34.45.56 | 45.56.67.89 | apple |
First, direct application
| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append
[makeresults
| eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
| mvexpand sip
| fromjson sip]
``` the above emulates
index=fruit sourcetype IN ("apple","orange")
```
| transpose
| eval allvalues=mvappend('row 1', 'row 2')
| stats dc(column) AS dc_column values(column) AS columns BY allvalues
| where dc_column>1
This gives no output.
That's because the events that match are separated too "far" away by events that do not. Extend to row 3 and row 4,
| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append
[makeresults
| eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
| mvexpand sip
| fromjson sip]
``` the above emulates
index=fruit sourcetype IN ("apple","orange")
```
| transpose
| eval allvalues=mvappend('row 1', 'row 2', 'row 3', 'row 4')
| stats dc(column) AS dc_column values(column) AS columns BY allvalues
| where dc_column>1
allvalues | dc_column | columns |
12.23.34.45 | 3 | dip sip sourceip |
23.34.45.56 | 2 | dip sourceip |
A more robust way to do this would be to put all values in one row with stats, i.e.,
| fields - sourcetype
| stats values(*) as *
| transpose
| rename "row 1" as ip
| stats values(column) as fieldname by ip
| where mvcount(fieldname) > 1
Using the same emulation, the output would still be
ip | fieldname |
12.23.34.45 | dip sip sourceip |
23.34.45.56 | dip sourceip |
but the order and interleave of events no longer matter. Additionally, there will be fewer rows to transpose, reducing cost.
This ended up getting the job done, but it still doesn't reference the sourcetype automagically. I'm aware I could use header_name=sourcetype to change row 1 and row 2, but ended up going with this for more general ease for someone following.
index=fruit sourcetype IN ("apple","orange")
| transpose
| eval allvalues=mvappend('row 1', 'row 2')
| stats dc(column) AS dc_column values(column) AS columns BY allvalues
| where dc_column>1
This flips the column names into a value, and then if 2 or more columns match the same value, it will suggest that they might be duplicates and list the duplicate fields.
I'm glad you found a method that fit your need. Would you mind explain the actual logic? Does this mean that any match would be considered duplicates? Or are you just trying to fish out duplicate candidates? Note this method depends a lot on how closely events from the two sourcetypes are mingled from index search returns, which is a factor that you have little control.
Here is a demo with my emulation, i.e.,
dip | sip | sourceip | sourcetype |
12.23.34.45 | orange | ||
23.34.45.56 | orange | ||
12.23.34.45 | 12.23.34.45 | apple | |
23.34.45.56 | 45.56.67.89 | apple |
First, direct application
| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append
[makeresults
| eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
| mvexpand sip
| fromjson sip]
``` the above emulates
index=fruit sourcetype IN ("apple","orange")
```
| transpose
| eval allvalues=mvappend('row 1', 'row 2')
| stats dc(column) AS dc_column values(column) AS columns BY allvalues
| where dc_column>1
This gives no output.
That's because the events that match are separated too "far" away by events that do not. Extend to row 3 and row 4,
| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append
[makeresults
| eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
| mvexpand sip
| fromjson sip]
``` the above emulates
index=fruit sourcetype IN ("apple","orange")
```
| transpose
| eval allvalues=mvappend('row 1', 'row 2', 'row 3', 'row 4')
| stats dc(column) AS dc_column values(column) AS columns BY allvalues
| where dc_column>1
allvalues | dc_column | columns |
12.23.34.45 | 3 | dip sip sourceip |
23.34.45.56 | 2 | dip sourceip |
A more robust way to do this would be to put all values in one row with stats, i.e.,
| fields - sourcetype
| stats values(*) as *
| transpose
| rename "row 1" as ip
| stats values(column) as fieldname by ip
| where mvcount(fieldname) > 1
Using the same emulation, the output would still be
ip | fieldname |
12.23.34.45 | dip sip sourceip |
23.34.45.56 | dip sourceip |
but the order and interleave of events no longer matter. Additionally, there will be fewer rows to transpose, reducing cost.
Your final solution is not only scales to more situations, but doesn't force it to be a 2 sourcetype situation. Thank you. This mostly satisfies everything (although i'd be great to know the sourcetype with the field in the lists, but that is probably impractical). TLDR, here was @yuanliu solution that is better than mine.
search_here_that_has_two_or_more_sourcetypes
| fields - sourcetype
| stats values(*) as *
| transpose
| rename "row 1" as ip
| stats values(column) as fieldname by ip
| where mvcount(fieldname) > 1
to know the sourcetype with the field in the lists,
I haven't found a scalable method to do this. But if
you can rename each field to contain info about its sourcetype.
I do not think there is an easy solution for this. What you can do is to examine your data and set field alias so that thy will respond to same field name
You may want to clarify the conditions and logic more, maybe illustrate your data and desired output, too.
For example, given that apple and orange are different sourcetypes, you could have data like this:
dip | sip | sourceip | sourcetype |
12.23.34.45 | orange | ||
23.34.45.56 | orange | ||
12.23.34.45 | 12.23.34.45 | apple | |
23.34.45.56 | 45.56.67.89 | apple |
Here, dip contains all values in sourceip, while sip contains partial overlap. What is your intention with this dataset?
Here is an emulation for this dataset. You may want to play with it and compare with your real data, and give us a realistic illustration
| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append
[makeresults
| eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
| mvexpand sip
| fromjson sip]
``` the above emulates data, only runs in Splunk 8 or later ```
Thanks for the ideas so far. I will clarify more. Lets say I have the following data that I am certain represents the same communication (orange is a device that sees traffic and apple is another device, I have confirmed that it is the identical communication, but they log differently).
destip | vlan | dip | sip | sourceip | sourcetype |
5.6.7.8 | 12.23.34.45 | orange | |||
5.22.66.11 | 45.56.67.89 | orange | |||
111 | 5.6.7.8 | 12.23.34.45 | apple | ||
111 | 5.22.66.11 | 45.56.67.89 | apple |
I can look at this and say "oh, they are the same traffic but it looks like orange and apple use a different taxonomy! I will combine sourceip and sip into src_ip. I will combine destip and dip into dest_ip!". I want to do this programatically without having to manually identify the common fields. Using JSON or something, I could transpose the values and the fields and say "show me the listing of the fields BY the values. Programmatically my result would look something akin to the following with a transpose;
45.56.67.89 | 12.23.34.45 | 5.22.66.11 | 5.6.7.8 | 111 |
sourceip sip | sourceip sip | destip dip | destip dip | vlan |
orange apple | orange apple | orange apple | orange apple | apple |
Admittedly that wouldn't be the prettiest version of the result, but it would get the job done. Any version of that result (one that says "here are redundant field names from semi-redundant sources) that lists the fields and their sourcetypes) would work.
@automayt - You can try the fieldsummary command to get summary of fields you are getting including common values.
- https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/fieldsummary
And you could run it over multiple sourcetypes with map command.
| makeresults | eval sourcetype="abc"
| append [| makeresults | eval sourcetype="xyz"]
| map search="index=* sourcetype=$sourcetype|s$ | fieldsummary | eval sourcetype=$sourcetype|s$"
This is just a starting point, and I hope you get to take it to the next level.