Solved: How to discover fields that have the same values?

automayt · ‎08-04-2023

I have sourcetype=apple and sourcetype=orange. They are both network related sourcetypes. Is there an automated way of finding redundancies in the two (or more) sourcetypes? For instance apple has "sip" and orange has "sourceip". I want to automate the discovery of the redundant fields. While I don't know how to do this, I had considered flipping the values and fields such that a result might look like

value

field

12.23.34.45

apple-sip

orange-sourceip

However, I am open to anything that accomplishes the goal of auto-discovering redundant fields across multiple sourcetypes. I thought foreach * might possible do the trick as well.

yuanliu · ‎08-04-2023

I'm glad you found a method that fit your need. Would you mind explain the actual logic? Does this mean that any match would be considered duplicates? Or are you just trying to fish out duplicate candidates? Note this method depends a lot on how closely events from the two sourcetypes are mingled from index search returns, which is a factor that you have little control.

Here is a demo with my emulation, i.e.,

dip	sip	sourceip	sourcetype
		12.23.34.45	orange
		23.34.45.56	orange
12.23.34.45	12.23.34.45		apple
23.34.45.56	45.56.67.89		apple

First, direct application

| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append 
    [makeresults
    | eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
    | mvexpand sip
    | fromjson sip]
``` the above emulates
index=fruit sourcetype IN ("apple","orange")
```
| transpose
| eval allvalues=mvappend('row 1', 'row 2') 
| stats dc(column) AS dc_column values(column) AS columns BY allvalues 
| where dc_column>1

This gives no output.

That's because the events that match are separated too "far" away by events that do not. Extend to row 3 and row 4,

| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append 
    [makeresults
    | eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
    | mvexpand sip
    | fromjson sip]
``` the above emulates
index=fruit sourcetype IN ("apple","orange")
```
| transpose
| eval allvalues=mvappend('row 1', 'row 2', 'row 3', 'row 4') 
| stats dc(column) AS dc_column values(column) AS columns BY allvalues 
| where dc_column>1

allvalues	dc_column	columns
12.23.34.45	3	dip sip sourceip
23.34.45.56	2	dip sourceip

A more robust way to do this would be to put all values in one row with stats, i.e.,

| fields - sourcetype
| stats values(*) as *
| transpose
| rename "row 1" as ip
| stats values(column) as fieldname by ip
| where mvcount(fieldname) > 1

Using the same emulation, the output would still be

ip	fieldname
12.23.34.45	dip sip sourceip
23.34.45.56	dip sourceip

but the order and interleave of events no longer matter. Additionally, there will be fewer rows to transpose, reducing cost.

View solution in original post

automayt · ‎08-04-2023

This ended up getting the job done, but it still doesn't reference the sourcetype automagically. I'm aware I could use header_name=sourcetype to change row 1 and row 2, but ended up going with this for more general ease for someone following.

index=fruit sourcetype IN ("apple","orange")
| transpose
| eval allvalues=mvappend('row 1', 'row 2') 
| stats dc(column) AS dc_column values(column) AS columns BY allvalues 
| where dc_column>1

This flips the column names into a value, and then if 2 or more columns match the same value, it will suggest that they might be duplicates and list the duplicate fields.

yuanliu · ‎08-04-2023

I'm glad you found a method that fit your need. Would you mind explain the actual logic? Does this mean that any match would be considered duplicates? Or are you just trying to fish out duplicate candidates? Note this method depends a lot on how closely events from the two sourcetypes are mingled from index search returns, which is a factor that you have little control.

Here is a demo with my emulation, i.e.,

dip	sip	sourceip	sourcetype
		12.23.34.45	orange
		23.34.45.56	orange
12.23.34.45	12.23.34.45		apple
23.34.45.56	45.56.67.89		apple

First, direct application

| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append 
    [makeresults
    | eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
    | mvexpand sip
    | fromjson sip]
``` the above emulates
index=fruit sourcetype IN ("apple","orange")
```
| transpose
| eval allvalues=mvappend('row 1', 'row 2') 
| stats dc(column) AS dc_column values(column) AS columns BY allvalues 
| where dc_column>1

This gives no output.

That's because the events that match are separated too "far" away by events that do not. Extend to row 3 and row 4,

| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append 
    [makeresults
    | eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
    | mvexpand sip
    | fromjson sip]
``` the above emulates
index=fruit sourcetype IN ("apple","orange")
```
| transpose
| eval allvalues=mvappend('row 1', 'row 2', 'row 3', 'row 4') 
| stats dc(column) AS dc_column values(column) AS columns BY allvalues 
| where dc_column>1

allvalues	dc_column	columns
12.23.34.45	3	dip sip sourceip
23.34.45.56	2	dip sourceip

A more robust way to do this would be to put all values in one row with stats, i.e.,

| fields - sourcetype
| stats values(*) as *
| transpose
| rename "row 1" as ip
| stats values(column) as fieldname by ip
| where mvcount(fieldname) > 1

Using the same emulation, the output would still be

ip	fieldname
12.23.34.45	dip sip sourceip
23.34.45.56	dip sourceip

but the order and interleave of events no longer matter. Additionally, there will be fewer rows to transpose, reducing cost.

automayt · ‎08-09-2023

Your final solution is not only scales to more situations, but doesn't force it to be a 2 sourcetype situation. Thank you. This mostly satisfies everything (although i'd be great to know the sourcetype with the field in the lists, but that is probably impractical). TLDR, here was @yuanliu solution that is better than mine.

search_here_that_has_two_or_more_sourcetypes
| fields - sourcetype
| stats values(*) as *
| transpose
| rename "row 1" as ip
| stats values(column) as fieldname by ip
| where mvcount(fieldname) > 1

yuanliu · ‎08-10-2023

to know the sourcetype with the field in the lists,

I haven't found a scalable method to do this. But if

combinations of sourcetypes and fields of interest are few and known, and
there is no identical field name in different sourcetypes,

you can rename each field to contain info about its sourcetype.

jotne · ‎08-04-2023

I do not think there is an easy solution for this. What you can do is to examine your data and set field alias so that thy will respond to same field name

yuanliu · ‎08-04-2023

You may want to clarify the conditions and logic more, maybe illustrate your data and desired output, too.

For example, given that apple and orange are different sourcetypes, you could have data like this:

dip	sip	sourceip	sourcetype
		12.23.34.45	orange
		23.34.45.56	orange
12.23.34.45	12.23.34.45		apple
23.34.45.56	45.56.67.89		apple

Here, dip contains all values in sourceip, while sip contains partial overlap. What is your intention with this dataset?

Here is an emulation for this dataset. You may want to play with it and compare with your real data, and give us a realistic illustration

| makeresults
| eval sourcetype="orange", sourceip = mvappend("12.23.34.45", "23.34.45.56")
| mvexpand sourceip
| append 
    [makeresults
    | eval sourcetype="apple", sip = mvappend(json_object("sip", "12.23.34.45", "dip", "12.23.34.45"), json_object("sip", "45.56.67.89", "dip", "23.34.45.56"))
    | mvexpand sip
    | fromjson sip]
``` the above emulates data, only runs in Splunk 8 or later ```

automayt · ‎08-04-2023

Thanks for the ideas so far. I will clarify more. Lets say I have the following data that I am certain represents the same communication (orange is a device that sees traffic and apple is another device, I have confirmed that it is the identical communication, but they log differently).

destip	vlan	dip	sip	sourceip	sourcetype
5.6.7.8				12.23.34.45	orange
5.22.66.11				45.56.67.89	orange
	111	5.6.7.8	12.23.34.45		apple
	111	5.22.66.11	45.56.67.89		apple

I can look at this and say "oh, they are the same traffic but it looks like orange and apple use a different taxonomy! I will combine sourceip and sip into src_ip. I will combine destip and dip into dest_ip!". I want to do this programatically without having to manually identify the common fields. Using JSON or something, I could transpose the values and the fields and say "show me the listing of the fields BY the values. Programmatically my result would look something akin to the following with a transpose;

45.56.67.89

12.23.34.45

5.22.66.11

5.6.7.8

111

sourceip

sip

sourceip

sip

destip

dip

destip

dip

vlan

orange

apple

orange

apple

orange

apple

orange

apple

Admittedly that wouldn't be the prettiest version of the result, but it would get the job done. Any version of that result (one that says "here are redundant field names from semi-redundant sources) that lists the fields and their sourcetypes) would work.

VatsalJagani · ‎08-04-2023

@automayt - You can try the fieldsummary command to get summary of fields you are getting including common values.

- https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/fieldsummary

And you could run it over multiple sourcetypes with map command.

| makeresults | eval sourcetype="abc"
| append [| makeresults | eval sourcetype="xyz"]

| map search="index=* sourcetype=$sourcetype|s$ | fieldsummary | eval sourcetype=$sourcetype|s$"

This is just a starting point, and I hope you get to take it to the next level.

How to discover fields that have the same values?

fields

How to Monitor Google Kubernetes Engine (GKE)

Index This | How can you make 45 using only 4?

Splunk Education Goes to Washington | Splunk GovSummit 2024