Splunk Search

What is the Optimal Solution for Creating a Match of Values from Two Data Sets

_gkollias
SplunkTrust
SplunkTrust

I have two results from two separate searches that give me a list of invoices that came in to our systems and a list of invoices that left our systems.

I want to create a match between the lists to show how many of the invoices came in and went out.

So far what I have done is exported the results using outputcsv due to the results being greater than 10,000.

Any insight on the best way to create a match using these results would be greatly appreciated.

Thanks in Advance.

Tags (3)
1 Solution

_gkollias
SplunkTrust
SplunkTrust

Thank you all for your responses! Rather than use the two data sets I exported from Splunk, I did this:

index=contract_gateway sourcetype=esb_audit bp_bp_name=InvoiceAudit SourceSystem=ESB OR SourceSystem=ESB2 OR SourceSystem=CTG 
| append 
[ 
search index=gentran sourcetype=gentran_audit Invoice=* source="source for 20150415" | rex max_match=0 "(?im)^(?:[^:\n]*:){2}(?P<InvoiceNumber>[^,]+)" | table _time, InvoiceNumber | mvexpand InvoiceNumber | eval SourceSystem="ViperInbound"
] 
| stats latest(Amount) as Amount values(SourceSystem) as SourceSystem_values, latest(status) as exit_status by InvoiceNumber 
| dedup InvoiceNumber 
| outputcsv viper_ctg_invoice_list_4152015

What I am doing essentially is appending the data from the ESB2 system (this is new and not included in what I initially wanted above, but an additional requirement afterwards) to the original search of searching all of SourceSystems to correlate the invoices that came in and went out. I also added in the exit_statsu to determine whether it was actually a SUCCESS or some sort of failure.

Thank you all again!
-kolgr001

View solution in original post

0 Karma

_gkollias
SplunkTrust
SplunkTrust

Thank you all for your responses! Rather than use the two data sets I exported from Splunk, I did this:

index=contract_gateway sourcetype=esb_audit bp_bp_name=InvoiceAudit SourceSystem=ESB OR SourceSystem=ESB2 OR SourceSystem=CTG 
| append 
[ 
search index=gentran sourcetype=gentran_audit Invoice=* source="source for 20150415" | rex max_match=0 "(?im)^(?:[^:\n]*:){2}(?P<InvoiceNumber>[^,]+)" | table _time, InvoiceNumber | mvexpand InvoiceNumber | eval SourceSystem="ViperInbound"
] 
| stats latest(Amount) as Amount values(SourceSystem) as SourceSystem_values, latest(status) as exit_status by InvoiceNumber 
| dedup InvoiceNumber 
| outputcsv viper_ctg_invoice_list_4152015

What I am doing essentially is appending the data from the ESB2 system (this is new and not included in what I initially wanted above, but an additional requirement afterwards) to the original search of searching all of SourceSystems to correlate the invoices that came in and went out. I also added in the exit_statsu to determine whether it was actually a SUCCESS or some sort of failure.

Thank you all again!
-kolgr001

View solution in original post

0 Karma

stephane_cyrill
Builder

Hi I think an other way to do you matching is:

1-after building you first search as you did,

2-you pipe it and do an evaluation with if(match(...).......) like this:

3- your_first_search |eval inbound_equal_outbound=if(match(InvoiceNumber,your_regex),InvoiceNumber, "") |table inbound_equal_outbound

4- such a procedure takes a value from first search result and if the same value appears in the second search result you put it in a new field inbound_equal_outbound.

0 Karma

chimell
Motivator

Hi KolGr001
As the difference between your searches is SourceSystem, i advise you to use the search code below

|set union [search Inbound][search Outbound]|table InvoiceNumber, Amount, SourceSystem|outputcsv inbound_invoice_list_4152015.csv

Use it like this

 |set union [ search index=contract_gateway sourcetype=esb_audit bp_bp_name=InvoiceAudit SourceSystem=ESB OR SourceSystem=ESB2 earliest=-1d@d | stats latest(Amount) as Amount, latest(InvoiceNumber) as InvoiceNumber, latest(SourceSystem) as SourceSystem by bp_context_id | sort 0 SourceSystem,InvoiceNumber | dedup InvoiceNumber | fields InvoiceNumber, Amount, SourceSystem ] [ search index=contract_gateway sourcetype=esb_audit bp_bp_name=InvoiceAudit SourceSystem=CTG earliest=-1d@d | stats latest(Amount) as Amount, latest(InvoiceNumber) as InvoiceNumber, latest(SourceSystem) as SourceSystem by bp_context_id | sort 0 SourceSystem,InvoiceNumber | dedup InvoiceNumber | fields InvoiceNumber, Amount, SourceSystem ] | table InvoiceNumber, Amount, SourceSystem | outputcsv   inbound_invoice_list_4152015.csv
0 Karma

Runals
Motivator

You latest search is something similar that I was going to recommend as it seems, at a high level, the difference between your original two queries was the SourceSystem. The use case itself is pretty interesting and I could see a whole dashboard with multiple panels showing the number of new invoices hitting the system that haven't been paid, total money coming in and out, a drop down or text box where people can choose an invoice and see the history etc. All that said where I might likely start would be something like this

index=contract_gateway sourcetype=esb_audit bp_bp_name=InvoiceAudit SourceSystem=ESB OR SourceSystem=ESB2 OR SourceSystem=CTG | eval InvoiceState = case(SourceSystem="ESB" OR SourceSystem="ESB2", "inbound", SourceSystem="CTG", "outbound", 1=1, "Undefined SourceSystem") | stats latest(Amount) as Amount by InvoiceNumber InvoiceState

You would probably want to change the InvoiceState values to something more business oriented like paid (collected?), pending, etc. The 1=1 bit is because if somehow a SourceSystem event came in not covered by the first couple items in the case statement - maybe someone monkeyed with your query - the stats command wouldn't choke on there not being a value and you would know where to fix it (I normally use "fixme" =).

Of course then you could do all sorts of cool things with the data like figuring out how long an invoice was in the system (you might have to play with the where statement)

index=contract_gateway sourcetype=esb_audit bp_bp_name=InvoiceAudit SourceSystem=ESB OR SourceSystem=ESB2 OR SourceSystem=CTG | eval InvoiceState = case(SourceSystem="ESB" OR SourceSystem="ESB2", "inbound", SourceSystem="CTG", "outbound", 1=1, "Undefined SourceSystem") | transaction InvoiceNumber InvoiceState | where InvoiceState="inbound" and InvoiceState="outbound" | stats avg(duration) as avg_duration

If you want to do the above but over a longer period of time for trending you could do something like. I've moved away from transaction as depending on factors like data volume, hardware, general Splunk performance transaction can be a more 'expensive' command.

index=contract_gateway sourcetype=esb_audit bp_bp_name=InvoiceAudit SourceSystem=ESB OR SourceSystem=ESB2 OR SourceSystem=CTG | eval InvoiceState = case(SourceSystem="ESB" OR SourceSystem="ESB2", "inbound", SourceSystem="CTG", "outbound", 1=1, "Undefined SourceSystem") | stats min(_time) as entered max(_time) as left by InvoiceNumber | eval delta = left - entered | table delta left | timechart avg(delta) as "Avg Time in System"

Which invoices have come in but not gone out sorted by the ones that have been in the longest

index=contract_gateway sourcetype=esb_audit bp_bp_name=InvoiceAudit SourceSystem=ESB OR SourceSystem=ESB2 OR SourceSystem=CTG | eval InvoiceState = case(SourceSystem="ESB" OR SourceSystem="ESB2", "inbound", SourceSystem="CTG", "outbound", 1=1, "Undefined SourceSystem") | stats min(_time) as entered values(InvoiceState) as InvoiceState by InvoiceNumber | where NOT like(InvoiceState,"%outbound") | table InvoiceNumber entered | sort entered

Sorry to go off on several tangents. I saw your post come through, kicked it over to my college student workers, and we hashed out a couple scenarios 😃

vganjare
Builder

Hi,

You can try using multisearch command. Have different set of field names for input and output. Finally, take the stats for all the fields by bp_context_id. Something like:

| multisearch [search input | rename field names] [search output | rename field names] | stats latest(inputFields) as inputFields, latest(outputFields) as outputFields by bp_context_id 

Thanks !!

0 Karma

_gkollias
SplunkTrust
SplunkTrust

What about something like this?

| multisearch [search index=contract_gateway sourcetype=esb_audit bp_bp_name=InvoiceAudit SourceSystem=ESB OR SourceSystem=ESB2] [search index=contract_gateway sourcetype=esb_audit bp_bp_name=InvoiceAudit SourceSystem=CTG] | stats latest(Amount) as Amount values(SourceSystem) as SourceSystem_values by InvoiceNumber

where I would get the invoice number, dollar amount, and which systems it came in and out from? Would this provide accurate results?

0 Karma

_gkollias
SplunkTrust
SplunkTrust

actually this would be better since both base searches are pretty similar:

index=contract_gateway sourcetype=esb_audit bp_bp_name=InvoiceAudit SourceSystem=ESB OR SourceSystem=ESB2 OR SourceSystem=CTG | stats latest(Amount) as Amount values(SourceSystem) as SourceSystem_values by InvoiceNumber
0 Karma

somesoni2
Revered Legend

Can you share your current searches (remove sensitive portions)? We might be able to live with just merging both the searches into one and using stats to get the information you need.

_gkollias
SplunkTrust
SplunkTrust

Sure thing! Here are the searches:

Inbound
index=contract_gateway sourcetype=esb_audit bp_bp_name=InvoiceAudit SourceSystem=ESB OR SourceSystem=ESB2 earliest=-1d@d
| stats latest(Amount) as Amount, latest(InvoiceNumber) as InvoiceNumber, latest(SourceSystem) as SourceSystem by bp_context_id
| table InvoiceNumber, Amount, SourceSystem
| sort 0 SourceSystem,InvoiceNumber
| dedup InvoiceNumber
| outputcsv inbound_invoice_list_4152015.csv

Outbound
index=contract_gateway sourcetype=esb_audit bp_bp_name=InvoiceAudit SourceSystem=CTG earliest=-1d@d
| stats latest(Amount) as Amount, latest(InvoiceNumber) as InvoiceNumber, latest(SourceSystem) as SourceSystem by bp_context_id
| table InvoiceNumber, Amount, SourceSystem
| sort 0 SourceSystem,InvoiceNumber
| dedup InvoiceNumber
| outputcsv inbound_invoice_list_4152015.csv

The SourceSystem fields are how we differentiate what comes in and what comes out. Also we group the other fields by bp_context_id as this is the unique identifier for each transaction.

Let me know if I can provide any other information - Thanks!

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!