Solved: how to only show events with certain like values f...

johnmvang · ‎03-03-2017

Hello All,

The business process is that every day a file will come and the name of the file will always change, however there is a chance on a new day the vendor pushes out the same 'duplicate' data which will come in as a new file name. I'm trying to find duplicates data from a source that is always going to be different. This data could come in the next day, or 5 days later or even a month later. I want to build a process to catch this.

This is what i have:

index=e2e sourcetype=e2e_st source=/opt/splunksa/spl-pkg/scripts/data/201604*161726151_X12.dat
| stats count as Event_Count by source
| append
[ search index=e2e sourcetype=e2e_st
source=/opt/splunksa/spl-pkg/scripts/data/201604*161726151_X12.dat ]
| stats values(deposit_date) AS Deposit_Date,sum(payment_amount) AS Payment_Amount, list(Event_Count) AS Event_Count by source

the result i get are below:

No.# source, Deposit_Date, Payment_Amount, Event_Count
1- /opt/splunksa/spl-pkg/scripts/data/20160427161726151_X12.dat, 20160427, 296935.64, 6
2- /opt/splunksa/spl-pkg/scripts/data/20160428161726151_X12.dat, 20160427, 296936.64, 6
3- /opt/splunksa/spl-pkg/scripts/data/20160429161726151_X12.dat, 20160427, 296935.64, 6
4- /opt/splunksa/spl-pkg/scripts/data/20160430161726151_X12.dat, 20160430, 296929.64, 6

However, since the source will always have a different file name these items are all 'unique' and i cannot show the duplicate entries them AND show the source names. I need to show that the file from line#1 is the same as line#3 and i want to only show those two lines PER the matching Deposit_Date and Payment_Amount and Event_Count with their corresponding source since it has the date in the filename.

The alternative search i came up with is below:

index=e2e sourcetype=e2e_st source=/opt/splunksa/spl-pkg/scripts/data/201604*161726151_X12.dat
| stats count as Event_Count by source
| append
[ search index=e2e sourcetype=e2e_st
source=/opt/splunksa/spl-pkg/scripts/data/201604*161726151_X12.dat]

| stats values(deposit_date) AS
Deposite_Date,sum(payment_amount) AS Payment_Amount, list(Event_Count) AS Event_Count by source
| top limit=5 Payment_Amount,Event_Count by Deposit_Date
| where count > 1
| fields - percent

my results below for the alternative search:

Line No.# Deposit_Date, Payment_Amount, Event_Count, count
1- 20160427, 296935.64, 6, 2

The results show there is only one line item, with the deposit date of of 20160427 which has the duplicate deposit amount of 296935.64 and the count '2' showing it is a duplicate.

This doesn't show the source names so our support personnel will not know what files are conflicting.

Any help is appreciated.

Thanks,

John

DalJeanis · ‎03-03-2017

This gets you any fully duplicated files.

index=e2e sourcetype=e2e_st source=/opt/splunksa/spl-pkg/scripts/data/201604*161726151_X12.dat
| stats count as event_count sum(payment_amount) as payment_amount max(deposit_date) as deposit_date by source
| stats count as source_count, values(source) as source by event_count payment_amount deposit_date
| where source_count > 1

Once the above appears to be working, you can add an additional check on the detailed transactions in the file like this. This grabs the first 5 records from each file -- probably the LAST five records, knowing splunk's preferences for the newest stuff -- and then compares them the same way.

| map  search="search index=e2e sourcetype=e2e_st source=$source$ | head 5" maxsearches=10
| stats count values(source) as source by account_number, deposit_date, payment_amount, and whatever other distinguishing fields there are
| sort 0 account_number, deposit_date

View solution in original post

DalJeanis · ‎03-03-2017

This gets you any fully duplicated files.

index=e2e sourcetype=e2e_st source=/opt/splunksa/spl-pkg/scripts/data/201604*161726151_X12.dat
| stats count as event_count sum(payment_amount) as payment_amount max(deposit_date) as deposit_date by source
| stats count as source_count, values(source) as source by event_count payment_amount deposit_date
| where source_count > 1

Once the above appears to be working, you can add an additional check on the detailed transactions in the file like this. This grabs the first 5 records from each file -- probably the LAST five records, knowing splunk's preferences for the newest stuff -- and then compares them the same way.

| map  search="search index=e2e sourcetype=e2e_st source=$source$ | head 5" maxsearches=10
| stats count values(source) as source by account_number, deposit_date, payment_amount, and whatever other distinguishing fields there are
| sort 0 account_number, deposit_date

johnmvang · ‎03-03-2017

This is it. I will analyze your search to review your approach. Thanks again for your help.

DalJeanis · ‎03-03-2017

You were pretty much there, just a little bit overcomplicating your life by joining searches. That second piece is just a triple-check on the details.

I had to do this in a mainframe shop many moons ago. My suggestion is to have the transmission arrive in a staging area, and have a scheduled job compare its byte-count to the sizes of the last 30-60 prior files and move it either to the splunk input area or a holding area based on whether it's different.

how to only show events with certain like values from stats command (or should my approach change?)

Exciting News: The AppDynamics Community Joins Splunk!

The All New Performance Insights for Splunk

Good Sourcetype Naming

Are you a member of the Splunk Community?

how to only show events with certain like values from stats command (or should my approach change?)

Exciting News: The AppDynamics Community Joins Splunk!

The All New Performance Insights for Splunk

Good Sourcetype Naming