Splunk Search

how to only show events with certain like values from stats command (or should my approach change?)

johnmvang
Path Finder

Hello All,

The business process is that every day a file will come and the name of the file will always change, however there is a chance on a new day the vendor pushes out the same 'duplicate' data which will come in as a new file name. I'm trying to find duplicates data from a source that is always going to be different. This data could come in the next day, or 5 days later or even a month later. I want to build a process to catch this.

This is what i have:

index=e2e sourcetype=e2e_st source=/opt/splunksa/spl-pkg/scripts/data/201604*161726151_X12.dat
| stats count as Event_Count by source
| append
[ search index=e2e sourcetype=e2e_st
source=/opt/splunksa/spl-pkg/scripts/data/201604*161726151_X12.dat ]
| stats values(deposit_date) AS Deposit_Date,sum(payment_amount) AS Payment_Amount, list(Event_Count) AS Event_Count by source

the result i get are below:

No.# source, Deposit_Date, Payment_Amount, Event_Count
1- /opt/splunksa/spl-pkg/scripts/data/20160427161726151_X12.dat, 20160427, 296935.64, 6
2- /opt/splunksa/spl-pkg/scripts/data/20160428161726151_X12.dat, 20160427, 296936.64, 6
3- /opt/splunksa/spl-pkg/scripts/data/20160429161726151_X12.dat, 20160427, 296935.64, 6
4- /opt/splunksa/spl-pkg/scripts/data/20160430161726151_X12.dat, 20160430, 296929.64, 6

However, since the source will always have a different file name these items are all 'unique' and i cannot show the duplicate entries them AND show the source names. I need to show that the file from line#1 is the same as line#3 and i want to only show those two lines PER the matching Deposit_Date and Payment_Amount and Event_Count with their corresponding source since it has the date in the filename.

The alternative search i came up with is below:

index=e2e sourcetype=e2e_st source=/opt/splunksa/spl-pkg/scripts/data/201604*161726151_X12.dat
| stats count as Event_Count by source
| append
[ search index=e2e sourcetype=e2e_st
source=/opt/splunksa/spl-pkg/scripts/data/201604*161726151_X12.dat]

| stats values(deposit_date) AS
Deposite_Date,sum(payment_amount) AS Payment_Amount, list(Event_Count) AS Event_Count by source
| top limit=5 Payment_Amount,Event_Count by Deposit_Date
| where count > 1
| fields - percent

my results below for the alternative search:

Line No.# Deposit_Date, Payment_Amount, Event_Count, count
1- 20160427, 296935.64, 6, 2

The results show there is only one line item, with the deposit date of of 20160427 which has the duplicate deposit amount of 296935.64 and the count '2' showing it is a duplicate.

This doesn't show the source names so our support personnel will not know what files are conflicting.

Any help is appreciated.

Thanks,

John

0 Karma
1 Solution

DalJeanis
SplunkTrust
SplunkTrust

This gets you any fully duplicated files.

index=e2e sourcetype=e2e_st source=/opt/splunksa/spl-pkg/scripts/data/201604*161726151_X12.dat
| stats count as event_count sum(payment_amount) as payment_amount max(deposit_date) as deposit_date by source
| stats count as source_count, values(source) as source by event_count payment_amount deposit_date
| where source_count > 1

Once the above appears to be working, you can add an additional check on the detailed transactions in the file like this. This grabs the first 5 records from each file -- probably the LAST five records, knowing splunk's preferences for the newest stuff -- and then compares them the same way.

| map  search="search index=e2e sourcetype=e2e_st source=$source$ | head 5" maxsearches=10
| stats count values(source) as source by account_number, deposit_date, payment_amount, and whatever other distinguishing fields there are
| sort 0 account_number, deposit_date

View solution in original post

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

This gets you any fully duplicated files.

index=e2e sourcetype=e2e_st source=/opt/splunksa/spl-pkg/scripts/data/201604*161726151_X12.dat
| stats count as event_count sum(payment_amount) as payment_amount max(deposit_date) as deposit_date by source
| stats count as source_count, values(source) as source by event_count payment_amount deposit_date
| where source_count > 1

Once the above appears to be working, you can add an additional check on the detailed transactions in the file like this. This grabs the first 5 records from each file -- probably the LAST five records, knowing splunk's preferences for the newest stuff -- and then compares them the same way.

| map  search="search index=e2e sourcetype=e2e_st source=$source$ | head 5" maxsearches=10
| stats count values(source) as source by account_number, deposit_date, payment_amount, and whatever other distinguishing fields there are
| sort 0 account_number, deposit_date
0 Karma

johnmvang
Path Finder

This is it. I will analyze your search to review your approach. Thanks again for your help.

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

You were pretty much there, just a little bit overcomplicating your life by joining searches. That second piece is just a triple-check on the details.

I had to do this in a mainframe shop many moons ago. My suggestion is to have the transmission arrive in a staging area, and have a scheduled job compare its byte-count to the sizes of the last 30-60 prior files and move it either to the splunk input area or a holding area based on whether it's different.

0 Karma
Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...