Splunk Search

find the duplicate files from particular source in splunk search query

Communicator

Hello All,

I need to find from particular source how many we have duplicate files in last 7 days.

I have used this method to indexed duplicate files in Splunk.

here the definition of duplicate file is first line and last line of file is matches to second file first and last line then its called as duplicate.

I can able to achieve the duplicate files if will matches only first line or last line as below

index=main sourcetype=sampledata Header* | eventstats count by _raw | where count>1 | table source, _raw

In this query its giving me the result of files which has the same header. and in below query am getting the result of files where "Trailer" 0r last line of the file is common

index=main sourcetype=sampledata TRL* | eventstats count by _raw | where count>1 | table source, _raw 

So here we need to compare two query and pick the result where we have header and trailer common.

Can any one please help me on this.

Thanks in Adavance

1 Solution

SplunkTrust
SplunkTrust

This will do it more efficiently, and will even work if there ever are more than one match for Header* or TRL* in a given file (fairly easy to imagine that this could happen sometimes).

index=main sourcetype=sampledata (Header* OR TRL*) | stats earliest(_raw) as first latest(_raw) as last by source | stats dc(source) as fileCount values(source) as files by first last | sort - fileCount | where fileCount>1

Breaking it down - we get the events off disk, and we try and only get Header* OR TRL* to avoid getting intermediate events that are of no use to us. The parens are unnecessary here, but I often like them for clarity.

The stats command will take the earliest line and the latest line for each source. NOTE - if events near the start and/or end end up with the same _time value, you'll have a problem here, and we'll need to some additional matching on the Header* and TRL*. Here I've assumed all events near start and end get a different timestamp and there's no ambiguity.

The next stats command now just counts up the number of source (files) it's seen per row dc(source) as fileCount, the actual values of the paths values(source) as files, and it does it for every unique combination of first and last by first last

the rest is just sorting and filtering to the ones that are actually duplicates.

View solution in original post

SplunkTrust
SplunkTrust

This will do it more efficiently, and will even work if there ever are more than one match for Header* or TRL* in a given file (fairly easy to imagine that this could happen sometimes).

index=main sourcetype=sampledata (Header* OR TRL*) | stats earliest(_raw) as first latest(_raw) as last by source | stats dc(source) as fileCount values(source) as files by first last | sort - fileCount | where fileCount>1

Breaking it down - we get the events off disk, and we try and only get Header* OR TRL* to avoid getting intermediate events that are of no use to us. The parens are unnecessary here, but I often like them for clarity.

The stats command will take the earliest line and the latest line for each source. NOTE - if events near the start and/or end end up with the same _time value, you'll have a problem here, and we'll need to some additional matching on the Header* and TRL*. Here I've assumed all events near start and end get a different timestamp and there's no ambiguity.

The next stats command now just counts up the number of source (files) it's seen per row dc(source) as fileCount, the actual values of the paths values(source) as files, and it does it for every unique combination of first and last by first last

the rest is just sorting and filtering to the ones that are actually duplicates.

View solution in original post

Communicator

Thank you, its working for me. and good explanation as well. once again thank you so much!!

0 Karma

Esteemed Legend

Like this:

index=main sourcetype=sampledata Header* OR TRL*
| eval MyType=if(searchmatch(Header*), "HDR", "TRL")
| stats first(HDR) AS HDR first(TRL) AS TRL BY source
| stats values(source) count BY HDR TRL
| search count > 1

Path Finder

Try this:

index=main sourcetype=sampledata "TRL*" OR "Header*"
| EVAL mytype=CASE(MATCH(_raw,"TRL"), "TRAILER",MATCH( _raw,"Header"), "HEADER")
| chart count by source, mytype
| SEARCH TRAILER>1 AND HEADER>1

It should show you every source with duplicated "TRL" and "Header"

Communicator

Hello Haley,

Its not displaying any result.. but i remove the eval command then events are coming

0 Karma

Path Finder

Oh, it seems that you need to convert your _raw to upper case like this:

index=main sourcetype=sampledata "TRL*" OR "Header*"
| EVAL mytype=CASE(MATCH(upper(_raw),"TRL"), "TRAILER",MATCH( upper(_raw),"HEADER"), "HEADER")
| stats count by source, mytype, _raw | WHERE count>1
| CHART first(_raw) by source, mytype

You should be able to see the duplicated _raw in header and trailer for each source

0 Karma

Champion

Please check this one -

 index=main sourcetype=sampledata Header* | eventstats count by _raw | where count>1 | table source, _raw
| append [index=main sourcetype=sampledata TRL* | eventstats count by _raw | where count>1 | table source, _raw]

Thanks and regards,
Sekar

Communicator

Hello Sekar,

Thanks for response, i have updated my queries, please check and let me know on this.

0 Karma

Communicator

Hello Sekar,

Here its resulting header matched files, ( eg: its has file which has header same but different trailer)

0 Karma