Hello All,
I need to find from particular source how many we have duplicate files in last 7 days.
I have used this method to indexed duplicate files in Splunk.
here the definition of duplicate file is first line and last line of file is matches to second file first and last line then its called as duplicate.
I can able to achieve the duplicate files if will matches only first line or last line as below
index=main sourcetype=sampledata Header* | eventstats count by _raw | where count>1 | table source, _raw
In this query its giving me the result of files which has the same header. and in below query am getting the result of files where "Trailer" 0r last line of the file is common
index=main sourcetype=sampledata TRL* | eventstats count by _raw | where count>1 | table source, _raw 
So here we need to compare two query and pick the result where we have header and trailer common.
Can any one please help me on this.
Thanks in Adavance
 
		
		
		
		
		
	
			
		
		
			
					
		This will do it more efficiently, and will even work if there ever are more than one match for Header* or TRL* in a given file (fairly easy to imagine that this could happen sometimes).
index=main sourcetype=sampledata (Header* OR TRL*) | stats earliest(_raw) as first latest(_raw) as last by source | stats dc(source) as fileCount values(source) as files by first last | sort - fileCount | where fileCount>1
Breaking it down - we get the events off disk, and we try and only get Header* OR TRL* to avoid getting intermediate events that are of no use to us.  The parens are unnecessary here, but I often like them for clarity. 
The stats command will take the earliest line and the latest line for each source. NOTE - if events near the start and/or end end up with the same _time value, you'll have a problem here, and we'll need to some additional matching on the Header* and TRL*. Here I've assumed all events near start and end get a different timestamp and there's no ambiguity.
The next stats command now just counts up the number of source (files) it's seen per row dc(source) as fileCount,  the actual values of the paths values(source) as files, and it does it for every unique combination of first and last by first last
the rest is just sorting and filtering to the ones that are actually duplicates.
 
		
		
		
		
		
	
			
		
		
			
					
		This will do it more efficiently, and will even work if there ever are more than one match for Header* or TRL* in a given file (fairly easy to imagine that this could happen sometimes).
index=main sourcetype=sampledata (Header* OR TRL*) | stats earliest(_raw) as first latest(_raw) as last by source | stats dc(source) as fileCount values(source) as files by first last | sort - fileCount | where fileCount>1
Breaking it down - we get the events off disk, and we try and only get Header* OR TRL* to avoid getting intermediate events that are of no use to us.  The parens are unnecessary here, but I often like them for clarity. 
The stats command will take the earliest line and the latest line for each source. NOTE - if events near the start and/or end end up with the same _time value, you'll have a problem here, and we'll need to some additional matching on the Header* and TRL*. Here I've assumed all events near start and end get a different timestamp and there's no ambiguity.
The next stats command now just counts up the number of source (files) it's seen per row dc(source) as fileCount,  the actual values of the paths values(source) as files, and it does it for every unique combination of first and last by first last
the rest is just sorting and filtering to the ones that are actually duplicates.
Thank you, its working for me. and good explanation as well. once again thank you so much!!
 
					
				
		
Like this:
index=main sourcetype=sampledata Header* OR TRL*
| eval MyType=if(searchmatch(Header*), "HDR", "TRL")
| stats first(HDR) AS HDR first(TRL) AS TRL BY source
| stats values(source) count BY HDR TRL
| search count > 1
 
					
				
		
Try this:
index=main sourcetype=sampledata "TRL*" OR "Header*"
| EVAL mytype=CASE(MATCH(_raw,"TRL"), "TRAILER",MATCH( _raw,"Header"), "HEADER")
| chart count by source, mytype
| SEARCH TRAILER>1 AND HEADER>1
It should show you every source with duplicated "TRL" and "Header"
Hello Haley,
Its not displaying any result.. but i remove the eval command then events are coming
 
					
				
		
Oh, it seems that you need to convert your _raw to upper case like this:
index=main sourcetype=sampledata "TRL*" OR "Header*"
| EVAL mytype=CASE(MATCH(upper(_raw),"TRL"), "TRAILER",MATCH( upper(_raw),"HEADER"), "HEADER")
| stats count by source, mytype, _raw | WHERE count>1
| CHART first(_raw) by source, mytype
You should be able to see the duplicated _raw in header and trailer for each source
 
		
		
		
		
		
	
			
		
		
			
					
		Please check this one -
 index=main sourcetype=sampledata Header* | eventstats count by _raw | where count>1 | table source, _raw
| append [index=main sourcetype=sampledata TRL* | eventstats count by _raw | where count>1 | table source, _raw]
Thanks and regards,
Sekar
Hello Sekar,
Thanks for response, i have updated my queries, please check and let me know on this.
Hello Sekar,
Here its resulting header matched files, ( eg: its has file which has header same but different trailer)
