Splunk Search

Trying to identify duplicate logs

Path Finder

I've found some logs in our splunk environment that seem to be duplicates (they differ only by their srcip field--which means one is coming directly from a client, while the other comes from a syslog server). So far the only way I've found to determine if the entries are actually duplicates is to export the results into different files based on srcip, then remove the srcip field and diff the resulting files. I'd really like to find a way to pull this comparison off in splunk, but I've not been able to so far. Does anyone have any ideas about how to do this?

EDIT: Here's an example of what I'm dealing with (redacting some stuff, of course).

Aug 19 09:34:36 A.B.C.D srcip=A.B.C.D fac=authpriv pri=notice sudo:      USER : TTY=pts/8 ; PWD=/var/log ; USER=root ; COMMAND=/bin/grep ssh messages
Aug 19 09:34:36 A.B.C.D srcip=W.X.Y.Z fac=authpriv pri=notice sudo:      USER : TTY=pts/8 ; PWD=/var/log ; USER=root ; COMMAND=/bin/grep ssh messages

These are clearly the same event; but the log is coming to splunk from A.B.C.D (the client) and W.X.Y.Z (a syslog server).

I initially hypothesized that it was everything of facility authpriv being duplicated, but that doesn't seem to be the case --I haven't been able to verify it at least.

So, again, what I'm looking for is a way to find events like this. "diff" won't work because they differ slightly, but I need to find all of our duplicates so I can take steps to cut out the second instance of the log.

Tags (2)
1 Solution

Splunk Employee
Splunk Employee

I see. then this might do it:

... | rex "^(?<text1>.*?srcip=)(?<srcip>\S+)(?<text2>.*)" | eval text=text1.text2 | stats count(srcip) as c values(srcip) by text | where c>1

View solution in original post

Super Champion

The transaction approach can work, but don't use maxpause=1s, use maxspan=1s instead. The difference being that maxpause is about time between events, and maxspan=1s means that the total duration of the transaction cannot exceed 1 second.

0 Karma

Splunk Employee
Splunk Employee

I see. then this might do it:

... | rex "^(?<text1>.*?srcip=)(?<srcip>\S+)(?<text2>.*)" | eval text=text1.text2 | stats count(srcip) as c values(srcip) by text | where c>1

View solution in original post

Path Finder

I've determined that there exist duplicate lines and I'm trying to determine how many duplicates I have or any information about them that could lead to reducing the duplicates. Also I'm certain they are duplicates because the timestamps don't differ at all and they log the same activity on the same machine (for example, two logs of a user su'ing to root).

0 Karma

Splunk Employee
Splunk Employee

I suppose I also don't understand, do the individual events have timestamps that differ by a second? Also, I suppose I should note that log lines are inherently extremely similar, differing only by a field or two, so I ask, are there other fields in your data (some GUID or sessionid, e.g.) that indicate that they are the same? If so, it seems more productive to focus on the identifying field values than the differing ones.

0 Karma

Splunk Employee
Splunk Employee

I don't understand your question. Are you trying to find duplicate lines (and it sounds to me like you've already determined that there are duplicate lines) or are you trying to group together sets of lines and then see if the entire set is the same as another set?

0 Karma

Path Finder

I tried piping my search to transaction with a maxpause of 1s, since the duplicates seem to come in at the same time. But that led to enormous transaction that didn't really alleviate the situation.