We have a bug in our software that is spamming out identical log messages (different timestamps) - when it's only supposed to log changes in value. The application is reading a field device (Modbus protocol) and parsing the message into registers - each register has 16 discrete "bits". The application takes the "bits" and separates them into a single log message (one entry per bit). The problem is, if only a single bit changes, it is logging the entire register (16 bits worth) - what it's suppose to do is log only the bit that changed from it's previous state. I'm trying to design a Splunk filter so our support guys can use that until the bug-fix is deployed.
For example:
timestamp 40010/1 = 1 (message)
timestamp 40010/2 = 0 (message)
timestamp 40010/3 = 1 (message)
timestamp 40010/4 = 1 (message)
timestamp 40010/5 = 0 (message)
timestamp 40010/6 = 0 (message)
timestamp 40010/7 = 0 (message)
timestamp 40010/8 = 0 (message)
...
timestamp 40010/16 = 0 (message)
on the next scan, let's say bit #3 changes to '1':
timestamp+1 40010/1 = 1 (message)
timestamp+1 40010/2 = 0 (message)
timestamp+1 40010/3 = 1 (message)
timestamp+1 40010/4 = 1 (message)
timestamp+1 40010/5 = 0 (message)
timestamp+1 40010/6 = 0 (message)
timestamp+1 40010/7 = 0 (message)
timestamp+1 40010/8 = 0 (message)
...
timestamp+1 40010/16 = 0 (message)
As you can see, this makes the logs super verbose, and terrible to troubleshoot. I'm trying to find a way to strip out all the "same values" - which still retaining the timestamps (and if possible, full original "_raw" message). I can't use "dedup" as it would strip out future "toggles" (I need all "changes", not just the first one... if bit #3 goes back to '0', I need the timestamp when that occurs).
So in my previous example I would get only:
timestamp+1 40010/1 = 1 (message)
timestamp+1 40010/2 = 0 (message)
timestamp+1 40010/3 = 1 (message)
timestamp+1 40010/4 = 1 (message)
timestamp+1 40010/5 = 0 (message)
timestamp+1 40010/6 = 0 (message)
timestamp+1 40010/7 = 0 (message)
timestamp+1 40010/8 = 0 (message)
...
timestamp+1 40010/16 = 0 (message)
next scan:
timestamp+1 40010/3 = 1 (message)
future scan (where bit #3 changes back to '0'):
timestamp+10 40010/3 = 0 (message)
future scan (where bit #5 changes to '1'):
timestamp+20 40010/5 = 1 (message)
etc.
Starting with an initial "baseline", I just want to see the changes in that register - with the newer timestamp of when that occurred.
Compounding the problem is there are 50+ different registers, all logging every scan (5 seconds) - regardless of if the "value" of the register changed.
Not an answer because it's not quite an answer yet, more testing.
If you have the field "register" with values like 40010/3 and "value" for the 0/1 values, then how far does something like the below get you?
myBaseSearch ...
| stats count list(value) as ValueList, list(_time) as ReportedTime by register.
I have a suspicion this, or something like it, will be the base of the answer.
Also try
myBaseSearch ...
| stats count last(value) as LatestValue by register.
Though it might be first(value)
because I can never remember which way around that is.
Last, if you need to rex out those, add this line in (assuming your registers are 5 digits exactly followed by a forward slash and one more digit, and assuming I don't know your timestamp format so I can't make this easier):
myBaseSearch ...
| rex "(?<register>\d{5}/\d)\s*=\s*(?<value>\d+)"
| stats ...
Filtering to a single (known register... 40015)
First one:
count = 3625
ValueList is 99.9% zero
ReportedTime - looks like Epoc timestamps...?
base query... | rex "ID: (?<REG_w_BIT>\d{5}\/\d+)" | rex "Summary: \w+ (?<BIT_VALUE>\d+)" | stats count list(BIT_VALUE) as ValueList, list(_time) as ReportedTime by REG_w_BIT
Second one:
count=3625 (identical id "first(value)" is used)
LatestValue=0
Third:
Yup, I'be got regex for
* the full message (sans timestamp)
* the "register & bit pair" (the part I'm trying to filter by)
* and most recently, the value of the register & pair
Sample "full messages":
40015/13 Summary: CLEAR 0 : Words, text, description of what bit #13 is
40015/13 Summary: SET 1 : Words, text, description of what bit #13 is
autoregress maybe...?
I found this answer - which is looking for a count of seqential values:
https://answers.splunk.com/answers/396654/is-there-a-way-to-count-the-series-of-consecutive.html
I tried following it directly, but my issue is my values are never going to be exactly sequential (the "next" value will never match)... unless I sort them first... duh!
(edit) - I have to sort by both the "host" and the "Register & bit", before sorting by time... otherwise it's no different than a dedup...
... | regex to strip timestamp and "junk" from beginning of log into new MESSAGE field | rex "ID: (?<REGISTER_w_BIT>\d{5}\/\d+)" | sort host, REGISTER_w_BIT, _time |autoregress MESSAGE | eval sameAsNext=if(MESSAGE=MESSAGE_p1,1,0) | search sameAsNext = 0 | sort _time
This is "kind of" working... if I pre-filter by a single register (40010), I can see all the transitions if it... but as soon as I try a generic search, those ones disappear... which makes no sense, as they are valid right up to before "search sameAsNext=0" (and the value of sameAsNext is zero).
I think there's a memory limitation to 'autoregress'... if I limit the amount of results (either by time range, or more complex "base" filter - the results exactly match my expectations.
But if I try to run a broad query (say 7 days... ~ 150k events) it starts "stripping out" known good values... entire registers worth in fact. Very odd.