Solved: How to create field extraction from a spliced mess...

Jdtoney · ‎04-11-2023

So far I haven't been able to find an answer that quite fits my problem.

I have events for an app coming in as deliminated text, and have extracted the fields based on the deliminators. For example, my events look like this (deliminated by '@'):

@ Field1 @ Field2 @ Field3 @ Field4

I am getting these events over UDP from an old system with a proprietary OS. Out of my control, the software sending the information is splicing one of the deliminated values if the value has more than 256 characters. This value is a log message from the system. In cases where this happens, I always get two events; one event with the spliced value containing the first 256 characters, and another event with the rest of the message.

So, very simply, the events coming into Splunk look a lot like this:

@ Timestamp @ ProcessInformation.Process @ <Null> @ Message: Help. Reason: This is going to be cut out at the 256th c

@ Timestamp @ ProcessInformation.Process @ 1 @ haracter.

I've extracted fields based on the deliminators, but I also need to extract fields from the spliced message. This is making it tricky when the message is larger than 256 characters, because a field I need to extract is sometimes spliced across 2 messages.

When the value is spliced, both events contain the same timestamp exactly, to 6 digits of a second. Also, since I am extracting fields based on the deliminator, the spliced message is always extracted as the same field, whether it's the first or second part of the message.

Is there any way I can extract fields from this message across these two events, such that I get two fields:

Message: Help

Reason: This is going to be cut out at the 256th character.

Even better, is there any way I can combine these events so that I can get something like this?

@ Timestamp @ ProcessInformation.Process @ 1 @ Message: Help. Reason: This is going to be cut out at the 256th character.

Any advice is appreciated, thank you.

woodcock · ‎04-20-2023

| makeresults
| eval spliced=mvappend(
"@ Timestamp1 @ ProcessInformation.Process @ @ Message: Help. Reason: This is going to be cut out at the 256th c",
"@ Timestamp2 @ ProcessInformation.Process @ 2 @ h a tota",
"@ Timestamp2 @ ProcessInformation.Process @ 3 @ l of 5 out-o",
"@ Timestamp1 @ ProcessInformation.Process @ 1 @ haracter",
"@ Timestamp2 @ ProcessInformation.Process @ 1 @ age wit",
"@ Timestamp2 @ ProcessInformation.Process @ 4 @ f-sequence parts",
"@ Timestamp2 @ ProcessInformation.Process @ @ Message: This is a mess")
| mvexpand spliced
| rename spliced as _raw
| rex "^@ (?<timestamp>[^@]+) @ (?<ProcessId>[^@]+) @ (?<sequence>[^@]*)\s*@ (?<message>.+)"
| eval sequence = if(sequence == "", 0, sequence) ``` change to <Null> if that's literal ```
| sort 0 sequence ProcessId timestamp
| streamstats count(eval(sequence==0)) AS sessionID by ProcessId timestamp
| stats min(_time) AS _time list(message) AS message BY sessionID ProcessId timestamp
| rex field=message mode=sed "s/$/:::/"
| nomv message
| rex field=message mode=sed "s/:::\s*//g"

View solution in original post

woodcock · ‎04-19-2023

| makeresults
| eval spliced=mvappend(
"@ Timestamp @ ProcessInformation.Process @ 1 @ haracter.",
"@ Timestamp @ ProcessInformation.Process @ @ Message: Help. Reason: This is going to be cut out at the 256th c",
"@ Timestamp @ ProcessInformation.Process @ 3 @ l of 4 parts",
"@ Timestamp @ ProcessInformation.Process @ 2 @ h a tota",
"@ Timestamp @ ProcessInformation.Process @ 1 @ age wit",
"@ Timestamp @ ProcessInformation.Process @ @ Message: This is a mess")
| mvexpand spliced
| rename spliced as _raw
| reverse
| rex "^@ (?<timestamp>[^@]+) @ (?<ProcessId>[^@]+) @ (?<sequence>[^@]*) @ (?<message>.+)"
| eval sequence = if(sequence == "", 0, sequence) ``` change to <Null> if that's literal ```
| streamstats count(eval(sequence==0)) AS sessionID by ProcessId
| stats min(_time) AS _time list(message) AS message BY sessionID ProcessId
| rex field=message mode=sed "s/$/:::/"
| nomv message
| rex field=message mode=sed "s/:::\s*//g"

Jdtoney · ‎04-19-2023

Thank you all for your effort to help me solve this issue.

@yuanliu I tried using your updated solution but still could not get it to work. I was able to get _raw to appear after I used "fillnull" on the sequence field, but now _raw only includes the first part of the message, and the message field only seems to contain the first part of every message, per event. I also can't find a line that contains a second half of a message. I don't think it's grouping them together.

In any case, jumping down rabbit holes to your response led my to get something that's working ok - though it has some limitations. Biggest contributor to this search is the ""streamstats" command which I don't understand fully, but seems to evaluate values adjacent to each other, which works ok for my data.

Ideally I would like these events to be coming into Splunk fully whole, but I'm thinking that's a pipe dream.

Here's what I put together:

index=foo
| fillnull value="-01" sequence
| eval message_part1=if(sequence=="-01",message,"") 
| eval message_part2=if(sequence=="00",message,"") 
| eval message_part3=if(sequence=="01",message,"") 
| eval message_part4=if(sequence=="02",message,"")
| streamstats window=2 current=t latest(timestamp) AS last earliest(timestamp) AS first 
| transaction endswith=eval(first!=last) keeporphans=true
| strcat message_part1 message_part2 message_part3 message_part4 message
| rex field=message (...regex stuff to get my fields...)

There's four sizable, albeit tolerable, problems with this:

1. This doesn't account for messages larger than 4 parts long (256 + 256 + 256 +256 +...+n characters) so is not dynamic

2. If the most recent event is one that was spliced, then that event will not be grouped with all parts of its spliced message. I think this is has to do with streamstats.

3. (probably the most annoying problem) strcat adds a whitespace in between message parts even though I have not given a delimiter and even though there are no whitespaces in my data originally. Strcat won't take "" as a delimiter either - it must be a value.

4. There was no consideration given to processing speed . . .

It's cumbersome, but I think it'll mostly work. I worry a little bit about the full messages having random whitespaces in them, making further field extraction that more annoying, but I think it's doable.

Thanks for all your help. As always would appreciate any feedback, further instruction on making these fields easier to extract, or any miraculous guidance on getting these events ingested into Splunk fully whole.

woodcock · ‎04-20-2023

The "transaction" command will silently fail and give false results at any scale at all. You should keep trying with "streamstats".

yuanliu · ‎04-19-2023

I cannot answer what streamstats does in your search. But if you could answer the question about whether timestamp or _time can serve as unique identifier, that can advance this remote diagnosis. I cannot overemphasize the importance of constructing a unique identifier. It doesn't have be be in one single field, but you must have a method to construct a unique identifier. Think of yourself as the computer. How can you tell that a collection of events belong to the same sequence? You mentioned that the message were transmitted in UDP. This means that there is a good chance time of arrival (_time) is not sequential.

In fact, your latest search sample implies the existence of a field named "sequence" in raw events that is not in your original illustration. Why don't you start by sharing some real (anonymized) events with real Timestamp and real (anonymized) ProcessInformation.Process, real (anonymized) _time, anonymized message, and so on?

Short of realistic data, let me modify @woodcock's emulation to include an out-of-sequence message, like this

	_raw
1	@ Timestamp1 @ ProcessInformation.Process @ @ Message: Help. Reason: This is going to be cut out at the 256th c
2	@ Timestamp2 @ ProcessInformation.Process @ 2 @ h a tota
3	@ Timestamp2 @ ProcessInformation.Process @ 3 @ l of 5 out-o
4	@ Timestamp1 @ ProcessInformation.Process @ 1 @ haracter.
5	@ Timestamp2 @ ProcessInformation.Process @ 1 @ age wit
6	@ Timestamp2 @ ProcessInformation.Process @ 4 @ f-sequence parts
7	@ Timestamp2 @ ProcessInformation.Process @ @ Message: This is a mess

As you can see, not only time of arrive is mixed up within each group (assuming they group by timestamp), but groups also interleave. (The first column is not in the event - it just signifies the returned event order - usually that's in reverse time order.)

Now apply my search

| eval data = split(_raw, "@")
| eval timestamp = mvindex(data, 1), ProcessId = mvindex(data, 2), sequence = mvindex(data, 3), message = trim(mvindex(data, 4), " ")
| sort sequence
| stats list(message) as message max(sequence) as sequence min(_time) as _time by timestamp ProcessId
| eval _raw = "@" . timestamp . "@" . ProcessId . "@ " . max(sequence) . " @ " . mvjoin(message, "")

This results in

timestamp	ProcessId	message	sequence	_raw
Timestamp1	ProcessInformation.Process	Message: Help. Reason: This is going to be cut out at the 256th c haracter.	1	@ Timestamp1 @ ProcessInformation.Process @ 1 @ Message: Help. Reason: This is going to be cut out at the 256th character.
Timestamp2	ProcessInformation.Process	Message: This is a mess age wit h a tota l of 5 out-o f-sequence parts	4	@ Timestamp2 @ ProcessInformation.Process @ 4 @ Message: This is a message with a total of 5 out-of-sequence parts

In short, even though UDP messages are fragmented and out of sequence, this should be able to reconstruct messages regardless of how many parts are in one message.

Here is the complete emulation. You should play with it and compare with real data.

| makeresults
| eval spliced=mvappend(
"@ Timestamp1 @ ProcessInformation.Process @  @ Message: Help. Reason: This is going to be cut out at the 256th c",
"@ Timestamp2 @ ProcessInformation.Process @ 2 @ h a tota",
"@ Timestamp2 @ ProcessInformation.Process @ 3 @ l of 5 out-o",
"@ Timestamp1 @ ProcessInformation.Process @ 1 @ haracter.",
"@ Timestamp2 @ ProcessInformation.Process @ 1 @ age wit",
"@ Timestamp2 @ ProcessInformation.Process @ 4 @ f-sequence parts",
"@ Timestamp2 @ ProcessInformation.Process @  @ Message: This is a mess")
| mvexpand spliced
| rename spliced as _raw
``` data emulation above ```

woodcock · ‎04-20-2023

| makeresults
| eval spliced=mvappend(
"@ Timestamp1 @ ProcessInformation.Process @ @ Message: Help. Reason: This is going to be cut out at the 256th c",
"@ Timestamp2 @ ProcessInformation.Process @ 2 @ h a tota",
"@ Timestamp2 @ ProcessInformation.Process @ 3 @ l of 5 out-o",
"@ Timestamp1 @ ProcessInformation.Process @ 1 @ haracter",
"@ Timestamp2 @ ProcessInformation.Process @ 1 @ age wit",
"@ Timestamp2 @ ProcessInformation.Process @ 4 @ f-sequence parts",
"@ Timestamp2 @ ProcessInformation.Process @ @ Message: This is a mess")
| mvexpand spliced
| rename spliced as _raw
| rex "^@ (?<timestamp>[^@]+) @ (?<ProcessId>[^@]+) @ (?<sequence>[^@]*)\s*@ (?<message>.+)"
| eval sequence = if(sequence == "", 0, sequence) ``` change to <Null> if that's literal ```
| sort 0 sequence ProcessId timestamp
| streamstats count(eval(sequence==0)) AS sessionID by ProcessId timestamp
| stats min(_time) AS _time list(message) AS message BY sessionID ProcessId timestamp
| rex field=message mode=sed "s/$/:::/"
| nomv message
| rex field=message mode=sed "s/:::\s*//g"

Jdtoney · ‎04-26-2023

Finally got around to trying these solutions. This is working very well for me!

| eval data = split(_raw, "@")
| eval timestamp = mvindex(data, 9), ProcessId = mvindex(data, 1), sequence = mvindex(data, 8), message = trim(mvindex(data, 16), " ")
| fillnull value="-01" sequence
| sort 0 sequence ProcessId timestamp
| stats list(message) as message max(sequence) as sequence min(_time) as _time by timestamp ProcessId
| eval _raw = "@" . timestamp . "@" . ProcessId . "@ " . max(sequence) . " @ " . mvjoin(message, "")

Either solution (the stats way or the streamstats way) work as long as I use the line "sort 0 sequence ProcessId timestamp". This line I think was the issue before. It wasn't enough to just sort by sequence.

@yuanliu to answer your question about unique identification, every spliced message of the same group should have the same timestamp and Process ID.

I tried only giving an "example" format of the data because I'm dealing with 16 fields and I'm not even sure what they all are or how they're correlated. In fact, as you all were helping me I found that there is a value (the sequence # being used now, but not the original I was referencing - it's weird) that does iterate by 1 integer for each subsequent splice. It starts at "00", not "1". That's what the "fillnull" is doing in my solution. I can see that the dumbed-down version of my events only made it harder to troubleshoot, for that I apologize.

There doesn't seem to be a major difference, in this case, between stats and streamstats, and there doesn't seem to be a difference between "| eval sequence = if(sequence == "", -01, sequence) and "| fillnull value="-01" sequence".

I think I have what I need - from the newly created or extracted field I can rex my other fields out of there. Unfortunate I can't go through the field extractor but hey, what you gonna do. Thanks for all your help!!

woodcock · ‎04-16-2023

This problem is impossible to solve at the Splunk level. I *suspect* that it is maybe possible to solve at the syslog level, if you only have one service.

Jdtoney · ‎04-19-2023

I was afraid of that. Unfortunately, the system's documentation says this 256-char cutoff is by design, and I can't access the system myself. I realistically only have Splunk to work with here.

yuanliu · ‎04-12-2023

As your data is strictly formatted, it is easier to simply reconstruct the message. To do this, I assume that the second field is a unique process ID by which you can group these messages. I will further assume that "@ <Null> @" in your data illustration means "@ @". (If it is a literal string "<Null>", it can be handled similarly.)

| rex "^@ (?<timestamp>[^@]+) @ (?<ProcessId>[^@]+) @ (?<sequence>[^@]*) @ (?<message>.+)"
| eval sequence = if(sequence == "", 0, sequence) ``` change to <Null> if that's literal ```
| stats list(message) as message max(sequence) as sequence earliest(timestamp) as timestamp min(_time) as _time by ProcessId
| eval sequence = if(sequence = 0, "", sequence) ``` change to <Null> if that's literal ```
| eval _raw = "@ " . timestamp . " @ " . ProcessId . " @ " . max(sequence) . " @ " . mvjoin(message, "")

To put this to test, I use the following emulation:

| makeresults
| eval spliced=mvappend("@ Timestamp @ ProcessInformation.Process @  @ Message: Help. Reason: This is going to be cut out at the 256th c","@ Timestamp @ ProcessInformation.Process @ 1 @ haracter.")
| mvexpand spliced
| rename spliced as _raw
``` data emulation above ```

The result is

ProcessId	message	sequence	timestamp	_time	_raw
ProcessInformation.Process	Message: Help. Reason: This is going to be cut out at the 256th c haracter.	0 1	Timestamp	2023-04-12 02:35:06	@ Timestamp @ ProcessInformation.Process @ 1 @ Message: Help. Reason: This is going to be cut out at the 256th character.

Jdtoney · ‎04-14-2023

@yuanliu thanks for the reply!

I'm having trouble viewing the edited _raw in the statistics table generated by your solution. It does generate a table, and the "message" column contains both parts of the message (albeit in reverse), but no column for _raw. I wonder if I do not have permissions to edit _raw? I only have some access after all; a different team at my company manages the Splunk backend.

I should say a few things in regards to your response:

1. You had a correct assumption about <Null> being ""

2. It is incorrect to group by the process ID - although the two events with the spliced message do share a process ID, there are other messages that share the same process ID in a short time span as well. I have been ordering by "timestamp" instead

3. "timestamp" is a field taken from a delimited field in each event and is technically different than _time, and "timestamp" is more accurate. There are wonky, old-proprietary-system-related reasons for this.

4. I should have ordered my example events in the reverse; the event with the 256 character part of the message appears in splunk first, followed by the second part as overflow

Do any of these factors also change the search you provided?

yuanliu · ‎04-14-2023

Yes, not having a unique ID to group by will be a deal breaker. Generally speaking, _time for every event is different, therefore you cannot group by _time unless

events in each group are spaced such that the duration of the group is much shorter than the interval between groups and
the interval between groups can be determined on calendar clock (as opposed to elapse clock), e.g., by X minutes.

The second condition is a particularly precarious one because events do not occur according to human calendar. Now, you mention that timestamp field is not the same as _time. If this value is identical for each event in a group, this can be used as a unique identifier.

| rex "^@ (?<timestamp>[^@]+) @ (?<ProcessId>[^@]+) @ (?<sequence>[^@]*) @ (?<message>.+)"
| sort sequence
| stats list(message) as message max(sequence) as sequence min(_time) as _time by timestamp ProcessId
| eval _raw = "@ " . timestamp . " @ " . ProcessId . " @ " . max(sequence) . " @ " . mvjoin(message, "")

Yes, the method depends on sequence order. (And it can handle more than one 256-character sequences.) So, a sort command is added. (Because you confirmed that the <Null> is really just blank, I also removed manipulation.) You can also use reverse if events are retrieved in perfect reverse-time order, or sort _time. Unfortunately, none of these are cheap.

Here's the emulation with reverse sequence order.

| makeresults
| eval spliced=mvappend("@ Timestamp @ ProcessInformation.Process @ 1 @ haracter.","@ Timestamp @ ProcessInformation.Process @  @ Message: Help. Reason: This is going to be cut out at the 256th c")
| mvexpand spliced
| rename spliced as _raw
``` data emulation above ```

The above search gives

timestamp	ProcessId	message	sequence	_time	_raw
Timestamp	ProcessInformation.Process	Message: Help. Reason: This is going to be cut out at the 256th c haracter.	1	2023-04-14 22:16:05	@ Timestamp @ ProcessInformation.Process @ 1 @ Message: Help. Reason: This is going to be cut out at the 256th character.

Play with the emulation and compare with your real data.

The final assembly uses string concatenation. SPL is pretty forgiving in concatenation but there are still some situations it will output null. When that happens, you won't see _raw in the table. (There is no permission in regard to an assembled field. If you can search raw events, you should see _raw.) Notable suspects include

One of fields is multivalued. (Not treated as string.)
One of components is itself null. (E.g., if sequence is null, max(sequence) is also null, hence _raw will be null.

Side note: If there are too many events, you can improve performance by using split instead of rex. (Although compared with the cost of sort, improvement may not be meaningful.)

| eval data = split(_raw, "@")
| eval timestamp = mvindex(data, 1), ProcessId = mvindex(data, 2), sequence = mvindex(data, 3), message = trim(mvindex(data, 4), " ")
| sort sequence
| stats list(message) as message max(sequence) as sequence min(_time) as _time by timestamp ProcessId
| eval _raw = "@ " . timestamp . " @ " . ProcessId . " @ " . max(sequence) . " @ " . mvjoin(message, "")

efavreau · ‎04-11-2023

@Jdtoney My mind went to regular expressions, and I went to sanity check my thoughts against the following articles:
https://community.splunk.com/t5/Splunk-Search/Multi-Line-field-extraction/m-p/331355
https://community.splunk.com/t5/Splunk-Search/Multi-line-field-Extraction/m-p/73920

with one of them mentioning using the field extractor: https://docs.splunk.com/Documentation/Splunk/latest/Knowledge/ExtractfieldsinteractivelywithIFX

Using what you provided, I was able to craft a regular expression that gets close to what you want as two fields, and then you can use an eval to glue the two fields together. YMMV, for what you want to capture and not, and based on your actual logs.

Regular Expression:

Message: Help\. Reason: (?<firstpart>.*)\n\n.*?@ 1 @ (?<secondpart>.*).$

In context: https://regex101.com/r/29jHcy/1

Good luck! If this was helpful at all, please consider giving some karma.

###

If this reply helps you, an upvote would be appreciated.

Jdtoney · ‎04-14-2023

@efavreau thank you for the response!!

If I'm understanding you right, you're suggesting I extract both "parts" as fields using the field extractor, and then get fields with eval using both parts.

Unfortunately, each of these lines are separate logs, not multiline events. When I extract the fields using the field extractor, I can't use regex to gather segments of other events, as far as I can tell.

Am I missing something from any of the links in your response, do you know what I might need to do to extract those "parts" from 2 different, single-line events with the field extractor?

How to create field extraction from a spliced message across separate events?

field extraction

fields

regex

Enter the Splunk Community Dashboard Challenge for Your Chance to Win!

.conf24 | Session Scheduler is Live!!

Introducing the Splunk Community Dashboard Challenge!