Getting Data In

How to split a TCP input on STX/ETX (0x02/0x03), no line breaks, into separate events?

arkadyz1
Builder

Hello,
I'm trying to accept TCP input from a device which wraps each transmission into STX/ETX pair (ASCII 002/003), with no line breaks ('\n'). The text inside is XML, which is handled by KV_MODE=xml rather nicely - I tried importing a file with line feeds instead of STX/ETX and both that and my TIMESTAMP_FIELDS= setting also worked as expected.
However, I can't figure out how to break it into events in case of STX/ETX instead of newlines. I tried LINE_BREAKER = [\x02\x03]+ and SHOULD_LINEMERGE=false to no avail. Admittedly, I tested my sourcetype by trying to upload a one-line file with STX/ETX thrown in, to no avail. It still loads as one huge event and cannot parse out the timestamp.

Am I escaping those hex codes (can switch to octal if necessary) improperly? Is there a known problem parsing LINE_BREAKER? Should I use SHOULD_LINEMERGE=true with BREAK_ONLY_BEFORE=\x02 and MUST_BREAK_AFTER=\x03 instead? In the latter case, I'll have to strip those control characters some other way before Splunk can parse the XML inside.

If everything else fails, I can try switching to a scripted input, but that seems an unnecessary hurdle.

Tags (2)
0 Karma
1 Solution

arkadyz1
Builder

So, to summarize: I was missing a very simple thing - a capturing group (parentheses) around my regex. Here is how it should read:

LINE_BREAKER = ([\x02\x03]+)

See those enclosing braces? They were the reason. The documentation is clear on that (just look for LINE_BREAKER in admin manual and carefully read through the description).

Special thanks to samsoni2 for pointing it out.

View solution in original post

arkadyz1
Builder

So, to summarize: I was missing a very simple thing - a capturing group (parentheses) around my regex. Here is how it should read:

LINE_BREAKER = ([\x02\x03]+)

See those enclosing braces? They were the reason. The documentation is clear on that (just look for LINE_BREAKER in admin manual and carefully read through the description).

Special thanks to samsoni2 for pointing it out.

dfrankekcg
Explorer

For a log file that was separating lines using the hex 0A character, I was able to use LINE_BREAKER = (\x0A). I viewed the log file in a hex editor to find the line separator.

0 Karma

bmunson_splunk
Splunk Employee
Splunk Employee

You need to tell splunk that it is using a diferent line breaker. On your indexer, create a props.conf stanza something like this

[source::my/source/file.log]
LINE_BREAKER = [\x02\x03]+

you may want to replace the source with your sourcetype.

See http://docs.splunk.com/Documentation/Splunk/latest/Data/Indexmulti-lineevents for more details.

0 Karma

arkadyz1
Builder

But that's exactly what I've done initially. What I haven't done, however, was enclose that regex in parentheses, as pointed out by samsoni2. Is it documented anywhere?

0 Karma

arkadyz1
Builder

Oh, nevermind - it is documented, just buried deep enough so that it's easy to miss.

The docs say this:

The regex must contain a capturing group -- a pair of parentheses which defines an identified subcomponent of the match

Something I missed initially. They also explain that the characters matched by LINE_BREAKER are stripped from the resulting events - something that I wanted all along :).

0 Karma

somesoni2
Revered Legend

You should be able to remove the STX/ETX using SEDCMD and they should be able to use BREAK_ONLY_BEFORE/MUST_BREAK_AFTER. Would you be able to provide some sample events that you might receive?

0 Karma

arkadyz1
Builder

Something along the lines of:

STX<objectdata><general oid="4"><timestamp>2015-09-18T02:00:13</timestamp></general></objectdata>ETX

where STX and ETX are 0x02 and 0x03 respectively. There may be many such XML structures, all surrounded by STX/ETX.

0 Karma

arkadyz1
Builder

As you can see, the data inside are pure XML. TIMESTAMP_FIELDS will include objectdata.general.timestamp for sure.

Here is my full index definition as of now:

[tcpInputTest]
SHOULD_LINEMERGE = false
category = Custom
pulldown_type = true
DATETIME_CONFIG = NONE
KV_MODE = xml
disabled = false
TIMESTAMP_FIELDS = objectdata.general.timestamp, tracedata.timestamp, heartbeatdata.timestamp
LINE_BREAKER = \x03?\x02
TRUNCATE = 0
0 Karma

somesoni2
Revered Legend

Give this a try for yoru LINE_BREAKER attribute

LINE_BREAKER = (\x02)(?=\<objectdata\>)

arkadyz1
Builder

Oh, not every record starts with objectdata tag - some are others. But using just \x02 and stripping \x03 with SEDCMD is what I'm going to try next.

0 Karma

somesoni2
Revered Legend

Ok... Try this as well...

LINE_BREAKER = (\x02)(?=\<\S+\>)

arkadyz1
Builder

Yes, this worked! The XML is parsed even without stripping the trailing ETX. I had to remove the trailing 'greater than' sign because some of the records have xmlns:xsi and other attributes. I'm wondering why it didn't work for me with LINE_BREAKER = \x02. What exactly did that lookahead add?

Timestamp extraction is my next problem - the events are broken into fields just fine, and, for example, I do find objectdata.general.timestamp field in the resulting event - but timestamp is not extracted properly. I realize that timestamp extraction is done at index time while most fields are extracted at search time, so I'm not sure how to solve that. The problem is that there are a few timestamps in the XML data, and the first one in the most important record type - objectdata - is not what I want. I'll have to seriously play with timestamp prefix, it seems.

0 Karma

somesoni2
Revered Legend

Well, I'm guessing you didn't have yoru STX enclosed within braces (regular braces), would 've caused it not to work.

For timestamp recognition, I would suggest you to go traditional and provide attributes like TIME_PREFIX and TIME_FORMAT.

Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...