I'm trying to accept TCP input from a device which wraps each transmission into STX/ETX pair (ASCII 002/003), with no line breaks ('\n'). The text inside is XML, which is handled by
KV_MODE=xml rather nicely - I tried importing a file with line feeds instead of STX/ETX and both that and my
TIMESTAMP_FIELDS= setting also worked as expected.
However, I can't figure out how to break it into events in case of STX/ETX instead of newlines. I tried
LINE_BREAKER = [\x02\x03]+ and
SHOULD_LINEMERGE=false to no avail. Admittedly, I tested my sourcetype by trying to upload a one-line file with STX/ETX thrown in, to no avail. It still loads as one huge event and cannot parse out the timestamp.
Am I escaping those hex codes (can switch to octal if necessary) improperly? Is there a known problem parsing LINEBREAKER? Should I use `SHOULDLINEMERGE=true
andMUSTBREAKAFTER=\x03` instead? In the latter case, I'll have to strip those control characters some other way before Splunk can parse the XML inside.
If everything else fails, I can try switching to a scripted input, but that seems an unnecessary hurdle.
You should be able to remove the STX/ETX using SEDCMD and they should be able to use BREAKONLYBEFORE/MUSTBREAKAFTER. Would you be able to provide some sample events that you might receive?
Something along the lines of:
where STX and ETX are 0x02 and 0x03 respectively. There may be many such XML structures, all surrounded by STX/ETX.
As you can see, the data inside are pure XML. TIMESTAMP_FIELDS will include objectdata.general.timestamp for sure.
Here is my full index definition as of now:
[tcpInputTest] SHOULD_LINEMERGE = false category = Custom pulldown_type = true DATETIME_CONFIG = NONE KV_MODE = xml disabled = false TIMESTAMP_FIELDS = objectdata.general.timestamp, tracedata.timestamp, heartbeatdata.timestamp LINE_BREAKER = \x03?\x02 TRUNCATE = 0
Give this a try for yoru LINE_BREAKER attribute
LINE_BREAKER = (\x02)(?=\<objectdata\>)
Oh, not every record starts with objectdata tag - some are others. But using just \x02 and stripping \x03 with SEDCMD is what I'm going to try next.
Yes, this worked! The XML is parsed even without stripping the trailing ETX. I had to remove the trailing 'greater than' sign because some of the records have xmlns:xsi and other attributes. I'm wondering why it didn't work for me with
LINE_BREAKER = \x02. What exactly did that lookahead add?
Timestamp extraction is my next problem - the events are broken into fields just fine, and, for example, I do find objectdata.general.timestamp field in the resulting event - but timestamp is not extracted properly. I realize that timestamp extraction is done at index time while most fields are extracted at search time, so I'm not sure how to solve that. The problem is that there are a few timestamps in the XML data, and the first one in the most important record type - objectdata - is not what I want. I'll have to seriously play with timestamp prefix, it seems.
Well, I'm guessing you didn't have yoru STX enclosed within braces (regular braces), would 've caused it not to work.
For timestamp recognition, I would suggest you to go traditional and provide attributes like TIMEPREFIX and TIMEFORMAT.
You need to tell splunk that it is using a diferent line breaker. On your indexer, create a props.conf stanza something like this
[source::my/source/file.log] LINE_BREAKER = [\x02\x03]+
you may want to replace the source with your sourcetype.
See http://docs.splunk.com/Documentation/Splunk/latest/Data/Indexmulti-lineevents for more details.