Getting Data In

Why events are not breaking/extracted properly?

SplunkDash
Motivator

Hello,

I completed a few UF based data ingestions and SPLUNK is getting events from those ingestions but have some issues with breaking event.

I have 2 types of files: 1)   text files with header and Pile Delimiters, 2) XML files

In the case of text files, header info is showing up within the SPLUNK events, and also events are not breaking as expected at all, most of the cases, one SPLUNK event contains more than one source events

In the case of XML files, info within one source file considers as one SPLUNK event, but it should be considered number of events based on the XML tag.

Any thoughts/recommendations to resolve these issues would be highly appreciated. Thank you!

props/input configuration files and source files are given below:

For Text Files:

props

[ds:audit]

SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]+)
HEADERFIELD_LINE_NUMBER=1
INDEXED_EXTRACTIONS=psv
TIME_FORMAT=%Y-%m-%dT%H:%M:%S.%Q%z
TIMESTAMP_FIELDS=TimeStamp

inputs

[monitor:///opt/audit/DS/DS_EVENTS*.txt]
sourcetype=ds:audit
index=ds_test

sample

serID|UserType|System|EventType|EventId|Subject|SessionID|SrcAddr|EventStatus|ErrorMsg|TimeStamp|Additional Application Data |Device
p22bb4r|TEST|DS|USER| VIEW_NODE |ELEMENT<843006481>|131e9d5b-e84e-567d-a6b1-775f58993f68|null|00||2022-06-14T09:01:55.001+0000||NA
p22bbs1|TEST|DS|USER| FULL_SEARCH |ELEMENT<843006481>|121e7d5b-f84e-467d-a6b1-775f58993f68|null|00||2021-06-14T09:01:50.001+0000||NA
p22bbw3|TEST|DS|USER| FULL_SEARCH | ELEMENT< 343982854>|5b8fb22e-eeed-4802-8b07-8559dbfe1e45|null|00||2021-06-14T08:54:08.054+0000||NA
ts70sbr4|TEST|DS|USER|VIEW_NODE| ELEMENT< 35382854>|5b8fb22e-eeed-4802-8b07-8559dbfe1e45|null|00||2021-06-14T08:54:16.054+0000||NA
ts70sbd3|TEST|DS|USER|FULL_SEARCH|ELEMENT<933982854>|5b8fb22e-eeed-4802-8b07-8559dbfe1e45|null|00||2021-06-14T08:53:54.053+0000||NA

For XML Files:

[secops:audit]

SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]*)<MODTRANSL>
TIME_PREFIX=<TIMESTAMP>
TIME_FORMAT=%Y%m%d%H%M%S
MAX_TIMESTAMP_LOOKAHEAD=14
TRUNCATE=2500

Input

[monitor:///opt/app/secops/logs/audit_secops_log*.XML]
sourcetype=secops:audit
index=secops_test

Sample Data

<?xml version="x.1" encoding="UTF-8"?><DSDATA><MODTRANSL><TIMESTAMP>20190621121321</TIMESTAMP><USERID>d23bsrb</USERID><USERTYPE>SECOPS</USERTYPE><SYSTEM>DS</SYSTEM><EVENTTYPE>ADMIN</EVENTTYPE><EVENTID>SYS</EVENTID><ID>0300001</ID><SRCADDR>10.210.135.108</SRCADDR><RETURNCODE>00</RETURNCODE><VARDATA> Initiated New Entity Status: AP</VARDATA></MODTRANSL><MODTRANSL><TIMESTAMP>20190621121416</TIMESTAMP><USERID> d23bsrb </USERID><USERTYPE>SECOPS</USERTYPE><SYSTEM>DSI</SYSTEM><EVENTTYPE>ADMIN</EVENTTYPE><EVENTID>SYS</EVENTID><ID>000000000</ID><SRCADDR>10.210.135.120</SRCADDR><RETURNCODE>00</RETURNCODE><VARDATA> Entity Status: Approved New Entity Status: TI</VARDATA></MODTRANSL><MODTRANSL><TIMESTAMP>20190621121809</TIMESTAMP><USERID>sj45yrs</USERID><USERTYPE>SECOPS</USERTYPE><SYSTEM>DSI</SYSTEM><EVENTTYPE>ADMIN</EVENTTYPE><EVENTID>DS_OPD</EVENTID><ID>2192345</ID><SRCADDR>10.212.25.19</SRCADDR><RETURNCODE>00</RETURNCODE><VARDATA> 43ded7433b314eb58d2307e9bc536bd3</VARDATA > <DURATION>124</DURATION> </MODTRANSL</DSDATA>

Labels (1)
Tags (1)
0 Karma
1 Solution

yeahnah
Communicator

Hi @SplunkDash

I've tried the following on a Splunk v8.0.5 instance and it worked OK for me.

[ __auto__learned__ ]
NO_BINARY_CHECK=true
SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]+)
INDEXED_EXTRACTION=PSV
HEADER_FIELD_LINE_NUMBER=1
HEADER_FIELD_DELIMITER=|
FIELD_DELIMITER=|
TIMESTAMP_FIELDS=TimeStamp
TIME_FORMAT=%FT%T.%3Q%z

And yes, it is strange that I need to specify FIELD_DELIMITER=|, as you would think that specifying PSV implies this anyway.  Might be a bug. 
yeahnah_1-1652337462790.png

Note, as mentioned before, this config should work on a either UF or HF, though generally INDEXED_EXTRACTIONS are defined on the UF, which means the structured  data is parsed at source by the UF.   No further modifications can be made to this forwarded data, by a Splunk HF/IDX at least, once it parsed like this.  

If this solves your problem then please mark this as answered. 

View solution in original post

PickleRick
Ultra Champion

Your PSV definition looks mostly OK. But if it doesn't work, there must surely be something wrong with it 😉

But about your XML source settings - if you put those into props.conf on UF, they won't work. You need them on indexers/HFs (whatever is the first "heavy" layer you're hitting). And with this LINE_BREAKER you won't get proper XML. You'd need to break the events using a non-capturing group so that the XML tag you're breaking at is not consumed.

0 Karma

SplunkDash
Motivator

Hello,

Thank you so much for your response, truly appreciate it. Just wondering are there any alternate solution to resolve these issues. Any additional/alternate recommendation would be highly appreciated. Thank you again.

0 Karma

yeahnah
Communicator

Hi @SplunkDash 

1. For the text files you look to have a typo with Splunk docs showing it should be

HEADER_FIELD_LINE_NUMBER = <integer>

Also, the docs show the INDEXED_EXTRACTIONS as capitalized, e.g.

INDEXED_EXTRACTIONS = PSV

 Not 100% sure capitalization would make a difference but worth a go. 

As INDEX_EXTRACTIONS is used, the props.conf can live on either the UF or HF.

2. Assuming you do not want the <?xml version=...> part then the event break config would be.

LINE_BREAKER = ([\r\n>]*.+)<DSDATA>

Note, I kept the <DSDATA> tag to keep the XML valid with the closing <\DSDATA> tag.

If you did want to get rid of the DSDATA fields too then this should work...

LINE_BREAKER = ((?:</DSDATA>)?[\r\n>]*.+<DSDATA>|</DSDATA>)

Note the OR statement (|</DSDATA>) is to try and strip the last entry in an event that may not have a newline or carriage return.  It will result an empty event.

If it was me, I think I'd keep the DSDATA field in the LINE_BREAKER and then use SEDCMD to strip <.?DSDATA> from the events.
  
For the XML extraction the props.conf file must live on the HF, or indexer, depending on your setup.

Restarts may be needed once configs are updated, depending on how you deploy this.

Hope it helps 

0 Karma

SplunkDash
Motivator

Hello,

Thank you for your quick response. In regard to XML file <MODTRANSL> should be the event breaking point, sample XML file I provided should have 3 events. Do you think,  

LINE_BREAKER = ([\r\n>]*.+)<DSDATA>

going to work for that. Thank you again. 

0 Karma

yeahnah
Communicator

Hi @SplunkDash 

OK that is clearer and it can be done.  Give this a try in the heavy forwarders props.conf file.

LINE_BREAKER = ([\r\n]*<\?xml.+<DSDATA>|</MODTRANSL>|</DSDATA>|[\n\r])
# to make XML valid again, re-append the </MODTRANSL> tag stripped in line breaking capture group
SEDCMD-re-append-trailing-tag =s/$/<\/MODTRANSL>/

A good way to play with this is with a test file uploaded via the Settings > Add Data wizard.

0 Karma

SplunkDash
Motivator

Hello @yeahnah,

Thank you so much, truly appreciated....it's working for me. 

Is there any recommendation for the issues with text files? Anything will help. Thank you so much again.

Tags (1)
0 Karma

yeahnah
Communicator

Hi @SplunkDash

I've tried the following on a Splunk v8.0.5 instance and it worked OK for me.

[ __auto__learned__ ]
NO_BINARY_CHECK=true
SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]+)
INDEXED_EXTRACTION=PSV
HEADER_FIELD_LINE_NUMBER=1
HEADER_FIELD_DELIMITER=|
FIELD_DELIMITER=|
TIMESTAMP_FIELDS=TimeStamp
TIME_FORMAT=%FT%T.%3Q%z

And yes, it is strange that I need to specify FIELD_DELIMITER=|, as you would think that specifying PSV implies this anyway.  Might be a bug. 
yeahnah_1-1652337462790.png

Note, as mentioned before, this config should work on a either UF or HF, though generally INDEXED_EXTRACTIONS are defined on the UF, which means the structured  data is parsed at source by the UF.   No further modifications can be made to this forwarded data, by a Splunk HF/IDX at least, once it parsed like this.  

If this solves your problem then please mark this as answered. 

Get Updates on the Splunk Community!

Synthetic Monitoring: Not your Grandma’s Polyester! Tech Talk: DevOps Edition

Register today and join TekStream on Tuesday, February 28 at 11am PT/2pm ET for a demonstration of Splunk ...

Instrumenting Java Websocket Messaging

Instrumenting Java Websocket MessagingThis article is a code-based discussion of passing OpenTelemetry trace ...

Announcing General Availability of Splunk Incident Intelligence!

Digital transformation is real! Across industries, companies big and small are going through rapid digital ...