Can I skip specific lines while indexing data?

andrewtrobec · ‎03-07-2017

Hello,

I am trying to index a csv log file that looks like this:

Description,NumJobWaitEvents,ReturnCode,RunEnd,RunStart,ScheduledStartTime,Status
Job.Description,Job.NumJobWaitEvents,Job.ReturnCode,Job.RunEnd,Job.RunStart,Job.ScheduledStartTime,Job.Status
String,Integer,Integer,DateTime,DateTime,DateTime,enum.JobStatus
Auto Start,0,null,"2017/03/05 06:03:39,441","2017/03/05 06:01:39,269","2017/03/05 06:01:39,065",Completed
Auto Start,0,null,"2017/03/05 06:09:04,493","2017/03/05 06:06:23,915","2017/03/05 06:06:23,743",Completed
AG43_542_TINA_CODE_AGB - Checking,1,null,"2017/03/05 06:32:18,908","2017/03/05 06:23:15,148","2017/03/05 06:23:14,822",Completed
DATA SANITY CHECK,0,null,"2017/03/05 09:02:23,997","2017/03/05 09:00:44,073","2017/03/05 09:00:42,959",Completed

The first line always contains the header, the second and third lines always contain object and type information, and the log data always starts from the fourth line.

When I index the file as it is, it only indexes the first two lines even though there are thousands. My question is: how can I skip the second and third lines so I can index the actual log data?

Thank you and best regards,

Andrew

woodcock · ‎03-08-2017

Try this:

[MY_SOURCETYPE]
FIELD_DELIMITER = ,
HEADER_FIELD_LINE_NUMBER = 1
INDEXED_EXTRACTIONS = CSV
PREAMBLE_REGEX = (^|[\r\n])(Job\.Description[^\r\n]+|String[^\r\n]+)
TIMESTAMP_FIELDS = RunStart
category = Structured
description = Comma-separated value format. Set header and other settings in "Delimited Settings"

Also, IMHO, events that are "durationful" (i.e. contain start and end time details) should always use the end time as the timestamp. For just one reason, think about what your timechart would look like if your system crashed and all events ended at the same time.

andrewtrobec · ‎03-08-2017

Thanks for the suggestion, but unfortunately it doesn't work. I think I see what you're getting at, though: you're trying to create one expression that covers both lines, right? I'm not too proficient with regexs.

I'll keep playing with it, thanks!

Andrew

woodcock · ‎03-08-2017

Yes, and make it flexible enough to work if presented the entire event or just a single line. That really should have done it.

andrewtrobec · ‎03-08-2017

Just to make sure that I'm following the right procedure I'm going to list out the steps I've followed:

Edit props.conf located in SPLUNKHOME\etc\apps\MY_APP\local to contain

[MY_SOURCETYPE]
FIELD_DELIMITER = ,
HEADER_FIELD_LINE_NUMBER = 1
INDEXED_EXTRACTIONS = csv
PREAMBLE_REGEX = (^|[\r\n])(Job.Description[^\r\n]+|String[^\r\n]+)
TIMESTAMP_FIELDS = RunStart
category = Structured
description = Comma-separated value format. Set header and other settings in "Delimited Settings"
Restart splunkd via cmd: net stop splunkd/net start splunkd
Once up, log into Splunk (6.5.2 btw) and enter my app
From the Settings menu, select Add Data
Select upload
Select the csv that contains the data above
Select Next
From the Source type list, select MY_SOURCETYPE
At this point, the first two lines of the event list are as follows

If the regex works as planned, would I see those two lines at that point?

Best regards,

Andrew

spisiakmi · ‎09-09-2019

Hi Andrew,

many greetings. We were colleagues and shared Splunk informations a lot. I have very similar problem as you have described. Did you solve your problem in the mean time?
I wish you all the best.
Michal Spisiak

woodcock · ‎03-09-2017

If everything is working, you should not see those lines. HOWEVER, I have never used the Add Data wizard with INDEXED_EXTRACTIONS before.

dmaislin_splunk · ‎03-07-2017

Check out: http://docs.splunk.com/Documentation/Splunk/6.5.2/Admin/Propsconf

The section: Structured Data Header Extraction and configuration

PREAMBLE_REGEX =
* Some files contain preamble lines. This attribute specifies a regular
expression which allows Splunk to ignore these preamble lines, based on
the pattern specified.

andrewtrobec · ‎03-07-2017

Thanks, I'll take a look. One doubt: will this allow me to read the first line as the headers and only ignore the second and third lines?

woodcock · ‎03-07-2017

Yes, exactly.

andrewtrobec · ‎03-08-2017

Thanks @woodcock

I've been experimenting but I can't get it to work. I've added PREAMBLE_REGEX = ^Job\.Description.*|String.* (which works on https://regex101.com/) and HEADER_FIELD_LINE_NUMBER = 1 but it doesn't seem to be working. I am performing a manual import, selecting MY_SOURCETYPE which is defined in my props.conf as follows:

[MY_SOURCETYPE]
AUTO_KV_JSON = 1
DATETIME_CONFIG = 
FIELD_DELIMITER = ,
HEADER_FIELD_LINE_NUMBER = 1
INDEXED_EXTRACTIONS = csv
KV_MODE = none
NO_BINARY_CHECK = true
PREAMBLE_REGEX = ^Job\.Description.*|String.*
SHOULD_LINEMERGE = false
TIMESTAMP_FIELDS = RunStart
category = Structured
description = Comma-separated value format. Set header and other settings in "Delimited Settings"
disabled = false
pulldown_type = true

Are there any other configurations that I should be aware of?

Best regards,

Andrew

Can I skip specific lines while indexing data?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits

Join the Conversation

Can I skip specific lines while indexing data?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits