Getting Data In

Can I skip specific lines while indexing data?

andrewtrobec
Motivator

Hello,

I am trying to index a csv log file that looks like this:

Description,NumJobWaitEvents,ReturnCode,RunEnd,RunStart,ScheduledStartTime,Status
Job.Description,Job.NumJobWaitEvents,Job.ReturnCode,Job.RunEnd,Job.RunStart,Job.ScheduledStartTime,Job.Status
String,Integer,Integer,DateTime,DateTime,DateTime,enum.JobStatus
Auto Start,0,null,"2017/03/05 06:03:39,441","2017/03/05 06:01:39,269","2017/03/05 06:01:39,065",Completed
Auto Start,0,null,"2017/03/05 06:09:04,493","2017/03/05 06:06:23,915","2017/03/05 06:06:23,743",Completed
AG43_542_TINA_CODE_AGB - Checking,1,null,"2017/03/05 06:32:18,908","2017/03/05 06:23:15,148","2017/03/05 06:23:14,822",Completed
DATA SANITY CHECK,0,null,"2017/03/05 09:02:23,997","2017/03/05 09:00:44,073","2017/03/05 09:00:42,959",Completed

The first line always contains the header, the second and third lines always contain object and type information, and the log data always starts from the fourth line.

When I index the file as it is, it only indexes the first two lines even though there are thousands. My question is: how can I skip the second and third lines so I can index the actual log data?

Thank you and best regards,

Andrew

0 Karma

woodcock
Esteemed Legend

Try this:

[MY_SOURCETYPE]
FIELD_DELIMITER = ,
HEADER_FIELD_LINE_NUMBER = 1
INDEXED_EXTRACTIONS = CSV
PREAMBLE_REGEX = (^|[\r\n])(Job\.Description[^\r\n]+|String[^\r\n]+)
TIMESTAMP_FIELDS = RunStart
category = Structured
description = Comma-separated value format. Set header and other settings in "Delimited Settings"

Also, IMHO, events that are "durationful" (i.e. contain start and end time details) should always use the end time as the timestamp. For just one reason, think about what your timechart would look like if your system crashed and all events ended at the same time.

andrewtrobec
Motivator

Thanks for the suggestion, but unfortunately it doesn't work. I think I see what you're getting at, though: you're trying to create one expression that covers both lines, right? I'm not too proficient with regexs.

I'll keep playing with it, thanks!

Andrew

0 Karma

woodcock
Esteemed Legend

Yes, and make it flexible enough to work if presented the entire event or just a single line. That really should have done it.

0 Karma

andrewtrobec
Motivator

Just to make sure that I'm following the right procedure I'm going to list out the steps I've followed:

  1. Edit props.conf located in SPLUNKHOME\etc\apps\MY_APP\local to contain

    [MY_SOURCETYPE]
    FIELD_DELIMITER = ,
    HEADER_FIELD_LINE_NUMBER = 1
    INDEXED_EXTRACTIONS = csv
    PREAMBLE_REGEX = (^|[\r\n])(Job.Description[^\r\n]+|String[^\r\n]+)
    TIMESTAMP_FIELDS = RunStart
    category = Structured
    description = Comma-separated value format. Set header and other settings in "Delimited Settings"

  2. Restart splunkd via cmd: net stop splunkd/net start splunkd

  3. Once up, log into Splunk (6.5.2 btw) and enter my app

  4. From the Settings menu, select Add Data

  5. Select upload

  6. Select the csv that contains the data above

  7. Select Next

  8. From the Source type list, select MY_SOURCETYPE

  9. At this point, the first two lines of the event list are as follows

alt text

If the regex works as planned, would I see those two lines at that point?

Best regards,

Andrew

0 Karma

spisiakmi
Communicator

Hi Andrew,

many greetings. We were colleagues and shared Splunk informations a lot. I have very similar problem as you have described. Did you solve your problem in the mean time?
I wish you all the best.
Michal Spisiak

0 Karma

woodcock
Esteemed Legend

If everything is working, you should not see those lines. HOWEVER, I have never used the Add Data wizard with INDEXED_EXTRACTIONS before.

0 Karma

dmaislin_splunk
Splunk Employee
Splunk Employee

Check out: http://docs.splunk.com/Documentation/Splunk/6.5.2/Admin/Propsconf

The section: Structured Data Header Extraction and configuration

PREAMBLE_REGEX =
* Some files contain preamble lines. This attribute specifies a regular
expression which allows Splunk to ignore these preamble lines, based on
the pattern specified.

andrewtrobec
Motivator

Thanks, I'll take a look. One doubt: will this allow me to read the first line as the headers and only ignore the second and third lines?

0 Karma

woodcock
Esteemed Legend

Yes, exactly.

0 Karma

andrewtrobec
Motivator

Thanks @woodcock

I've been experimenting but I can't get it to work. I've added PREAMBLE_REGEX = ^Job\.Description.*|String.* (which works on https://regex101.com/) and HEADER_FIELD_LINE_NUMBER = 1 but it doesn't seem to be working. I am performing a manual import, selecting MY_SOURCETYPE which is defined in my props.conf as follows:

[MY_SOURCETYPE]
AUTO_KV_JSON = 1
DATETIME_CONFIG = 
FIELD_DELIMITER = ,
HEADER_FIELD_LINE_NUMBER = 1
INDEXED_EXTRACTIONS = csv
KV_MODE = none
NO_BINARY_CHECK = true
PREAMBLE_REGEX = ^Job\.Description.*|String.*
SHOULD_LINEMERGE = false
TIMESTAMP_FIELDS = RunStart
category = Structured
description = Comma-separated value format. Set header and other settings in "Delimited Settings"
disabled = false
pulldown_type = true

Are there any other configurations that I should be aware of?

Best regards,

Andrew

0 Karma
Get Updates on the Splunk Community!

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...