Getting Data In

Time extraction in CSV

redc
Builder

I have a CSV file that has a rather large number of fields (189, to be exact). The timestamp is in field #47 (CREATEDATE) and is in %Y%m%d format (field is limited to 8 numeric characters, so 20130101, for example). I have tried numerous TIME_FORMAT and TIME_PREFIX, extractions to no avail; I also can't seem to get Splunk (v 6.0) to recognize that the first line is the header when using HEADER_FIELD_LINE_NUMBER (which, if I understand correctly, would allow me to use TIMESTAMP_FIELDS to identify the timestamp as the CREATEDATE field rather than using a regex or anything like that).

The props.conf example below seems to be the setup that should work, or is closest to what should work, based on various other Answers posts about CSVs and timestamp extraction, such as:

What am I doing wrong?

props.conf

MAX_TIMESTAMP_LOOKAHEAD=8
NO_BINARY_CHECK=1
TIME_FORMAT=%Y%m%d
TIME_PREFIX=^([^,]*,){46}
# set by detected source type
INDEXED_EXTRACTIONS=csv
KV_MODE=none
SHOULD_LINEMERGE=false
pulldown_type=true

Sample data (including header - probably easier to see if you copy and paste into Excel and do a text-to-columns using only commas):

RECNUM,ORDERNUM,CUSTNUM,SUBTYPE,SHIPTO,PREFIX,FNAME,LNAME,SUFFIX,SALUTATION,SEX,TITLE,COMPANY,DEPT,STREET1,STREET2,CITY,STATE,ZIP,ZIP2,CARRT,CNTRYCODE,COUNTRY,TAXJR,PHONE,CELLPHONE,FAX,EMAIL,EMAIL2,USERID,PASSWORD,TAXABLE,RENTPOSTAL,RENTEMAIL,RENTPHONE,RENTFAX,BADSEED,DEMCODEA,DEMCODEB,DEMCODEC,DEMCODED,DEMDATA1,DEMDATA2,EXTCUSTID,SELFLAGS,CASSDATE,CREATEDATE,POSTALDATE,EMAILDATE,PHONEDATE,FAXDATE,SHIPCLASS,SHIPCOPIES,PUB,SUBDATE,RENEWALS,TRACK,CHANNEL,SALESPRSN,RENEWFLAG,GIFTCOMP,CANREASON,EXPIREDATE,TOTALDIC,TOTALDIA,BILLMETHOD,BILLDELAY,BILLDELDT,CBILLSENT,PREMFLAG,AUTORENEW,ORDDATE1,ORDBATCH1,STATUS1,SHIPCLASS1,COPIES1,TERM1,FREEISS1,SERVEFRST1,PREMIUM1,SERVICE1,PRICE1,TAX1,SHIPPING1,PAID1,PAYDATE1,PAYTYPE1,PAYNUM1,REFUND1,PONUM1,CONTROL1,REQTYPE1,SUBCLASS1,BILLSER1,BILLDATE1,BILLEFF1,RESPBILL1,RENSER1,ORDSOURCE1,ORDCHAN1,ORDSALES1,MISC1,AGENCY1,DISCRATE1,EXTORDID1,RENDATE1,RENEFF1,RESPRNUM1,SERVED1,LASTISS1,NEXTISS1,FRSTISSDT1,LASTISSDT1,ORDDATE2,ORDBATCH2,STATUS2,SHIPCLASS2,COPIES2,TERM2,FREEISS2,SERVEFRST2,PREMIUM2,SERVICE2,PRICE2,TAX2,SHIPPING2,PAID2,PAYDATE2,PAYTYPE2,PAYNUM2,REFUND2,PONUM2,CONTROL2,REQTYPE2,SUBCLASS2,BILLSER2,BILLDATE2,BILLEFF2,RESPBILL2,RENSER2,ORDSOURCE2,ORDCHAN2,ORDSALES2,MISC2,AGENCY2,DISCRATE2,EXTORDID2,RENDATE2,RENEFF2,RESPRNUM2,SERVED2,LASTISS2,NEXTISS2,FRSTISSDT2,LASTISSDT2,ORDDATE3,ORDBATCH3,STATUS3,SHIPCLASS3,COPIES3,TERM3,FREEISS3,SERVEFRST3,PREMIUM3,SERVICE3,PRICE3,TAX3,SHIPPING3,PAID3,PAYDATE3,PAYTYPE3,PAYNUM3,REFUND3,PONUM3,CONTROL3,REQTYPE3,SUBCLASS3,BILLSER3,BILLDATE3,BILLEFF3,RESPBILL3,RENSER3,ORDSOURCE3,ORDCHAN3,ORDSALES3,MISC3,AGENCY3,DISCRATE3,EXTORDID3
1,2968,839,S,,Mr.,Clyde,Barrow,,,M,,,,1 Hideaway Lane,,Dunaway,NM,88401,,,USA,U.S.A.,NM,,,,,,,,Y,Y,Y,Y,Y,,10,0,0,0,,,,,,20030202,,,,,,,AIC,20041115,0,AIC102,MAIL,,,,,SUMMER 05,"     0.00","   120.00",,10,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,20041115,160,A,F,1,4,0,L,,,"   240.00","     0.00","     0.00","     0.00",,,,"     0.00",,N,OTH,NQP,STD,20041125,1,,STDAIC,AIC102,MAIL,,,," 0.00",,,,,2,2,3,20041115,20041125,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2979,841,S,,Ms.,Bonnie,Parker,,,F,,,,15 Hideaway Lane,,Dunaway,NM,88401,,,USA,U.S.A.,NM,,,,,,,,Y,Y,Y,Y,Y,,10,0,0,0,,,,,,20030202,,,,,,,AIC,20041115,0,AIC102,MAIL,,,,,SUMMER 05,"     0.00","   120.00",,10,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,20041115,160,A,F,1,4,0,L,,,"   240.00","     0.00","     0.00","     0.00",,,,"     0.00",,N,OTH,NQP,STD,20041125,1,,STDAIC,AIC102,MAIL,,,," 0.00",,,,,2,2,3,20041115,20041125,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2981,1026,S,,Mr.,Francis,Robinson,,,M,,,,Tortoise Lane,,Shark Island,ZZ,,,,GN,GUINEA,,,,,,,,,Y,Y,Y,Y,Y,,10,0,0,0,,,,,,20030930,,,,,,,AIC,20041115,0,AIC101,MAIL,,,,,SUMMER 05,"   124.50","   124.50",,10,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,20041115,160,A,A,1,4,0,L,,,"   240.00","     0.00","     9.00","   249.00",20041115,C,,"     0.00",,N,OTH,NQP,STD,,,,STDAIC,AIC101,MAIL,,,," 0.00",,,,,2,2,3,20041115,20041125,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2990,861,S,,Mr.,Lou,Costello,,,M,,,,10 Laurel Drive,,Hoboken,NJ,07001,,,USA,U.S.A.,NJ,,,,,,,,Y,Y,Y,Y,Y,,10,0,0,0,,,,,,20030615,,,,,,,AIC,20041122,0,AIC101,MAIL,,,,,FALL 05,"   180.00","   180.00",,0,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,20041122,164,A,F,1,4,0,L,,,"   240.00","     0.00","     0.00","   240.00",20041122,C,,"     0.00",,N,OTH,NQP,STD,,,,STDAIC,AIC101,MAIL,,,," 0.00",,,,,1,2,3,20041125,20041125,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1 Solution

redc
Builder

The MAX_TIMESTAMP_LOOKAHEAD, according to the documentation, is "timestamp never extends more than ___ chars past the pattern", so it is from the end of TIME_PREFIX.

However, setting MAX_DAYS_AGO to 5000 days dealt with the issue.

View solution in original post

0 Karma

redc
Builder

The MAX_TIMESTAMP_LOOKAHEAD, according to the documentation, is "timestamp never extends more than ___ chars past the pattern", so it is from the end of TIME_PREFIX.

However, setting MAX_DAYS_AGO to 5000 days dealt with the issue.

0 Karma

sideview
SplunkTrust
SplunkTrust

My guess is that the culprit is

MAX_TIMESTAMP_LOOKAHEAD=8

this says that Splunk should look only 8 characters into each event and if it doesn't find a timestamp by that point, give up looking for one.

and HEADER_FIELD_LINE_NUMBER is designed for cases where you have a csv or tsv file or what-have-you where the header isn't technically on the first line but rather on the second or third. In most cases since it defaults to 0 you should just leave this unset and not think about it.

As to why INDEXED_EXTRACTIONS doesn't seem to be working, I'm not sure. Things look fine. Maybe the wonky timestamp was putting things in very wrong times and causing confusion.

0 Karma

sideview
SplunkTrust
SplunkTrust

Right. You also ALMOST CERTAINLY will be running into problems if you're tailing these files where it only indexes a couple of them at all. This is because your header is so big. You'll want to set something like:

initCrcLength = 5000

that will tell splunk to look 5000 chars into the start of each tailed file when it's determining whether it's looking at a rolled copy of a file already indexed, or a new file. The default initCrcLength value is far too low for this sourcetype and splunk will never check past the header row, so it'll conclude that all files after the first are dupes.

0 Karma

lukejadamec
Super Champion

To add to this, MAX_TIMESTAMP_LOOKAHEAD default is 150 characters, and you'll likely exceed this limit at times. MAX_TIMESTAMP_LOOKAHEAD is not from the end of the TIMESTAMP_PREFIX but from the start of the event.
MAX_DAYS_AGO defaults to 2000 days, and that is about 5.5 years, so your timestamps (2004 ish) will get ignored.

Get Updates on the Splunk Community!

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...