I have a CSV file that has a rather large number of fields (189, to be exact). The timestamp is in field #47 (CREATEDATE) and is in %Y%m%d format (field is limited to 8 numeric characters, so 20130101, for example). I have tried numerous TIME_FORMAT and TIME_PREFIX, extractions to no avail; I also can't seem to get Splunk (v 6.0) to recognize that the first line is the header when using HEADER_FIELD_LINE_NUMBER (which, if I understand correctly, would allow me to use TIMESTAMP_FIELDS to identify the timestamp as the CREATEDATE field rather than using a regex or anything like that).
The props.conf example below seems to be the setup that should work, or is closest to what should work, based on various other Answers posts about CSVs and timestamp extraction, such as:
What am I doing wrong?
props.conf
MAX_TIMESTAMP_LOOKAHEAD=8
NO_BINARY_CHECK=1
TIME_FORMAT=%Y%m%d
TIME_PREFIX=^([^,]*,){46}
# set by detected source type
INDEXED_EXTRACTIONS=csv
KV_MODE=none
SHOULD_LINEMERGE=false
pulldown_type=true
Sample data (including header - probably easier to see if you copy and paste into Excel and do a text-to-columns using only commas):
RECNUM,ORDERNUM,CUSTNUM,SUBTYPE,SHIPTO,PREFIX,FNAME,LNAME,SUFFIX,SALUTATION,SEX,TITLE,COMPANY,DEPT,STREET1,STREET2,CITY,STATE,ZIP,ZIP2,CARRT,CNTRYCODE,COUNTRY,TAXJR,PHONE,CELLPHONE,FAX,EMAIL,EMAIL2,USERID,PASSWORD,TAXABLE,RENTPOSTAL,RENTEMAIL,RENTPHONE,RENTFAX,BADSEED,DEMCODEA,DEMCODEB,DEMCODEC,DEMCODED,DEMDATA1,DEMDATA2,EXTCUSTID,SELFLAGS,CASSDATE,CREATEDATE,POSTALDATE,EMAILDATE,PHONEDATE,FAXDATE,SHIPCLASS,SHIPCOPIES,PUB,SUBDATE,RENEWALS,TRACK,CHANNEL,SALESPRSN,RENEWFLAG,GIFTCOMP,CANREASON,EXPIREDATE,TOTALDIC,TOTALDIA,BILLMETHOD,BILLDELAY,BILLDELDT,CBILLSENT,PREMFLAG,AUTORENEW,ORDDATE1,ORDBATCH1,STATUS1,SHIPCLASS1,COPIES1,TERM1,FREEISS1,SERVEFRST1,PREMIUM1,SERVICE1,PRICE1,TAX1,SHIPPING1,PAID1,PAYDATE1,PAYTYPE1,PAYNUM1,REFUND1,PONUM1,CONTROL1,REQTYPE1,SUBCLASS1,BILLSER1,BILLDATE1,BILLEFF1,RESPBILL1,RENSER1,ORDSOURCE1,ORDCHAN1,ORDSALES1,MISC1,AGENCY1,DISCRATE1,EXTORDID1,RENDATE1,RENEFF1,RESPRNUM1,SERVED1,LASTISS1,NEXTISS1,FRSTISSDT1,LASTISSDT1,ORDDATE2,ORDBATCH2,STATUS2,SHIPCLASS2,COPIES2,TERM2,FREEISS2,SERVEFRST2,PREMIUM2,SERVICE2,PRICE2,TAX2,SHIPPING2,PAID2,PAYDATE2,PAYTYPE2,PAYNUM2,REFUND2,PONUM2,CONTROL2,REQTYPE2,SUBCLASS2,BILLSER2,BILLDATE2,BILLEFF2,RESPBILL2,RENSER2,ORDSOURCE2,ORDCHAN2,ORDSALES2,MISC2,AGENCY2,DISCRATE2,EXTORDID2,RENDATE2,RENEFF2,RESPRNUM2,SERVED2,LASTISS2,NEXTISS2,FRSTISSDT2,LASTISSDT2,ORDDATE3,ORDBATCH3,STATUS3,SHIPCLASS3,COPIES3,TERM3,FREEISS3,SERVEFRST3,PREMIUM3,SERVICE3,PRICE3,TAX3,SHIPPING3,PAID3,PAYDATE3,PAYTYPE3,PAYNUM3,REFUND3,PONUM3,CONTROL3,REQTYPE3,SUBCLASS3,BILLSER3,BILLDATE3,BILLEFF3,RESPBILL3,RENSER3,ORDSOURCE3,ORDCHAN3,ORDSALES3,MISC3,AGENCY3,DISCRATE3,EXTORDID3
1,2968,839,S,,Mr.,Clyde,Barrow,,,M,,,,1 Hideaway Lane,,Dunaway,NM,88401,,,USA,U.S.A.,NM,,,,,,,,Y,Y,Y,Y,Y,,10,0,0,0,,,,,,20030202,,,,,,,AIC,20041115,0,AIC102,MAIL,,,,,SUMMER 05," 0.00"," 120.00",,10,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,20041115,160,A,F,1,4,0,L,,," 240.00"," 0.00"," 0.00"," 0.00",,,," 0.00",,N,OTH,NQP,STD,20041125,1,,STDAIC,AIC102,MAIL,,,," 0.00",,,,,2,2,3,20041115,20041125,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2979,841,S,,Ms.,Bonnie,Parker,,,F,,,,15 Hideaway Lane,,Dunaway,NM,88401,,,USA,U.S.A.,NM,,,,,,,,Y,Y,Y,Y,Y,,10,0,0,0,,,,,,20030202,,,,,,,AIC,20041115,0,AIC102,MAIL,,,,,SUMMER 05," 0.00"," 120.00",,10,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,20041115,160,A,F,1,4,0,L,,," 240.00"," 0.00"," 0.00"," 0.00",,,," 0.00",,N,OTH,NQP,STD,20041125,1,,STDAIC,AIC102,MAIL,,,," 0.00",,,,,2,2,3,20041115,20041125,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2981,1026,S,,Mr.,Francis,Robinson,,,M,,,,Tortoise Lane,,Shark Island,ZZ,,,,GN,GUINEA,,,,,,,,,Y,Y,Y,Y,Y,,10,0,0,0,,,,,,20030930,,,,,,,AIC,20041115,0,AIC101,MAIL,,,,,SUMMER 05," 124.50"," 124.50",,10,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,20041115,160,A,A,1,4,0,L,,," 240.00"," 0.00"," 9.00"," 249.00",20041115,C,," 0.00",,N,OTH,NQP,STD,,,,STDAIC,AIC101,MAIL,,,," 0.00",,,,,2,2,3,20041115,20041125,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2990,861,S,,Mr.,Lou,Costello,,,M,,,,10 Laurel Drive,,Hoboken,NJ,07001,,,USA,U.S.A.,NJ,,,,,,,,Y,Y,Y,Y,Y,,10,0,0,0,,,,,,20030615,,,,,,,AIC,20041122,0,AIC101,MAIL,,,,,FALL 05," 180.00"," 180.00",,0,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,20041122,164,A,F,1,4,0,L,,," 240.00"," 0.00"," 0.00"," 240.00",20041122,C,," 0.00",,N,OTH,NQP,STD,,,,STDAIC,AIC101,MAIL,,,," 0.00",,,,,1,2,3,20041125,20041125,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
The MAX_TIMESTAMP_LOOKAHEAD, according to the documentation, is "timestamp never extends more than ___ chars past the pattern", so it is from the end of TIME_PREFIX.
However, setting MAX_DAYS_AGO to 5000 days dealt with the issue.
The MAX_TIMESTAMP_LOOKAHEAD, according to the documentation, is "timestamp never extends more than ___ chars past the pattern", so it is from the end of TIME_PREFIX.
However, setting MAX_DAYS_AGO to 5000 days dealt with the issue.
My guess is that the culprit is
MAX_TIMESTAMP_LOOKAHEAD=8
this says that Splunk should look only 8 characters into each event and if it doesn't find a timestamp by that point, give up looking for one.
and HEADER_FIELD_LINE_NUMBER
is designed for cases where you have a csv or tsv file or what-have-you where the header isn't technically on the first line but rather on the second or third. In most cases since it defaults to 0 you should just leave this unset and not think about it.
As to why INDEXED_EXTRACTIONS
doesn't seem to be working, I'm not sure. Things look fine. Maybe the wonky timestamp was putting things in very wrong times and causing confusion.
Right. You also ALMOST CERTAINLY will be running into problems if you're tailing these files where it only indexes a couple of them at all. This is because your header is so big. You'll want to set something like:
initCrcLength = 5000
that will tell splunk to look 5000 chars into the start of each tailed file when it's determining whether it's looking at a rolled copy of a file already indexed, or a new file. The default initCrcLength value is far too low for this sourcetype and splunk will never check past the header row, so it'll conclude that all files after the first are dupes.
To add to this, MAX_TIMESTAMP_LOOKAHEAD default is 150 characters, and you'll likely exceed this limit at times. MAX_TIMESTAMP_LOOKAHEAD is not from the end of the TIMESTAMP_PREFIX but from the start of the event.
MAX_DAYS_AGO defaults to 2000 days, and that is about 5.5 years, so your timestamps (2004 ish) will get ignored.