Getting Data In

How to index CSV data files with no timestamp that are more than 100K lines?

Builder

Hi,

We're currently indexing a number of CSV files that are all generated output from someone else's script. These files have no timestamps for "events" and are truly CSV data. Several of them have more than 100K lines and since Splunk is creating the timestamp at the time it reads the file, I regularly get

WARN DateParserVerbose - The same timestamp has been used for 100K consecutive times. If more than 200K events have the same timestamp, not all events may be retrieveable. Context: source=/

Everything must be between 100K and 200K because as far as I can tell no events are unsearchable. So far. I don't really have an expectation that these files are going to grow, but I think it would be good for me to plan that they will.

This is Splunk 7.1, by the way.

I don't think I could track down the owner of the scripts generating these CSV-data files and then get them to modify their scripts to add a fake, but incrementing timestamp to each row. Even I wouldn't want to do that...

The only other solution I can think of would be to change to a scripted input that read the file and generated bogus timestamps as it read the rows into Splunk. I'm not crazy about that solution either.

Is there some other potential solution that members of the community might suggest?

Thank you

0 Karma

Motivator

Hello,

You could try setting "DATETIME_CONFIG=CURRENT" in your props.conf. Doing so should prevent all events from a large file having the same time stamp.

I am wondering about your use case though. Do you only need the latest version of each CSV file? If so, instead of indexing the CSV files you could place them in etc/yourapp/lookups (or /etc/search/lookups or etc/system/lookups). Then in Splunk search you can retrieve them via:

| inputlookup file.csv

I haven't tested though whether that works with very large files.

0 Karma

Builder

Would setting it to "CURRENT" just end up hoping that the file isn't read/indexed too quickly? If there's say, a 210K CSV file that Splunk sees and slurps up as a batch, it's going to go from UF to indexer pretty quickly. I'm not sure that I'd imagine say, 5 seconds of latency in there unless I had a few heavy forwarders in between the UF and the indexer.

As it is, Splunk is wasting some time attempting to parse a timestamp out of each line. It fails, then uses the time it read the line. That time wasted would potentially increase the "spread" of time the events were seen even if it's very slight.

Unfortunately, they do a lot of delta calculations in their reports, so just looking at the current day's information isn't enough so a lookup table wouldn't really work.

Thanks

0 Karma

Motivator

Yes, that means hoping that the file doesn't get indexed too quickly. I understand that this approach is anything but ideal.

The universal forwarder is conducting the input phase before sending the data to the heavy forwarder. So that will cause some delay.

For testing purposes, I just indexed a 100'000 line csv file using DATETIME_CONFIG=CURRENT. I see about roughly 10'000' per seconds. You might want to test it in your environment.

0 Karma

Builder

I was thinking of this mostly in the abstract so I forgot that DATETIME_CONFIG is already set to CURRENT for this particular file.

We have no Splunk infrastructure between the UF and the indexer(s) so no intermediate forwarder stop before events hit the indexer.

0 Karma