topic Re: Why Splunk can't index very large csv files in Getting Data In

Why Splunk can't index very large csv files

bonnlbbelandres — Wed, 02 Aug 2017 13:20:52 GMT

I am using a csv file to input data in my local Splunk Enterprise.
I have a very big csv file that is around 100mb.

The data in my csv file contains the following count of events:
January: 36,055
February: 37,613
March: 41,521
April: 33,697
May : 39,980
June: 36,994
July: 31,963

After loading the data into Splunk, the data in Splunk contains the following count of events:
January: 29,416
February: 32,042
March: 37,516
April: 33,458
May : 39,975
June: 15,935
July: 22,766

Note: My index usage is only 243MB/488.28GB

I tried cutting my csv file to only May June and July data and uploaded it to Splunk.
csv count:
May : 39,980
June: 36,994
July: 31,963

Splunk count:
May : 39,980
June: 36,994
July: 31,963

So this means I have no problem with the formatting of the timestamp in my csv file.

Could you help me find the configuration that causes this truncation?
or atleast help me on how to investigate it?
I will appreciate any response regarding the matter.

Re: Why Splunk can't index very large csv files

DalJeanis — Wed, 02 Aug 2017 14:29:54 GMT

Hmmm. Those May and June numbers are bizarrely out of whack with the rest. May got near 100% indexed, and June about 43%. That's probably NOT a clue, but I'd keep it in mind while looking at everything else.

I'd do the same thing again, putting the results into two different temporary indexes. If the resultant load numbers for the full file are not identical to the first results, then I'd look at memory usage and so on.

Next, I'd diff the full results against the partial load results to see which records were dropped.

Finally, I might set up two different sourcetypes, and set one to send any records before April 1 to the null queue, and the other to send any after March 31 to the null queue, and see whether they successfully loaded all the appropriate records.

Truncate setting in props.conf is for each line, so that's not relevant.

Check this one here for the notes on the TRUNCATE setting.

https://answers.splunk.com/answers/80146/splunk-search-of-indexed-csv-file-does-not-pull-out-all-the-fields.html

max_mem_usage_mb in limits.conf affects searches, apparently not indexing, so that's probably not it.

Re: Why Splunk can't index very large csv files

woodcock — Wed, 02 Aug 2017 14:35:06 GMT

My suspicion is that you have a malformed CSV (missing/extra commans, merged lines, etc.). How are you sending this CSV to Splunk? Why are you not using it as a lookup instead (how often does it change)?