Getting Data In

Why Splunk can't index very large csv files

bonnlbbelandres
Path Finder

I am using a csv file to input data in my local Splunk Enterprise.
I have a very big csv file that is around 100mb.

The data in my csv file contains the following count of events:
January: 36,055
February: 37,613
March: 41,521
April: 33,697
May : 39,980
June: 36,994
July: 31,963

After loading the data into Splunk, the data in Splunk contains the following count of events:
January: 29,416
February: 32,042
March: 37,516
April: 33,458
May : 39,975
June: 15,935
July: 22,766

Note: My index usage is only 243MB/488.28GB

I tried cutting my csv file to only May June and July data and uploaded it to Splunk.
csv count:
May : 39,980
June: 36,994
July: 31,963

Splunk count:
May : 39,980
June: 36,994
July: 31,963

So this means I have no problem with the formatting of the timestamp in my csv file.

Could you help me find the configuration that causes this truncation?
or atleast help me on how to investigate it?
I will appreciate any response regarding the matter.

woodcock
Esteemed Legend

My suspicion is that you have a malformed CSV (missing/extra commans, merged lines, etc.). How are you sending this CSV to Splunk? Why are you not using it as a lookup instead (how often does it change)?

DalJeanis
Legend

Hmmm. Those May and June numbers are bizarrely out of whack with the rest. May got near 100% indexed, and June about 43%. That's probably NOT a clue, but I'd keep it in mind while looking at everything else.

I'd do the same thing again, putting the results into two different temporary indexes. If the resultant load numbers for the full file are not identical to the first results, then I'd look at memory usage and so on.

Next, I'd diff the full results against the partial load results to see which records were dropped.

Finally, I might set up two different sourcetypes, and set one to send any records before April 1 to the null queue, and the other to send any after March 31 to the null queue, and see whether they successfully loaded all the appropriate records.


Truncate setting in props.conf is for each line, so that's not relevant.

Check this one here for the notes on the TRUNCATE setting.

https://answers.splunk.com/answers/80146/splunk-search-of-indexed-csv-file-does-not-pull-out-all-the...


max_mem_usage_mb in limits.conf affects searches, apparently not indexing, so that's probably not it.

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In November, the Splunk Threat Research Team had one release of new security content via the Enterprise ...

Index This | Divide 100 by half. What do you get?

November 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with this ...

Stay Connected: Your Guide to December Tech Talks, Office Hours, and Webinars!

❄️ Celebrate the season with our December lineup of Community Office Hours, Tech Talks, and Webinars! ...