Getting Data In

Why Splunk can't index very large csv files

bonnlbbelandres
Path Finder

I am using a csv file to input data in my local Splunk Enterprise.
I have a very big csv file that is around 100mb.

The data in my csv file contains the following count of events:
January: 36,055
February: 37,613
March: 41,521
April: 33,697
May : 39,980
June: 36,994
July: 31,963

After loading the data into Splunk, the data in Splunk contains the following count of events:
January: 29,416
February: 32,042
March: 37,516
April: 33,458
May : 39,975
June: 15,935
July: 22,766

Note: My index usage is only 243MB/488.28GB

I tried cutting my csv file to only May June and July data and uploaded it to Splunk.
csv count:
May : 39,980
June: 36,994
July: 31,963

Splunk count:
May : 39,980
June: 36,994
July: 31,963

So this means I have no problem with the formatting of the timestamp in my csv file.

Could you help me find the configuration that causes this truncation?
or atleast help me on how to investigate it?
I will appreciate any response regarding the matter.

woodcock
Esteemed Legend

My suspicion is that you have a malformed CSV (missing/extra commans, merged lines, etc.). How are you sending this CSV to Splunk? Why are you not using it as a lookup instead (how often does it change)?

DalJeanis
Legend

Hmmm. Those May and June numbers are bizarrely out of whack with the rest. May got near 100% indexed, and June about 43%. That's probably NOT a clue, but I'd keep it in mind while looking at everything else.

I'd do the same thing again, putting the results into two different temporary indexes. If the resultant load numbers for the full file are not identical to the first results, then I'd look at memory usage and so on.

Next, I'd diff the full results against the partial load results to see which records were dropped.

Finally, I might set up two different sourcetypes, and set one to send any records before April 1 to the null queue, and the other to send any after March 31 to the null queue, and see whether they successfully loaded all the appropriate records.


Truncate setting in props.conf is for each line, so that's not relevant.

Check this one here for the notes on the TRUNCATE setting.

https://answers.splunk.com/answers/80146/splunk-search-of-indexed-csv-file-does-not-pull-out-all-the...


max_mem_usage_mb in limits.conf affects searches, apparently not indexing, so that's probably not it.

0 Karma
Get Updates on the Splunk Community!

Monitoring MariaDB and MySQL

In a previous post, we explored monitoring PostgreSQL and general best practices around which metrics to ...

Financial Services Industry Use Cases, ITSI Best Practices, and More New Articles ...

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...

Splunk Federated Analytics for Amazon Security Lake

Thursday, November 21, 2024  |  11AM PT / 2PM ET Register Now Join our session to see the technical ...