I want to gobble in CSV files containing numeric data. Each file will have between 500 and 150,000 fields. (Yes that's right 150K :). The first line of the CSV will have the column names (headers). Each CSV file can have different field names and a different number of fields. If anyone is acquainted with esxtop batch mode output From VMware hypervisors), you know what I'm talking about. I'm relatively new to Splunk, but what I want to eventually accomplish is to write a dashboard that will be able to manipulate the data found in these fields. If you've seen the ESXplot python tool (that I wrote) , you will get the idea. Any help on how I might begin to look at this would be helpful.
not to be all, whatever, but 150k field names? how might these be generated or even used? i find it hard to imagine that there could not be some normalization performed on these to make it rather more manageable and meaningful.
Also, how many files? are the files at all patterned by name, e.g., files with a certain path have one common set of field, with another it will have another shared set?
As far as indexing the files, Splunk should be okay with that. All you might need is to increase settings for TRUNCATE, which cuts off lines after 50000 characters. Splunk can handle lines of a few million characters at least, though I'm not sure how the UI will do in certain browsers.
However, for pulling fields out, you can do things a couple of ways. You can either have Splunk generate the field extraction configs from the file contents as it reads and indexes them (here), or you can generate them yourself into the props.conf and transforms.conf files.
I have no idea at what point, if any, a large number of fields (and I am suspicious about the necessity and meaningfulness of a table with purportedly 150k fields) will cause either the generation of configs or the extraction of fields using those configs to fail.
If not all the fields are populated for all events, then it may make sense to use a python input script to convert the CSV-style input into a list of
key=value pairs instead. There are pros/cons for either approach, but it may be worth considering some kind of pre-processing.
This is the way performance data comes out of VMware ESX hypervisors, to see how it might be used, see www.durganetworks.com/esxplot. I wrote a python program that allows the user to navigate through this "sea of data". I'd like to make a Splunkable version.
To answer your second question, each distinct file can have a different set of field names. Part of the field name will be the hostname of the machine writing it, if ther are say, 50 VMs running, there may be 100 - 1000 metrics or each of those VMs. Now under each field there will be data, but a 15 minute run can only have 750 samples, so you can generate very wide but somewhat shallow CSV files.