So, say we had a dataset with 5 fields, 20 trillion rows.
I assume the csv file would be smaller when indexed, but maybe not?
More importantly, if I were to do a search over the whole index, something like "index=foo | stats sum(column3)", which would run faster, the key/value dataset or the csv dataset with a search-time field extraction?
For either question, is the difference negligible on such a large dataset?
Let me add a few thoughts to what Jrodman said.
I think the first and most important question is whether or not your field list will be completely static or if it will change. If it changes then the
k=v option is a better and more flexible approach. If you have a few variations, it's best to make one sourcetype per set of fields.
Also you should keep in mind that with the
k=v approach, you can always do a direct search against
"v3=c" (which will be an indexed term in the kv, but it will NOT be in the CSV approach.)
If you're new to splunk, I should point out that searching for
"v3=c" (with quotes) is in fact different from searching for
v3=c1 (without quotes). The first is an exact string matching type of search. And with normal segmentation rules, it will be a faster lookup if the term
c1 commonly occurs in your data other than in the
v3 field. The second example is a field search, which requires field extraction. Again, it all depends on the frequency of terms and how you need to search on your data. This doesn't matter so much for pure extraction speed.
So if finding specific records is your goal, then there are some advantagse to the kv approach. But if your actually searching requires are more date based, for example, and your looking to do statistical analysis, then either approach will probably be about as fast and the CSV approach may be more applicable for you. (Jrodman, do you know if FIELDS based extraction is faster or slower than the automatic
kv extraction. It seems like CSV could be faster from a parsing perspective, but a more definitive answer on this would be great)
Bottom line. If this is really such a big deal. Then I would recommend testing out both methods. Create two temporary indexes, feed the same data into both (obviously formatted in your two different ways). Also be sure to make a different sourcetype for each test. Make sure that in the CSV sourcetype you set
props.conf and rely only on using the
transforms.conf (you don't want the overhead of extracting KV pairs from your CSV formatted sourcetype). Then compare the speeds. This shouldn't take too long to setup. Then pick the fastest most ideal solution for you. (Note that comparing bucket sizes is tricky. You would want to rotate your buckets, manually run
splunk-optimize, and then manually
gzip all of the files in the
rawdata folder that don't end with `.gz' already. ... and there could be more involved in a truly fair comparison.)
For a speed test, you may also want to try uncompressing all your
.gz files in the
rawdata folder. I've heard this can sometime improve performance, but it could also backfire. You'd have to try to know for sure.
Also, if your data is really static (well defined, and strongly typed), then you may be better off feeding your data into a database rather than splunk. Don't get me wrong, I love splunk, it does some really awesome stuff, but I strongly believe in using the right tool for the job. (Hope I don't get in trouble for saying this.) You can probably find a way to pull data back out of your database and feed it back into splunk if you want to leverage splunk's visualization features. It's certainly feasible to to write a small (data generating) search script to pull data from your database and use splunk to post-process or visualize it. Just a thought. There are lots of pros/cons for either approach, but only you know the whole picture.
Good point about time inversed index stuff. My point is not to debate splunk vs relational DBs, just to note that there are options. (Sometimes people use a screwdriver to hammer in nails and then they somehow think it's the screwdriver's fault when it's not optimal... not that this is a perfect analogy. But we certainly don't want people throwing out their screwdrivers because of an initial bad experience.)
Continuing... if your job has no time axis at all, then there might be more performant datastores, but sometimes splunk is still worth using for interface, utility, accessibilty, etc.
Regarding splunk vs nonsplunk, where splunk really shines is when you have some amount of interest in selecting the data based on time. This is nearly always true for our use cases. People may care about historical windows, but they very, very frequently care about the recent window -- data in the last day/hour/etc. In splunk there's no cost for having 80 terabytes of historical data when you only care about the most recent 40MB, which is very much not true in most other data stores.
Agreed, I considered talking about the other issues, but didn't. The stability of data identification is probably the much more important factor in this case, given that the performance difference is likely to be very small.
I'm not certain I know what you're asking.
Is this a question of storage for
v1=a v2=b v3=c v4=d v5=e v6=f
on each line? Assuming so, my answer follows:
The first entry will require 6 postings in the tsidx files, while the second will require 12, which only cost a few bits apiece, perhaps on the order of a byte per event. I'm less familiar with how gzip will behave over the two datasets, which is how we store the text. I suspect again the second will be slightly larger. The exact cost in ratio to your total dataset will depend upon frequency of terms, the length of events, and other factors. It's unlikely to be a huge difference, but you may want to measure both if the total storage is very important to you. I would imagine the difference would be single percentage points, but this is guessing.
The smaller dataset will have a small advantage, with less data to pull off disk, and less data to uncompress. That particular search takes no advantage of our term index (you are pulling every event for the time window), so it will be entirely proprtional to the rawdata files, in the rawdata directories of your buckets, if you want to undertake the measurement.