We have a fairly large indexing cluster. We tend to get about a 2x compression rate on our raw data. In other words, we can store about 2 GB of raw data + indexes on 1GB of disk. We were hoping to get much better compression rates. Given our data is text we believe 5 to 10x is possible. It was suggested to us that the reason we are only getting 2x is due to the amount of indexes created. Moreover, it was suggested we could reduce the number of indexes if we knew for a fact that some portion of the index is never used in searching. For example, If we have a URL like "http://www.someurl/path1/path2/path3/path4" in our data, the indexing algorithm automatically stores each path in a separate index and combinations of the paths as well. So we would end up storing path1, path2, path1/path2, etc, in indexes. Yet if we know our logs are well defined then folks won't be searching for these types of things, rather they will search for URL=path1... hence we could, if splunk let's us, significantly reduce the size of the index. Is this possible? If so is their any documentation on the same?
I am kinda at a loss? Your logs are not broken by timestamp, but rather by pathing. so each segment of the path is broken by a transform into seperate events?
The logs are broken by timestamp. here is an example event:
Date=11-10-2012 00:00:00, URL="http://www.someurl/path1/path2/path3/path4"
The question is how can I reduce the size/number of indexes created for this sourcetype. I'm using "index" here in the common usage (i.e. index on a relational data table), when splunk uses the word Index they mean a set of files some of which are raw data files and some of which are index files.
I believe what you're talking about is Splunk's segmentation. You can tweak this to your liking in props.conf, but as you probably will have guessed optimizing segmentation settings for storage efficiency will have impact on performance. More information available in (among others) the following links:
The Splunk concept you're after is segmentation. That documentation link will explain the general case, how Splunk does it, and lead you to other articles to adjust its configuration.
Note: I could make a case for searching for a given token in the path, like the subdirectory "path3" as given above.
I should note that typically this is not necessary, as Splunk gets pretty good compression of data even when you factor in the size of the index files. I've seen data with low entropy compressing as 19::1. YMMV.
I guess I could say the same thing about common nomenclature... At any rate, is there a name for the indexes inside a splunk Index? If there is, happy to use it.... Thanks for the answers below... I will investigate and get back to you.