I've been evaluating Splunk against a custom application which consists of a cluster of tomcat instances running two separate applications (partially sharing classes) and some front and back end apaches. I tested importing a months worth of log data (all at once), and have been playing around with it.
Firstly it seems that Splunk reduces the data size to about half of what it was originally. If possible I'd like to be even more efficient with the indexing, as a lot of the data in the logs contains duplicate info.
Looking at how splunk has processed the incoming data from the tomcat application (log4j), it seems to have only parsed the timestamp and nothing further (it's possible I'm missing something), so a log line is pretty much processed as a string. I later used field extraction (from the search) to extract fields such as the log level, the actual java class etc, but the concept is still a bit foreign to me (despite reading through a lot of documentation).
Especially this last point is confusing, as documentation mentioned that field extraction during import can actually increase the size of the indexes. If I have 500 java classes producing a million lines of log a day, wouldn't separating that from the bulk log line during indexing actually reduce the size of the indexes (especially if the alternative is having a stored search producing a dashboard out of the data anyway)?
Once you create your field extractions either through the UI or the conf files they fields will auto-extract at search time. This requires no additional space.
You don't want to extract at index time. Bigger indexes slower index times.
To add a few points to what dmaislin wrote :
1) Splunk stores the data in flat files on disk that are compressed, hence the reduction in size from your source uncompressed log4j logs.
2) in props.conf you can use EXTRACT and REPORT to perform search time field extraction vs index time field extraction.
3) by default , Splunk will look 150 characters into the log event to try and find a parseable timestamp to use as the _time field that gets indexed.
I don't think that nullqueue is what he's referring to, as this would remove the events completely.
Thanks for the answers!
Yeah, I don't think nullqueue is what I'm after since I don't want to remove the data. I was thinking Splunk would work more like deduplication of data; that is if I have a field that only has a few possible values the values would only be stored once and then referred to by an index (thus reducing the amount of data stored on disk).
However if this is not the case, then clearly extracting during the import will not improve compression, which was my main goal.
Splunk reduces the size of raw events by compression alone. Attempts to do field-based "deduplication" to save disk space would (likely) result in much lower indexing throughput. It would have to do field extractions at index time (which as DMaislin mentioned, is not ideal), and then look for existing values of fields, update references, etc. At search time, it would have to do lots more random I/O to get field values for various fields in order to reassemble the original event. Practically speaking, it would be a lot of work for (perhaps) not a huge gain.
In addition to the (compressed) raw events, Splunk also maintains a keyword index. Basically, each event is tokenized into a series of tokens based on common separators. Tokens are typically "words" (separated by spaces), things separated by commas, colons, slashes, or periods, etc. Each of these unique tokens is then stored in the keyword index, with a reference to each of the raw event(s) that contains it. This process of tokenization is called event segmentation and is documented at docs.splunk.com/Documentation/Splunk/latest/Admin/Configuresegmentationtomanagediskusage.
You can't make Splunk store the raw events any smaller than their compressed form. But, you can adjust the segmentation rules to make the keyword index smaller. This may have a net improvement on disk space, at the cost of search performance and utility.
This is one of those areas where, unless you are extremely cost-constrained, "disk space is (relatively) cheap" and a little more space used doesn't matter versus search performance.