This is potentially a very long discussion of the differences between Splunk, which seeks to index time-series, machine generated data, and Lucene, which was originally designed to index human-generated text documents. We can begin with your questions.
Splunk has no notion of stop words. By default, Splunk indexes all keywords found in events, as defined by the segmentation rules.
Splunk provides wildcard searches and phrase searches, but the index doesn't provide native proximity searches or regex searches. For those, we rely on subsequent commands in the search processing pipeline.
Splunk aggressively compresses the rawdata we store, and we spend a lot of effort to make the indexes as small as possible, by means of explicit compression and other low footprint data structures. Typically, you can expect that the rawdata will be 10% the size of the original data and the indexes are 20-40% of the size of the original data, depending on entropy. Together Splunk typically requires 30-50% the size of the original raw data as storage.
The index itself doesn't provide synonym support, since that's fundamentally a problem for human text. We provide an analogous concept however, in eventtypes, which can be used to represent meaningful classes of queries, including synonyms.
... View more