Monitoring Splunk

What are the merged_lexicon.lex files in my buckets?

Splunk Employee
Splunk Employee

What is the purpose of these files? Some get to be quite large

1179300775 Feb 9 16:45 merged_lexicon.lex

Tags (1)
1 Solution

Splunk Employee
Splunk Employee

These files are part of the search index. They are mostly used to support typeahead. I would not consider them large. They are usually quite a bit smaller than the .tsidx files that constitute the main part of the index.

View solution in original post

Champion

If you don't need typeahead and are looking to save some space on your Splunk partition, deleting these files can save you about 10% on your total index size.

Champion

Apparently they can take up anywhere from 5%-20%

0 Karma

Splunk Employee
Splunk Employee

I think there's been some optimization to the merged_lexicon files. They're currently under 5% for me.

0 Karma

Splunk Employee
Splunk Employee

In a bit more detail, a tsidx file consists of two parts: a lexicon, and a set of postings. The lexicon is a list of terms in alpha order, followed by a pointer to its posting list. The posting list is a mapping for that term, to which events (in the rawdata files) contain that term.

So essentially you have, something like this:

tsidxfile 1:
leixcon: a  b  c 
          |  |  | 
          |  |  +-+
          |  ++   |
          V   v   v
postings: 2 4|1 5|2

tsidxfile 2: (smaller)
leixcon: d 
          |
          V
postings: 2 8

The lexicon tells us what terms exist and the postings tell us where to find them. However, we have to look in every tsidx file to find out all the terms. So if there are 20 tsidx files and you type in 'gromblhyozorktooks', which doesn't exist, splunkd has to open all 20 tsidx files to figure out you're crazy.

The merged_lexicon.lex is just a file to contain all the lexicons, which are much smaller, it looks more like this:

a b c d 

This allows typeahead to answer its questions much more quickly (what terms exist), and allows negative lookups to fail much faster. The typical case for this is that some buckets have your term, and some do not, so the merged lexicon allows buckets to be completely ruled out much faster.

Engager

Isn't that, what you just described, a bloomfilter file and not lexicon?

0 Karma

Splunk Employee
Splunk Employee

These files are part of the search index. They are mostly used to support typeahead. I would not consider them large. They are usually quite a bit smaller than the .tsidx files that constitute the main part of the index.

View solution in original post