Solved: What are the merged_lexicon.lex files in my bucket...

Chris_R_ · ‎03-04-2010

What is the purpose of these files? Some get to be quite large

1179300775 Feb 9 16:45 merged_lexicon.lex

gkanapathy · ‎03-05-2010

These files are part of the search index. They are mostly used to support typeahead. I would not consider them large. They are usually quite a bit smaller than the .tsidx files that constitute the main part of the index.

View solution in original post

the_wolverine · ‎03-06-2010

If you don't need typeahead and are looking to save some space on your Splunk partition, deleting these files can save you about 10% on your total index size.

the_wolverine · ‎03-30-2010

Apparently they can take up anywhere from 5%-20%

jrodman · ‎03-11-2010

I think there's been some optimization to the merged_lexicon files. They're currently under 5% for me.

jrodman · ‎03-05-2010

In a bit more detail, a tsidx file consists of two parts: a lexicon, and a set of postings. The lexicon is a list of terms in alpha order, followed by a pointer to its posting list. The posting list is a mapping for that term, to which events (in the rawdata files) contain that term.

So essentially you have, something like this:

tsidxfile 1:
leixcon: a  b  c 
          |  |  | 
          |  |  +-+
          |  ++   |
          V   v   v
postings: 2 4|1 5|2

tsidxfile 2: (smaller)
leixcon: d 
          |
          V
postings: 2 8

The lexicon tells us what terms exist and the postings tell us where to find them. However, we have to look in every tsidx file to find out all the terms. So if there are 20 tsidx files and you type in 'gromblhyozorktooks', which doesn't exist, splunkd has to open all 20 tsidx files to figure out you're crazy.

The merged_lexicon.lex is just a file to contain all the lexicons, which are much smaller, it looks more like this:

a b c d

This allows typeahead to answer its questions much more quickly (what terms exist), and allows negative lookups to fail much faster. The typical case for this is that some buckets have your term, and some do not, so the merged lexicon allows buckets to be completely ruled out much faster.

pradeepkr13 · ‎02-20-2018

Isn't that, what you just described, a bloomfilter file and not lexicon?

gkanapathy · ‎03-05-2010

These files are part of the search index. They are mostly used to support typeahead. I would not consider them large. They are usually quite a bit smaller than the .tsidx files that constitute the main part of the index.

What are the merged_lexicon.lex files in my buckets?

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!