Solved: Large lookup files in a distributed environment: h...

rsouth · ‎04-20-2017

Splunk automagically builds .tsidx indexes on Lookup files which are large.
This is triggered the 1st time someone performs a query on the large Lookup.
Some really large lookups (MB to GB) take some time to build the .tsidx so we schedule reports to run to force Splunk to build them in the early morning.

Here's the problem: in a distributed environment, that appears to only build the .tsidx files (or build them correctly) on one of the Search heads. I haven't done enough testing to prove if this behavior is "all the time" or "sometimes".

Is this a bug which I should report? Is this the expected behavior? I'm not sure what the expected behavior is for sharing the .tsidx files/indexes.

If it is expected behavior, is there a way to force the "prebuild" on each of the search heads?
Right now I'm remotely logging into each server, running splunk as localhost, and running the query which forces the .tsidx build. Not ideal.

I'm considering KVStore but the regular Lookup files appear to handle queries on "non-key" fields better. These Lookups have a variety of fields people may be interested in searching/looking-up on.

sowings · ‎04-20-2017

Because you've indicated that you're working within a Search Head Cluster (SHC), the options for this aren't exactly straightforward. The scheduled search you've built to trigger the rebuild of the lookup table is dispatched to one of the members of the SHC, not all of them. This is expected behavior, and coincides with what you're observing in your environment. An alternative to the "login by hand" option for triggering the search would be to remotely do so via REST / curl. I always refer back to this answer for triggering saved searches.

Another complication to this is that the lookup table is (typically) distributed down to the indexers, too, so that they can perform the enrichment in the "Map" part of "map reduce", meaning that they're doing work on the behalf of the search heads. This may also trigger that "create index file from large lookup" behavior on those hosts.

You might consider the KV store as an approach. The lookup abstraction will still play nice, and the KV store members will handle the replication for you.

However, since you've indicated that some of your lookups are GB-sized, I might cherry pick the fields that are most commonly used, build a lookup from that, and double-jump (lookup to key another lookup) in the worst case.

View solution in original post

sowings · ‎04-20-2017

Because you've indicated that you're working within a Search Head Cluster (SHC), the options for this aren't exactly straightforward. The scheduled search you've built to trigger the rebuild of the lookup table is dispatched to one of the members of the SHC, not all of them. This is expected behavior, and coincides with what you're observing in your environment. An alternative to the "login by hand" option for triggering the search would be to remotely do so via REST / curl. I always refer back to this answer for triggering saved searches.

Another complication to this is that the lookup table is (typically) distributed down to the indexers, too, so that they can perform the enrichment in the "Map" part of "map reduce", meaning that they're doing work on the behalf of the search heads. This may also trigger that "create index file from large lookup" behavior on those hosts.

You might consider the KV store as an approach. The lookup abstraction will still play nice, and the KV store members will handle the replication for you.

However, since you've indicated that some of your lookups are GB-sized, I might cherry pick the fields that are most commonly used, build a lookup from that, and double-jump (lookup to key another lookup) in the worst case.

sowings · ‎04-20-2017

Sounds like you're employing a search head cluster, can you confirm?

rsouth · ‎04-20-2017

Yes, we have 4 Search Heads in a Search Head Cluster.

Large lookup files in a distributed environment: how can we force the .tsidx indexes to build on all search heads?

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms