Splunk Search

How could I optimize distributed replication of large lookup tables

Dan
Splunk Employee
Splunk Employee

Say I have a distributed environment with 1 search head and 4 indexers. On the search head, I am updating a lookup table as often as every 5 minutes (mac-to-ip lookup generated from the latest dhcp logs). The lookup tables can approach 500MB in size.

Every time the lookup is updated, it takes 1-2 minutes to replicate amongst the indexers (local 100MB connection). During this time the UI on the search head is completely un-responsive. Is this a bug?

What are my options for minimizing the replication task?

Tags (2)
1 Solution

the_wolverine
Champion

Dan,

This is a known bug where large bundles (lookup tables) in distributed search environment cause the search to take a long time to start. In version 4.1.4 we change the way we handle these as well as include some config parameters that significantly improve performance.

(SPL-30907)

View solution in original post

rphillips_splun
Splunk Employee
Splunk Employee

example config:

You can also use [replicationBlacklist] to reduce the size of the knowledge bundle. Since bin directories, jar and lookup files do not need to be replicated to search peers you could blacklist these in distsearch.conf.
on each Search Head:

$SPLUNK_HOME/etc/system/local/distsearch.conf
[replicationBlacklist]
noBinDir = (.../bin/*)
nojavabin = apps/splunk_archiver/java-bin/...

Note: Blacklist settings will override Whitelist settings

haraksin
Path Finder

Is there a source for not needing bin directories and lookup files in the knowledge bundle? I thought you needed them for search commands and non-automatic lookups? Am I wrong?

0 Karma

Dan
Splunk Employee
Splunk Employee

In which I offer more work-arounds for SPL-30907:

First, let me describe the current state:

The search head is responsible for making sure that the search peers used during search have an up-to-date set of configuration files (bundles). Each search peer has two bundle stores for:
- local bundles ($SPLUNK_HOME/etc) - used for indexing, searches originating from itself
- search head bundles ($SPLUNK_HOME/var/run/searchpeers/) - used during searches distributed by another machine

During search dispatching the search head (in a rate controlled manner) contacts all the search peers to determine their bundle version - identified by the 8 bytes of the MD5 of the tared bundles shipped by the search head. If the versions are different the search head uploads the most current bundle version. While bundle replication/synchronization is in place all searches will be blocked (by default, 4.1.4 introduces async bundle replication).
The major problem with the current implementation is that a small config change can cause the entire etc/, which can be quite large (primarily due to large lookup files) to be shipped to all peers. Another big issue is that the search head uri-encodes the contents (which can triple the upload size), only for the peer to unencode it.

Now for some details on SPL-30907. It is resolved, it is waiting for the 4.1.4 release, and contains the following changes to the replication protocol:
  1. thread pool vs. serial replication: 48% improvement
  2. no compression: 58% improvement
  3. new encoding: 17% improvement
  4. encoding cache: 11% improvement
  5. asynchronous replication

4.2 may also have some more significant changes to bundle replication, perhaps allowing for indexers to access bundles as read-only from shared storage on the search heads.

In the meantime, the recommended work-arounds (in no particular order) are:

A) Turn off SSL compression to achieve approximately 50% speed improvement. This is done by setting useClientSSLCompression = false in server.conf. This is not recommended in WAN settings or if slow links between indexer and search head 

OR

B) to "disable" bundle replication and share the bundles using some shared storage mechanism 

Search head changes:
(1) put the search head's etc directory in a shared location
(2) add the following stanza in etc/system/local/distsearch.conf (ie don't ship anything during bundle replication)
[replicationWhitelist]
conf  = 
other = 
searchscripts = 

Search peer changes (make these changes on all indexers):
(1) create the following directory $SPLUNK_HOME/var/run/searchpeers/-9999999999/
(2) inside the dir created in step (1) create symlinks to the search head's shared etc/system/ etc/apps/ and etc/users/

Basically the directory we just created overrides all the bundles that are being sent by the searchhead because it's timestamp is more recent.

OR

C) To place the lookup in a separate bundle that is not replicated (using whitelist in distsearch.conf) and just adjust how you use the lookup to make sure it is only used on the search head (e.g. if you are using the lookup command, give it the option, 'local=true').

the_wolverine
Champion

Dan,

This is a known bug where large bundles (lookup tables) in distributed search environment cause the search to take a long time to start. In version 4.1.4 we change the way we handle these as well as include some config parameters that significantly improve performance.

(SPL-30907)

View solution in original post

Dan
Splunk Employee
Splunk Employee

I have looked at modifying the lookup generation process so that the table itself is stored on an nfs and only a symlink is placed in the apps/lookups/ directory. However, it looks like the replication logic will follow the symlink reference and copy the original data.

I am also looking at replication whitelist options in distsearch.conf to exclude the lookup table from being replicated. However, I would then have to script something to manually do the replication.

Both of these sound like sub-optimal solutions.

#******************************************************************************
# REPLICATION WHITELIST OPTIONS
# These options may be set under an [replicationWhitelist] entry 
#****************************************************************************** 
<name> = <whitelist_regex>
* A pattern that if it matches a candidate file for replication (ie is under $SPLUNK_HOME/etc ) that file will be replicated.
* Note: Wildcards and replication:
* You can use wildcards to specify your path for replicated files. Use ... for paths and * for files.
* ... recurses through directories until the match is met. This means that /foo/.../bar will match foo/bar, foo/1/bar, foo/1/2/bar, etc. but only if bar is a file.
* To recurse through a subdirectory, use another .... For example /foo/.../bar/....
* matches anything in that specific path segment. It cannot be used inside of a directory path; it must be used in the last segment of the path. For example /foo/*.log matches /foo/bar.log but not /foo/bar.txt or /foo/bar/test.log.
* Combine * and ... for more specific matches:
* foo/.../bar/* matches any file in the bar directory within the specified path.