Splunk Search

Large lookup caused the bundle replication to fail. What are my options?

rbal_splunk
Splunk Employee
Splunk Employee

My searches are failing with the following errors in splunkd.log. I have one Search Head and 26 indexers. In the Search Peer's splunkd.log, I see following errors:

08-10-2015 20:37:24.501 +0000 INFO NetUtils - SSL_write failed. Connection reset by peer
08-10-2015 20:37:24.501 +0000 ERROR DistributedBundleReplicationManager - Unexpected problem while uploading bundle: Unknown write error
08-10-2015 20:37:24.501 +0000 ERROR DistributedBundleReplicationManager - Unable to upload bundle to peer named xxx04 with uri=https://xxx04.corp.intranet:8089.
08-10-2015 20:37:24.502 +0000 WARN DistributedBundleReplicationManager - Asynchronous bundle replication to 26 peer(s) succeeded; however it took too long (longer than 10 seconds): elapsed_ms=32287, tar_elapsed_ms=20891, bundle_file_size=819380KB, replication_id=1439239012, replication_reason="async replication allowed"

1 Solution

rbal_splunk
Splunk Employee
Splunk Employee

We have taken the following steps to debug this situation.

Based on the above error, the search bundle size is 800+MB and as a result, bundles are not getting downloaded to the indexers, causing searches to fail.

On the search head, the knowledge bundles reside under the $SPLUNK_HOME/var/run directory. The bundles have the extension .bundle for full bundles or .delta for delta bundles. They are tar files, so you can run tar tvf against them to see the contents.

The knowledge bundle gets distributed to the $SPLUNK_HOME/var/run/searchpeers directory on each search peer. The search peers use the search head's knowledge bundle to execute queries on its behalf. When executing a distributed search, the peers are ignorant of any local knowledge objects. They have access only to the objects in the search head's knowledge bundle.

Bundles typically contain a subset of files (configuration files and assets) from $SPLUNK_HOME/etc/system, $SPLUNK_HOME/etc/apps and $SPLUNK_HOME/etc/users

The process of distributing knowledge bundles means that peers by default receive nearly the entire contents of the search head's apps. If an app contains large binaries or CSV that do not need to be shared with the peers, you can eliminate them from the bundle and thus reduce the bundle size.

Next we checked the content of the bundle on the search head using:

$SPLUNk_HOME_SEARCH_HEAD/var/run   tar -tvf sh604-1409261525.bundle

We noticed that bundle had many lookup files and some as Big as 100MB.

One of the options we have is to filter out lookup using: http://docs.splunk.com/Documentation/Splunk/6.2.2/DistSearch/Limittheknowledgebundlesize

The questions that came up are ....

What is the recommendation on filtering the lookup on search Head?
When is the lookup required on the Search Head verses indexer?

We used the following guideline to determine the lookups that can be filtered.

i) The lookup is only needed on the Search Head when output fields from the lookup tables are always required post-reporting. For example, in this scenario, the lookup is only needed on the SH.

index=test | stats count by clientip, domain | lookup domain2datacenter domain OUTPUT datacenter 

ii) Here's an example where the lookup is needed on the indexers:

index=test | lookup domain2datacenter domain OUTPUT datacenter | stats count by clientip, datacenter 

Note: The stats count is the point at which map/reduce happens and sends that to the search head. This typically happens with the first reporting command, so what matters is: do I need the lookup before or after the 1st reporting command? That is the determining factor for needing the lookup on the indexers or not.

In the 2nd example, I use a field produced by the lookup ("datacenter") in my first reporting command. Clearly, my indexers are going to need access to the lookup in order to run that stats.

In the 1st example, that is not the case.
You need local=true if you want the indexers not to attempt to run the lookup. So, the 1st example should actually be:

index=test | stats count by clientip, domain | lookup local=true domain2datacenter domain OUTPUT datacenter 

Here are some relevant Answers:
http://answers.splunk.com/answers/13942/big-lookup-file-replication-too-slow
http://answers.splunk.com/answers/88894/external-lookup-script-on-search-head
http://answers.splunk.com/answers/66064/using-localtrue-in-automatic-lookups

iii) In addition, if the lookups are used only for the Dashboard drop-down (selection), these lookups do not need to be sent to the indexer.

iv) If the lookup is defined using props.conf like shown below, these lookups are defined as Global and will be required on the indexers.

[my_lookuptype] 
LOOKUP-foo = mylookuptable userid AS myuserid OUTPUT username AS myusername

View solution in original post

rbal_splunk
Splunk Employee
Splunk Employee

One of the other issue to with large lookup in distributed environment.

For the lookup over 10MB when they are replicated from the “Search Head” to the indexer . Any lookup that is over 10MB ( default value of mem_table_bytes =10MB in lookup.conf) will be indexed on the indexer once the search bundle is download. This add an overhead due to lookup indexing and leads to timeout.

In this situation , from a search performance perspective you would want to raise the max_memtable_bytes to encompass your large lookups, but realize that your RAM memory will start to swap if you start utilizing memory system wide too much.
also refer : - https://answers.splunk.com/answers/177494/is-my-big-csv-lookup-file-indexed-in-memory-by-spl.html

0 Karma

rbal_splunk
Splunk Employee
Splunk Employee

Here is an example that I recently worked on to filter a large lookup.

My knowledge bundle has the following CSV file:

$SPLUNK_HOME/var/run/tar -tvf 2cm262-1445532650.bundle | grep -i .csv

-rw------- splunk/splunk 675 2015-10-14 10:22 apps/splunk_management_console/lookups/assets.csv
-r-xr-xr-x splunk/splunk 3396 2015-09-09 19:26 apps/splunk_management_console/bin/splunklib/searchcommands/csv/dict_writer.py
-r-xr-xr-x splunk/splunk 894 2015-09-09 19:26 apps/splunk_management_console/bin/splunklib/searchcommands/csv/dialect.py
-r-xr-xr-x splunk/splunk 935 2015-09-09 19:26 apps/splunk_management_console/bin/splunklib/searchcommands/csv/__init__.py
-r-xr-xr-x splunk/splunk 2859 2015-09-09 19:26 apps/splunk_management_console/bin/splunklib/searchcommands/csv/dict_reader.py
-r--r--r-- splunk/splunk 832 2015-09-15 23:27 apps/search/lookups/geo_attr_us_states.csv
-r--r--r-- splunk/splunk 18241 2015-09-15 23:27 apps/search/lookups/geo_attr_countries.csv
-rw-r--r-- splunk/splunk 2685769 2015-10-22 09:44 apps/search/lookups/STVAsomefilenamethatismadeup_RBAL_TEST.csv

I applied following blacklist to the search head, and restarted the SH after this change.

$SPLUNk_HOME/etc/system/local/distsearch.conf
[replicationBlacklist]
staylocal = apps/search/lookups/STVAsomefilenamethatismadeup_RBAL_Test.csv

Followed by restart and delete the old bundles from location $SPLUNk_HOME/var/run

The new bundle is created and it does not have blacklisted lookup as shown below.

root@centos65-64sup02 run]# tar -tvf 2cm262-1445533204.bundle | grep -i .csv

-rw------- splunk/splunk 675 2015-10-22 09:51 apps/splunk_management_console/lookups/assets.csv
-r-xr-xr-x splunk/splunk 3396 2015-09-09 19:26 apps/splunk_management_console/bin/splunklib/searchcommands/csv/dict_writer.py
-r-xr-xr-x splunk/splunk 894 2015-09-09 19:26 apps/splunk_management_console/bin/splunklib/searchcommands/csv/dialect.py
-r-xr-xr-x splunk/splunk 935 2015-09-09 19:26 apps/splunk_management_console/bin/splunklib/searchcommands/csv/__init__.py
-r-xr-xr-x splunk/splunk 2859 2015-09-09 19:26 apps/splunk_management_console/bin/splunklib/searchcommands/csv/dict_reader.py
-r--r--r-- splunk/splunk 832 2015-09-15 23:27 apps/search/lookups/geo_attr_us_states.csv
-r--r--r-- splunk/splunk 18241 2015-09-15 23:27 apps/search/lookups/geo_attr_countries.csv

In case of Search Head Clustering deployment these blasklist will need to be deployed from Deploer to Search Head Cluster Members using Splunk's recommendation at link --http://docs.splunk.com/Documentation/Splunk/6.4.2/DistSearch/PropagateSHCconfigurationchanges

rbal_splunk
Splunk Employee
Splunk Employee

This question has come up few times by Splunk Users, In a distributed search environment, I was under the impression a lookup only needed to exist on the search heads, not the indexers. However, I created a lookup on a search head and get warning messages when I use it in a search:

[splunk-idx-01] Search process did not exit cleanly, exit_code=255, description="exited with code 255". Please look in search.log for this peer in the Job Inspector for more info.
[splunk-idx-01] Streamed search execute failed because: Error in 'lookup' command: The lookup table 'showDiskType' does not exist.
[splunk-idx-02] Search process did not exit cleanly, exit_code=255, description="exited with code 255". Please look in search.log for this peer in the Job Inspector for more info.
[splunk-idx-02] Streamed search execute failed because: Error in 'lookup' command: The lookup table 'showDiskType' does not exist.
[splunk-idx-03] Search process did not exit cleanly, exit_code=255, description="exited with code 255". Please look in search.log for this peer in the Job Inspector for more info.
[splunk-idx-03] Streamed search execute failed because: Error in 'lookup' command: The lookup table 'showDiskType' does not exist.

To avoid this issue and force search to use lookup only on the Search Head , here are two options:

1) Using local=t on the lookup:

f you have a distributed environment and your lookup file is BIG , you also add local=t .

Like;

[|inputlookup local=t blacklist.csv]

or , here is another example

<your spl search> | lookup local=t  <lookup_name> <lookup_key_field> OUTPUTNEW <output_field_1> ... <output_field_n>

2) Pretending the lookup with atable command with subsequently needed fields.

rbal_splunk
Splunk Employee
Splunk Employee

We have taken the following steps to debug this situation.

Based on the above error, the search bundle size is 800+MB and as a result, bundles are not getting downloaded to the indexers, causing searches to fail.

On the search head, the knowledge bundles reside under the $SPLUNK_HOME/var/run directory. The bundles have the extension .bundle for full bundles or .delta for delta bundles. They are tar files, so you can run tar tvf against them to see the contents.

The knowledge bundle gets distributed to the $SPLUNK_HOME/var/run/searchpeers directory on each search peer. The search peers use the search head's knowledge bundle to execute queries on its behalf. When executing a distributed search, the peers are ignorant of any local knowledge objects. They have access only to the objects in the search head's knowledge bundle.

Bundles typically contain a subset of files (configuration files and assets) from $SPLUNK_HOME/etc/system, $SPLUNK_HOME/etc/apps and $SPLUNK_HOME/etc/users

The process of distributing knowledge bundles means that peers by default receive nearly the entire contents of the search head's apps. If an app contains large binaries or CSV that do not need to be shared with the peers, you can eliminate them from the bundle and thus reduce the bundle size.

Next we checked the content of the bundle on the search head using:

$SPLUNk_HOME_SEARCH_HEAD/var/run   tar -tvf sh604-1409261525.bundle

We noticed that bundle had many lookup files and some as Big as 100MB.

One of the options we have is to filter out lookup using: http://docs.splunk.com/Documentation/Splunk/6.2.2/DistSearch/Limittheknowledgebundlesize

The questions that came up are ....

What is the recommendation on filtering the lookup on search Head?
When is the lookup required on the Search Head verses indexer?

We used the following guideline to determine the lookups that can be filtered.

i) The lookup is only needed on the Search Head when output fields from the lookup tables are always required post-reporting. For example, in this scenario, the lookup is only needed on the SH.

index=test | stats count by clientip, domain | lookup domain2datacenter domain OUTPUT datacenter 

ii) Here's an example where the lookup is needed on the indexers:

index=test | lookup domain2datacenter domain OUTPUT datacenter | stats count by clientip, datacenter 

Note: The stats count is the point at which map/reduce happens and sends that to the search head. This typically happens with the first reporting command, so what matters is: do I need the lookup before or after the 1st reporting command? That is the determining factor for needing the lookup on the indexers or not.

In the 2nd example, I use a field produced by the lookup ("datacenter") in my first reporting command. Clearly, my indexers are going to need access to the lookup in order to run that stats.

In the 1st example, that is not the case.
You need local=true if you want the indexers not to attempt to run the lookup. So, the 1st example should actually be:

index=test | stats count by clientip, domain | lookup local=true domain2datacenter domain OUTPUT datacenter 

Here are some relevant Answers:
http://answers.splunk.com/answers/13942/big-lookup-file-replication-too-slow
http://answers.splunk.com/answers/88894/external-lookup-script-on-search-head
http://answers.splunk.com/answers/66064/using-localtrue-in-automatic-lookups

iii) In addition, if the lookups are used only for the Dashboard drop-down (selection), these lookups do not need to be sent to the indexer.

iv) If the lookup is defined using props.conf like shown below, these lookups are defined as Global and will be required on the indexers.

[my_lookuptype] 
LOOKUP-foo = mylookuptable userid AS myuserid OUTPUT username AS myusername

reed_kelly
Contributor

Thanks for the detailed analysis. Unfortunately, we use very large lookup files referenced by accelerated datamodels. By the very nature of accelerated datamodels, these lookups have to take place on the indexers. Other types of lookups are either too slow or get replicated in the same way.
We are just below the point that breaks the bundle replication, but would love an intermediate solution.

Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...