Deployment Architecture

How to trouble shoot bundle replication failure issue?

Motivator

Hi All,

We could see the below message are popping out in the search heads, recently we have upgraded the indexer/search head instances to 7.0.4 version from 6.6.1. we have not yet upgrade the Heavy forwarder will this might be causing the issue.

 Error Message: 

 Distributed Bundle Replication Manager: The current bundle directory contains a large lookup file that might cause bundle replication fail. The path to the directory is /opt/splunk/var/run/C090FDA2-105E-4875-A110-3F13FF986151-1531313061-1531313166.delta.

Kindly guide me to troubleshoot this issue.

Tags (2)
0 Karma
1 Solution

Motivator

Hi Woodcock, We had fixed this issue by following the below solution.

Problem details: Distributed Bundle Replication Manager: The current bundle directory contains a large lookup file that might cause bundle replication fail. The path to the directory is /opt/splunk/var/run/C090FDA2-105E-4875-A110-3F13FF986151-1531313061-1531313166.bundle

Impact: Due to this, the size of the file systems was almost utilized 100 % and which in-turn caused the search failure and it was impacting all the user from searching the data.

Solution:

The below gave some idea to fix this issue.

link text

1) First we checked which csv file is consuming more space from the apps folder in the search head by using the below command we

 /opt/splunk/etc/apps/

find . -name *.csv -exec du -sh {} \; | grep "M" | less 

2) After narrowing down correct .csv file which was consuming 660MB in the /opt/splunk/etc/apps/servicenow/lookup/cmdb_ci_list_lookup.csv

3) On trouble shooting we found the lookup file was broken, the fields in the lookup table are data that is not relevant to ServiceNow, So we rectified the lookup fields details and uploaded the rectified csv file in the search head cluster master via deployer.

4) On uploading rectified csv file with the size of 49 MB and it fixed the issue.

Note: Since this cmdb_ci_list_lookup.csv was configured in the props.conf and defined as Global this lookup file is needed in the indexer, so we did not use the replicationBlacklist stanza in distsearch.conf as mentioned in the link.

View solution in original post

0 Karma

Motivator

Hi Woodcock, We had fixed this issue by following the below solution.

Problem details: Distributed Bundle Replication Manager: The current bundle directory contains a large lookup file that might cause bundle replication fail. The path to the directory is /opt/splunk/var/run/C090FDA2-105E-4875-A110-3F13FF986151-1531313061-1531313166.bundle

Impact: Due to this, the size of the file systems was almost utilized 100 % and which in-turn caused the search failure and it was impacting all the user from searching the data.

Solution:

The below gave some idea to fix this issue.

link text

1) First we checked which csv file is consuming more space from the apps folder in the search head by using the below command we

 /opt/splunk/etc/apps/

find . -name *.csv -exec du -sh {} \; | grep "M" | less 

2) After narrowing down correct .csv file which was consuming 660MB in the /opt/splunk/etc/apps/servicenow/lookup/cmdb_ci_list_lookup.csv

3) On trouble shooting we found the lookup file was broken, the fields in the lookup table are data that is not relevant to ServiceNow, So we rectified the lookup fields details and uploaded the rectified csv file in the search head cluster master via deployer.

4) On uploading rectified csv file with the size of 49 MB and it fixed the issue.

Note: Since this cmdb_ci_list_lookup.csv was configured in the props.conf and defined as Global this lookup file is needed in the indexer, so we did not use the replicationBlacklist stanza in distsearch.conf as mentioned in the link.

View solution in original post

0 Karma

Communicator

This might be a better command for it:
find . -type f -name *.csv -size +10M -exec ls -lh {} \;

Tags (1)
0 Karma

Esteemed Legend

The message is helping you to realize that you are possibly nearing a problem, NOT that you actually have it. It is safe to ignore it, but you can also monitor it with a dashboard panel like this one:

<panel>
  <title>Lookup table details (beware "Bundle too large" replication errors) - WARNING: may take a long time to complete; sizes and percentages are APPROXIMATE</title>
  <table>
    <title>The warning will be this: Bundle Replication: Problem replicating config (bundle) to search peer ' <hostname>:8089 ', HTTP response code 413 (HTTP/1.1 413 Content-Length of <size here> too large (maximum is 838860800)). Content-Length of <size here> too large (maximum is 838860800) (Unknown write error)</title>
    <search>
      <query>|rest/services/data/lookup-table-files splunk_server=local
| search eai:acl.app="$env:app$"
| rename dispatch.* AS *
| rename eai:acl.* AS *
| map maxsearches=99 search="
| inputlookup $$title$$
| rename COMMENT1of3 AS \"Some field names have single-quotes which will cause this error:\"
| rename COMMENT3of3 AS \"{map}: Failed to parse templatized search for field 'Bad Field's Name Here'\"
| rename COMMENT3of3 AS \"So rename those fields before we process them to replace ' with _\"
| rename *'*'*'*'* AS *_*_*_*_*, *'*'*'* AS *_*_*_*, *'*'* AS *_*_*, *'* AS *_*
| eval T3MpJuNk_bytes=0, T3MpJuNk_cols=0, T3MpJuNk_field_names=\",\"
| foreach _*
    [ eval T3MpJuNk_bytes = T3MpJuNk_bytes + coalesce(len('<<FIELD>>'), 0)
    | eval T3MpJuNk_cols = T3MpJuNk_cols + 1
    | eval T3MpJuNk_field_names = T3MpJuNk_field_names . \"<<FIELD>>\"]
| rename _* AS *, T3MpJuNk_* AS _T3MpJuNk_*
| foreach *
    [ eval _T3MpJuNk_bytes = _T3MpJuNk_bytes + coalesce(len('<<FIELD>>'), 0)
    | eval _T3MpJuNk_cols = _T3MpJuNk_cols + 1
    | eval _T3MpJuNk_field_names = _T3MpJuNk_field_names . \"<<FIELD>>\"]
| rename COMMENT AS \"Account for the commas, too!\"
| eval bytes = bytes + (cols - 1)
| stats sum(_T3MpJuNk_bytes) AS bytes count AS lines first(_T3MpJuNk_cols) AS cols first(_T3MpJuNk_field_names) AS field_names
| rename COMMENT AS \"Account for the header line, too!\"
| eval bytes = bytes + (len(field_names) - 1)
| eval title=\"$$title$$\"
| eval owner=\"$$owner$$\"" 
| eval bytes = coalesce(bytes, 0)
| addtotals row=false col=true labelfield=title label="$TOTAL_FIELD_VALUE$" 
| eval "bytes/line" = if(title=="$TOTAL_FIELD_VALUE$", "N/A", round(coalesce(bytes/lines, 0), 2))
| eval owner = if(title=="$TOTAL_FIELD_VALUE$", "N/A", owner)
| eval cols  = if(title=="$TOTAL_FIELD_VALUE$", "N/A", coalesce(cols, "N/A"))
| eval MB = round(bytes / 1024 / 1024, 2)
| eval bundlePct = round(100 * bytes / 838860800, 2)
| eval status=case(
   title=="$TOTAL_FIELD_VALUE$", if((bundlePct < 90),                         "OK", "DANGEROUS TERRITORY"),
   true(),                       if((bundlePct < 25 AND lines < 10000000), "OK", "Consider KVStore"))
| sort 0 - bytes
| table title status bundlePct owner bytes MB lines cols bytes*line
| eval _drilldown  = if(title=="$TOTAL_FIELD_VALUE$", "*", title)</query>
      <earliest>0</earliest>
      <latest></latest>
      <sampleRatio>1</sampleRatio>
    </search>
    <option name="count">100</option>
    <option name="dataOverlayMode">none</option>
    <option name="drilldown">cell</option>
    <option name="percentagesRow">false</option>
    <option name="rowNumbers">false</option>
    <option name="totalsRow">false</option>
    <option name="wrap">true</option>
    <drilldown target="_blank">
      <link>/manager/$env_app$/data/lookup-table-files?app=$env:app$&amp;app_only=1&amp;count=100&amp;search=$row._drilldown$</link>
    </drilldown>
  </table>
</panel>
0 Karma

Motivator

Hi Woodcock, Good Evening We started to receive this message again in the search head, but not sure how to find which look table is actually taking the space and when I tried to execute the above query from the XML file, did not get any output, when checked the Job inspector - Found the below message.

No Matching field exists 
REST Processor: Restricting results of the "rest" operator to the local instance because you do not have the "dispatch_rest_to_indexers" capability.

Error message:

Search peer searchhead01 has the following message: Distributed Bundle Replication Manager: The current bundle directory contains a large lookup file that might cause bundle replication fail. The path to the directory is /opt/splunk/var/run/C090FDA2-105E-4875-A110-3F13FF986151-1531761965-1531762072.delta.7/16/2018, 1:28:25 PM
Search peer searchhead02 has the following message: Distributed Bundle Replication Manager: The current bundle directory contains a large lookup file that might cause bundle replication fail. The path to the directory is /opt/splunk/var/run/C090FDA2-105E-4875-A110-3F13FF986151-1531334559-1531334668.delta.7/16/2018, 1:28:25 PM 

Could you please guide me how to trouble shoot this issue.

0 Karma

Esteemed Legend

Change the |rest/services/data/lookup-table-files to |rest/services/data/lookup-table-files splunk_server=local (I updated the answer). This should show you the files. In order to best investigate/remedy this, you need CLI access to your Search Head so you can go into the directories listed in the logs and see what is there.

0 Karma

Motivator

Hi Woodcock, I tried executing the above splunk query but still facing the same message from the job inspector. As advised I did CLI access to the search head under this location /opt/splunk/var/run/C090FDA2-105E-4875-A110-3F13FF986151-1531313061-1531313166.delta. and execute this command to check what are the lookup files present.

tar -tvf C090FDA2-105E-4875-A110-3F13FF986151-1531313061-1531313166.delta | grep -i .csv

Could see lots of CSV files but could not find the correct lookup file which is taking the space, so can guide me on this.

0 Karma

Esteemed Legend

Try this; the last one is your guy:

tar -tvf C090FDA2-105E-4875-A110-3F13FF986151-1531313061-1531313166.delta | sort -k3,3
0 Karma

Motivator

Woodcock, I could see the below three look up are consuming some space and every five minutes we could see the message getting popped out with different .delta files. When the same was extracted using tar -xvf C090FDA2-105E-4875-A110-3F13FF986151-1531313061-1531313166.delta could see below three csv files are causing the replication problem.

-rw------- 1 splunk splunk 5.9M Jul 18 06:15 SEC-CND-INCIDENTS-SPLUNK.csv
-rw------- 1 splunk splunk 8.2M Jul 18 06:15 SEC-CND-INCIDENTS-SNOW.csv
-rw------- 1 splunk splunk 35K Jul 18 06:14 SEC-CND-INCIDENTS-SNOW-NOT-SECURITY.csv

When checked the location / app where these CSV files are residing, found in the search & reporting app under this path /opt/splunk/etc/apps/search/lookups/

These lookup files are defined as Private by the user.

Question:

1) How to fix this issue, based on the splunk document, should I need to configure the replicationBlacklist in the $SPLUNk_HOME/etc/system/local/distsearch.conf

link text

Kindly guide me on this.

0 Karma

Path Finder

Either add to replication blacklist  or make use of KV store 

Tags (1)
0 Karma

Influencer

This message is not related to the upgrade. The message says, a lookup was created with large size which might cause bundle replication issues. Once the bundle crosses the default max size, search heads will no longer be able to push the bundle to the indexers.

On the file system search for any .csv files that stand out with respective to size. You will have to remove that lookup in order to resolve the issue.

0 Karma

Motivator

hey pradeep thanks, how/where to find the default max lookup size in the limits.conf file. And also how to get the lookup file which has maximum size configured.

0 Karma

Motivator

@Hemnaath, since you're seeing this message on search heads (which I assume are part of search head cluster), run the below linux command from deployer to get an idea about the size of lookup files.

From your deployer, find $SPLUNK_HOME/etc/shcluster/apps/ -type f -name '.csv' -exec du -sh {} +;*
The default value of lookup file may be 10 MB, I am not 100% sure about this. More details here, under lookup stanza.

0 Karma