That seems to have done the trick. Funnily enough it now shows as 3/3 replicas (based on the cluster rep factor I guess). If I switch back to group rep factor of 3 then the problem comes back. Thanks for your help.
... View more
I moved to a multi-site cluster yesterday and I'm not entirely sure that replication is actually working within the cluster. It may not be, or it may be the splunk commands aren't playing nicely with the new multi-site cluster feature.
This is my clustering stanza in server.conf on the master
mode = master
multisite = true
available_sites = site1,site2
site_replication_factor = origin:2, site1:1, site2:1, total:3
site_search_factor = origin:1, site1:1, site2:1, total:2
pass4SymmKey = <REDACTED>
search_factor = 2
replication_factor = 3
I have 2 peers each in 2 sites, with 1 search head in each site. All of the Splunk servers in the cluster are assigned sites in servers.conf. I want to have a full searchable copy in each site for search affinity, thus the site_search_factor above.
I suppose the first thing I should say is that I'm getting my information from the splunk show cluster-status --verbose command or from the cluster settings page on the master.
When all 4 peers are up my search factor is met but all indices except 2 only get 2/3 for replication factor. All the others have between 1-8 buckets missing for the third copy and it never catches up. If I take down 1 peer in any site then my search factor goes to 1/2 for some portion of the indices and never recovers. The replication factor in this case will either stay at 2/3 or go to 1/3, it varies.
What makes me think this may be the tools working strangely is that it never recovers, despite no replication errors in splunkd.log (although I'm not sure if there are replication messages to fix up at all) and if I then bring up the node I brought down in a site then take down the other node in that site I get the same result.
Maintenance mode is off on the master.
If it's any help, when I upgraded to a multi-site cluster I made a mistake and didn't enable maintenance mode on the master before bringing up each peer (I ran the command but didn't notice it asked for a login). I'm not sure if that's broken something.
While I'm at it, does anyone know when the splunk remove excess-buckets command will be enabled for multi-site clusters? I think I've pretty much got a searchable copy on every peer by now.
... View more
Great, that seems to have done the trick. Thanks.
It confused me for a little bit as it takes time to run the action after you issue the command. The nouns are switched around in your example btw, this is the path I ended up using /services/cluster/master/buckets/_internal~148~307D1B57-3D07-45F3-A0FC-A6BB94644886/remove_all
Is there a command reference for that? The API guide only shows GET commands for this path: http://docs.splunk.com/Documentation/Splunk/6.0.3/RESTAPI/RESTcluster
... View more
I've been getting a few errors like this recently as reported by various nodes (shows up in master server messages):
Search peer s2splunk02 has the following message: Failed to make bucket = _internal~148~307D1B57-3D07-45F3-A0FC-A6BB94644886 searchable, retry count = 106.
I've also been failing to reach my search factor (2) for this index in our cluster, it always shows that there is 1 bucket that is not replicated.
I tried repairing the bucket when our node was offline (and the cluster is in maintenance mode) but I received this error:
[splunk@s2splunk02 ~/var/lib/splunk/_internaldb/db]$ splunk fsck repair --one-bucket --bucket-path=/opt/splunk/var/lib/splunk/_internaldb/db/rb_1397557401_1397547003_148_307D1B57-3D07-45F3-A0FC-A6BB94644886
Not loading indexes.conf; will proceed with all defaults
Operating on: idx= bucket='/opt/splunk/var/lib/splunk/_internaldb/db/rb_1397557401_1397547003_148_307D1B57-3D07-45F3-A0FC-A6BB94644886'
Error reading compressed journal while streaming: gzip data truncated, provider=/opt/splunk/var/lib/splunk/_internaldb/db/rb_1397557401_1397547003_148_307D1B57-3D07-45F3-A0FC-A6BB94644886/rawdata/journal.gz
Repair (entire bucket) idx= bucket='/opt/splunk/var/lib/splunk/_internaldb/db/rb_1397557401_1397547003_148_307D1B57-3D07-45F3-A0FC-A6BB94644886' failed: (entire bucket) Rebuild for bkt='/opt/splunk/var/lib/splunk/_internaldb/db/rb_1397557401_1397547003_148_307D1B57-3D07-45F3-A0FC-A6BB94644886' failed: Error reading compressed journal while streaming: gzip data truncated, provider=/opt/splunk/var/lib/splunk/_internaldb/db/rb_1397557401_1397547003_148_307D1B57-3D07-45F3-A0FC-A6BB94644886/rawdata/journal.gz
At this stage we're still developing our cluster and testing so I'm not too worried about the data, however I've no idea how to get rid of the bucket. I tried to stop splunk on all our peers and delete the bucket but it just re-appears. So how can I delete a bucket in a cluster?
In this thread they found a solution of offline-onlining the index, but I can't work out how to do that for a full cluster. Doing it on an individual peer complains that this is not a valid command for a cluster index node. Any help?
... View more
We use log4net for a bunch of our windows services and web applications. Currently I set the sourcetype for each of the log4net logs using props.conf based on the source file and I make it a meaningful name like sourcetype=log4net:service:dispatcher or sourcetype=log4net:web:portal, something along those lines. The log files are rolled daily, so they have the date appended to the log file each day. e.g.
I use wildcards in my props.conf to match the source due to the date rolling. We also have multiple hosts that use the same log format and applications. So three hosts or so could have the same log file names above, etc.
As almost all of the log4net logs are a common format I'd like to change the sourcetype to log4net, which let's me easily put in a bunch of field extractions and a few other things, and then just tag the logs with the meaningful information rather than having it in the sourcetype. This seems like a better way to search for things as well as the other advantates.
I was envisioning having tags in key value pairs, so I'd have something like
log4netfamily=service or log4netfamily=web
then drilling down into the actual application names, like
log4netapp=dispatcher or log4netapp=portal
As far as I can tell there is no way to do this easily with tags except by mucking about with your props and transforms and appending a key value string to your log entry each time it gets indexed. Seems like a hassle, no? What's the best way to achieve having additional key value tags (essentially an extra field) based on source?
I was also thinking having single value tags would be okay too (tag=log4netservice or tag=log4netweb is fine with me), but I'm having problems getting that going too.
What I was hoping I could do is just add a source stanza to my tags.conf based on the file name, so for example in tags.conf
log4netservice = enabled
log4netdispatcher = enabled
(I also tried the following alternative on the tag, I'm not sure which is correct)
But neither seems to apply the tags (I'm searching for tag::source="log4netservice", is that right?). From what I can see/read it seems wildcards don't work in tags.conf. Is that so? What are my options here?
If this all seems a bit pointless I'll just put in a wildcard extraction stanza as per this splunkbase article, but it seems like a bit of a no no based on the comments:
... View more