Howdy all,
We just rolled out the Splunk for VMware suite to our test VMware environment for evaluation and I'm getting much higher throughput than expected. The docs for the VMware suite say to expect 800MB - 1GB of data indexed per day per ESXi host. I'm seeing 4x that much data being indexed and I'm wondering where things went pear-shaped.
Our test environment consists of two ESXi hosts and one vCenter installation running on a VM. We have a total of 16 running VMs in the environment and a few more that are turned off. This is not a busy system so when I saw 8GB/day going in to the vmware index I started to get worried. We've had this running since Monday and I've been checking daily and the 8GB/day indexing rate has stayed true since Monday.
So, are there any thoughts on where things might be going wrong? Or why we are seeing so much data getting indexed?
It does ge a lot of data, but I was able to get it under 500MB a day pers ESX. Number of guests will change it, but 18 is not high number.
Look into next couple of things you can easly do without going too deep:
1) Log level on ESX servers. We do have ESXi/5, but they were upgraded from ESXi/4.1. On ESXi/4.1 by default log level is Verbose. It collected huge amount of data and nullQueue transformation is disabled by default on FA. So either suggest to check Log level or enable ‘verbose|trivia’ nullQueue on FA, they are already there, just commented out.
2) On VC amount of logging was big too, especially from c:\programdata\application data\vmware\vmware virtualcenter\logs\vpxd-profiler-xxx.log.
I had to disable vpxd-profiler output to file in https://myvcener/vob/index.html
3) Also I had to reduce VC error level to Error/Warnings to reduce size of vpxd file.
4) My https://myvcenter url uses standard VMware certificate. Now FA generates a lot of errors on SLA handshake. I had to create nullQueue on FA for it like this:
[vmnull]
REGEX=SSL\sHandshake\sfailed
DEST_KEY=queue
FORMAT=nullQueue
[vmnull-2]
REGEX=SSL_accept\sfailed
DEST_KEY=queue
FORMAT=nullQueue
But normal solution would be to install proper certificates and trusted authorities. You probably should provide this info in post-install steps or something.
Another big part was inventory.
I increased interval of collecting inventory on splunk vm appliance in file:
/home/splunkadmin/opt/splunk/etc/apps/Splunk_TA_vmware/local/enginehierarchy.conf
hierarchyExpiration = 1800
In file:
/home/splunkadmin/opt/splunk/etc/apps/Splunk_TA_vmware/local/engineinvvc1.conf
inventoryExpiration = 7200
action = InventoryDiscovery
interval = 3600
After setting Inventory interval to 3600 and inventoryExpiration = 7200 and hierarchyExpiration = 1800 I reduced data collection size by more than 50%.
If you are seeing this following upgrade to v3.3.1, it is likely due to the change in instance versus aggregated collection.
This post explains more.
https://answers.splunk.com/answers/470088/on-the-vmware-app-following-upgrade-from-v32x-to-v.html
Given this post is from 2012, it isn't the case here.
However, it might be helpful for people googling for generally higher data volumes.
I did not have error like this. I had:
Encountered other certificate error: 27
So I used this regex, but for your case you need to modify REGEX to something like this or add new one, just make sure that you actually getting data and it is not a real error:
REGEX=SSL_accept\sfailed\swith\sUnexpected\sEOF
Thanks, so this will stop the me from seeing the SSL errors (SSLStreamImpl::DoServerHandshake (7f9e0a98) SSL_accept failed with Unexpected EOF) from the hosts and vc on splunk?
On FA in file: /home/splunkadmin/opt/splunk/etc/apps/Splunk_TA_vmware/default/transforms.conf
Uncomment everything after #NullQueues (you can copy this file to local and uncoment there, that would be splunk way to do it)
#NullQueues
[vmware_vpxd_level_null]
DEST_KEY = queue
FORMAT = nullQueue
REGEX = ^\[?\d{4}-\d{2}-\d{2}[T\s][\d\:\.]{8,12}(?:[\+\-\s][\d\:]{5}|Z)?\s\[?\w+\s(verbose|trivia)
[vmware_vpxd_retrieveContents_null]
DEST_KEY = queue
FORMAT = nullQueue
REGEX = ^\[?\d{4}-\d{2}-\d{2}[T\s][\d\:\.]{8,12}(?:[\+\-\s][\d\:]{5}|Z)?\s\[?\w+\sinfo.*?task-internal.*?vmodl\.query\.PropertyCollector\.retrieveContents
[vmware_vpxd_null]
DEST_KEY = queue
FORMAT = nullQueue
REGEX = ^\[?\d{4}-\d{2}-\d{2}[T\s][\d\:\.]{8,12}(?:[\+\-\s][\d\:]{5}|Z)?\s\[?\w+\s(verbose|trivia|info.*?task-internal.*?vmodl\.query\.PropertyCollector\.retrieveContents)
The rest actually done on indexer.
Sorry for original confusion. When I rolled out VmWare app I build vm guest dedicated to indexing just VMware. This way I had better control over license, I just created new license pool with size limit and assign it to this new indexer.
If you have 1 indexer (or search head I think they call it) you need to do it there. In my case it is windows.
So file location: C:\Program Files\Splunk\etc\apps\Splunk_TA_vcenter\local\props.conf
Content:
[host::<indexer-host-name>]
TZ = America/Toronto
TRANSFORMS-vm = vmnull
TRANSFORMS-vm2 = vmnull-2
TRANSFORMS-vm4 = vmnull-4
TRANSFORMS-vm3a = vmnull-3a
TRANSFORMS-vm5 = vmnull-5
[vmware:esxlog:hostd]
TRANSFORMS-vmd = vmnull
TRANSFORMS-vmd2 = vmnull-2
[source::/var/log/vpxa.log]
TRANSFORMS-vpxa3a = vmnull-3a
In file C:\Program Files\Splunk\etc\apps\Splunk_TA_vcenter\local\props.conf
[vmnull]
REGEX=SSL\sHandshake\sfailed
DEST_KEY=queue
FORMAT=nullQueue
[vmnull-2]
REGEX=SSL_accept\sfailed
DEST_KEY=queue
FORMAT=nullQueue
[vmnull-3]
REGEX=info\s\'Default
DEST_KEY=queue
FORMAT=nullQueue
[vmnull-3a]
REGEX=info
DEST_KEY=queue
FORMAT=nullQueue
[vmnull-4]
REGEX=Not\scollecting\sstats\sthis\stime
DEST_KEY=queue
FORMAT=nullQueue
[vmnull-5]
REGEX=Encountered\sother\scertificate\serror:\s27
DEST_KEY=queue
FORMAT=nullQueue
also on indexer I uncomented Nullqueues as well - open: C:\Program Files\Splunk\etc\apps\Splunk_TA_vcenter\default\transforms.conf
You can copy content to local and modify or modify in place, but then update might overwrite it.
Anyways I uncomment #NullQueues in it same as on FA.
I also had one type in step 2.
Should be vo*d* instead of vo*b* in https://myvcener/vod/index.html
This question is for nazdryna, For step number 4 (I had to create nullQueue on FA for the vCenter SSL error) in your first post, what file and path do you put the info in the FA?
Couple of more things I did. Now I collect 500MB a day for whole vmware index, it includes 4 hosts, on vc centre and like 80 vm guests.
This steps will reduce data without lose of existing functionality:
/home/splunkadmin/opt/splunk/etc/apps/Splunk_TA_vmware/local/engineinvvc1.conf
action = InventoryDiscovery
inventoryLevel=Required
interval = 3600
This will further reduce inventory data.
On indexer add nullQuies for some logs events:
apps/Splunk_TA_vcenter/local
props.conf
[host::myindexerhost_changeit]
TZ = America/Toronto
TRANSFORMS-vm = vmnull
TRANSFORMS-vm2 = vmnull-2
[vmware:esxlog:hostd]
TRANSFORMS-vm = vmnull
TRANSFORMS-vm2 = vmnull-2
[vmware:esxlog:vpxa]
TRANSFORMS-vm3 = vmnull-3
[vmware:vclog:vpxd]
TRANSFORMS-vm4 = vmnull-4
in transforms.conf
[vmnull]
REGEX=SSL\sHandshake\sfailed
DEST_KEY=queue
FORMAT=nullQueue
[vmnull-2]
REGEX=SSL_accept\sfailed
DEST_KEY=queue
FORMAT=nullQueue
[vmnull-3]
REGEX=info\s\'Default
DEST_KEY=queue
FORMAT=nullQueue
[vmnull-4]
REGEX=Not\scollecting\sstats\sthis\stime
DEST_KEY=queue
FORMAT=nullQueue
Next steps will reduce amount of performance data you are getting with loss of some functionality.
I have performance monitoring done by different agents, with splunk I want to monitor only VM infrastructure, hosts performance and VM guests disk latency. You can use similar technique to fine tune collecting data. More details at: http://docs.splunk.com/Documentation/VMW/latest/Install/engine.confsettings
Next steps will help to achieve it.
In each engineperf.conf file (you might have multiple) make next changes:
For example Original:
for every stanza with action=PerfDiscovery add next stanza:
perfLevel=2
(you can set level 3 if you notice something missing, difference described here: http://www.vmware.com/support/developer/vc-sdk/visdk41pubs/ApiReference/vim.HistoricalInterval.html).
perfManagedEntityWhitelist will deactivate collection perfdata from Guest and virtual appliances, but it will collect full perf info from host, datastores, clusters etc.
perfManagedEntityWhitelist = ClusterComputeResource|ResourcePool|HostSystem
Now to enable collection of specific metrix from guests copy stanza you just modify (add something to stanza name, like -disk) and replace
perfManagedEntityWhitelist = ClusterComputeResource|ResourcePool|HostSystem
to VirtualMachine (you can add |VirtualApp if you need) and use perfTypeWhitelist to add perftype you want to collect.
For example this:
[esx4host1]
url = https://host1/sdk/webService
username = splunkforvmuser
password = xxxxxxxxxx
action = PerfDiscovery
perfLevel=2
perfInstanceData = OFF
interval = 60
perfManagedEntityWhitelist = ClusterComputeResource|ResourcePool|HostSystem
will become this:
[esx4host1-disk]
url = https://host1/sdk/webService
username = splunkforvmuser
password = xxxxxxxxxx
action = PerfDiscovery
perfLevel=2
perfInstanceData = OFF
interval = 60
perfManagedEntityWhitelist = VirtualMachine
perfTypeWhitelist=disk
With this you can separate what performance you collect from host and guests. To disk
you can add any of the following:
From what I see splunk Big Data approach is to grab all data it can and them figure out what to do with it. This is valid approach if you have unlimited license, if not you are paying price for indexing useless data. We can not afford it, so I had to do all this tweaking to get under our license limit, yet have important data for analysis.
It does ge a lot of data, but I was able to get it under 500MB a day pers ESX. Number of guests will change it, but 18 is not high number.
Look into next couple of things you can easly do without going too deep:
1) Log level on ESX servers. We do have ESXi/5, but they were upgraded from ESXi/4.1. On ESXi/4.1 by default log level is Verbose. It collected huge amount of data and nullQueue transformation is disabled by default on FA. So either suggest to check Log level or enable ‘verbose|trivia’ nullQueue on FA, they are already there, just commented out.
2) On VC amount of logging was big too, especially from c:\programdata\application data\vmware\vmware virtualcenter\logs\vpxd-profiler-xxx.log.
I had to disable vpxd-profiler output to file in https://myvcener/vob/index.html
3) Also I had to reduce VC error level to Error/Warnings to reduce size of vpxd file.
4) My https://myvcenter url uses standard VMware certificate. Now FA generates a lot of errors on SLA handshake. I had to create nullQueue on FA for it like this:
[vmnull]
REGEX=SSL\sHandshake\sfailed
DEST_KEY=queue
FORMAT=nullQueue
[vmnull-2]
REGEX=SSL_accept\sfailed
DEST_KEY=queue
FORMAT=nullQueue
But normal solution would be to install proper certificates and trusted authorities. You probably should provide this info in post-install steps or something.
Another big part was inventory.
I increased interval of collecting inventory on splunk vm appliance in file:
/home/splunkadmin/opt/splunk/etc/apps/Splunk_TA_vmware/local/enginehierarchy.conf
hierarchyExpiration = 1800
In file:
/home/splunkadmin/opt/splunk/etc/apps/Splunk_TA_vmware/local/engineinvvc1.conf
inventoryExpiration = 7200
action = InventoryDiscovery
interval = 3600
After setting Inventory interval to 3600 and inventoryExpiration = 7200 and hierarchyExpiration = 1800 I reduced data collection size by more than 50%.
This is great! Given how much data this app is likely to collect I would suggest a tuning document forthwith.