All Apps and Add-ons

Splunk App for VMware throughput much higher than expected

colinj
Path Finder

Howdy all,

We just rolled out the Splunk for VMware suite to our test VMware environment for evaluation and I'm getting much higher throughput than expected. The docs for the VMware suite say to expect 800MB - 1GB of data indexed per day per ESXi host. I'm seeing 4x that much data being indexed and I'm wondering where things went pear-shaped.

Our test environment consists of two ESXi hosts and one vCenter installation running on a VM. We have a total of 16 running VMs in the environment and a few more that are turned off. This is not a busy system so when I saw 8GB/day going in to the vmware index I started to get worried. We've had this running since Monday and I've been checking daily and the 8GB/day indexing rate has stayed true since Monday.

So, are there any thoughts on where things might be going wrong? Or why we are seeing so much data getting indexed?

1 Solution

nazdrynau
Explorer

It does ge a lot of data, but I was able to get it under 500MB a day pers ESX. Number of guests will change it, but 18 is not high number.
Look into next couple of things you can easly do without going too deep:

1) Log level on ESX servers. We do have ESXi/5, but they were upgraded from ESXi/4.1. On ESXi/4.1 by default log level is Verbose. It collected huge amount of data and nullQueue transformation is disabled by default on FA. So either suggest to check Log level or enable ‘verbose|trivia’ nullQueue on FA, they are already there, just commented out.

2) On VC amount of logging was big too, especially from c:\programdata\application data\vmware\vmware virtualcenter\logs\vpxd-profiler-xxx.log.
I had to disable vpxd-profiler output to file in https://myvcener/vob/index.html

3) Also I had to reduce VC error level to Error/Warnings to reduce size of vpxd file.

4) My https://myvcenter url uses standard VMware certificate. Now FA generates a lot of errors on SLA handshake. I had to create nullQueue on FA for it like this:

[vmnull]

REGEX=SSL\sHandshake\sfailed

DEST_KEY=queue

FORMAT=nullQueue

[vmnull-2]
REGEX=SSL_accept\sfailed

DEST_KEY=queue

FORMAT=nullQueue

But normal solution would be to install proper certificates and trusted authorities. You probably should provide this info in post-install steps or something.

Another big part was inventory.
I increased interval of collecting inventory on splunk vm appliance in file:
/home/splunkadmin/opt/splunk/etc/apps/Splunk_TA_vmware/local/enginehierarchy.conf
hierarchyExpiration = 1800
In file:
/home/splunkadmin/opt/splunk/etc/apps/Splunk_TA_vmware/local/engineinvvc1.conf

inventoryExpiration = 7200

action = InventoryDiscovery

interval = 3600

After setting Inventory interval to 3600 and inventoryExpiration = 7200 and hierarchyExpiration = 1800 I reduced data collection size by more than 50%.

View solution in original post

bohanlon_splunk
Splunk Employee
Splunk Employee

If you are seeing this following upgrade to v3.3.1, it is likely due to the change in instance versus aggregated collection.
This post explains more.
https://answers.splunk.com/answers/470088/on-the-vmware-app-following-upgrade-from-v32x-to-v.html

Given this post is from 2012, it isn't the case here.
However, it might be helpful for people googling for generally higher data volumes.

nazdrynau
Explorer

I did not have error like this. I had:

Encountered other certificate error: 27

So I used this regex, but for your case you need to modify REGEX to something like this or add new one, just make sure that you actually getting data and it is not a real error:

REGEX=SSL_accept\sfailed\swith\sUnexpected\sEOF
0 Karma

idsersupport
Explorer

Thanks, so this will stop the me from seeing the SSL errors (SSLStreamImpl::DoServerHandshake (7f9e0a98) SSL_accept failed with Unexpected EOF) from the hosts and vc on splunk?

0 Karma

nazdrynau
Explorer

On FA in file: /home/splunkadmin/opt/splunk/etc/apps/Splunk_TA_vmware/default/transforms.conf
Uncomment everything after #NullQueues (you can copy this file to local and uncoment there, that would be splunk way to do it)

#NullQueues
[vmware_vpxd_level_null]
DEST_KEY = queue
FORMAT = nullQueue
REGEX = ^\[?\d{4}-\d{2}-\d{2}[T\s][\d\:\.]{8,12}(?:[\+\-\s][\d\:]{5}|Z)?\s\[?\w+\s(verbose|trivia)

[vmware_vpxd_retrieveContents_null]
DEST_KEY = queue
FORMAT = nullQueue
REGEX = ^\[?\d{4}-\d{2}-\d{2}[T\s][\d\:\.]{8,12}(?:[\+\-\s][\d\:]{5}|Z)?\s\[?\w+\sinfo.*?task-internal.*?vmodl\.query\.PropertyCollector\.retrieveContents

[vmware_vpxd_null]
DEST_KEY = queue
FORMAT = nullQueue
REGEX = ^\[?\d{4}-\d{2}-\d{2}[T\s][\d\:\.]{8,12}(?:[\+\-\s][\d\:]{5}|Z)?\s\[?\w+\s(verbose|trivia|info.*?task-internal.*?vmodl\.query\.PropertyCollector\.retrieveContents)

The rest actually done on indexer.
Sorry for original confusion. When I rolled out VmWare app I build vm guest dedicated to indexing just VMware. This way I had better control over license, I just created new license pool with size limit and assign it to this new indexer.
If you have 1 indexer (or search head I think they call it) you need to do it there. In my case it is windows.
So file location: C:\Program Files\Splunk\etc\apps\Splunk_TA_vcenter\local\props.conf

Content:

[host::<indexer-host-name>]
TZ = America/Toronto
TRANSFORMS-vm = vmnull
TRANSFORMS-vm2 = vmnull-2
TRANSFORMS-vm4 = vmnull-4
TRANSFORMS-vm3a = vmnull-3a
TRANSFORMS-vm5 = vmnull-5

[vmware:esxlog:hostd]
TRANSFORMS-vmd = vmnull
TRANSFORMS-vmd2 = vmnull-2
[source::/var/log/vpxa.log]
TRANSFORMS-vpxa3a = vmnull-3a

In file C:\Program Files\Splunk\etc\apps\Splunk_TA_vcenter\local\props.conf

[vmnull]
REGEX=SSL\sHandshake\sfailed
DEST_KEY=queue
FORMAT=nullQueue

[vmnull-2]
REGEX=SSL_accept\sfailed
DEST_KEY=queue
FORMAT=nullQueue

[vmnull-3]
REGEX=info\s\'Default
DEST_KEY=queue
FORMAT=nullQueue

[vmnull-3a]
REGEX=info
DEST_KEY=queue
FORMAT=nullQueue


[vmnull-4]
REGEX=Not\scollecting\sstats\sthis\stime
DEST_KEY=queue
FORMAT=nullQueue


[vmnull-5]
REGEX=Encountered\sother\scertificate\serror:\s27
DEST_KEY=queue
FORMAT=nullQueue

also on indexer I uncomented Nullqueues as well - open: C:\Program Files\Splunk\etc\apps\Splunk_TA_vcenter\default\transforms.conf
You can copy content to local and modify or modify in place, but then update might overwrite it.
Anyways I uncomment #NullQueues in it same as on FA.

I also had one type in step 2.
Should be vo*d* instead of vo*b* in https://myvcener/vod/index.html

0 Karma

idsersupport
Explorer

This question is for nazdryna, For step number 4 (I had to create nullQueue on FA for the vCenter SSL error) in your first post, what file and path do you put the info in the FA?

0 Karma

nazdrynau
Explorer

Couple of more things I did. Now I collect 500MB a day for whole vmware index, it includes 4 hosts, on vc centre and like 80 vm guests.

This steps will reduce data without lose of existing functionality:

/home/splunkadmin/opt/splunk/etc/apps/Splunk_TA_vmware/local/engineinvvc1.conf

action = InventoryDiscovery
inventoryLevel=Required
interval = 3600

This will further reduce inventory data.
On indexer add nullQuies for some logs events:
apps/Splunk_TA_vcenter/local
props.conf

[host::myindexerhost_changeit]
TZ = America/Toronto
TRANSFORMS-vm = vmnull
TRANSFORMS-vm2 = vmnull-2

[vmware:esxlog:hostd]
TRANSFORMS-vm = vmnull
TRANSFORMS-vm2 = vmnull-2

[vmware:esxlog:vpxa]
TRANSFORMS-vm3 = vmnull-3


[vmware:vclog:vpxd]
TRANSFORMS-vm4 = vmnull-4

in transforms.conf

[vmnull]
REGEX=SSL\sHandshake\sfailed
DEST_KEY=queue
FORMAT=nullQueue

[vmnull-2]
REGEX=SSL_accept\sfailed
DEST_KEY=queue
FORMAT=nullQueue

[vmnull-3]
REGEX=info\s\'Default
DEST_KEY=queue
FORMAT=nullQueue

[vmnull-4]
REGEX=Not\scollecting\sstats\sthis\stime
DEST_KEY=queue
FORMAT=nullQueue

Next steps will reduce amount of performance data you are getting with loss of some functionality.
I have performance monitoring done by different agents, with splunk I want to monitor only VM infrastructure, hosts performance and VM guests disk latency. You can use similar technique to fine tune collecting data. More details at: http://docs.splunk.com/Documentation/VMW/latest/Install/engine.confsettings

Next steps will help to achieve it.
In each engineperf.conf file (you might have multiple) make next changes:

For example Original:
for every stanza with action=PerfDiscovery add next stanza:
perfLevel=2

(you can set level 3 if you notice something missing, difference described here: http://www.vmware.com/support/developer/vc-sdk/visdk41pubs/ApiReference/vim.HistoricalInterval.html).

perfManagedEntityWhitelist will deactivate collection perfdata from Guest and virtual appliances, but it will collect full perf info from host, datastores, clusters etc.

perfManagedEntityWhitelist = ClusterComputeResource|ResourcePool|HostSystem

Now to enable collection of specific metrix from guests copy stanza you just modify (add something to stanza name, like -disk) and replace
perfManagedEntityWhitelist = ClusterComputeResource|ResourcePool|HostSystem
to VirtualMachine (you can add |VirtualApp if you need) and use perfTypeWhitelist to add perftype you want to collect.

For example this:

[esx4host1]
url = https://host1/sdk/webService
username = splunkforvmuser
password = xxxxxxxxxx
action = PerfDiscovery
perfLevel=2
perfInstanceData = OFF
interval = 60
perfManagedEntityWhitelist = ClusterComputeResource|ResourcePool|HostSystem

will become this:

[esx4host1-disk]
url = https://host1/sdk/webService
username = splunkforvmuser
password = xxxxxxxxxx
action = PerfDiscovery
perfLevel=2
perfInstanceData = OFF
interval = 60
perfManagedEntityWhitelist = VirtualMachine
perfTypeWhitelist=disk

With this you can separate what performance you collect from host and guests. To disk you can add any of the following:

  • cpu
  • disk
  • net
  • mem
  • power
  • ds (datastore)
  • cl (cluster services)
  • ma (management agent)
  • sa (storage adapter)
  • spth (storage path)
  • rcpu (resource scheduler)
  • vdsk (virtual disk)
  • vcdbg (vc debug info)
  • vcres (vc resources)
  • sys (system)

From what I see splunk Big Data approach is to grab all data it can and them figure out what to do with it. This is valid approach if you have unlimited license, if not you are paying price for indexing useless data. We can not afford it, so I had to do all this tweaking to get under our license limit, yet have important data for analysis.

nazdrynau
Explorer

It does ge a lot of data, but I was able to get it under 500MB a day pers ESX. Number of guests will change it, but 18 is not high number.
Look into next couple of things you can easly do without going too deep:

1) Log level on ESX servers. We do have ESXi/5, but they were upgraded from ESXi/4.1. On ESXi/4.1 by default log level is Verbose. It collected huge amount of data and nullQueue transformation is disabled by default on FA. So either suggest to check Log level or enable ‘verbose|trivia’ nullQueue on FA, they are already there, just commented out.

2) On VC amount of logging was big too, especially from c:\programdata\application data\vmware\vmware virtualcenter\logs\vpxd-profiler-xxx.log.
I had to disable vpxd-profiler output to file in https://myvcener/vob/index.html

3) Also I had to reduce VC error level to Error/Warnings to reduce size of vpxd file.

4) My https://myvcenter url uses standard VMware certificate. Now FA generates a lot of errors on SLA handshake. I had to create nullQueue on FA for it like this:

[vmnull]

REGEX=SSL\sHandshake\sfailed

DEST_KEY=queue

FORMAT=nullQueue

[vmnull-2]
REGEX=SSL_accept\sfailed

DEST_KEY=queue

FORMAT=nullQueue

But normal solution would be to install proper certificates and trusted authorities. You probably should provide this info in post-install steps or something.

Another big part was inventory.
I increased interval of collecting inventory on splunk vm appliance in file:
/home/splunkadmin/opt/splunk/etc/apps/Splunk_TA_vmware/local/enginehierarchy.conf
hierarchyExpiration = 1800
In file:
/home/splunkadmin/opt/splunk/etc/apps/Splunk_TA_vmware/local/engineinvvc1.conf

inventoryExpiration = 7200

action = InventoryDiscovery

interval = 3600

After setting Inventory interval to 3600 and inventoryExpiration = 7200 and hierarchyExpiration = 1800 I reduced data collection size by more than 50%.

colinj
Path Finder

This is great! Given how much data this app is likely to collect I would suggest a tuning document forthwith.

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...