We are using the JMS Messaging TA to pull messages from IBM MQ queues. Recently, a couple of these queues started receiving production load and the TA is falling behind during peak hours (7a-3p). It does seem to eventually catch up late at night around 10p, but of course that kinda defeats the purpose of "real time" monitoring.
One queue does about 120k messages an hour during peak hours and we're behind by about an hour. The other queue we're typically behind by about 4 hours, which I understand is about 600k messages on the queue. The payloads can be quite large as well (hl7 messages).
I don't have access the queues themselves as another team manages all of that - I just point to where they tell me and use the bindings file they provide. But they want this resolved too, so I can get questions to them if needed. I did notice some errors in the log like the following - not all that often, but they are there.
04-12-2018 13:34:37.867 -0400 ERROR ExecProcessor - message from "python /opt/splunk/etc/apps/jms_ta/bin/jms.py" Stanza jms://queue/:SOME_QUEUE : Error running message receiver : com.ibm.msg.client.jms.DetailedJMSException: JMSWMQ2002: Failed to get a message from destination 'SOME_QUEUE'
The add-on is installed on a heavy forwarder (6.5.2) and here is an example of how one of these queues is configured:
[jms://queue/:Some_Queue]
browse_mode = stats
browse_queue_only = 0
destination_pass = gobbledygook
destination_user = some_queue_user
durable = 0
hec_batch_mode = 0
hec_https = 0
index = some_queue_index
index_message_header = 1
index_message_properties = 1
init_mode = jndi
jms_connection_factory_name = SomeFactoryName
jndi_initialcontext_factory = com.sun.jndi.fscontext.RefFSContextFactory
jndi_provider_url = File:/<path to bindings file>
output_type = stdout
sourcetype = queue:message
strip_newlines = 1
I did increase the java heap to 512MB on line 97 of the jms.py script, but I have no idea if that should make much of a difference (didn't seem to) .
Any other suggestions for increasing performance? Or any way to determine if maybe the problem is on the queue itself and not Splunk? Or possibly related to the HF parsing too slow (not sure the messages would still be on the queue then though).
Also, I'm assuming this is the code used by the add-on, is that correct?
https://github.com/damiendallimore/SplunkModularInputsJavaFramework/blob/master/jms/src/com/splunk/m...
Thanks!
To achieve scale you should try this in order :
1) add/clone more JMS input stanzas pulling from the same queue, this will effectively run multiple consumers in multiple threads in the same JMS Modular Input JVM instance , thereby taking advantage of any increased JVM heap limits also
2) add more JMS Modular Inputs deployed out horizontally across multiple Universal Forwarders.
3) a combination of 1 and 2
Please contact us if you require formal support: www.baboonbones.com
To achieve scale you should try this in order :
1) add/clone more JMS input stanzas pulling from the same queue, this will effectively run multiple consumers in multiple threads in the same JMS Modular Input JVM instance , thereby taking advantage of any increased JVM heap limits also
2) add more JMS Modular Inputs deployed out horizontally across multiple Universal Forwarders.
3) a combination of 1 and 2
Please contact us if you require formal support: www.baboonbones.com
still trying to determine how many inputs we may need for each queue in total, but adding additional stanzas has been working to improve performance. I may end up deploying across multiple heavy forwarders at some point as well.
The answer is "it depends" on the specifics of your environment, message throughput/message size/any pre-processing/available compute resources etc...
The way you are going about it is just fine...incrementally scale up by using approach (1) first.
See what performance improvements you get, keep an eye on CPU and Memory usage.
When you start to max out the performance achievable by approach (1) , then start to look at approach (2) and (3) to continue scaling horizontally as far as you need to reach your performance SLAs.
thanks...number 1 seems like a pretty obvious answer now that you say it. I'll give that a go today!