Monitoring Splunk

Splunk not usable for desktop app analytics service (performance issues)?

MatMeredith
Path Finder

Over the past couple of weeks we’ve been trialling the free version of Splunk with a view to using Splunk Enterprise as the basis for an analytics system for a desktop VoIP app. Unfortunately, what I’m seeing so far suggests that performance is ~1000x short of what we need. At the moment it looks to me like this means Splunk is just not suitable for this purpose. Is that true?

  • I’m currently running Splunk Free, and I’ve tried it on both a Linux VM (1 CPU, 2GB RAM, 40GB HD) and on a Windows 7 PC (Core i5 M560 CPU, 4GB RAM and 128 GB SSD). Performance was similar on both.

  • Following the advice here, our clients generate JSON logs that look like this.

{"timestamp":1362482851648,"name":"User opening voicemail window","timezone":"Greenwich Mean Time","session_id":"dd619139-6ec7-4c25-b442-5a563118b2c7","service_provider":"Test SP","os_platform":"x86","app_version":"0.8.0.build.by.SVN","debug_mode":true,"uuid":"60a15079-c403-4e29-9fbd-7b76a4ff9cb6","os_version":"Windows7"}

  • To test scalability I’ve loaded Splunk with ~1.2 GB of event data comprising ~5 million events.
  • Even a very simple search to determine how many unique users we have, e.g.

index="test_large" | stats dc(uuid)

is taking ~1 minute to complete. More complex searches take ~4 minutes, and searches involving e.g. “transaction” are too slow to be runnable. Furthermore, search times seem to be increasing at best linearly with the volume of data.

  • This volume of data is what I’d expect to be generated by about 2500 users in 30 days and I need a solution that’s going to scale to e.g. 1 million users whilst supporting dashboards that load in seconds rather than minutes. Based on the performance I'm seeing so far, with a million users even the simplest of searches on just the past 30 days of data are going to take over 6 hours to run.

Does this mean that Splunk is simply not a suitable product for our purposes (in which case I'm back to building a solution on a traditional database), or is there some way I can get the kind of performance improvements necessary here? A couple of things to note...

  • I'm aware I'm clearly not running on top end dedicated hardware, but I'm doubtful that any realistic hardware upgrade is going to come even close to making the difference necessary here

  • As above, I'm already only injecting pretty minimal logs that are dedicated to containing the information necessary for analytics, so I don't think I can reduce the volume of data massively.

Tags (2)
0 Karma

jonuwz
Influencer

The 2nd piece of hardware has a 128GB SSD.

If this is moderately recent it will easily best the 1200 IOP reference recommendation.
For an "all time" search on 1.2 GB of data it should suck it up in < 5 seconds.

So if you're getting similar performance on the VM as the i5 box, something is seriously bottlenecked in the software (or you're doing it wrong).

You have 4x the cores, 2x the ram, infinity x disk io - thats just not right.

In my experience transaction and spath are hideously slow - i bet you use both.

So, rework your searches to not use transaction - its possible 95% of the time.
If you can change the log format to key value pairs instead of JSON, do that too.

Host the scrubbed sample data and your queries and lay down the challenge.

Now, at the risk of being downvoted to oblivion...

Splunk is a swiss army knife. It stores your data, searches your data, creates pretty charts, alerts, creates pdfs etc etc etc.

However the biggest advantage is the mentality of how to store and retrieve data.

With splunk you just throw throw logs at it and worry about parsing it into fields later.

A classic RDBMS is the opposite. You define your data, design a schema, then throw your data at it.

So... if

a) your data is highly structured

b) you know exactly what you need to present from your data already

c) you have the inhouse experience to design a performant RDBMS

d) you have the inhouse experience to present it nicely.

e) you already have support contracts in place with the alternative vendors.

Rolling out apache with mod_<insert_language_of_choice>, open ldap and mysql / postgresql will probably give you better performance on a single node. And you can spend the license cost on better kit.

</end_flame_bait>

Drainy
Champion

Splunk is perfect for your needs but I suspect you haven't look at the requirements?

Splunk locks a core each time it runs a search, if you have a single CPU you have to allow for the fact that it will be handling the OS, other processes, any searches you are running and any possible scheduled searches you may have setup or have from other apps along with monitoring files. Add to this that the OS scheduler will be swapping it in and out and you will find it creak to a halt.

Secondly IOPS are critical, try running bonnie++ to see how fast you can read/write to disk and this will dictate how quickly Splunk can run searches and process data.
Reference Hardware - http://docs.splunk.com/Documentation/Splunk/latest/Installation/Referencehardware

What you want to achieve is perfect for Splunk, the way it stores and searches data makes it considerably more agile than a normal relational database but you do have to give it some hardware to get it off the ground.

jonuwz
Influencer

@lguinn - 2nd test system has an SSD - this in all likelyhood demolishes 1200 IOPS.

0 Karma

lguinn2
Legend

Matt - I think you need to seriously look at your I/O speed. A VM is fine to test functionality, but very poor for testing speed. Not only are you way undersized on CPU and memory, my experience with VMs suggests that they often run at about 35-50 IOPS. Splunk wants 1200. Thats a 600X difference, not even considering the CPU/memory requirements. (I'm not going to comment on the PC, except that I doubt it has good I/O.)

Would you consider running a production DBMS on your test server? If not, there is no way that your environment is a reasonable performance testing platform for Splunk.

Drainy
Champion

http://docs.splunk.com/Documentation/Splunk/5.0.2/Knowledge/Usesummaryindexing
Futhermore, you can then expand Splunk horizontally. Its biggest strength is in its distributed setup, the more indexers you have the more efficient the search is as it can start to utilise map/reduce functions to increase the speed of statistical searches. I've deployed Splunk in a telecoms environment where we are monitoring switches and routers across Europe on well spec'ed machines with no performance issues.

0 Karma

Drainy
Champion

I wouldn't neccessarily trust the number of events per second, this will vary wildly depending on timestamps, bucket settings, structure of the data etc. Also its worth pointing out the that is the bare minimum for running Splunk, we wouldn't run it on anything like that in production. Now its true that running a statistical command will cause a hit on the system but then you need to think about how to architect your system. Firstly I would say that for a 30 day statistical analysis I would run an hourly or daily scheduled search to build summary data which you can then search quicker.

0 Karma

MatMeredith
Path Finder

(FYI: I only have a single user, I don't have any scheduled searches, and Splunk isn't doing any indexing whilst running these tests.)

0 Karma

MatMeredith
Path Finder

Like I say -- I need these searches to run 1000x faster not just a couple of times faster...

- If Splunk has to search all the events to produce stats like these (and what I'm seeing in the UI suggests that it does?) then it seems like it's never going to deliver the performance I require. 1m users are going to produce 500GB of data in 30 days, and Splunk will never be able to load that fast enough -- let alone search it. The only option is surely to store the data in a structured form that allows the necessary stats to be generated without having to search all events.

0 Karma

MatMeredith
Path Finder

Thanks. Are you claiming that running on the reference hardware would deliver the orders of magnitude performance increase I need? I'm still doubtful...

  • The reference hardware says that it delivers search performance of "Up to 50,000 events per second for dense searches".
  • My current test (with data from just 2500 clients) gives 5 million events in 30 days. If to produce a basic stat like the number of unique users, Splunk has to search all the events then that's going to take 100 seconds on the reference hardware -- not massively different to the 170s I'm seeing in my testing...
0 Karma
Get Updates on the Splunk Community!

New in Observability - Improvements to Custom Metrics SLOs, Log Observer Connect & ...

The latest enhancements to the Splunk observability portfolio deliver improved SLO management accuracy, better ...

Improve Data Pipelines Using Splunk Data Management

  Register Now   This Tech Talk will explore the pipeline management offerings Edge Processor and Ingest ...

3-2-1 Go! How Fast Can You Debug Microservices with Observability Cloud?

Register Join this Tech Talk to learn how unique features like Service Centric Views, Tag Spotlight, and ...