Displaying real-time values with an auto-fresh rat...

marvinlee93 · ‎11-26-2018

Is it possible to display real-time values with an auto-refresh rate of 0.1sec on a timechart/single-value display?

I have a data generator which will flush data into a CSV file (every sec or in ms) that Splunk will be monitoring. It's working fine at a refresh rate of 1s. But when I changed from 1s to 0.1s or ms. The timechart stops working properly.

Am I doing anything wrong here? Or this is due to the limitations of splunk?

Richfez · ‎11-30-2018

There's a lot of places I can think of that just having ms precision in timestamps wouldn't let you accurately have ms refresh rates on the UI or in any charting/graphing. Also nearly all these points apply to "Real time" searching in general, let alone ones with attempted ms-level accuracy. Hence why i think it's a really useful answer to have at hand to point others to.

Let me just go through a few I can think of off the top of my head, even though a lot of these don't apply (or don't apply a lot) to eventgen'ed data. I think the trip through the data flow might be useful to many people, so I'll just write it up.

The originating data may have milliseconds timestamped in them, and Splunk will use it, but in many cases it's not guaranteed to come into Splunk in order. In your eventgen'ed method this may not be a big consideration, but it's certainly there. Many products (Cisco firewalls for one) use Syslog as a sort of lazy dump of data and individual events can come in far later than their timestamps would indicate they should have. If the firewall's busy, it won't even bother sending until it slows down and then it could very well start with the most recent events first then flush it's buffers of older events later.

Once the data hits the syslog endpoint (again, not a consideration for eventgen'ed data, but I'm speaking generically), syslog should write them in the order they were received (there's probably even settings to turn this on or off), but it caches writes at least to a small extent. So 387 events in the same second come in it's highly likely you won't get 387 writes to the log file. I'd expect 1. Who knows, maybe it'll surprise me and write 30 times, dunno, but I'm pretty sure it's not 387 separate writes for that situation.

Now you have minor amounts of delay in the Universal Forwarder picking up the changes. fs_notify (or whatever that is) is likely pretty fast, but that doesn't mean it's instant. More ms of delays. The UF reads the file - I am just about positive that it reads from the current file pointer to the end when it gets notified of changes, so if 380 events come in, they get written in 10 batches of 38 each, I'd guess if 4 of those were written by the time the UF actually gets around to looking in the file then it'll read 38x4 events in one go and send them in. Which means accurate "millisecond response times" for things down the data stream are thrown out the window - you haven't lost the ms level timestamps, but you've lost the ms level timing between events as they come in. Also think about having more than one file you are reading... maybe one's not even on the same fast UF, or it's on different disk, or is in a folder with 10,000 other files making it slower...

Now network latency. Of course you have the regular latency at this level, but being TCP you are at least guaranteed that it'll come in. Heh. "Guaranteed" I guess should be put in quotes. This also applied in the original data feeding in over syslog, but that was UDP where even getting the data wasn't a "guarantee". 🙂

Then the biggie - what happens as the data travels through the various indexing, parsing and whatever queues to go from "something I just got sent from a UF" to actual indexed data? There are quite a few places where things slow down on occasion, sometimes by mere ms, sometimes by seconds, occasionally by tens of seconds or even minutes.

So, after all that we have indexed data, and we haven't lost millisecond accuracy in the data, but we've lost millisecond timing. On a second by second level, the "nearly real time" data we're getting has now been clumped into chunks of arrival time by so many processes that have happened to it to this point. What I'm saying is that even if you have that ms timestamp in the event, it's only useful for comparing to other timestamps in those events - you can't expect the data to be coming in precisely on those timestamps or even "precisely X seconds after that" consistently, so you are already effectively prevented from doing a millisecond accurate display of the data.

Let's handwave all that away. Let's just suppose that somehow, you manage to get ms-precision events in, with ms-precision and timing, so that let's pretend they're all delayed exactly one second from the original time second so they're in order, and consistent and all that. Now what could go wrong?

So your search runs. First off, many people now just disable RT searching completely for a couple of reasons - one is that it induces a much much larger load on the system than a non-RT search, another is that they're generally useless because the response time of the people handling whatever problem the RT search shows isn't instant, and fixing it isn't instant either so a 1 minute delay really isn't a big deal. When they don't disable RT searching completely, they make it indexed real time, which means there's yet more delays in reading it, up to 30ish seconds normally, longer if the system's busy). More important than those considerations is that there's frequent and small delays going on all the time inside the plumbing of Splunk. Splunk is doing more than one thing, your search can't always be the highest priority thing happening. And even if it was this wouldn't help, because your load being super-high-priority would actually slow down other parts of Splunk that are required behind the scenes to display your data. There's only so much CPU, you know, if you use it all, there's none left for actually sending your data in to your search.

Now, how are you displaying this? I'll give you a hint, it's not in some super-fast, highly-optimized charting framework. Nope, there's a lot of layers between your data and your eyes. It's a rather generic platform capable of all sorts of fancy shenanigans, but the general-purpose abilities means no single pathway the data can take can be optimized too far. So, search results come in, they get passed around the internals of Splunk as they wend their way from the search, get search time transformations and evals done to them and finally passed into the layer that shows them to you.

That brings up the charting frameworks, which I think it's safe to say they're pretty neat things, but not really built for ms-level absolute speed. They have to deal with running on different browsers, through different JS engines ... and I'm no expert but there's lots of places this could be going wrong.

Of course, the display layers are, if you hadn't noticed, a web browser. Just think about that for a second - you want a web browser to accurately show you millisecond accurate things. Again, I'm not an expert, but this seems fairly silly right on the face of it. When things are done with subsecond accuracy - like games - they're not interactively reading data from disk to use to display. They're preloading their data because ... well, that just makes sense. In browser based games with interaction between different players, I'd expect the only data being passed around is a status of the coordinates of the opponent and which direction their facing. Maybe a little more, but not much. And they do a lot of prediction of where things are going, IIRC, so they only have to send that information a few times per second, not once per millisecond. And they often use Flash or some other framework that has tools that let them be far more interactive.

Lastly, let's get to the ultimate display layer - your monitor. 60 Hz refresh means 1/60th of a second response times max, so even at best that's what, one "chunk" of displaying each 16.7 ms? You've just lost 94% of your ms accuracy right there, though my math's probably wrong. Still, a lot of it is gone, for sure.

The positives:

So, far from being critical, I think it's pretty amazing that you can consistently get about 1 second resolution out of data traveling through this long and torturous path, coming out through the browser. At least when your servers and network are all working optimally. (Although I think that the more typical 95th percentile delay through that path to display is more like 3-5 seconds). I see no real reason to be particularly upset with no real ability to have millisecond or microsecond abilities - we're trading that sort of speed for the ability to do so much more stuff than just that one thing.

And also, none of this precludes having a timechart that DOES show milliseconds, accurately, on a chart. You just have to do it after you've collected all the data, a process that as described above takes a few seconds. And you can't "refresh that data on screen" with anything remotely close to ms-level accuracy or timing, so you have to use a more leisurely once per second, or once per 10 second, update speed. So you can't do either of these "real time" (which I think Splunk should rename to "Less delayed" anyway, because it's not really real time anyway, it's only real time at a single layer of that entire stack I go through above).

Anyway, I'm sorry the answer isn't "Sure, just turn this knob", and maybe if you are lucky some Splunker will pop in with a "well, you might be able to get a little better by doing X", but I think the answer is what it is.

Regardless - Happy Splunking,
Rich

Richfez · ‎11-28-2018

I don't believe Splunk understands real time faster than 1 second, but I'm sure someone else knows more definitively. I mean, it certainly understands time stamps that happen in subsecond intervals, just that it's not built to refresh like that.

Heck, I'd be surprised if the UI could refresh in under a second. 🙂

In any case, could you share what it is you are doing? Sounds like an interesting use case.

marvinlee93 · ‎11-29-2018

Precisely, if splunk time stamps so quickly, why can't it refresh in the scale of ms? Just creating an interactive dashboard to display real-time data from various sources.

Richfez · ‎11-30-2018

I was going to write up a big comment, but then realized it's actually probably the answer to the question so I'm moving it down there. Look for it in a bit, but it's basically that Splunk has been engineered as an general purpose truck. It's not a race car. There's a lot of power, but the power is focused on being able to do a lot of things reasonably quickly instead of on being able to do a few things really fast.

Anyway, I feel it's a terrible answer, but it is what it is, it's what the product was designed for and around.

Displaying real-time values with an auto-fresh rate of 0.1sec.

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!