All Apps and Add-ons

DBConnect Input Rising Column on Last Modified Date

fvegdom
Path Finder

I am currently exploring the possibilities of hooking up a production database to Splunk. I noticed the Rising Column input type is advised for getting data that only gets created and is never updated.
In my application though, every table has a lastmodified date, what if i made the last_updated column the rising column? would I be able to index all updates to my tables(perhaps with some duplicates if they are indexed at the same time) or am I forgetting something?

I realise that you also have the problem that if a record is updated more than once before splunk pulls from it you might lose data, am I forgetting anything else?

Tags (1)
0 Karma
1 Solution

Richfez
SplunkTrust
SplunkTrust

This will probably work reasonably well enough with some caveats.

The reason date columns tend to be bad are because it's hard to get one copy of all data in all circumstances due to overlapping requests, too small of granularity and so on. If your timestamps use a high granularity, it's probably going to be OK (ms or better) or if they a low update rate (a relative term 🙂 ) it could be pretty usable.

My rule of thumb - if you tend to have one or fewer records per time stamp granularity value, you'll probably be reasonably OK. So if your timestamp granularity is seconds you'll not usually have big problems as long as your data doesn't generate more than a new row per second.

There IS a way around some of this, though. One technique I use is to - in the SQL I'm using to pull the data - grab only slightly older data. Often I can get by with a few minutes delay on the data. I may run a particular search every 5 minutes, right? Having a 6 or 10 minute delay is totally fine, I only do 5 minutes because it's convenient (see below for a trick I found). So in my SQL I write as part of my SQL statement (pseudocode, fix to your own date math for your DB type) something like

SELECT Field1, Field2, ...,  DATEADD(mi, -1, MyTimeStamp) AS ShiftedTimeStamp ...

Then I use the ShiftedTimeStamp field as the rising column. I have not convinced myself this solves all the problems. Indeed, I've not convince myself it actually solves any problems, but I have no other choices in a few particular DBs so I do this to soothe my soul. 🙂

So the tip: In most everything in Splunk at least in our case, the exact timing of something doesn't matter. 5 minutes, 15 minutes, or 1 minute intervals are only for my own convenience - a few seconds one way or the other makes no difference. So instead of using 120 seconds or 300 seconds or 60 minutes or whatever, I use prime numbers. Doesn't matter which primes I use, I just pick one near to the time value I want. So instead of 120 seconds I'll schedule things at 131 or 109 or whatever. For 5 minutes I'll use 347 or some prime near there.

What this does is spread out the requests so as few overlap as possible. If you have all your stuff going off each 300 seconds, lots of things will trigger at nearly the same time. You can get better by just adjusting them up or down a few seconds, but the way to assure they will fall on separate seconds most often is by making them prime. (Technically, I think they only need to be co-prime, but when you have to do "co-prime" with a LOT of numbers, it's just easier to pick them from an actual prime list. 🙂 )

View solution in original post

0 Karma

Richfez
SplunkTrust
SplunkTrust

This will probably work reasonably well enough with some caveats.

The reason date columns tend to be bad are because it's hard to get one copy of all data in all circumstances due to overlapping requests, too small of granularity and so on. If your timestamps use a high granularity, it's probably going to be OK (ms or better) or if they a low update rate (a relative term 🙂 ) it could be pretty usable.

My rule of thumb - if you tend to have one or fewer records per time stamp granularity value, you'll probably be reasonably OK. So if your timestamp granularity is seconds you'll not usually have big problems as long as your data doesn't generate more than a new row per second.

There IS a way around some of this, though. One technique I use is to - in the SQL I'm using to pull the data - grab only slightly older data. Often I can get by with a few minutes delay on the data. I may run a particular search every 5 minutes, right? Having a 6 or 10 minute delay is totally fine, I only do 5 minutes because it's convenient (see below for a trick I found). So in my SQL I write as part of my SQL statement (pseudocode, fix to your own date math for your DB type) something like

SELECT Field1, Field2, ...,  DATEADD(mi, -1, MyTimeStamp) AS ShiftedTimeStamp ...

Then I use the ShiftedTimeStamp field as the rising column. I have not convinced myself this solves all the problems. Indeed, I've not convince myself it actually solves any problems, but I have no other choices in a few particular DBs so I do this to soothe my soul. 🙂

So the tip: In most everything in Splunk at least in our case, the exact timing of something doesn't matter. 5 minutes, 15 minutes, or 1 minute intervals are only for my own convenience - a few seconds one way or the other makes no difference. So instead of using 120 seconds or 300 seconds or 60 minutes or whatever, I use prime numbers. Doesn't matter which primes I use, I just pick one near to the time value I want. So instead of 120 seconds I'll schedule things at 131 or 109 or whatever. For 5 minutes I'll use 347 or some prime near there.

What this does is spread out the requests so as few overlap as possible. If you have all your stuff going off each 300 seconds, lots of things will trigger at nearly the same time. You can get better by just adjusting them up or down a few seconds, but the way to assure they will fall on separate seconds most often is by making them prime. (Technically, I think they only need to be co-prime, but when you have to do "co-prime" with a LOT of numbers, it's just easier to pick them from an actual prime list. 🙂 )

0 Karma

fvegdom
Path Finder

Thank you for your answer, your prime number interval idea is very interesting.

I do not quite understand the benefit of the shifted interval though. If you make the shifted interval the rising column the value it will compare the records in the db against will also be shifted 5 minutes back, right? The fact that the last modified date also is related to the real world in that when it is set it is the current time, has no effect here.
It will be as if you had a rising column on a primary key (identity) column and you shifted it 5 back. So instead of comparing 8 with 7 you would be comparing 3 with 2. Or am I misunderstanding you?

0 Karma

Richfez
SplunkTrust
SplunkTrust

Also keep in mind the "insert/update to timestamp granularity" ratio - as long as that's OK, you will probably be more or less fine. It's only where you have multiple inserts that can't be told apart by the timestamps alone that the issue becomes IMO significant.

Unless you HAVE to have every record, perfectly, every time. In which case I know of now way except to only grab historical data that we know to be complete, never grabbing the "current" timeframe.

0 Karma

fvegdom
Path Finder

I see now, thank you for your clarification

0 Karma

Richfez
SplunkTrust
SplunkTrust

The idea of using a shifted interval came up in a DB (vendor supplied and in which I can't make the changes I'd like) that uses a smalldatetime instead of a datetime for the row timestamps.

That means it has no seconds. Granularity is minute-by-minute, and at peak it does a few hundred rows per minute. I believe the logic I outline below is valid even for second- or millisecond-based timestamps, though obviously not to the same extent.

So the logic of shifting your timeframe is simple: without this, each time you collect data the precise timing matters. When new events get written to the DB, I have no control over if they get written with a timestamp of the previous minute or the next minute. So when I pull the records FROM the DB, DB Connect just grabs all that existed at that time. Then next run it'll grab events that are newer than that time. So in the case of minute-granularity timestamps, an average of 30 seconds of events may come in that you won't grab. If you were to grab the past 10 minutes of events each time, that's 5% you'll miss (30 seconds each 10 minutes).

By shifting the time periods back, you at least should only grab data from minutes which are entirely completed. This should side-skirt the problem.

If your timestamps are more granular you still have this problem. The scenario above still applies but depends a lot on number of rows expected during "granularity period" time period. If you have a DB with one-second granularity which gets 100 rows per second written to it, you'd expect to miss about 50 of those each update (the ones that come in later during this second). By shifting backwards even just one second you've changed it to where you are only asking for "seconds" that are fully written. (Hopefully - hence why I suggest a minute ago as an easy thing, though one would think 5 seconds would be enough!)

Unfortunately, with the smaller granularity timestamps you run into another issue: if the SQL side of things isn't perfectly fast, depending on your DBMS you could run into a situation where it has a record it hasn't written yet, but which will be written with a timestamp of a few milliseconds ago (or even just precisely "now" to the granularity possible). But your SELECT is already pulling everything to a "now" from a millisecond ago, missing that one.

Or in other words, it's all precise timing issues that are being hopefully somewhat mitigated by only requesting data that's slightly historical and hopefully thus complete already.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...