I can see that Splunk can load data into indexes in a twofold manner (batch vs rising), the latter of which loads only new data. Despite this, Splunk does not react to the deletion and/or updates of the data set that had been already uploaded. Is there a way to configure the data source so that if I make changes to data, they're reflected in an index?
Yes, but possibly only if you have control over the source data and how it's being written into the DB in the first place.
Create a "last_updated" column in the DB. This column would update with the latest timestamp any time any information in the row updates. Here's a blog post on doing it in MySQL. You can use this in addition to a "row created" type column, which would hold the date/time of when that row was inserted.
Once you have that column available you can use it as your rising column. Then, whenever a row gets updated it'll be grabbed the next time DBX grabs data from that table.
Thank you Rich7177. A problem with this approach is that it will append new data the result of which will be duplicates. If I modify a record (SQL update), it will add a new record to Splunk but also will retain the old one. As for deletes, although they will no longer exist in the database, Splunk will still retain them.
Yes, so the situation as you have further described it has several solutions. Which one is best in your case may be apparent, or you might need to test a bit. In no particular order:
You can take a hit on your license and not "tail" the DB, but instead grab full copies each time by using the "dump" option on the input. This is probably your best/easiest method. You could also do this for now while researching/playing with the other options.
You can take a hit on your DBA's time and ask them to build an audit table for this table, so all inserts, updates and deletes are recorded, then pull in that audit table. This will involve somewhat more complexity in Splunk searches as well, but when done right this can be amazingly flexible. Imagine:
... action!="DELETE" | other stuff ...
You can work around the deletes and old records in Splunk with some specific searches - using stats latest() and other functions and possibly subqueries - and that's a fairly big topic to handle in one answer. Without deletes it's much easier. Either way, if you decide you want to do it this way, I'd give it a try (do some searching) and then if you get stuck on one particular problem (like, "I can't make the deleted rows not show up") ask about that. Answers generally are easier to get when you have a short, specific problem.
Hmm, I had another thought but I can't recall what it is right now.
Anyway, give those some thought and maybe a quick test/trial given those hints. I'd be happy to help more if you need it once you figure out in which direction you would like to head.
Hi rich7177. Sorry for hijacking a question, but can you say more about dumping an index and getting a full copy each time data is retrieved from the db?
I'm using Splunk cloud, with dbx version 3. Lmk if you want me to post a separate question.
Three things here: first doing an actual "dump and reload" of an index, meaning "delete the index and then reload the data":
I may have accidentally implied something that doesn't really work - that of deleting the index automatically and specifically. I was (in my mind) talking about data that only needs a fresh copy every now and then - you could manually clean eventdata on that index and then when you start up Splunk again reload a fresh copy of data. I do WAY too much testing where that's exactly what I do. Stop splunk, clean eventdata -index blah, then start splunk and re-run my input. Obviously this ONLY works if it's an occasional refresh you need and if it's the only data in that index.
Second thing though that you COULD do would be to schedule a special search as a special user that's been given
| delete rights that will clean out older data from that index. In this idea, if you are running your input once an hour, you could
| delete that data that's older than your refresh. Even better would be to just remove outdated data this way, but it'll take a bit of work to define exactly what "outdated data" is in your specific case. Now keep in mind that using
| delete doesn't really "delete" the data, it just makes it not-searchable so no one will see it or be able to search using it. E.g. it will not save disk space or anything like that.
Third though would be that instead of fooling around with that like I described, you could just only use the latest data in your searches. There are a zillion techniques for this, from using variants of
| stats latest(blah) all the time, to subsearches using tstats to grab the latest time, relative time evals to do a bit of math to expand that then using that to drive the time-frame to use in your real search in order to only use the latest data. (Say you have a DB that gets sucked in every 10 minutes. If it takes a few seconds to actually pull that in, you would use a subsearch that uses tstats to find the latest time in that index and sourcetype, relative_time that minus one minute (or something), then feed that back into your base search as your "earliest".)
Another option there is just create a macro that handles most of that. Yet one more idea is to just craft your subsearch right in the case of where some data isn't there any more ( like to remove delete statements
... | [ search blah NOT delete] or even just
... NOT delete right in the root search. )
The point being, there are many ways to take care of your specific situation if we knew what your specific situation is.
If those help a lot, great. If not, I'd create a new question and ask - Lots of folks hang out checking for Questions just like this because they're also good at this. Otherwise I'm a bit busy this weekend but I'll try to pop on and check for your question!