Sorry in advance this is such a long post so I'll try describing this in a sentence or two in case this is so easy you don't need to read the short novel I wrote below it to figure this out.
Q. I need Splunk to help me figure out "what changed" by ingesting all Windows files, directories, and registry keys, and then have Splunk compare and trigger notifications from that indexed data the next time that data is ingested, in either a query, dashboard, ITSI service or ???? An example would be, someone changed a hosts file and "broke the internet" but of course there are no change-control records stating any work was done the previous night.
This may go a bit beyond what Splunk was designed for but I've also learned that Splunk can do about anything you can dream up. Here's a few use-cases why I need this (and so do you).
Use-case 1: I need to track changes and difference between large clusters of application servers since change-control is something my organization doesn't believe in. Here are some things I've found that were different that caused outages. Differences in hosts files, no hosts files, differences in application or OS patching, differences in registry hives, configuration files, .DLL/file versions, file sizes, you get the idea.
Use-case 2: Once we can figure out use-case 1, I'd like to setup triggers so that when specific changes are detected that we define, Splunk sends out an alert/email to notify some poor guy in our NOC. For example, someone changes a hosts file or patches one of our Prod servers, notify someone.
What I've Tried
In the past I've used Powershell DSC to automate "fixing" when changes happen but this is a slightly larger hammer than I want since often times as in the case of patching, we want those changes. But when something is broken and the inevitable "what changed" comes up, nobody knows, or they don't know all the files, registry updates, etc. that happened.
I've also looked at the Security Event logs which will list what files have changed or updated, and who made the changes, but if you install a patch for example, this isn't going to tell you exactly all the files that were changed, which reg keys were updated, etc.
I've used SMS and Zenworks back in the 90's which worked awesome to identify "what changed" but those tools don't really work on modern systems since, just sitting idle, your system makes about a million updates and changes which wasn't always the case.
SysInternals Process Explorer is a great tool but if you use the snapshot feature to see "what changed" without even installing or changing a single file, thousands of files and reg keys are updated and the longer you wait, the more changes take place.
What I'm Doing Today
So what I currently do is basically run some basic "tree" and "dir" commands and pipe the output to a .txt file with variables that name the file after the server it was run on, time stamp it, and copy the output to a NAS share with all the other files. Then, when there's an issue, I use WinMerge to ingest 2 or 3 files from suspicious systems, and it will automatically highlight the differences in my output files, and very slowly and manually, I find "what changed" from the night before and can solve the mystery usually much faster than anyone else not using this approach.
What I'm Hoping is Possible
If you're still following my ranting question here's what I'm hoping is possible with Splunk to make this either way easier, less manual, or hopefully plain old automated in a query, dashboard, or ITSI thingy.
I'd like to think since we're capturing perfmon, IIS logs, and the normal event logs (and whatever comes with your standard Windows TA), that our Splunk Indexes somehow already have a complete listing of system files and directories indexed (registry files would be a bonus). Then, with that data, run some sort of sysdiff on the data each morning, notify me something changed, and output the differences (or at least the file/directory that changed) so some location we can use for further investigation.
Then once that's working, set triggers on certain files so that proactively, when a certain file, app, or regkey is changed, I get an email before it's even reported as a problem.
I know that's a lot but I also know it's possible and at the very least be scripted, I'm just too new to Splunk and to dim to have the skillset needed to make this happen. And if there's some totally better way of doing this outside of Splunk, I'm fine buying or learning another product and having Splunk call/index that other "thing" to get our notifications out there but really trying to keep Splunk as my central hub for all our notifications one way or another. Thanks for the time!
Long text, good use cases, nice work.
I noticed there was no mention of
auditd for linux and
sysmon for windows. You should definetly look into those since they are made for exactly this purpose and allow you to set conditions and simply generate alerts based on changes-permission modifications-file/directory removal etc. They would make your life much much easier and would replace all the manual scripting part for tracking changing and indexing "ls" of entire directories.
EDIT: both are free to use as well !
Hope that helps.
Thanks David. Once I figure this out I'd image everyone here would want to use something very similar for the "what changed" outages. And it was long so I don't blame ya for missing the part towards the end where I suggest the solution is hopefully part of sysmon since we do ingest this already but good point on auitd since we do have some linux boxes here and once I figure out how to do this in Windows I'm going to for sure copy the process to our linux boxes.
I'm also working with this rockstar consultant that's given me some ideas but if you're up to the task here's my suggested next-steps. Feel free to dream up a solution and share your findings and I'll do the same. His suggestion is to create an index called "Directory" and use the monitoring function "monitor://C:/*" to watch the data in question. Then use a lookup (vlookup mabe?) to check for differences and generate a report to trigger when something changes. In theory this looks easy but I don't have the Splunk background to understand the exact steps or commands needed to make this happen so again, if you have real examples and syntax to help with I'm happy to try anything you can dream up. Thanks again!
Glad I could help, and great to have someone helping you out building use-cases.
Be careful about "monitor://C:/*" in Splunk though, this will end up indexing all your files under C:\ ...which means bye-bye license and hello data noise. Instead either stick to monitoring dir changes and last modified date, that will give you a good idea when files have been modified.
Or since you mentioned you have
sysmon then you're better off monitoring
sysmon event ID 2 :
This will automatically tell you what changed and which process changed it.
Thanks David one of our Splunk guys was quick to point that out too although not quite as constructively so thanks for the heads up I'm curious how the consultant responds and what his plan was to minimize the impact for these use cases.
And using existing data was my original plan from the beginning but the event logs only tracked actual installs like MSI's but sysmon event ID 2 looks way more promising and something I'm going to bring up to see if we can leverage. Really, it sounds like he's got the rest figured out for how we'd lookup the changes, do comparisons, and trigger alerts so getting the data in seems to be the hardest part. And technically, I can create the output I need manually by outputting the registry and filesystem to a text file which is about 400MB which isn't great but better than the alternative.
Thanks again for the help I'm sure there's a dozen guys out there trying to learn this between our real jobs, have no training, and it's replies from someone with nothing to gain like you that give me a little more hope that not everyone out there is a huge d0uc4e when getting training or the trial and error method isn't cutting it for issues you need to solve yesterday 🙂
Most welcome @mariog2000, if you get time let me know how it goes for you and which path you follow for this sort of monitoring. Always interesting to hear how people tackle this kind of solutions, especially since there is no "one" solution for this kind of challenge.
Will do thanks. And as I was thinking about this I feel like I might run into the same problem with sysmon or the eventlogs which is, how would I index those details in a way to compare 2 systems? Hopefully the consultant has some ideas but what I started doing after your comment was looking at manually outputting some data for my baseline. Let me know your thoughts on this.
Assuming we can't find the data already buried somewhere in the eventlogs or sysmon, this seems like it might serve the same purpose since dir /s, ls -la, etc., output the file, date, and size fields so if there was something overwritten, corrupted, etc., it would at least be called out so we could further investigate what trigged the change.
That would leave me to my final question which would be, is there an ideal way to ingest semi-structured text files into Splunk? Thanks again I hope to have some new updates for ya today 🙂
"how would I index those details in a way to compare 2 systems? "
It's easy, each host will send its data to Splunk tagged with a host field, so you can easily differentiate between the two of them. So you can simply use the host field to be sure you're comparing the right systems together.
Also be sure to check out splunk's native capabilities to monitor registry changes and directory listing changes before you start scripting. Registry change monitoring documentation is here : https://docs.splunk.com/Documentation/Splunk/7.3.1/Data/MonitorWindowsregistrydata
Finally while reading your three steps you gave me a good idea that can reduce the license usage and get you what you're looking for. Instead of indexing 400 MB daily you can simply index the "diff" of daily changes. So on your system all you'll need is the output of "yesterday" and the output of "today" for your script, then with a scripted input in Splunk you simply fetch the diff of the two files. This should drop the indexed data from 400MB to less than 10 MB.
Let me know what you think 😄
Great stuff David thanks again. I'm looking into the first part now. As for the second part, with initial problem was figuring out how to "diff" the 2 files which is where I'm heavily leaning on my consultant to help with via Splunk.
I could try using vlookup to get the diff of each file but if I script capturing the output, and I script doing the vlookup, at that point I may as well just script email notifications when certain changes are triggered. (which I might end up doing anyway). But this is why I want to get a good POC done since I'm guessing the best solution will bubble to the top after playing with a few test cases once we have the data to play with. Thanks again great stuff I'll keep you posted with updates as I clumsily hack through these things. Thanks again!
Yeah you're right ! Best way to find the right solution is to test and see which one gives you the best outcome.
Well, keep me posted and don't forget to up-vote and accept if the answer was helpful 🙂