Getting Data In

Working with an existing Splunk Environment, how should I start making this CIM compliant (normalization)?

Path Finder

Hello all,

In a current project, I have to work with an existing Splunk environment which is already in use for about 2 years (about 250 billion events indexed until now). For indexing, no TA's were used - about 70% of the events have eventtype syslog (from a LOT of different devices) and the rest is mostly syslog as well (but with a different eventtype), some Windows Event Logs and some Logs from internal applications. Events are kept apart by using the port of e.g., the incoming syslog.

My job is - or will be - to make all this CIM compliant (we call it normalization). The Problem is that I am not really sure how to start with it.

What I plan to do is to identify the different sources (e.g., Firewall xyz) + the port it sends syslog to. Then I'd search splunkbase for an App providing CIM compliant field extractions/event type/sourcetype for this appliance and use these. If I can't find any explicit TA I might be able to extract the config files out of an App (no need for that many dashboards) or I'd have to write the config files myself.

Is this a sound way to start with this task ?

Thx a lot !

0 Karma


I don't know if this would be the best way, but it is how I've been proceeding with a huge cleanup. We had a tangled mess of inputs that were needing to be separated from one another and to have standard TAs applied to them. It was made easier (though by no means "easy") by having a small user base, but I think the technique may still be valid for you.

First, I created a smallish VM to use as a deployment server (DS), then I built an entirely new Splunk server to replace my rather aged existing Splunk server. I did NOT cluster the two, but did tell the new server to use the old server as the license master. I also created a syslog server (actually, syslog-ng installed on the new Splunk box) to make that easier and better.

Then, one by one I would:

Duplicate inputs to both boxes temporarily, using a temporary index on the new side, and using the DS to deploy an app with just the inputs if possible, or making e.g. syslog changes where not. Then confirm data was now getting to the new Splunk instance.

Once I have data, I would find all the parsing and input side of things off the old server and create a small app to deploy those to the new Splunk instance. Once I had it parsing correctly I would then redeploy into/with a "permanent" index. This way if I had messed stuff up too badly earlier on, I could easily wipe that index out and start over.

Once I have data going in and being properly parsed, I'd find all the pieces of a particular app/dashboard/whatever and start compiling them into a deployable app as well. In some cases I just rebuilt the app using about half of what I had with new ideas for the remaining half - amazing how much fun that can be when you know you already have the problem solved elsewhere and you are just copying and enhancing it as you go.

Once I had it working well enough, notify users of the change and then finally drop off the old inputs.

That last was the sticking point a few times, but it always turned out that the longest I needed to run the two in parallel was about 3-4 weeks. Once there was nearly a month's worth of information in the new system, the inputs could be turned off going to the old server, leaving its history intact as a place to search if users needed it.

Do watch your license if you proceed like this. We had no particular issues, but our largest input at the time was only about 15% of our license amount. If I had any larger ones, I might have been forced to find an accelerated or more "disruptive" way to handle that one input. One option I had thought of was if an input comes from multiple hosts, I may have been able to just redirect one or a small sampling of the hosts to the new server to get enough data to work with, then make the cut-over more drastic when the time came.

0 Karma


Just to clarify - the data is already in use as it comes in now, i.e. someone is already running searches on it/powering dashboards and alerts with it? I'm asking because some TAs will change your metadata to appropriate and meaningful data, e.g. change the sourcetype, source and/or host fields based on the content of the raw data. While this is generally in your interests, it might be a good idea to have a look at the existing knowledge objects and whether they will still work if metadata is changed.
Other than that consideration, I'd say using existing TAs is a good idea to save you from the hassle of making your indexed data CIM compliant. If you somehow can, you might also want to change the method of getting data in from grabbing syslog directly to writing syslog to file with a daemon and indexing those files (if I interpreted your sitation correctly and that's what you're doing at the moment), see here for some details on why.

Good luck anyway!

Path Finder


yes the data in use - actually heaviliy in use. Thx for this hint - I did not thought about the effect on existing knowledge objects when using new TAs.
Btw: I like this Geroge Starcher - His blog already helped me a lot 🙂