Solved: What are best practices for handling data in a Spl...

jtacy · ‎04-12-2016

All,

We use a Splunk staging environment to test system upgrades and fine-tune props and transforms before deploying new indexing configuration into production. That's brought the temptation of letting non-production Universal Forwarders, syslog, etc. continue to send data to staging (and not to production) after testing is complete. On one hand, it's great to have data coming in continuously for testing, but on the other hand, end users understandably want to search their non-prod data which means that we're managing those indexes, permissions, apps, and need to be careful about interrupting service in staging. Nothing technically wrong, but something about it doesn't feel right; the same question about when to use staging is coming up with Deployment Server, too.

I'm curious what others do with non-production Splunk environments and/or non-production data that needs to go into Splunk. Any particular approaches that seem to work well (or not) when building a centralized Splunk service shared by many app teams? Thanks for your thoughts!

Lucas_K · ‎04-13-2016

Depends on your architecture but what you can do is use distsearch groups that will allow users to pick and choose their data source from a single set of search heads.

By doing this you can utilize the same field extractions, eventypes, macros, lookups, etc with all the exact same authentication, permissions and so on while being able to check two sets of data.

The work flow might be like this.

1.New data is onboarded via your staging forwarders. Alternatively you use an F5 with a staging pool of forwarders behind it.
2.You use this staging data to get your line breaking, timestamping and other issues worked out.
3.Once your parsing settings are proven to be correct copy those into your production indexers.
4.When ready switch your F5 vip's forwarding pool to be the production one.
5.Search correctly parsed production data from day 1.

Now the questions you should have at the moment are "How are you looking at these two sets of data on different search peers groups at the same time?"

Easy.

Distributed search groups.

With this option you can set a default set of peers to search (i'd suggest your production ones) aswell as your staging ones via use of a qualifying statement in your query. This way people fixing their staging data will know what to do while your other users don't need to do anything different.

How do we set it up?

Say for example we have a total of 4 peers. 2 peers are staging and 2 peers are production.

You distsearch would look something like this.

distsearch.conf (goes on your search heads).

[distributedSearch:production]
servers  = prod1:8089,prod2:8089
default = true

[distributedSearch:staging]
servers = staging1:8089,staging2:8089

What happens now is that all searches by default will use your production search peers. Normal users will see their normal production data.

If you want to check the data that is on the staging servers just put in "splunk_server_group=staging" into your base query. Simple!

You could then update dashboards to have a check box that would allow staging data to be shown by just inserting that variable into the base search and you have a user friendly way to toggle between production data and staging data on the same dashboard.

More info : http://docs.splunk.com/Documentation/Splunk/latest/Admin/Distsearchconf

View solution in original post

scottsavaresevi · ‎04-17-2016

If you don't mind increasing your license volume you can use a dual outputs configuration where data from staging hosts goes to both the production and staging environments. Users would access the data using the production environment and you would have a small non-prod environment for you to play with data on.

[tcpout]
defaultGroup = staging,production
[tcpout:staging]
server = stagingserver:stagingport
[tcpout:production]
server = prodserver:prodport

You can keep the license usage low by putting that in an app and only deploying it to servers that have the sourcetype you care about and then turning it off once you are done.

jtacy · ‎04-17-2016

Thanks! I've had a few test systems configured like this for a while and I like it; the only technical issue I can find with it is that if production is down due to an outage or perhaps cluster upgrade that requires full shutdown, as long as staging is still up the UFs won't block and production will never get some events. However, that should be rare and as long as production hosts are configured to point only to production Splunk, which I think is how we would approach this, there's no risk of losing "important" data.

I like the idea of things like Windows event logs, disk space logs, etc. always going to production including from non-production hosts so our ITOC can monitor that stuff, then having a small number of non-production test bed machines always send that same data to staging so we always have data to work with. Thanks again!

ddrillic · ‎04-17-2016

A standard best practice for any software is to treat staging exactly as production from the releases point of view. It means that development and initial testing is done in the lower environment, such as dev. When work is done in dev, the best practice dictates that deployment to staging and prod is identical, which verifies that the code changes as well as the deployment mechanism are safe and sound.

jtacy · ‎04-17-2016

Thanks! It sounds like the sweet spot may be 3 Splunk environments (in addition to local dev environs for each of the Splunk admins); a test environment to prototype new props/transforms and validate that things work properly in a distributed environment, a staging environment to test the deployment strategy against the cluster architecture and maybe test performance, and then production. I'm liking the idea of conditionally sending data to each of these 3 environments depending on the nature of the application and the integration complexity. Windows event logs? Probably just send everything to production. Custom app logs where the app also uses the REST API to do reporting? Maybe have that data going to all 3 environments so everyone can be comfortable with Splunk upgrades and architecture changes. Thanks again!

ddrillic · ‎04-17-2016

It's a bit a foreign concept for me to send the same data to different environments. Each system that Splunk supports should normally have its own stack of environments from my point of view...

jtacy · ‎04-17-2016

I agree with that (to a point, since we can't control how many environments our various app teams create) but I'm struggling with how to satisfy groups like Security and the ITOC who are charged with monitoring all systems regardless of their non-prod or prod status. We could use the distributed search group feature that was mentioned and have prod Splunk search data from the non-prod Splunks, we could send certain commodity data to prod regardless of where it came from, or we could make those groups set up their dashboards and alerts on multiple Splunk environments to achieve full separation. It's extremely tempting to allow crosstalk between environments to avoid that last inconvenient option, at the risk of unintended consequences. Thanks for the great discussion!

ddrillic · ‎04-18-2016

That's a good one - It's extremely tempting to allow crosstalk between environments ; -)

Jeremiah · ‎04-18-2016

I'm struggling with how to satisfy groups like Security and the ITOC who are charged with monitoring all systems regardless of their non-prod or prod status.

Yes, that's exactly the situation we have. All of our systems log to our production Splunk environment.

In our non-production Splunk environments we need the ability to restart, clear indexes, test apps and upgrades and do other things that could impact searching.

We can temporarily redirect a forwarder to a non-prod environment, or have the forwarder send to both environments. Part of the problem though is knowing what data to replicate, when you have 1000's of servers. We can't build a non-production environment nearly as big as our production. Replicating data from a forwarder to non-prod takes care of providing data going forward, but it doesn't solve the issue of needing historical data as well. I've had both types of requests from Splunk users; to have a live stream of their data or to have historical (say, 3 months worth) of data in non-prod.

Lucas_K · ‎04-13-2016