Dashboards & Visualizations

Splunking continuous delivery (blue/green deployment) applications

Builder

Continuous delivery is a method of developing and releasing software in a manner that minimises downtime and thus allows swiftly releasing software to production outside of normal downtime window, as well as testing it in exactly the configuration it would be in prod/live. You run two instances of the same application, with access to the live (light) instance being directed through a router (in our case, an apache redirect). New code is released to the non-live (dark) instance and testing is conducted there. Once testing has passed, the router is changed to direct traffic to the dark instance - promoting it to live. This can happen many times throughout the day at any time.

A more detailed explanation of continuous delivery can be found here: http://martinfowler.com/bliki/BlueGreenDeployment.html

We have Splunk dashboards monitoring our application logs providing views into things like numbers and types of errors, transaction durations etc. We are planning to move our apps from normal single instances to the design outlined above. However, this causes a problem for us in Splunk because we need to maintain charts for only the live (light) instance of the application automatically (the dark instance will be filled with errors as its mostly untested, and testing traffic will pollute the useful live info). There are three problems:

  • The mechanism of determining which instance is live needs to be automatic because the instances can be switched to live at any time by our application teams (who do not have any Splunk access).
  • Because the instances need to be identical, there is no way from the log event that we can tell which is acting in the live role.
  • I can't restart Splunk many times in the day as it will effect other users, so any solution that requires restart Splunk server is not practical (restart forwarder on app server is OK).

This seems like a deployment mechanism that many web based companies with high availability web applications might use. Does anyone have any ideas how we can achieve continuous monitoring of only our live instance from within our dashboards?

0 Karma
1 Solution

Builder

There is a file on the filesystem within the blue and green app instances which says whether that instance is live or dark.

I needed a way to set an index time field to contain the state of the instance, that I could then use later on the search head to filter only the live logs. Because of the lack of the ability to "tag" certain sources at their sources on Universal forwarders, I had to use the only index time field I have the ability to change from the forwarder side: sourcetype.

I wrote a bash script to read the live/dark state files and determine which instance was in which state. The script then runs a "splunk edit monitor" command to set the sourcetype of the log to either "mysourcetype-live" or "mysourcetype-dark" and log its activity. I packaged the script into a splunk app, which is pushed out via deployment server. The inputs.conf of this app - as well as monitoring the app logs we are interested in - sets my script to run every minute, and since the script outputs to stdout, this output is then indexed by Splunk providing a nice way to monitor the changing states of the instances.

Now I can construct dashboards on the search head that filter for source=mysourcetype-live and have charts that cover only live instances, as required.

View solution in original post

0 Karma

Builder

There is a file on the filesystem within the blue and green app instances which says whether that instance is live or dark.

I needed a way to set an index time field to contain the state of the instance, that I could then use later on the search head to filter only the live logs. Because of the lack of the ability to "tag" certain sources at their sources on Universal forwarders, I had to use the only index time field I have the ability to change from the forwarder side: sourcetype.

I wrote a bash script to read the live/dark state files and determine which instance was in which state. The script then runs a "splunk edit monitor" command to set the sourcetype of the log to either "mysourcetype-live" or "mysourcetype-dark" and log its activity. I packaged the script into a splunk app, which is pushed out via deployment server. The inputs.conf of this app - as well as monitoring the app logs we are interested in - sets my script to run every minute, and since the script outputs to stdout, this output is then indexed by Splunk providing a nice way to monitor the changing states of the instances.

Now I can construct dashboards on the search head that filter for source=mysourcetype-live and have charts that cover only live instances, as required.

View solution in original post

0 Karma

Communicator

You don't appear to have provided any method of detected that a new environment has gone live. It sounds like you need logs from the load balancer, or build scripts to detect this event?

If you had those logs, you could tags to seperate your blue/green environment as such, and then populate the dashboard, based on whichever environment is the live one.

0 Karma

Builder

There is a file within the application instance that specifies the role of the instance at the current time. The problem with tags is that they are search time. When looking at logs in the past, before role was changed, specifying tag=live equating to a specific instance will not bring back the correct instance. The only index time variable I can find, that can be set by a universal forwarder, is sourcetype. If anyone knows a way to set some other index time searchable item on the forwarder, please let me know.

0 Karma

Splunk Employee
Splunk Employee

Conceptually, you're promoting by switching the inbound stream at the router or reverse-proxy level. Above that, it sounds like Staging and Prod are indistinguishable.

Have you explored ways to apply this same low-level-routing pattern to outbound logging traffic?

0 Karma

Builder

That idea makes sense, but in our case I think won't work. I've found out the mechanism for the routing now, it is done by adding a request parameter (eg. http://blah/?instanceRole=DARK) to the webserver requests for anyone wanting to look at the dark/non-live instance (requests will go to the light/live by default if request parameter is omitted). I don't think it's possible to use this mechanism for forwarder traffic.

0 Karma