I was talking with someone who may have assets with the same IP across multiple data centers. In other words, the same IP ranges are allocated in different data centers, so any given IP may appear associated with assets from two different data centers.
Anyone dealt with this? Any recommended best practice (other than not overlapping IP ranges) for distinguishing said IPs within Splunk?
One approach I considered was a host override in the transforms to associate the datacenter with the ip for the host value. Although, I'm curious if other folks have found other creative solutions?
You can also hard-code the unique server name in server.conf on each box. This will appear on host value on all the logs that server is indexing.
serverName = your-custom-name
Yea, but I wonder if that would confuse end users because the hostname doesn't match the actual hostname. You could append the datacenter name and then rip it off with a calculated field. It doesn't address data already in the system as well. So if the overlapping IPs or hostname issue was just discovered.
Probably better to create new indexed fields rather than replace the host field with a value that is different from the host/ip.
Yea because if someone wants to investigate and the hostname is not matching that could cause a ton of confusion.
Depends on what you want to do with the data. I've recommended a combination of the above in the past where each dataset needed to be treated independently. It helped if the data was easy to identify as soon as it started coming in, and easy to work with once ingested.
Host overrides or tagging at ingest time will identify it early. Routing each to a separate index as well gives you more advantages, again tailored to this use-case. Easy to report on each simply by specifying the index name, which you're going to do anyway. You don't have to do extra work to contain your correlations, for example, to a particular datacenter, which I'd guess is more common than correlating across DCs. You also have index-level RBAC; If you have requirements around who can access the data from a particular datacenter, you can encode that policy as index-level permissions.
Whichever method you choose, I like the idea of the concept of a datacenter being a first-class citizen if it's important here. Instead of knowing that the domain portion of one field (host) uniquely identifies this other field (IP address), you could explicitly tag all data as belonging to a particular DC. That makes it clearer in my mind, and if for some reason domains do start overlapping one day, it doesn't matter.
Good point raising the detail of RBAC on data center. If that's a requirement then that could be a huge thing to consider. Clutch!
So in a scenario where you have duplicate hostnames and IP addresses, then I would look at using fqdn's in the host field. You could key off of the domain in the host field to search by data center. Then your searches for say, PAN logs would be restricted to a particular data center.
Yea def. Although I thought FQDN was only for a friendly DNS name, not an IP? Correct me here?
I'm thinking of what I have seen at some DR sites where you have overlapping IP address space and overlapping hostnames. So for example something like:
Then you could create a second field out of the host field, say datacenter, and search for datacenter=dc-east or datacenter=dc-west. And even if the IP addressing doesn't overlap, this would still deal with the duplicate hostnames.
Probably you'd have an indexer(s) in each data center and could search by splunk_server, or make a dc-east and dc-west index, too. Even if you did all of that though, I still think capturing the fqdn in the host field is worthwhile.
I just realized: I think the
splunk_server says what indexer returned the results, not what indexer indexed the results. That means that during multisite indexer replication, you wouldn't know where the data originated from, right?
Yes you're right. I don't much like the idea of using that field anyways for this use case.
Depending on how the data is ingested a few ideas come to mind. I'm in the mindset of at search time.
Just spitballing here.
I like the searchtime idea because I hate having things permanent.
At first I didn't get what you meant but I think I'm catching on - lemme know if I'm on the right page here: your saying that the data can have different sourcetypes, sent to different indexers, use different tags based on where it came in.
I think the splunk indexer is a great catch: if there are two data centers with different indexers getting only their own data center's data then the splunk_server can be used as the differentiator. Then, generating user friendly tags off that becomes a trivial effort!
Yes, that's exactly what I was saying. The "splunk_server", if exists, is the easiest way to seperate your searches. I do like what the others are saying about FQDN though. I wouldn't use index or sourcetype unless I absolutely had to. Its best to keep all data of a specific sourcetype in the same nomenclature.
Tags are simple to apply, but hard to set permissions on. They are flexible, and can be changed later. So if you can tag hosts from one DC, and assuming all those hosts report to the same splunk_server, you're have a way to differentiate. Tags are optional if the hosts in one DC ALWAYS ALWAYS report to the same indexer. But, tags are kind of fun, so I won't discourage you from using them.
We customized sourcetype names to track applications and license utilization, and it definitely adds overhead. I wouldn't do it unless I had to.
Tags are flexible, but how do you handle keeping them up to date when you have 100's or 1000's of servers? At that point it seems like a lookup is a better option.
I'm thinking you could use tags to make it easy for end users but the tags are defined through eventtypes that just pivot off the datacenter's splunk_server value. That way you're not managing a list (or even a wildcarded list) of hostnames.
Are the IP addresses appearing in the host field? Or do the IPs AND the hostnames overlap?
Let's try both scenarios. I'm betting in some scenarios the ip or hostname is the host field while others the ip or hostname might just be in the _raw (like for pan logs). I'm sure for the later there could be some answers made by the host value itself but regardless, I'm still interested in what folks have found to be the strongest solution.