Getting Data In

Segmenting/ignoring selective data from a server we don't own

Builder

I have a question that I'm looking for some guidance on.

Our division has a team that's interested in data that sits on a another division's (let's call them "ABC") server. That data (info on dialup users) is intermingled with info from users from many other divisions. As such ABC is hesitatant to give us access to it as we'll see everyone else's dialup data.

Additionally, ABC does not run Splunk at all (although this is turning out to be a great introduction for them) and isn't likely to in the near term. So the discussion here is to let us put a LWF on their Windows server to slurp up this one particular log file that has everyone's dialup data in it.

ABC has asked if they let us do this, can we only index the data for our division. I haven't seen the log file in question so I don't know how one would discern between our data and theirs, but I'm assuming it's possible. I'm hoping it's not a question of "here's 4,000 users, only index events that are from one of them". Obviously this would be much easier if our data was in a separate log file, but I don't think that's an option.

I realize that the LWF isn't built to differentiate events as that particular engine isn't turned on to keep it light. So presumably the indexer has to do the heavy lifting here. If we can come up with some handy way to differentiate the events our data may be some minority of the full stream of events. We already filter out some very static events from being indexed, and I'm assuming that we can do this on a larger scale, but I worry that this will be a performance killer if we end up having to do it based on a large list of exclusions/inclusions.

In a perfect world, I would see ABC having their own Splunk indexers and pulling the data in themselves. Then potentially giving us access to do a distributed search for this data against their servers. A question I would have there is, assuming there's some field we can use to differentiate our users from theirs, can they restrict a distributed search based on a field? That is, we could search a specific index on their server, but only if it matched a certain field value?

Thanks!

Tags (2)
1 Solution

Splunk Employee
Splunk Employee

If i have understood your situation correctly you can do one of the two things:

Hope this is at least a good start for ya!
.gz

View solution in original post

0 Karma

Splunk Employee
Splunk Employee

I would say that you have two choices:

  • Run a standard (not light) forwarder, which can parse out and presumably select the events via a regex, and only route that data. Hopefully there's some field with a value that's unique to your data to make that easy.

  • Forward all the data, and have your indexer discard everything but your data.

Both require the same amount of processing and almost the same configuration. The difference of course is that in the first case, the processing will have to take place on the forwarder on their machine. In the second case, the processing takes place on your indexer, and all the data moves over to your machine, but they have to trust you to discard stuff that isn't yours. (Then again, in the first case they have to trust your configuration of the forwarder to discard the same set of data. But mathematical equivalence isn't the relevant factor.) You could I suppose introduce an intermediate forwarder at their location that collects the data from the LWF and filter it there (taking the load off their machine), then forwards it to your indexer. All these solutions are the same, except for where the filtering is done (and hence how far the unwanted data travels before it gets discarded).

They really can't restrict distributed search based on a field.

0 Karma

Builder

Yes, it would be better to not index it, but my concern is that the filtering to nullqueue stuff filters based on an individual event. I think the events they wouldn't want the local app team to see are part of a transaction. In other words, the "bookend" parts of the transaction might easier to identify and route to nullqueue, but the middle parts might not be. If we can filter out locally based on search terms, that might work better. I have yet to see the data though.

0 Karma

Splunk Employee
Splunk Employee

Well, you can restrict a search based on any search term, including indexes or any other base search term. However they can not force the restriction on you, and you will still have indexed the other data. Presumably everyone would prefer for you not to even index data you shouldn't have, rather than simply refraining from searching it?

0 Karma

Builder

We want to keep the footprint light on the agent side. I'm not sure why I thought we could restrict a search based on something other than indexes. So I guess the best approach here would to have our indexer try to either drop data that isn't ours or write it to a separate index. unfortunately, I think that identification would be done as a transaction level not an event level.

0 Karma

Splunk Employee
Splunk Employee

If i have understood your situation correctly you can do one of the two things:

Hope this is at least a good start for ya!
.gz

View solution in original post

0 Karma

Builder

Thanks. I have yet to actually see the data, but I'm being told that some things may be not identifiable by a single event. That is, they may be more transaction-oriented. Like a situation where a session is opened and you can identify that that session-open is our data, but then subsequent events about that session dont contain a string with our data in it. You'd have to tie it back to the original session id via a transaction.

Anyway, if and when I get access to the data, I'll consider these ideas. Thanks!

0 Karma