I work for a certain agency which maintains a list of names of individuals who are on a "no-fly" list. Every day, sometimes several times per day, we get updates to this list, which can have tens of thousands of names on it.
At the same time, we receive lists of passenger ticket bookings from around the globe...this list, as you can imagine, is absolutely massive. At the moment it's around 150 gigabytes per day, as it contains passenger names, dates of birth, locations of departure, destination, payment details, travel agency (if used) and much, much more.
The question is, what's the most efficient way, in Splunk, to cross-reference these lists and only find, on a given day, or even within a given hour (happy with a sliding search and summary-index, if needed), those people on our no-fly list on the much larger all-passenger list?
Making a query like so is hardly efficient:
index=travellers [ search index=nofly | fields name]
As that would result in Splunk "parsing" the search and realising it's essentially being asked to do:
index=travellers (jane OR bill OR eve OR bob OR ... )
Where the OR clause could be tens of thousands of names long.
I've pondered creating a summary-index of just the traveller names from the travellers index, which would be a smaller amount of data, but would still take considerable time to extract, and would also need at least the Date of Birth and likely other fields as it turns out a lot of humans share the same name, and that can get rather embarrassing when adding people to the no-fly list, let me tell you.
The typical approaches that I've used for other queries work great when the data set is small, but at the scale we're looking at here, it becomes a painfully slow process. Obviously, we could throw more hardware at the problem, as our agency budget is unrestricted and no one is allowed to question it, but I'm hoping that there's a much better way to do this and ensure that we have optimal tracking of people on the no-fly list with all potential travellers.
NB: (I jest; this isn't really what I'm trying to compare, it's just a great example of the type and scale of data)
A use case like this would be a good application for the KV store lookups introduced in Splunk 6.2, the nosql based db for the 'no fly list' could be applied to the travellers index on a rt or summary based search.
That is definitely a concern for scale, I would be looking at that approach only in the instance of realtime search or summary generation.
The kvstore provides some additional flexibility for population/update of the incoming data as compared to very large lookups.
First, I will assume that the
nofly index has been de-duped - there are no duplicate name/birthdate tuples in it. Second, on any given day, many passengers will fly multiple segments, so that list needs to be de-deduped as well.
nofly index is not time-series data. I would prefer to put it in a lookup table. The fields could be as simple as
Name,DOB,status John Doe,3/19/1968,nofly etc.
Then the search could be
index=travelers | dedup Name DOB | lookup nofly_lookup Name DOB output status | where status=="nofly"
I assumed that the names of the fields in the
travelers index are also "Name" and "DOB" just because it made it easy for me. The first use of the lookup will be rather slow, because Splunk will build a index file if the lookup is over about 10 MB. Lookups should work fine as long as you are talking about less than a million rows of data - after that things can deteriorate.
Of course, one of the interesting things that your question doesn't cover is - names can be spelled, hyphenated and capitalized in a number of different ways. Hopefully you aren't just doing a simple string match for the names - but of course this is an illustration, not the real thing...
I thought about using a lookup table, and may go back to that. The no-fly list does have date/time sensitivity, as we do track the specific date someone was added to the list. When trying to cross-reference that with passenger bookings, we, sometimes, look to analyse the time gap between when someone is added to the list and when they last made a booking and when they make their next.
You're right about names and the variations. That's something we encounter daily and since in many cases the input is user-driven, as they fill out a form at their favourite online flight booking company, it's sometimes error-filled. But we have to try anyway, and have ways to try to normalise and enhance that.
In order for the lookup table approach to work, however, we'll need a unique field added to the list which would return the match of the
status=="nofly" -- that might work, but unfortunately the data as we get it is more in the format of:
Entry_Date,Name,DOB,Reason,Agency 2014-11-21,"Jane Doe",1985-02-11,Knife,TSA 2013-10-03,"John Smith",1972-03-18,Drugs,FBI
There may or may not be other fields, but, that's the format we've got.
By way of example of how big the data is, and this is just for a 12 hour window:
Number of events: 77,203,622
Number of unique traveller names: 116,990
The no-fly list is just over 41,000 entries long at the moment.
To come up with the count of events and names took 1,100.173 seconds, again, just for that 12 hour window of traveller data.
I probably should have mentioned that our agency doesn't just get plane flight traveller data, it's also car ferries, trains and other forms of transportation. Thus the size of the index.
It turns out there was another field in the data which I could treat as your sample "status" entry - however, the query is still painfully slow. The traveller index is simply massive, so having to perform the lookup for each entry is taking considerable time. My gut reaction is that I'll have to do this with a summary-index which has an hour-by-hour breakdown of just the names/DOB of the travellers, and then do the lookup, or use the KV store, as suggested by @eddit0r - still reading up on that.
It may be that there's simply no fast way to do this, given the quantity of data.
Another thing that I was thinking about was - is the lookup distributed or is it only performed on the search head? If lookups could run on the indexing tier, then you could scale them. If not, then the lookup becomes a terrific bottleneck. Joins (and similar commands) have the same problem, as you can't complete the join operation until you have all the data in one place.
Here is an interesting answer about that