very large lookup tables (exceeding 2gb bundle siz...

jshaynes · ‎11-29-2011

We're in the situation that we need to have lookup tables that are larger than the 2gb bundle size.

For example, creating a lookup table for the Ironport logs with "mid,mailfrom,mailto,subject" for just a day works out to around 20 million entries deduped. Stripping out the _.* fields and using the gz format brings that to 285mb for a single day. We need the timeframe for the lookup table to go back at least a month -- ideally to Splunk's typical 90 day optimal performance window.

I can think of two different possibilities:

external data store that gets queried via a script during search commands
point splunk to the lookup table files to have it index them

Are there any other options? And is there a recommended approach for this situation?

hexx · ‎12-01-2011

If you need to stick with large lookups, I would suggest to evaluate the following options :

Exclude the large lookups from bundle replication using a blacklist or a whitelist in distsearch.conf. In all likelihood, if your search is wired the right way only the search-head needs access to the lookups.
If it turns out that your indexers really need access to the lookups, you could use a mounted knowledge bundle to avoid replicating the large lookups.

sdwilkerson · ‎11-30-2011

jshaynes,

The Splunk engineers say that you can look a static lookup of up to one million rows without problems. Honestly, I have done several million, though with this the initial lookup takes a while to run since Splunk creates a lookup-index of this data. This lookup file would have been a few MB in size, not GB. I think GB well-exceeds what was intended with a lookup table. If it even worked, it could take a very long time to get that initial query returned and once the data changed, you would have to take that wait hit again for the lookup-index to re-populate.

Many mailers make analysis difficult by putting useful data on different log lines. Your pain with this is not unique to Splunk.

One solution is to run an ongoing summary index (maybe every 10 min) that summarizes the last 10 minutes of logs, does a transaction on MID, then outputs specific fields that you would like to be summarized and on the same line. If you had that, then this would just be a simple search in splunk e.g. index=summary search_name=ironport_summary status=bounce | stats count(mailfrom)
Or if you wanted the top by volume, you could do | top limit=25 mailfrom, instead of stats on the end.

Note: status=bounce should be replaced by whatever field you search to get a bounce. If a field doesn't exist, make one.

Note: This is not the intended use of a summary index. Summary indexes are designed to report on statistics related to specific events (e.g. top 10 sources by count, or top 15 dest-host by bytes, etc). This method above is using a summary index to preserve a portion of your data in a way or format that didn't exist in the original dataset.

jshaynes · ‎12-05-2011

thanks! that did it. conveniently, it turns out you can store the whole transaction in the summary index without even having to try and squish all of the desired fields onto one line.

eventtype=cisco_esa [search | fields mid] | transaction fields=mid mvlist=t | collect index=summary

creates the index, and running a search on "index=summary" actually returns the whole transaction for all of the search matches.

unfortunately it seems like the field extractions are lost in the process. i need to see about getting those rerun on the index somehow. but at least all of the data is there

sdwilkerson · ‎12-01-2011

jshaynes,
Ahh, sorry, thinko on my part. In the first part of my answer you create a summary that contains a transacted version of your original logs which allows you to get all desired fields on one line. That makes your target search just need to search fields and get your answer. I edited my answer appropriately.

jshaynes · ‎12-01-2011

im not sure how I would be able to use the index to accomplish the same thing, without going back to requiring a subsearch.

in the search command you have listed, where does the index come into play? also, i don't see how it would work, because the search for "Bounced message..." won't return any mailfrom fields:

eventtype=cisco_esa "Bounced message..." | stats count(mailfrom)

i would need something conceptually like this: (although this is incorrect syntax)

eventtype=cisco_esa "Bounced message..." | fields mid | index=my_ironport_summary mid | table mid,mailfrom,mailto,subject

jshaynes · ‎11-29-2011

Sure! I missed the comment box here... My response with the problem specifics got posted below as one of the answers...

jshaynes · ‎11-29-2011

Sure, to answer your question sdwilkerson. There is a class of search queries that I keep running into that conceptually require a subsearch. Since the subsearch has limitations on how many results can be string interpolated and inserted into the main search command, I am using lookup tables instead. Here is one example:

I want to create a list of the top 10 offenders sending mails with the most number of "Bounced" errors. (This is a good way to fingerprint someone who is using their company email address to spam external people, as the spammers list is usually of poor quality--many old or illegitimate email addresses).

To calculate this, I need to know the mailfrom field which is associated with each occurrence of the "Bounced" error. The initial way I approached this was to first run a subsearch for the "bounced" message, and pipe that into fields mid to return a list of MIDs. The subsequent search looked for all events related to that list of mids, and then ran a transaction over the MID field. That allowed me to know the mailto, mailfrom and subject of the mails which had the "Bounced" errors.

Since I ran into the subsearch limits on this, the solution was to create a lookup table. This removes the need for a subsearch to find the context for each "Bounced" mail, since I can just look up the mailfrom by mid in the table.

eventtype=cisco_esa "Bounced message..." | lookup my_ironport_table mid OUTPUT mailfrom,mailto,subject | stats count(mailfrom)

There are a good number of other queries similar to this that I need to run. They all follow the pattern of needing to search for something that is part of a conceptual transaction, and then needing to find the other events associated with the rest of that transaction in order to complete the query.

sdwilkerson · ‎11-29-2011

jshaynes,
I have done many engagements with mail integration (including ironport) but can't understand why you want to put the logs into a lookup table. This seems like a misuse of the tool. Can you better explain your use-case, maybe there is an alternative solution.
Specifically, what problem are you trying to solve by doing this?

bchen · ‎11-29-2011

There will be an app that provides the option of doing lookup tables in MySQL rather than flat files to avoid the need to replicate the data via bundle replication. (which would be option 1, the post above)

Damien_Dallimor · ‎11-29-2011

If I understand correctly , you are creating a large lookup file based on fields in Ironport log events that you index ... therefore , scheduled Summary Indexes might be a better option for you.

very large lookup tables (exceeding 2gb bundle size)

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Adoption of RUM and APM at Splunk