I need to parse apache web logs that can run into the billions of requests per month. I need to coorelate and aggregate all this data and be able to display these results for up to a year back. I have two approaches and can't decide which is the better approach being new to splunk.
Problem: for a given web event, I need to create a custom search expression to extract several fields that are not cleanly parsed by the default field extraction. Then take the extracted ISBN data field and run it through a perl or python ISBN converter script to get normalized 13 digit values. I then need to call an external Oracle DB system through perl scripts to obtain the title and publisher. I then need to aggregate the occurances of this ISBN per client, which is another field in the event. The user then would want to view this data by seeing the results by client or by top clients, all clients, by ISBN, by publisher. There could be thousands of unique clients. This data would need to be viewed by marketing separate from the views of the IT dept.
choice 1 I have proposed to do all the field extractions, ISBN normalization, Title/publisher lookups and aggregation in perl scripts and create a CSV file that represents one days worth of event parsing. date=1274428800,client=user1,isbn=9789004106802,title=A,publisher=X,hits_tday=2 date=1274428800,client=user1,isbn=9789004106895,title=B,publisher=Y,hits_tday=105 date=1274428800,client=user2,isbn=9789004107328,title=C,publisher=Z,hits_tday=1 date=1274428800,client=user2,isbn=9789004115620,title=D,publisher=W,hits_tday=1
Having this parser run as a cron job once a day. I then proposed that this csv data would be fed into splunk and using the search and stats commands to build views of this tabular data. This of course requires that splunk now has to keep the original event logs, used by the IT guys, and my new csv tables, to be used by marketing. I would build a dashboard for the marketing team and have custom searches created. Like by-client, all-clients, by-publisher etc...
choice 2 They want me to try and do this all in a custom search in real time and not pre-parse/stage this data as I just mentioned. Does means rewritting all the perl into python (unless perl can be used?). I see conflicting examples of only python scripts can be used and in some cases perl is used. Can this be done?
Can perl be used in a search command? When I retrieve the normalized ISBN, title and publisher, will this data be added to the original indexed set of events so this data is not retrieved every time a user views the data? Should I use summary indexing and aggregate daily to speed up user reports?
We have a licensed version of Splunk so I have access to all its capabilities.
What is the suggested approach?
It seems that if I built a custom search command that parses billions of events, normalizes the ISBN, retrieves the title/publisher for every event then provides a roll up analysis of this data per client for the selected time frame would take a long time. If the user then clicks by-pubisher would the entire data set be processed again? Shouldn't all this data be saved into a new table somehow, so the next user request doesn't go through it all again?
I would suggest you use the lookup feature that splunk provides. Basically you can have a csv file which contains the following fields: log_isbn, normalized_isbn, title, <other fields>, then at search time you can enrich your results with data from this lookup table
Documentation on lookups: http://www.splunk.com/base/Documentation/latest/Knowledge/Addfieldsfromexternaldatasources
A similar question by another user http://answers.splunk.com/questions/1884/lookups-using-them-to-replace-the-host-field
Ok, this sounds good. Once I enrich the results with the data from the CSV file, is the data in the result set now part of the indexed data?