I've been looking at Splunk's external lookup features and they sound ideal for several of my logs. For example, I've got a log with a user ID where I'd like to be able to do counts based on their sales region. I can easily get the region name or ID from the user by calling a simple URL like the illustrative example below:
http://www.foo.com/splunk/lookup/GetSalesRegionNameFromUserID?user_id=1b4545x5
With a result of an empty string or an ID.
Can anyone point me to an example that uses HTTP GET or POST to call a URL to satisfy an external lookup? As far as the various configuration file modifications and packaging requirements, I figure I'll use example from the following article by Nimish Doshi as a starting point:
http://blogs.splunk.com/2009/09/14/enriching-data-with-db-lookups-part-2/
That article use SQL calls but, otherwise, the idea is much the same. It might be relevant to mention that I'm pretty clueless about Python but that I have total control over the Web interface. (The URL, inputs and outputs.) I've already written and tested the Web+DB side of this, I just need to figure out how to call the URL from Python and get the inputs/outputs sorted out with Splunk.
Getting a working example would be incredibly helpful to me as, starting from that, I could build out many other similar functions.
Thanks in advance for any help, pointers, or suggestions.
P.S. I looked at GET workflow actions but they seem to be about adding interactivity to search results. What I'm after right now is the ability to do statistics based on external lookup field values as well as improve the meaningfulness of log entries in results.
Here is a quick example of a lookup script. Change the value returned by the lookup() function according to how the response of your web page looks as well as the URL that you call. The field returned will be called "SalesRegion".
#!/usr/bin/env python
####################################
# Translates a URL+id to #
# to a value returned by Url #
# #
####################################
import csv
import urllib2
import sys
def main():
# Get name of column holding ID field to translate
idColumn = sys.argv[1]
if idColumn == None:
print "incorrect argument for script lookup - ensure a fieldname with name of column holding ID is a parameter"
sys.exit(0)
#set name of column that will be created holding translated value (the Sales Region basically)
salesRegionColumn = 'SalesRegion'
reader = csv.reader(sys.stdin)
writer = None
header = []
first = True
for line in reader:
if first:
header = line
#append new column name to list of headers
header.append(salesRegionColumn)
if idColumn not in header:
print "Id field must be submitted"
sys.exit(0)
csv.writer(sys.stdout).writerow(header)
writer = csv.DictWriter(sys.stdout, header)
first = False
continue
result = {}
i = 0
#loop through the list of headers
while i < len(header):
#print 'length of line is %d' %len(line)
if i < len(line):
result[header[i]] = line[i]
else:
result[header[i]] = ''
i += 1
#if the SalesRegion already exists for any particular reason then just output the existing value
if len(result[salesRegionColumn]):
writer.writerow(result)
#if value of ID exists, then lets use that to translate it to SalesRegion
elif len(result[idColumn]):
print 'url is %s' %result[idColumn]
salesRegion = lookup("http://www.foo.com/splunk/lookup/GetSalesRegionNameFromUserID?user_id=",result[idColumn])
result[salesRegionColumn] = salesRegion
writer.writerow(result)
def lookup(url,id):
fullUrl = url+id
try:
response = urllib2.urlopen(fullUrl)
except urllib2.URLError:
return 'cannot translate ID due to UrlError'
except:
return 'cannot translate ID'
data = response.read()
#Return the first eight characters of response. You could change this to whatever you wanted (apply regex for example).
return data[:8]
main()
Your transforms.conf stanza should be something like this (where the name of the field holding the ID to translate is called "user_id"):
[your_lookup_name]
external_cmd = the_script_name.py user_id
fields_list = user_id, SalesRegion
This is great information. Thank you to everyone for this.
I have a question regarding saving the retrieved SalesRegion to the Splunk index. What I'm finding is my searches have really slowed down since the python script is making several calls to the web to resolve the user id. Moreover, it's not actually writing the SalesRegion to the index so every time I do a search, run a report, create a pivot, etc, the python script is re-invoked and subsequently, so are the HTTP calls for the entire search scope.
Can anyone help here?
Tested it, as long as your search returns a field named user_id and your script is written to return a column named MarketID (header.append("MarketID")) should work with transforms.conf:
[marketIDLookup]
external_cmd=user_to_market_id.py user_id MarketID
external_type=python
fields_list=user_id, MarketID
props.conf:
[int_web_log]
LOOKUP-table=marketIDLookup user_id OUTPUT MarketID AS market_id
Here is a quick example of a lookup script. Change the value returned by the lookup() function according to how the response of your web page looks as well as the URL that you call. The field returned will be called "SalesRegion".
#!/usr/bin/env python
####################################
# Translates a URL+id to #
# to a value returned by Url #
# #
####################################
import csv
import urllib2
import sys
def main():
# Get name of column holding ID field to translate
idColumn = sys.argv[1]
if idColumn == None:
print "incorrect argument for script lookup - ensure a fieldname with name of column holding ID is a parameter"
sys.exit(0)
#set name of column that will be created holding translated value (the Sales Region basically)
salesRegionColumn = 'SalesRegion'
reader = csv.reader(sys.stdin)
writer = None
header = []
first = True
for line in reader:
if first:
header = line
#append new column name to list of headers
header.append(salesRegionColumn)
if idColumn not in header:
print "Id field must be submitted"
sys.exit(0)
csv.writer(sys.stdout).writerow(header)
writer = csv.DictWriter(sys.stdout, header)
first = False
continue
result = {}
i = 0
#loop through the list of headers
while i < len(header):
#print 'length of line is %d' %len(line)
if i < len(line):
result[header[i]] = line[i]
else:
result[header[i]] = ''
i += 1
#if the SalesRegion already exists for any particular reason then just output the existing value
if len(result[salesRegionColumn]):
writer.writerow(result)
#if value of ID exists, then lets use that to translate it to SalesRegion
elif len(result[idColumn]):
print 'url is %s' %result[idColumn]
salesRegion = lookup("http://www.foo.com/splunk/lookup/GetSalesRegionNameFromUserID?user_id=",result[idColumn])
result[salesRegionColumn] = salesRegion
writer.writerow(result)
def lookup(url,id):
fullUrl = url+id
try:
response = urllib2.urlopen(fullUrl)
except urllib2.URLError:
return 'cannot translate ID due to UrlError'
except:
return 'cannot translate ID'
data = response.read()
#Return the first eight characters of response. You could change this to whatever you wanted (apply regex for example).
return data[:8]
main()
Your transforms.conf stanza should be something like this (where the name of the field holding the ID to translate is called "user_id"):
[your_lookup_name]
external_cmd = the_script_name.py user_id
fields_list = user_id, SalesRegion
Why is half the script in the answer above out of the code box?
Bad formatting of the code. The answer had been pasted in as pure HTML, not board markup. I have now corrected this, and also removed your remark to being a comment on the answer.
1) Something is wrong in my configuration statements as the field does not show up in the field list or picker.
2) I may be returning the data in the wrong format. I'd like to return the wrong data but have also tried
MarketID+CRLF+data+CRLF
Getting as far as seeing the requests come in and the response go out is a huge win, but I'd still be grateful for help figuring out how to get everything wired into Splunk properly.
P.S. I don't see that it matters but I put my script into
$SPLUNK_HOME/etc/searchscripts/user_to_market_id.py
Added a stanza to props.conf tied to the sourcetype, like
[int_web_log]
lookup_table = marketIDLookup user_id OUTPUT MarketID AS market_id
Added a stanza to transforms.conf, like:
[marketIDLookup]
external_cmd = user_to_market_id.py user_id MarketID
external_type = python
fields_list = user_id, MarketID
Restarted Splunk.
When I load data from the int_web_log sourcetype, the new field does not show up in the list. I've experimented with trying to do the lookup directly from the search bar using a pipe, but get no results their either.
Thanks for the code! That's a huge help, thanks very much. I've adapted what you posted and Splunk is now calling my Web interface correctly. I can see the requests coming in with the right arguments and am able to respond. Right now, nothing is showing up in Splunk, but I can see that the data is going over the wire. So, it looks like I still have a Splunk configuration step done incorrectly. I've been going through samples and the docs and have done the following: