Splunk Search

simulating a SQL JOIN in Splunk

Contributor

I have indexed the contents of a relational database along with a log file. My log contains these fields:

  • cost - this is an integer that I want to report on)
  • userid - this is an ID into the Users table that I've indexed from my database

My users table, now indexed in Splunk, contains these fields:

  • userid - same ID as above
  • username - human-readable text of their name

I want to end up with a chart which shows the top 20 users ordered by how much they've spent (sum of cost field) over a specific time range (e.g. last 2 weeks). I want to show them by username, not by ID. What's the best way to do this in Splunk?

Tags (3)
1 Solution

SplunkTrust
SplunkTrust

It is easy to look at this problem and think that since this would definitely be a `join` in SQL, therefore you need to use the `join` command in Splunk.

However you do not need the join command here and you will be much better off without it. Not only is join slower for having to run the second search and have a second process getting events off disk, but its searches will quietly truncate at I think 50,000 events, and will quietly self-finalize in some number of seconds in limits.conf whichever comes first. Both of these problems will bite you and it's usually easy to avoid joins by thinking about how to solve it with lookups or with the plain old stats command and some grouping.

In your particular case I'll echo Gerald's answer that the lookup is probably the way to go, particularly if the userids and usernames dont change very much relative to the scale of the scheduled search maintaining the lookup table.

But here's how to do it just with search and stats. One big OR clause in search mashes the two data sets together and then we do a little bit of stats command to merge it all into what we want.

Run this search over the last 2 weeks:

source=costtable OR source=usertable | stats first(username) as name sum(cost) as total by userid | sort - total | fields username total

View solution in original post

SplunkTrust
SplunkTrust

It is easy to look at this problem and think that since this would definitely be a `join` in SQL, therefore you need to use the `join` command in Splunk.

However you do not need the join command here and you will be much better off without it. Not only is join slower for having to run the second search and have a second process getting events off disk, but its searches will quietly truncate at I think 50,000 events, and will quietly self-finalize in some number of seconds in limits.conf whichever comes first. Both of these problems will bite you and it's usually easy to avoid joins by thinking about how to solve it with lookups or with the plain old stats command and some grouping.

In your particular case I'll echo Gerald's answer that the lookup is probably the way to go, particularly if the userids and usernames dont change very much relative to the scale of the scheduled search maintaining the lookup table.

But here's how to do it just with search and stats. One big OR clause in search mashes the two data sets together and then we do a little bit of stats command to merge it all into what we want.

Run this search over the last 2 weeks:

source=costtable OR source=usertable | stats first(username) as name sum(cost) as total by userid | sort - total | fields username total

View solution in original post

Splunk Employee
Splunk Employee

You might want to read "Splunk for SQL Users"...

If you are familiar with SQL this quick comparison might be helpful to jump-starting you into using Splunk.

http://www.innovato.com/splunk/SQLSplunk.html

Builder

Alternatively you could use the JOIN command. Assuming this first search returns the usernames

source=usertable | dedup username | fields + userid username

And your cost data has the common field userid, you could use the following to join the outputs.

source=costtable | join userid [source=usertable | fields + userid username]

Bob

alt text

Splunk Employee
Splunk Employee

The best way is to set your users table as a lookup. Periodically (or whenever you update it, or whenever you need it, run):

source=usertable | outputlookup usertable.csv

You could also just skip putting the user table into the Splunk index and just export it as CSV and place it into a lookup table directly. If you define the lookup table in transforms.conf, you can use that name rather than usertable.csv

Then:

source=logfile | lookup usertable.csv userid OUTPUT username | stats sum(cost) by username | sort 20 -sum(cost)

You could also set the lookup as an automatic lookup in props.conf so it runs every time you view your log source (or sourcetype or whatever).

Splunk Employee
Splunk Employee

Splunk 4.0 executes lookups more efficiently than join. Making the lookup automatic lets you efficiently search using the username (e.g. source=costtable username=myname), while with join you must execute the join on the entire costtable. If there is no match on (e.g.) userid, then join will drop your costtable row, while lookup will keep it (join=inner join vs lookup=left join). Lookups can be transparently replaced with a script. Note, with join, you need to specify option max=0, or you'll only get one result per row in the usertable.

Contributor

how does a lookup approach differ (in performance, ease of management, etc.) from @Equalis's suggestion above to use the JOIN command?

0 Karma