Splunk Search
Highlighted

Query to find correlated items

Explorer

Working with data of the form:

OrderNumber=x, ItemNumber=y

I am trying to see if Splunk is suitable for doing analysis that is fairly straightforward in the SQL realm. I'd like to calculate which items in an order appear together most frequently. I started by calculating the most frequently occuring item numbers, and then have been trying to use that as the basis for a subquery. In SQL I would likely generate summary tables and then join them together, and I am finding it hard to break out of the SQL mindset for this.

So far I have:
index="yyy" | stats count(ItemNumber) as ItemCount by ItemNumber | sort 10 -ItemCount

to get the most popular items. Using that, I am trying to get a count of the top 3 ItemNumbers that appear in an Order with that list of ItemNumbers.

Since the OrderNumber is present in all records, does using a transaction make this more straightforward? I am having difficult trying to visualize how the data would be piped together in a Splunk query. Any guidance would be greatly appreciated.

Edit:
Let me clarify what the data looks like:

OrderNumber=1234,LineNumber=1,ItemNumber=1200804,Description=info,Quantity=2,UnitPrice=200,LineItemTotal=400 OrderNumber=1234,LineNumber=2,ItemNumber=1269083,Description=info,Quantity=1,UnitPrice=612,LineItemTotal=612 OrderNumber=12345,LineNumber=1,ItemNumber=1270005,Description=blah,Quantity=2,UnitPrice=313,LineItemTotal=626 OrderNumber=12345,LineNumber=2,ItemNumber=1485227,Description=blah,Quantity=1,UnitPrice=70,LineItemTotal=70

So each row can either be an order with a single line item or be a part of a multi-line order. That is why I am trying to group them together as a transaction (using the OrderNumber). The goal is to get a table with a count of the frequency in descending order of the occurrence of two ItemNumbers (within the same OrderNumber / transaction) across all OrderNumbers. Think of it as Item #2 is often purchased when Item #1 is in a order (planning for inventory availability).

I have updated my earlier query to this: index="yyy" | transaction OrderNumber | stats values(ItemNumber) as Items, max(LineNumber) as OrderLines by OrderNumber | table OrderNumber, Items, OrderLines | where OrderLines > 1 | sort -OrderLines,Items

This gives me a table by OrderNumber with a list of all of the ItemNumbers within a transaction. Ideally I would like to be able to run a function on the Items column to get a histogram or count of the frequency of any two values from that column across all orders.

In the SQL world I would likely create a summary table of the most popular items, and then use that as the basis for creating potential pairs of items. Filter the orders to those that contain one of the most popular items, and find out which other items appear most frequently in those orders.

Update 2:
The output I am trying to generate is something like this:
Bundle (ItemNumber) : Number of occurrences
155900,155950 : 500
160110, 160112 : 387
etc
where the bundle is the two most frequently occurring pairs within an order across all orders

The counterpart would something like this:
Item : Recommended Item : Sales : Linked : Percentage
149183 : 194182 : 26 : 25 : 96%
101171 : 101172 : 26 : 24 : 93%
etc
where linked is the number of times they appeared in the same order, sorted by the percentage together

Tags (2)
0 Karma
Highlighted

Re: Query to find correlated items

Motivator

Hello

There is an existing command to do what you need: CORRELATE

http://docs.splunk.com/Documentation/Splunk/5.0.4/SearchReference/Correlate

Regards

0 Karma
Highlighted

Re: Query to find correlated items

Explorer

Thanks for the answer. I'm not sure that correlate applies directly, since I know that field exists in every record. I am trying to correlate the values of the field rather than the field itself.

0 Karma
Highlighted

Re: Query to find correlated items

Explorer

The solution for this issue ended up being a combination of scripting outside of Splunk and some lookups within Splunk. I had worked with some possible approaches using multivalued lists the mvzip function to generate value pairs, but it wasn't generating a complete set. The limits.conf also needs to be adjusted to make sure that results are not truncated during the processing.

Having said that, here is how the problem was addressed

  1. From the existing data in Splunk, use the transaction command to generate a multivalued list of all item numbers by order number
  2. Export this list to CSV, strip out the field names and keep only a list of comma separated item numbers on each line of the file, each line representing a single order
  3. Parse the file with an external (python was what i used) script that generates all possible item combinations for each order and dumps the order pairs into another file. The key is to make sure the orders items are sorted so that the lower numbered items is always first in each pair so you can avoid issues when aggregating results.
  4. After all ordered order item pairs have been generated, you can either go ahead and import back into splunk using a different sourcetype (or index), but I chose to go ahead and do some pre-processing - sort the file, run it through uniq, and then sort again by count (sort | uniq -c | sort -rn) and then capture that output. You then have an item pair and the count of occurrences.
  5. Import this ordered count of item pairs back into Splunk as a new source type and query directly. You can join on item number to get the description or use a lookup table.
  6. This external approach could be scripted to create triples by adding another nested loop.

Hope this helps others working on a similar effort to replicate standard BI queries in Splunk.

View solution in original post

0 Karma