Working with data of the form:
I am trying to see if Splunk is suitable for doing analysis that is fairly straightforward in the SQL realm. I'd like to calculate which items in an order appear together most frequently. I started by calculating the most frequently occuring item numbers, and then have been trying to use that as the basis for a subquery. In SQL I would likely generate summary tables and then join them together, and I am finding it hard to break out of the SQL mindset for this.
So far I have:
index="yyy" | stats count(ItemNumber) as ItemCount by ItemNumber | sort 10 -ItemCount
to get the most popular items. Using that, I am trying to get a count of the top 3 ItemNumbers that appear in an Order with that list of ItemNumbers.
Since the OrderNumber is present in all records, does using a transaction make this more straightforward? I am having difficult trying to visualize how the data would be piped together in a Splunk query. Any guidance would be greatly appreciated.
Let me clarify what the data looks like:
OrderNumber=1234,LineNumber=1,ItemNumber=1200804,Description=info,Quantity=2,UnitPrice=200,LineItemTotal=400 OrderNumber=1234,LineNumber=2,ItemNumber=1269083,Description=info,Quantity=1,UnitPrice=612,LineItemTotal=612 OrderNumber=12345,LineNumber=1,ItemNumber=1270005,Description=blah,Quantity=2,UnitPrice=313,LineItemTotal=626 OrderNumber=12345,LineNumber=2,ItemNumber=1485227,Description=blah,Quantity=1,UnitPrice=70,LineItemTotal=70
So each row can either be an order with a single line item or be a part of a multi-line order. That is why I am trying to group them together as a transaction (using the OrderNumber). The goal is to get a table with a count of the frequency in descending order of the occurrence of two ItemNumbers (within the same OrderNumber / transaction) across all OrderNumbers. Think of it as Item #2 is often purchased when Item #1 is in a order (planning for inventory availability).
I have updated my earlier query to this: index="yyy" | transaction OrderNumber | stats values(ItemNumber) as Items, max(LineNumber) as OrderLines by OrderNumber | table OrderNumber, Items, OrderLines | where OrderLines > 1 | sort -OrderLines,Items
This gives me a table by OrderNumber with a list of all of the ItemNumbers within a transaction. Ideally I would like to be able to run a function on the Items column to get a histogram or count of the frequency of any two values from that column across all orders.
In the SQL world I would likely create a summary table of the most popular items, and then use that as the basis for creating potential pairs of items. Filter the orders to those that contain one of the most popular items, and find out which other items appear most frequently in those orders.
The output I am trying to generate is something like this:
Bundle (ItemNumber) : Number of occurrences
155900,155950 : 500
160110, 160112 : 387
where the bundle is the two most frequently occurring pairs within an order across all orders
The counterpart would something like this:
Item : Recommended Item : Sales : Linked : Percentage
149183 : 194182 : 26 : 25 : 96%
101171 : 101172 : 26 : 24 : 93%
where linked is the number of times they appeared in the same order, sorted by the percentage together
Thanks for the answer. I'm not sure that correlate applies directly, since I know that field exists in every record. I am trying to correlate the values of the field rather than the field itself.
The solution for this issue ended up being a combination of scripting outside of Splunk and some lookups within Splunk. I had worked with some possible approaches using multivalued lists the mvzip function to generate value pairs, but it wasn't generating a complete set. The limits.conf also needs to be adjusted to make sure that results are not truncated during the processing.
Having said that, here is how the problem was addressed
Hope this helps others working on a similar effort to replicate standard BI queries in Splunk.