Help on calculating statistics

Sukisen1981 · ‎02-11-2018

Hi, I have a CSV with something like the one shown. first field is order id and second field is product code.

ordr    prdcd
a              1
a              2
b              3
c             4
c             1
c             5
d             6
d             7
e             2
f             1
g             1
g             5

Ask- Look at order id 'a', it has 3 related products 1, 2 & 3. Ask is to find out if an order has a product , then what else could be the products that are usually purchased along with the same product?
For example, order ids c & g also have product code 1 , in addition both c and g order codes have linked product code 5, so for order id a, and product code 1 (first row) , we should be able to suggest to include product code 5 to be bundled / offered. However, when we come to the second row, there should be 0/no recommendations since the product code 2 is only being used in order id e and e in itself does not have any other linked product codes.
Another example are rows with order id f and order id c(row order id c product code 1). Here the recommendation should be 50% of order ids having product code 1 also have product code 5 AND 25% of order ids having product code 1 also have product code 2 & 3. Since order id a has linked products 2 & 3
I need this kind of recommendations in a third column for each row

sashraf · ‎02-11-2018

Try appending this to the search that generates the data you have above:

| stats values(prdcd) as itemsordered by ordr | stats values(itemsordered) as recommendations by itemsordered | rename itemsordered as prdcd | nomv recommendations

Without seeing your original source it would be difficult to say for sure, however you would most likely be best to write this to a lookup file (using outputlookup) and keep the file updated via a scheduled search, that way you will be able to add an auto lookup to your original search.

Sukisen1981 · ‎02-13-2018

Hi @sashraf thanks for the reply.
I had reached until this point earlier, I guess the stats is the easier part. However, the difficulty is I was not able to add a split by clause to get a percentage of the top linked products.

For example , in my above example look at product 1, it is present in order types a,c,f,g. Now, look at the OTHER products (except product 1) in these 4 orders. 'a' has product type 2 in addition to product type 1, 'c' has product types 4& 5 linked in addition to type 1, 'f' has no other products except type 1, lastly 'g' has type 5 linked other than one.
Hence, the expected recommendation in this case should be something like - 50% of order ids having product code 1 also have product code 5 AND 25% of order ids having product code 1 also have product code 2 & 3. Since order id a has linked products 2 & 4. I am not too worried about the text, but I do need an associated occurrence based count/percentage for the linked products, something like the preceding statement or at least something like
prd ordr linkedprd linkedprdcount 1 a 5 2 c 2 1 f 3 1 g

Help on calculating statistics

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?