I can't get any output data. My test dataset includes two fields f1 and f2:
| inputcsv tmp1030.csv | arules f1 f2
How do it? thx
tmp1030.csv:
f1 f2
a 1
a 2
a 3
a 4
b 1
b 2
c 2
c 3
c 4
d 2
d 3
e 1
e 2
e 4
f 3
f 4
g 2
g 4
Updated:
I found use table is fail, but use fields is pass. So, I add temp1030.csv to test index. Then:
index=test source="/opt/splunk/var/spool/splunk/dd7e0d3b0d032b1a_events.stash_new" | fields + f1 f2 | arules f1 f2 sup=2 conf=.3
Result:
Given fields Implied fields Strength Given fields support Implied fields support
1 a 0.333333 3 1
1 b 0.333333 3 1
1 e 0.333333 3 1
b 1 0.500000 2 1
b 2 0.500000 2 1
c 2 0.333333 3 1
c 3 0.333333 3 1
c 4 0.333333 3 1
d 2 0.500000 2 1
d 3 0.500000 2 1
e 1 0.333333 3 1
e 2 0.333333 3 1
e 4 0.333333 3 1
f 3 0.500000 2 1
f 4 0.500000 2 1
g 2 0.500000 2 1
g 4 0.500000 2 1
Please ignore my English.
@inventsekar
I want use splunk to do arules analysis base on data in http://www.salemmarafi.com/code/market-basket-analysis-with-r/.
First:
I download Groceries data ,like this . But splunk don't support one fields to do arules analysis.
id items
1 {citrus fruit,semi-finished bread,margarine,ready soups}
2 {tropical fruit,yogurt,coffee}
3 {whole milk}
4 {pip fruit,yogurt,cream cheese ,meat spreads}
5 {other vegetables,whole milk,condensed milk,long life bakery product}
Second:
I create splunk custom command.
combin.py:
import itertools, re, sys, time, splunk.Intersplunk
def combinations(results):
try:
# get list of fields, and hash of arguments
fields, argvals = splunk.Intersplunk.getKeywordsAndOptions()
# for each result, add fields set to message
for r in results:
str1 = r["items"].split(",")
str2 = list(itertools.combinations(str1, 2))
str3 = '; '.join(','.join(s) for s in str2)
for f in fields:
r[f] = str3
# return the results
splunk.Intersplunk.outputResults(results)
except:
import traceback
stack = traceback.format_exc()
results = splunk.Intersplunk.generateErrorResults("Error : Traceback: " + str(stack))
results, dummyresults, settings = splunk.Intersplunk.getOrganizedResults()
results = combinations(results)
Third:
In splunk:
| inputcsv Groceries.csv
| eval top1item=if(match(items,"whole milk"),"whole milk",null()) | search top1item="whole milk"
| eval items=replace(items,"{([^}]*)}","\1")
| eval items=replace(items,"whole milk,","")
| eval items=replace(items,",whole milk","")
| combin item2c
| makemv delim=";" item2c | fields - _time items
| mvexpand item2c
| collect index=test marker="id=t3"
Final:
index=test id=t3
| arules item2c top1item sup=1
| sort 20 - "Given fields support"
Result:
Given fields Implied fields Strength Given fields support Implied fields support
root vegetables,other vegetables whole milk 1.000000 228 228
other vegetables,yogurt whole milk 1.000000 219 219
other vegetables,rolls/buns whole milk 1.000000 176 176
tropical fruit,other vegetables whole milk 1.000000 168 168
yogurt,rolls/buns whole milk 1.000000 153 153
tropical fruit,yogurt whole milk 1.000000 149 149
other vegetables,whipped/sour cream whole milk 1.000000 144 144
root vegetables,yogurt whole milk 1.000000 143 143
other vegetables,soda whole milk 1.000000 137 137
pip fruit,other vegetables whole milk 1.000000 133 133
citrus fruit,other vegetables whole milk 1.000000 128 128
root vegetables,rolls/buns whole milk 1.000000 125 125
other vegetables,domestic eggs whole milk 1.000000 121 121
tropical fruit,root vegetables whole milk 1.000000 118 118
other vegetables,butter whole milk 1.000000 113 113
tropical fruit,rolls/buns whole milk 1.000000 108 108
yogurt,whipped/sour cream whole milk 1.000000 107 107
other vegetables,bottled water whole milk 1.000000 106 106
other vegetables,pastry whole milk 1.000000 104 104
other vegetables,fruit/vegetable juice whole milk 1.000000 103 103
Conclusion: splunk do aruls analysis is not mature, so temporarily abandoned.
The above information is for reference
well, not much related to splunk arules command.. but an interesting read on this arules topic.
as arules command says this - Implements arules agorithm as discussed in Michael Hahsler, Bettina Gruen and Kurt Hornik (2012). arules: Mining Association Rules and Frequent Itemsets. R package version 1.0-12 (http://docs.splunk.com/Documentation/Splunk/6.4.2/SearchReference/arules)
did google and found this -
http://www.salemmarafi.com/code/market-basket-analysis-with-r/
A little bit of Math
We already discussed the concept of Items and Item Sets.
We can represent our items as an item set as follows:
I = { i1,i2,…,in }
Therefore a transaction is represented as follows:
tn = { ij,ik,…,in }
This gives us our rules which are represented as follows:
{ i1,i2} => { ik}
Which can be read as “if a user buys an item in the item set on the left hand side, then the user will likely buy the item on the right hand side too”.
A more human readable example is:
{coffee,sugar} => {milk}
If a customer buys coffee and sugar, then they are also likely to buy milk.
With this we can understand three important ratios; the support, confidence and lift. We describe the significance of these in the following bullet points, but if you are interested in a formal mathematical definition you can find it on wikipedia.
Support: The fraction of which our item set occurs in our dataset.
Confidence: probability that a rule is correct for a new transaction with items on the left.
Lift: The ratio by which by the confidence of a rule exceeds the expected confidence.
Note: if the lift is 1 it indicates that the items on the left and right are independent.