Archive

Similarity "score" of similar values of fields>

Explorer

Howdy Folks,

I have data in a chart similar to this, with particular scored values per attribute (may attributes...maybe up to several hundred) for a given item.

The values of the attributes are average or sum scores , based on a previous search, like a chart avg(score) by attribute, item. So the attributes are not extracted fields or anything...the attribute names themselves are pulled from a lookup table and used as the header based on the average / sum of a score from field that is looked-up to a particular attribute name in the lookup. I guess this may be important as the list of headers is essentially dynamic, and may change, which could complicate me having to pass in field names without a wildcard...

Item | Attribute1 | Attribute2 | Attribute3|....Attribute99
dog | 5.5 |10.0 | 0 |100
cat | 6.5 | 2.0 | 100 |0
bird | 0.2 | 4 | 10 | 25

What I'm after, is there a way I can determine how similar dog is to cat based on the likeness / similarity of the scores for the attributes?

Thanks!

0 Karma

SplunkTrust
SplunkTrust

@oclumbertruck - have you made any progress on this one?

0 Karma

Explorer

Yes we have made some progress, and I want you to know that the content provided help tremendously - both from a technical and non-technical perspective.

What we basically did is took the minimum value for each attribute, and then used that as an origin or offset in a Gaussian decay function. Then on a per-attribute level, we are able to evaluate how "close" a particular attribute is to "perfect", or the item with the best in class for a given feature. By then multiplying these scores per feature together, we are able to get a score for the combined features.

In our scenario, all the features are essentially bad-things-to-begin-with, so the larger the number the worse off the item is, and with the decay we are able to see how thing compare by feature and drive scores into the ground as they become more and more irrelevant to the "perfect" score.

So, not entirely what we had set out for, but a valid workaround that seems to fit our needs.

Thanks again for your help and insight.

0 Karma

SplunkTrust
SplunkTrust

Answering this question in the general (as opposed to answering it for a specific application) requires roughly two semesters of graduate statistics.

Basically, you have to define and measure Similarity, which also requires that you define and measure Difference, or Variability. All of which requires some kind of scoring methodology, which usually would be determined in conjunction with understanding what the underlying measures are.

As a first, awfully simplistic way of looking at the question, you could take the measures that the two items have in common, and calculate the stdev for the entire population of items on each measure, and then calculate how many stdevs away from each other the two are. You could initially do that in terms of z score or percentile or whatever... the "right" choice will have to use successive approximation until the answers are coming in sensibly based on reference items you KNOW to be similar and items you KNOW to be different. The only requirement is that all the measures are scored the same, relative to their baselines (which is why you use zscore or stdevs or percentile rather than gross score difference).

Any measures that the two do NOT have in common must be treated as differences, and assigned some arbitrarily high distance/zscore/percentile.

Their gross geometric difference score then becomes the square root of the sum of the squares of their differences... which may yield some information or may not.


One of the basic problems with the strategy behind your source data is that you've ALREADY extracted various statistical information which identifies relationships between things, but then you've deleted the metaknowledge that relates those statistics to each other. Assuming items were different models of car, you have twelve numbers that represent, in no particular order and in no particular standard of measurement, the car's wheel base, miles per gallon, horsepower, weight of car, number of passengers, recommended mileage for first maintenance, sticker price, overall length, turning radius, number of cylinders, customer satisfaction rating, number of such cars produced and sold per year, and so on.

A proper treatment analyzing differences between car models would have to be cognizant of which variables were expected to move together. Smaller cars get better gas mileage, therefore as weight drops, MPG goes up and length and wheel base drop. A car which violates this rule is likely to be an outlier of some sort, and "different" from those that track the rule. However, following the rule does not make two cars at different points on the curve "similar" to each other, they are just exemplars of their portion of the weight-performance curve.

0 Karma

SplunkTrust
SplunkTrust

Here's an interesting semi-standard approach - using some arbitrary yardstick of stdevs, calculate whether or not the items are similar on each measurement they share in common, and assign by fiat that they are not similar on any measurement that they do not share. Then calculate the Sørensen–Dice (DSC) coefficient for the two items.

https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

Let's say that dog has 24 measurements, cat has 26, they share 20 items, and allowing readings within 1 stdev to be considered similar, they are found similar on 12 measurements.

DSC = 2*12 / (24+26) = 24/50 = 48% similar

Or, you could also reverse the process and ask, how many stdevs of variance do I need to allow as similar in order for the two items to be considered 50% (or some other yardstick) similar? This is only slightly more complicated to calculate - in the case of the above, it would be the 12.5th-13th smallest number of stdevs found among all the measurements.

0 Karma