The Splunk best practices document recommends:
Use clear key-value pairs
key1=value1, key2=value2, key3=value3 . . .
This makes sense for simple data that can be represented in key-value format, but what about nested data structures? For example, what's the best way of representing the following log data using key-value format?
{
"categories": [
"Restaurants",
"American (New)",
"Southern"
],
"attributes": {
"BusinessParking": {
"street": false,
"garage": true
},
"WheelchairAccessible": true,
"GoodForKids": false,
},
"stars": 4.5,
"city": "Las Vegas",
"name": "Yardbird Southern Table & Bar",
}
I can represent the attributes and top level keys using dotted-notation:
attributes.BusinessParking.street="false",
attributes.BusinessParking.garage"true",
attributes.WheelchairAccessible="true",
attributes.GoodForKids"false",
stars="4.5",
city="Las Vegas",
name="Yardbird Southern Table & Bar",
Although I'm not sure if this is optimal.
However, my main question is: how should I represent the categories array?
I need to be able to perform a search on the above data and return all records that have more than N number of categories, so how should my data be structured in order to facilitate such a query in the most efficient way possible?
The reason I'm asking is because we're currently storing our logs in JSON format, and I can indeed perform the above query using JSON data with spath, but there are people in my organization that believe that spath is very slow and using key-value is much faster, and they want to change our logging format from JSON to key-value. I'd like to be able to compare both log structures, JSON and key-value, to understand which format is more efficient for querying (if, in fact there is any difference at all), and at the moment, I can't even figure out how to best structure the key-value logs to allow me to query array data.
@adamcohen - what did you end up doing?
I am in the same situation as you. If Splunk recommends key value pairs (which I also like above json), why doesn't it recommend a way to represent searchable arrays?
If your data is in JSON keep it that way and just put KV_MODE = json on your sourcetype.
Thanks for the response @starcher, however, I'm not trying to solve this problem for a JSON formatted log - I already know how to do that, and it works well. The problem is how to solve this problem for key-value formatted logs, since my organization wants to have a clear comparison of JSON formatted logs versus key-value. This is why I'm trying to figure out the best way to store a nested data structure in key-value format, so I can attempt to run the same queries against both JSON and key-value formatted data to figure out what the differences are between the two formats, in order to summarise the advantages/disadvantages of both approaches.
For example, say I want to return all restaurants that have more than 15 categories, I can use the following query on JSON formatted data:
source="business.json" | spath categories{} | where mvcount('categories{}') > 15
The above query requires using spath, which can be slow. In order to compare this to key-value, I need to first understand how to store the nested data (including the categories array) in key-value format, so I can then construct a query.