Re: how to do trivial projection and json parsing?

alfonz19 · ‎09-19-2024

I'm new to splunk and really struggle very hard with it's documentation. Everytime I try to do something, it does not work as documented.

I'm pretty fluent with free tool named jq, but it requires to downloading the data from splunk to process it, which is very inconvenient to do over globe. I have some query producing jsons. I'd like to do this trivial thing. Extract data from field json.msg (trivial projection), parse them as json, then proceed further. In jq this is as hard as: '.json.msg | fromjson ' Done. Can someone advice how to do this in splunk?

I tried:
… | spath input=json.msg output=msg_raw path=json.msg

and multiple variants of that, but it either does not compile (say if path is missing) or do nothing.

… spath input=json.msg output=msg_raw path=json.msg | table msg_raw

prints empty lines.

I need to do much more complex things with it(reductions/aggregations/deduplications) all trivial in jq, but even this is not doable in splunk query. How to do? Or where is valid documention showing things which works?

yuanliu · ‎09-19-2024

SPL can present a steeper learning curve compared with non-streaming languages. But once you get some basics, it is very rewarding for it gives you so much freedom. This said, SPL's JSON path notations need some getting used to. The JSON functions are actually OK once you understand the notations. Before I give my suggestions, let's examine your original trial.

| spath input=json.msg output=msg_raw path=json.msg

This will not give you desired output because in the embedded JSON object in json.msg does not contain a path named json.msg. The object that does contain this path is _raw. If you try

| spath ``` input=_raw implied ``` output=msg_raw path=json.msg

you would have extracted a field named msg_raw that duplicates the value of json.msg:

json.msg	msg_raw
{"name":"", "connection":22234743, "time":20000, "success":false, "type":"Prepared", "batch":false, "querySize":1, "batchSize":0, "query":["select * from whatever.whatever w where w.whatever in (?,?,?) "], "params":[["1","2","3"]]}	{"name":"", "connection":22234743, "time":20000, "success":false, "type":"Prepared", "batch":false, "querySize":1, "batchSize":0, "query":["select * from whatever.whatever w where w.whatever in (?,?,?) "], "params":[["1","2","3"]]}

Of course, this is not what you wanted. What did we learn here? That path option in spath goes into the JSON object itself.

But if you try

| spath input=json.msg

you will get these fields from json.msg:

batch	batchSize	connection	name	params{}{}	querySize	query{}	success	time	type
false	0	22234743		1 2 3	1	select * from whatever.whatever w where w.whatever in (?,?,?)	false	20000	Prepared

What did we learn here? Place that field name whose value is itself a valid JSON object directly in spath's input option to extract from that field. Additionally, Splunk uses {} to denote fields extracted from JSON array, and turn them into a multivalue field.

In your other comment, you said you want the equivalent of `jq '.json.msg|fromjson|.query[0]'`. Such would be trivial from the above result. Add

| eval jq_equivalent = mvindex('params{}{}', 0)
| fields params* jq_equivalent

you get

params{}{}	jq_equivalent
1 2 3	1

What did we learn here? 1. mvindex selects value from a multivalue field (params{}{}), using base 0 index; 2. Use single quote to dereference value of field whose name contains special characters.

A word of caution: If all you want from params{}{} is a single multivalue field, the above can be sufficient. But params[[]] is an array of arrays. To complicate things, your developer doesn't do you the best of service by throwing in query[] array in the same flat structure. As the JSON array query can have more than one element, my speculation is that your developer intended for each element in top level array of params to represent params to each element of query[].

What if, instead of

{\"name\":\"\", \"connection\":22234743, \"time\":20000, \"success\":false, \"type\":\"Prepared\", \"batch\":false, \"querySize\":1, \"batchSize\":0, \"query\":[\"select * from whatever.whatever w where w.whatever in (?,?,?) \"], \"params\":[[\"1\",\"2\",\"3\"]]}

your raw data contains json.msg of this value?

"{\"name\":\"\", \"connection\":22234743, \"time\":20000, \"success\":false, \"type\":\"Prepared\", \"batch\":false, \"querySize\":2, \"batchSize\":0, \"query\":[\"select * from whatever.whatever w where w.whatever in (?,?,?) \", \"select * from whatever.whatever2 w where w.whatever2 in (?,?) \"], \"params\":[[\"1\",\"2\",\"3\"],[\"4\",\"5\"]]}"

i.e., query[] and params[] each contains two elements? (For convenience, I assume that querySize represents the number of elements in these arrays. We can live without this external count but why complicate our lives in a tutorial.) Using the above search, you will find query{} and params{}{} to contain

querySize	query{}	params{}{}
2	select * from whatever.whatever w where w.whatever in (?,?,?) select * from whatever.whatever2 w where w.whatever2 in (?,?)	1 2 3 4 5

This is one of shortcomings of flattening structured data like JSON, not unique to SPL but the shortcoming becomes more obvious. On top of the flattened structure, the spath command also cannot handle array of arrays correctly. Now what?

Here is what I would use to get past this barrier. (This is not the only way. But JSON functions introduced in 8.2 works really well while preserving semantic context.)

| spath input=json.msg
| eval params_array = json_array_to_mv(json_extract('json.msg', "params"))
| eval idx = mvrange(0, querySize) ``` assuming querySize is size of query{} ```
| eval query_params = mvmap(idx, json_object("query", mvindex('query{}', idx), "params", mvindex(params_array, idx)))
| fields - json.msg params* query{} idx
| mvexpand query_params

With this, the output contains

batch	batchSize	connection	name	querySize	query_params	success	time	type
false	0	22234743		2	{"query":"select * from whatever.whatever w where w.whatever in (?,?,?) ","params":"[\"1\",\"2\",\"3\"]"}	false	20000	Prepared
false	0	22234743		2	{"query":"select * from whatever.whatever2 w where w.whatever2 in (?,?) ","params":"[\"4\",\"5\"]"}	false	20000	Prepared

I think you know what I am going for by now. What did we learn here? To compensate for the unfortunate implied semantics your developer forces on you, first construct an intermediary JSON object that binds each query with each array of params. Then, use mvexpand to separate the elements. (Admittedly, json_array_to_mv is an oddball function at first glance. But once you understand how Splunk uses multivalue, you'll get used to the concept. Hopefully you will find many merits of using a multivalue representation.)

From here, you can use spath again to get desired results, but I find JSON functions to be simpler AND more semantic considering there are only two keys in this intermediary JSON. Add the following to the above

| eval query = json_extract(query_params, "query")
| eval params = json_array_to_mv(json_extract(query_params, "params"))

With this, you get the final result

batch	batchSize	connection	name	params	query	querySize	success	time	type
false	0	22234743		1 2 3	select * from whatever.whatever w where w.whatever in (?,?,?)	2	false	20000	Prepared
false	0	22234743		4 5	select * from whatever.whatever2 w where w.whatever2 in (?,?)	2	false	20000	Prepared

Hope this is a useful format for your further processing.

Below is an emulation of the above 2-query mock data that I adapted from @ITWhisperer's original emulation. Play with it and compare with real data.

| makeresults
| eval _raw="{
  \"time\": \"2024-09-19T08:03:02.234663252Z\",
  \"json\": {
    \"ts\": \"2024-09-19T15:03:02.234462341+07:00\",
    \"logger\": \"<anonymized>\",
    \"level\": \"WARN\",
    \"class\": \"net.ttddyy.dsproxy.support.SLF4JLogUtils\",
    \"method\": \"writeLog\",
    \"file\": \"<anonymized>\",
    \"line\": 26,
    \"thread\": \"pool-1-thread-1\",
    \"arguments\": {},
    \"msg\": \"{\\\"name\\\":\\\"\\\", \\\"connection\\\":22234743, \\\"time\\\":20000, \\\"success\\\":false, \\\"type\\\":\\\"Prepared\\\", \\\"batch\\\":false, \\\"querySize\\\":2, \\\"batchSize\\\":0, \\\"query\\\":[\\\"select * from whatever.whatever w where w.whatever in (?,?,?) \\\", \\\"select * from whatever.whatever2 w where w.whatever2 in (?,?) \\\"], \\\"params\\\":[[\\\"1\\\",\\\"2\\\",\\\"3\\\"],[\\\"4\\\",\\\"5\\\"]]}\",
    \"scope\": \"APP\"
  },
  \"kubernetes\": {
    \"pod_name\": \"<anonymized>\",
    \"namespace_name\": \"<anonymized>\",
    \"labels\": {
      \"whatever\": \"whatever\"
    },
    \"container_image\": \"<anonymized>\"
  }
}"
| spath
``` data emulation ```

Hope this helps.

alfonz19 · ‎09-24-2024

Thanks for your help, I really appreciate the time you put it. I asked in good faith I can learn this.

streaming languages — like JQ I mentioned — are harder, yes, functional languages are harder, yes, but all is doable. Yes, shown json is tricker, but I just happen to know, that there will be just one query there, and if not, I can call one more reduce in JQ and I'm good, still oneliner.

Regarding the SPL — the spl solution is just crazy. I just cannot see what each individual part does and why. And the documentation does it make it harder instead of easier.

But I probably miss some basic premise of splunk. Ok, simple sample, simpler task.

| makeresults
| eval _raw="{
\"json\": {
\"class\": \"net.ttddyy.dsproxy.support.SLF4JLogUtils\",
},
}"

Use regex to remove word ttddyy and return .json.class. Should be exceptionally trivial.

If I ran that just thing above I see the json. OK, now projection:

… | table _raw.json.class

no.

… | table json.class

no.

... | table _raw

yes. OK, so maybe it's not parsed or whatever. spath to the rescue.

… | spath _raw.json.path | table json.path    //please consider this as all potential combinations of cartesian products of all subpaths from _raw.json.path and same with table.

no! Ok(!!) maybe it's input parameter:

... | spath input=_raw.json path=json.class | table json.class  //all cartesian products again.

no.

ok. .. So maybe we need output for something I don't know what:

| spath input=_raw path=json.class output=aaaaa | table aaaaa  //all cartesian products of subpaths again.

30 minutes passed... It's just-a-simple-projection. No luck and I didn't even get to regex, which will be true struggle (all regex flavors are really easy, that's not the crux of the problem).

chatgpt thinks this is the solution:

| makeresults 
| eval _raw="{\"json\": {\"class\": \"net.ttddyy.dsproxy.support.SLF4JLogUtils\"}}" 
| spath input=_raw path=json.class 
| table json.class

but it isn't. Does not print anything(it's so confusing, that trained chatgpt cannot do projection OR it does not work on our heavily paid sw).

Can you please explain, how simple projection works in splunk and what steps the engine really do internally since it's really mystery?

In jq to compare:

jq -n '{ "json": { "class": "net.ttddyy.dsproxy.support.SLF4JLogUtils", } } | .json | .class | sub("ttddyy";"goddamm-easy")'

to explain: 1) declare json 2) take .json subtree 3) take .class subtree 4) do replacement. 40s, straightforward.

UPDATE: if I export data returned for table, and then download the data, I got different results in file than on screen, so I guess that what-is-shown on screen is different to what-the-data-of-query-is, which is probably source of confusion, since I'm working with something I can see, while splunk probably work with some datastructure I'm not aware of.

{
  "preview": false,
  "result": {
    "json.msg": //:|
...

ITWhisperer · ‎09-24-2024

Splunk SPL works on a pipeline of events, so, in a way, they are streamed from one command to the next

| makeresults ``` Start and event pipeline with one event ```
``` Set the _raw field to the JSON string ``` 
| eval _raw="{\"json\": {\"class\": \"net.ttddyy.dsproxy.support.SLF4JLogUtils\"}}"
``` Extract the JSON from the _raw field, specifically the json.class field ``` 
| spath input=_raw path=json.class
``` Replace value in field and store in new field ```
| eval class=replace('json.class',"ttddyy", "goddamm-easy") 
``` Show this as a table (effectively removes all other fields from the events in the pipeline) ```
| table class

ITWhisperer · ‎09-19-2024

Please share the raw event that you are working on, anonymised and in a code block to preserve formatting.

alfonz19 · ‎09-19-2024

{
  "time": "2024-09-19T08:03:02.234663252Z",
  "json": {
    "ts": "2024-09-19T15:03:02.234462341+07:00",
    "logger": "<anonymized>",
    "level": "WARN",
    "class": "net.ttddyy.dsproxy.support.SLF4JLogUtils",
    "method": "writeLog",
    "file": "<anonymized>",
    "line": 26,
    "thread": "pool-1-thread-1",
    "arguments": {},
    "msg": "{\"name\":\"\", \"connection\":22234743, \"time\":20000, \"success\":false, \"type\":\"Prepared\", \"batch\":false, \"querySize\":1, \"batchSize\":0, \"query\":[\"select * from whatever.whatever w where w.whatever in (?,?,?) \"], \"params\":[[\"1\",\"2\",\"3\"]]}",
    "scope": "APP"
  },
  "kubernetes": {
    "pod_name": "<anonymized>",
    "namespace_name": "<anonymized>",
    "labels": {
      "whatever": "whatever"
    },
    "container_image": "<anonymized>"
  }
}

to begin with, I'd like to do equivallent`jq '.json.msg|fromjson|.query[0]'`. After that, eventually, do the actual parameter substitutions, deduplication, counting, min/max time, but that's way beyond of scope of this question.

ITWhisperer · ‎09-19-2024

| makeresults
| eval _raw="{
  \"time\": \"2024-09-19T08:03:02.234663252Z\",
  \"json\": {
    \"ts\": \"2024-09-19T15:03:02.234462341+07:00\",
    \"logger\": \"<anonymized>\",
    \"level\": \"WARN\",
    \"class\": \"net.ttddyy.dsproxy.support.SLF4JLogUtils\",
    \"method\": \"writeLog\",
    \"file\": \"<anonymized>\",
    \"line\": 26,
    \"thread\": \"pool-1-thread-1\",
    \"arguments\": {},
    \"msg\": \"{\\\"name\\\":\\\"\\\", \\\"connection\\\":22234743, \\\"time\\\":20000, \\\"success\\\":false, \\\"type\\\":\\\"Prepared\\\", \\\"batch\\\":false, \\\"querySize\\\":1, \\\"batchSize\\\":0, \\\"query\\\":[\\\"select * from whatever.whatever w where w.whatever in (?,?,?) \\\"], \\\"params\\\":[[\\\"1\\\",\\\"2\\\",\\\"3\\\"]]}\",
    \"scope\": \"APP\"
  },
  \"kubernetes\": {
    \"pod_name\": \"<anonymized>\",
    \"namespace_name\": \"<anonymized>\",
    \"labels\": {
      \"whatever\": \"whatever\"
    },
    \"container_image\": \"<anonymized>\"
  }
}"
| spath json.msg output=msg
| spath input=msg query{}

alfonz19 · ‎01-08-2025

yes, this:

spath input=msg query{}

is probably just some cheating to see some data, but it does not actually work. It is just some overcomplicated syntax to see the whole array, but not individual items; if the array would have more items, we won't get first item. If I need to get first item in array to work further, this just does not work

yuanliu · ‎01-08-2025

As I said above, there is a steep learning curve with SPL's JSON flattening schema. But it is learnable, and the syntax is reasonably logical. (Logical, not intuitive or self-explanatory.)

First, the easiest way to to examine each individual array element is by mvexpand. Like

| spath path=json.msg
| spath input=json.msg path=query{}
| mvexpand query{}
| rename query{} as query_single

Here, xxx{} is SPL's explicity denotation of an array that is flattened from a structure; array is most commonly known in SPL as multivalued. You will see a lot of this word in documentation.

Second, if you only want the first element, simply take the first element using mvindex.

| spath path=json.msg
| spath input=json.msg path=query{}
| eval first_query = mvindex('query{}', 0)

Test these over any of the emulations @ITWhisperer and I supplied above, and compare with your real data.

alfonz19 · ‎09-19-2024

I have no idea why it printed this. And it's surprising, that adding

| spath json.msg output=msg
| spath input=msg query{}

in this context yields different results than if I place it after my query in our system. And that the query for some reason also extract query[] part of json(in your context, not in mine). Why? Who asked for that? But even then I still cannot access 'parsed' field named 'batch'... I think the query is some generic function doing some guess-what-is-important extractions.

I've got an idea, which will force us to avoid clever functions doing random data extractions. Can you please show me, how could I transform this little json sample into string:

select * from whatever.whatever w where w.whatever in (1,2,3)

Equivalent for this in jq would look like:

jq -r '.json.msg|fromjson|.query[0] as $q| .params[]|reduce .[] as $param ($q; sub("\\?";$param))'

full bash command including input being:

echo '{"time":"2024-09-19T08:03:02.234663252Z","json":{"ts":"2024-09-19T15:03:02.234462341+07:00","logger":"<anonymized>","level":"WARN","class":"net.ttddyy.dsproxy.support.SLF4JLogUtils","method":"writeLog","file":"<anonymized>","line":26,"thread":"pool-1-thread-1","arguments":{},"msg":"{\"name\":\"\", \"connection\":22234743, \"time\":20000, \"success\":false, \"type\":\"Prepared\", \"batch\":false, \"querySize\":1, \"batchSize\":0, \"query\":[\"select * from whatever.whatever w where w.whatever in (?,?,?) \"], \"params\":[[\"1\",\"2\",\"3\"]]}","scope":"APP"},"kubernetes":{"pod_name":"<anonymized>","namespace_name":"<anonymized>","labels":{"whatever":"whatever"},"container_image":"<anonymized>"}}' | jq -r '.json.msg|fromjson|.query[0] as $q| .params[]|reduce .[] as $param ($q; sub("\\?";$param))';

how to do trivial projection and json parsing?

eval

lookup

search job inspector

Get Operational Insights Quickly with Natural Language on the Splunk Platform

What’s New in Splunk Observability Cloud – June 2025

Almost Too Eventful Assurance: Part 2

Are you a member of the Splunk Community?