Solved: With regular expression how to auto extract JSON e...

dijikul · ‎08-02-2018

I've created a Field Transform that attempts to extract all JSON key-value pairs, via the following regex:

(?:\"|\')([^"])(?:\"|\')(?=:)(?::\s)(?:\")?(true|false|[0-9a-zA-Z()\@:\,\/!+-.\$\ \\']*)(?:\")?

It's extracting ALL Json Key Value Pairs, except for Array's.

I'm okay not capturing arrays for now.

The problem I'm having is due to the one-size-fits-all approach of this RegEx, I need to include Comma's within the value matching for some of our error logging, however, that's resulting in the comma being captured after non-quoted numerical fields, as shown here:

Without the \, in the second capture group, I can't get the entire 'About' message, which includes a comma.

With it, I pick up the comma's on non-quoted numerical fields.

I haven't given up, but thought I'd crowd source an answer if possible because I'm a couple hours deep in this now and thought maybe someone knows what's missing.

Note: We can't use the KV_Mode Json Auto-extractions because JSON data is embedded within other log data in unexpected places, so this is a simple "catch all" match we apply to a handful of sourcetypes.

Thanks for your help! Feel free to head to the URL in that image to play with the expression directly.

dijikul · ‎08-03-2018

Here's what I did to resolve the issue:

View solution in original post

dijikul · ‎08-03-2018

Here's what I did to resolve the issue:

dijikul · ‎08-03-2018

The solution is this Regular Expression:

(?:\"|\')([^"]*)(?:\"|\')(?=:)(?:\:\s*)(?:\")?(true|false|[-0-9]+[\.]*[\d]*(?=,)|[0-9a-zA-Z\@\:\,\/\!\+\-\.\$\ \\\']*)(?:\")?

coccyx · ‎08-03-2018

Doing this at search time is pretty difficult with only regex available to you. Doing this at ingestion time is a better approach. Can you use SEDCMD in transforms to clean up the data to extract just the JSON?

This is a pretty common use case for a product we are building that helps you work with data in Splunk at ingestion time. We could easily extract the JSON out of the log, parse it, emit a new event with just that data or transform the event to be just the JSON. We'd love to talk to you about our use case. If you're interested, you can fill out the form at https://www.cribl.io/.

coccyx · ‎08-03-2018

RegEx is powerful but limited. Can't really comment on what "should be" doable in Splunk itself, only what is. Your other options at Search Time without third party products would be to build a custom search command. This could probably be done in 10 lines of python. The use case you're describing really requires being able to programmatically process the data.

dijikul · ‎08-03-2018

One of the things I love most about Splunk is it's ability to give you access to hard-to-reach data.

It's not always possible to pre-process log files; that's the whole point of digging into "Machine Data", which Splunk touts so readily as it's focus...

In a perfect world, Application Developers would work alongside Data Architects and Report Developers to produce meaningful events from the start. That's not how software usually gets built in my experience, so we're often left picking up the pieces of what's leftover; in this case, embedded JSON within other events.

I liked your suggestion of stripping out JSON with a SEDCMD and then passing that field into the auto parser; I'll explore it, but for now, we have the data in the system as it is and I didn't configure it, though am expect to work with it.

A rock and a hard place is where successful IT people flourish.

coccyx · ‎08-03-2018

I agree with you whole heartedly. A few friends on mine in the industry are pushing the idea that everything should be an event, encoded in JSON, very wide with lots high cardinality fields. We find that most of the time the person who needs to consume it and the person who writes it are different and so there's a need to sit in the middle and make the logs smarter through transformation and enrichment. You're confirming the need for such a thing. Perhaps Splunk should provide this out of the box, but after years of trying to convince them to do that from the inside we decided to build it on the outside :).

dijikul · ‎08-03-2018

You've nailed it right on the head --- the people writing logs aren't the people reading logs, and it takes coordination from the beginning to make it happen.

Splunk's toolkit gives us, I believe, most of what we need to get at the data, but I don't believe it's positioning itself in the market correctly to tackle the problem it so-elegantly solves, which is big-data reporting...

As soon as you start dealing with micro-services and decoupled architecture, all sorts of new problems come into play that Splunk takes care of really well. Industry isn't quite there yet as most companies are still only beginning their big data journey, but I'm hoping we see a shift in Splunk's focus to start really targeting the enterprise reporting space, which is desperately lacking the functionality expected from RDBMS days which goes away once you start working with cluster storage.

dijikul · ‎08-03-2018

@coccyx it's working; I suppose it would make better use of the built-in JSON extraction to pre-clean the data-source and output the JSON payload as it's own field, but that's a whole other difficult task in itself when dealing with multiple sourcetypes that all bury JSON (or in some cases XML) in various fields...

I have automatic XML extraction working flawlessly. JSON should be equally achievable without third party products.

deepashri_123 · ‎08-02-2018

Hey@dijikul,

You can try using KV_MODE=json in your props.conf which will enable search time extraction.

KV_MODE = [none|auto|auto_escaped|multi|json|xml]
* Used for search-time field extractions only.
* Specifies the field/value extraction mode for the data.
* Set KV_MODE to one of the following:
* none: if you want no field/value extraction to take place.
* auto: extracts field/value pairs separated by equal signs.
* auto_escaped: extracts fields/value pairs separated by equal signs and
honors \" and \ as escaped sequences within quoted
values, e.g field="value with \"nested\" quotes"
* multi: invokes the multikv search command to expand a tabular event into
multiple events.
* xml : automatically extracts fields from XML data.
* json: automatically extracts fields from JSON data.
* Setting to 'none' can ensure that one or more user-created regexes are not
overridden by automatic field/value extraction for a particular host,
source, or source type, and also increases search performance.
* Defaults to auto.
* The 'xml' and 'json' modes will not extract any fields when used on data
that isn't of the correct format (JSON or XML).

Let me know if this helps!!

dijikul · ‎08-03-2018

hey @deepashri_123,

You may note in my original post, I state:

"Note: We can't use the KV_Mode Json Auto-extractions because JSON data is embedded within other log data in unexpected places".

We have JSON embedded in Log4j and IIS data; there is no clean JSON-only event to match. The approach I've found resolves this issue, albeit with a little bit of search-time overhead, but it's WAY cleaner and WAY more automatic than what Splunk provides out of the box with the KV_MODE matching.

Using KV_MODE as you've proposed isn't helpful in our scenario.

dijikul · ‎08-02-2018

I think I got it with it in this version:

Time to target those Arrays...

sudosplunk · ‎08-02-2018

I appreciate your efforts! Can you provide some sample data to further assist you?

Also, on the side note, did you try using props.conf setting, INDEXED_EXTRACTIONS = JSON as described here.

dijikul · ‎08-03-2018

@nittala_sirya,

Sample data has been provided in the regex101 mockup in the screenshot - I'm not permitted to paste links, so you'll need to type the url seen in the screenshots to pull up both my Regular Expression and Sample Data shown in the example.

I've resolved the issue, but what doesn't work is using INDEXED_EXTRACTIONS and/or KV_MODE=JSON, as I tried to make clear in the post, because the events aren't purely JSON data; there's metadata before / after within the event that prevents Splunk's built-in automatic extraction methods from working.

My approach doesn't capture array's yet, but those aren't required for my use case; what matters is that once I applied my custom field-transform to a sourcetype, every single JSON key-value pair is extracted, regardless of where it sits within the event.

I don't understand why this isn't something offered out of the box.

sudosplunk · ‎08-03-2018

Glad you've resolved it. Please post your solution in the "Answers" to help others with similar issue.

dijikul · ‎08-03-2018

I'm new here. How do I tag my reply to my question as the correct response?

...or are you asking me to RE-post this as an Answers article?

sudosplunk · ‎08-03-2018

Hover over to your comment (which you feel is the solution) and you will see a little gear wheel with More. Clicking on it will give you option to Convert to answer. This will move your comment to answer. Additionally, you can accept answer for your question. More info here.

dijikul · ‎08-03-2018

sudosplunk · ‎08-03-2018

Ah, bummer. You have to answer your own question with the regex which worked for you.

On the side note: I am not sure if you've already seen this, but please take a look at [spath] splunk command. For starters, you can just append |spath at the end of your search query. More info, http://docs.splunk.com/Documentation/Splunk/7.1.2/SearchReference/Spath

dijikul · ‎08-03-2018

spath's great for users familiar with SPL, and I've made great use of it, but I'm trying to get extractions for business users so they don't have to code.

Thanks for the tip!

dijikul · ‎08-03-2018

Feels like there's a bug in the forums. I can convert a reply to an answer but only the most-recent reply and only immediately after:

With regular expression how to auto extract JSON elements?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits

Join the Conversation

With regular expression how to auto extract JSON elements?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits