Remove Non-json data to auto-extract json fields

splunklearner · ‎02-09-2025

Hello all,

Currently we have following event which contains both json and non json data. Please help me in removing this non-json part and where I need to give indexed_extractuons or KV_mode effectively to auto extract all json fields.

Nov 9 17:34:28 128.160.82.28 [local0.warning] <132>1 2024-11-09T17:34:28.436542Z AviVantage v-epswafhic2-wdc.hc.cloud.uk.hc-443 NILVALUE NILVALUE - {"adf":true,"significant":0,"udf":false,"virtualservice":"virtualservice-4583863f-48a3-42b9-8115-252a7fb487f5","report_timestamp":"2024-11-09T17:34:28.436542Z","service_engine":"GB-DRN-AB-Tier2-se-vxeuz","vcpu_id":0,"log_id":10181,"client_ip":"128.12.73.92","client_src_port":44908,"client_dest_port":443,"client_rtt":1,"http_version":"1.1","method":"HEAD","uri_path":"/path/to/monitor/page/","host":"udg1704n01.hc.cloud.uk.hc","response_content_type":"text/html","request_length":93,"response_length":94,"response_code":400,"response_time_first_byte":1,"response_time_last_byte":1,"compression_percentage":0,"compression":"","client_insights":"","request_headers":3,"response_headers":12,"request_state":"AVI_HTTP_REQUEST_STATE_READ_CLIENT_REQ_HDR","significant_log":["ADF_HTTP_BAD_REQUEST_PLAIN_HTTP_REQUEST_SENT_ON_HTTPS_PORT","ADF_RESPONSE_CODE_4XX"],"vs_ip":"128.160.71.14","request_id":"61e-RDl6-OZgZ","max_ingress_latency_fe":0,"avg_ingress_latency_fe":0,"conn_est_time_fe":1,"source_ip":"128.12.73.92","vs_name":"v-epswafhic2-wdc.hc.cloud.uk.hc-443","tenant_name":"admin"}

And where I need to give these configurations?

We have syslog servers with UF installed and that send data to our deployment server. DS will push apps to master and deployer from there pushing will be done.

As of now we have props.conf in master which will push to indexers.

splunklearner · ‎02-09-2025

Hi all, I have given the below stanza in props.conf and pushed to indexers. Fields are being extracted in json but logs are getting duplicated. Please help me.

[sony_waf]

TIME_PREFIX = ^

MAX_TIMESTAMP_LOOKAHEAD = 25

TIME_FORMAT = %b %d %H:%M:%S

LINE_BREAKER=([\r\n]+)

pulldown_type=true

SEDCMD-removeheader=s/^[^\{]*//g

SHOULD_LINEMERGE=false

TRUNCATE = 20000

KV_MODE=json

AUTO_KV_JSON=true

kiran_panchavat · ‎02-09-2025

@splunklearner

Verify in splunkd.log whether your Universal Forwarder (UF) or Heavy Forwarder (HF) is sending duplicate events.

Check inputs.conf, make sure crcSalt = <SOURCE> is set to avoid duplicate ingestion.

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!

kiran_panchavat · ‎02-09-2025

@splunklearner

Please check this solution.

Solved: Re: Why would INDEXED_EXTRACTIONS=JSON in props.co... - Splunk Community

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!

splunklearner · ‎02-09-2025

Hi @kiran_panchavat can you please guide me where to add your stanza? Indexers or Search heads??

kiran_panchavat · ‎02-09-2025

@splunklearner Yes, KV_MODE is for search time field extractions.

KV_MODE = [none|auto|auto_escaped|multi|multi:<multikv.conf_stanza_name>|json|xml]
* Used for search-time field extractions only.
* Specifies the field/value extraction mode for the data.
* Set KV_MODE to one of the following:
  * none - Disables field extraction for the host, source, or source type.
  * auto_escaped - Extracts fields/value pairs separated by equal signs and
                   honors \" and \\ as escaped sequences within quoted
                   values. For example: field="value with \"nested\" quotes"
  * multi - Invokes the 'multikv' search command, which extracts fields from 
            table-formatted events.
  * multi:<multikv.conf_stanza_name> - Invokes a custom multikv.conf 
    configuration to extract fields from a specific type of table-formatted 
    event. Use this option in situations where the default behavior of the 
    'multikv' search command is not meeting your needs.
  * xml - Automatically extracts fields from XML data.
  * json - Automatically extracts fields from JSON data.
* Setting to 'none' can ensure that one or more custom field extractions are not
  overridden by automatic field/value extraction for a particular host,
  source, or source type. You can also use 'none' to increase search 
  performance by disabling extraction for common but nonessential fields.
* The 'xml' and 'json' modes do not extract any fields when used on data
  that isn't of the correct format (JSON or XML).
* If you set 'KV_MODE = json' for a source type, do not also set 
  'INDEXED_EXTRACTIONS = JSON' for the same source type. This causes the Splunk 
  software to extract the json fields twice: once at index time and again at 
  search time.
* When KV_MODE is set to 'auto' or 'auto_escaped', automatic JSON field 
  extraction can take place alongside other automatic field/value extractions. 
  To disable JSON field extraction when 'KV_MODE' is set to 'auto' or 
  'auto_escaped', add 'AUTO_KV_JSON = false' to the stanza. 
* Default: auto

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!

splunklearner · ‎02-09-2025

So should I give the following stanza in Deployer or cluster manager?

livehybrid · ‎02-09-2025

Hi @splunklearner
To have this processed at ingest time you can do a simple INGEST_EVAL on your indexers.

== props.conf ==
[yourStanzaName]
TRANSFORMS = stripNonJSON

== transforms.conf ==
[stripNonJSON]
INGEST_EVAL = _raw:=replace(_raw, ".*- ({.*})", "\1")

Please let me know how you get on and consider upvoting/karma this answer if it has helped.
Regards

Will

livehybrid · ‎02-09-2025

@splunklearner
If you go down the ingest time approach then you will add the props/transforms.conf within an app in your manager-apps folder on your Cluster Manager and then push out to your indexers.

No changes should be required for your searchheads if you go down that route, but feel free to evaluate the alternatives provided in this post too.

I hope this helps.

Please let me know how you get on and consider upvoting/karma this answer if it has helped.
Regards

Will

splunklearner · ‎02-09-2025

Hi @livehybrid ,

I heard that search time extractions are more better than index time due to performance issues? Is it so? Please clear fy

livehybrid · ‎02-09-2025

Hi @splunklearner ,

I guess the answer really is "it depends" however in this scenario we are overwriting the original data with just the JSON, rather than adding an additional extracted field.

Search time field extractions/eval/changes are executed every time you search the data, and in some cases need to be evaluated before the search is filtered down. For example if you search for "uri=/test" then you may find that at search time it needs to process all events to determine the uri field for each event, before it can then filter down. Being able to search against the URI without having to do any modification to every event means it should be faster.

The disadvantage of index-time extractions is that it doesnt apply retrospectively to data you already have, whereas search time will apply to everything currently indexed.

kiran_panchavat · ‎02-09-2025

@splunklearner I have standalone server, so you can try this settings on your heavy forwarder or indexers.

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!

splunklearner · ‎02-09-2025

I don't have access to UI. I need to do it from backend only. Where I can give this props.conf? In cluster master or deployer? Is it index time extraction or search time?

kiran_panchavat · ‎02-09-2025

@splunklearner

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!

kiran_panchavat · ‎02-09-2025

@splunklearner I tried this using your sample data; please have a look.

[syslogtest]
SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]+)
NO_BINARY_CHECK=true
CHARSET=UTF-8
category=Custom
pulldown_type=true
SEDCMD-removeheader=s/^[^\{]*//g
KV_MODE=json
AUTO_KV_JSON=true

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!

splunklearner · ‎02-09-2025

Hi @kiran_panchavat ,

Thanks for the answer.

But I read that kv_mode = json needs to be given on search time extraction i.e on search heads... But you are saying to give this on indexers or heavy forwarders... Will it help.. please clarify?

splunklearner · ‎02-09-2025

Hi @kiran_panchavat ,

This is present in my current props.conf which is there is Cluster Manager for this sourcetype (which is copied from other sourcetype)--

[sony_waf]

TIME_PREFIX = ^

MAX_TIMESTAMP_LOOKAHEAD = 25

TIME_FORMAT = %b %d %H:%M:%S

SEDCMD-newline_remove = s/\\r\\n/\n/g

SEDCMD-formatxml =s/></>\n</g

LINE_BREAKER = ([\r\n]+)[A-Z][a-z]

{2}\s+\d{1,2}\s\d{2}:\d{2}:\d{2}\s

SHOULD_LINEMERGE = False

TRUNCATE = 10000

Now do I need to add here in this props.conf and push it to indexers? Or create new props.conf in Deployer which includes your props.conf stanza and push it to search heads?

Remove Non-json data to auto-extract json fields

indexer

JSON

props.conf

syslog

transforms.conf

Splunk MCP & Agentic AI: Machine Data Without Limits

Finding Based Detections General Availability

Get Your Hands Dirty (and Your Shoes Comfy): The Splunk Experience

Join the Conversation

Remove Non-json data to auto-extract json fields