Solved: How to parse this kind of log

Na_Kang_Lim

I have this kind of log:

Mar 18 02:32:19 MachineName python3[948]: DEBUG:root:... Dispatching: {'id': '<id>', 'type': 'threat-detection', 'entity': 'threat', 'origin': '<redacted>', 'nature': 'system', 'user': 'system', 'timestamp': '2025-03-17T19:32:17.974Z', 'threat': {'id': '<redacted_uuid>', 'maGuid': '<redacted_guid>', 'detectionDate': '2025-03-17T19:32:17.974Z', 'eventType': 'Threat Detection Summary', 'threatType': 'non-pe-file', 'threatAttrs': {'name': '<filename>.ps1', 'path': 'C:\\Powershell\\Report\\<filename>.ps1', 'md5': '<redacted_hash>', 'sha1': '<redacted_hash>', 'sha256': '<redacted_hash>'}, 'interpreterFileAttrs': {'name': 'powershell.exe', 'path': 'C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe', 'md5': '097CE5761C89434367598B34FE32893B', 'sha1': '044A0CF1F6BC478A7172BF207EEF1E201A18BA02', 'sha256': 'BA4038FD20E474C047BE8AAD5BFACDB1BFC1DDBE12F803F473B7918D8D819436'}, 'severity': 's1', 'rank': '100', 'score': '50', 'detectionTags': ['@ATA.Discovery', '@ATA.Execution', '@ATE.T1083', '@ATE.T1059.001', '@MSI._apt_file_psgetfiles', '@ATA.CommandAndControl', '@ATE.T1102.003', '@MSI._process_PS_public_repos', '@MSI._process_ps_getchilditem', '@ATE.T1105', '@ATE.T1071.001', '@MSI._process_pswebrequest_remotecopy', '@ATA.DefenseEvasion', '@ATE.T1112', '@MSI._reg_ep0029_intranet'], 'contentVersion': None}, 'firstDetected': '2025-03-17T19:32:17.974Z', 'lastDetected': '2025-03-17T19:32:17.974Z', 'tenant-id': '<redacted_tenant_id>', 'transaction-id': '<redacted_transaction_id>'}

The "Dispatching" I want it to be a required text, so only log that have this keywork would I apply transforming.
I want to parse the JSON part so I can use its fields, like json_data.threatAttrs.name.
Any suggestions? I tried the regex editor UI, but it broke down since it couldn't differentiate the "name" fields, since the same field name appeared. So I am thinking of using props.conf and transforms.conf, but I don't know how.
Any help would be appreciated!

gargantua

Hi,

One option would be to :

1 - Get rid of whatever data is being before the valid JSON

For the example you posted, we can ask Splunk to delete this :

Mar 18 02:32:19 MachineName python3[948]: DEBUG:root:... Dispatching:

I'd use this in a props.conf :
SEDCMD-removeheader=s/.*DEBUG:root:\.\.\. Dispatching: //g

2- Replace simple quotes with double quotes.
Still in the props.conf :
SEDCMD-replace_simple_quotes=s/'/"/gs

3- Activate the kv_mode=auto in order to extract the JSON fields:
KV_MODE=json

The props.conf could look like this :

[custom_sourcetype]
SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]+)
NO_BINARY_CHECK=true
CHARSET=UTF-8
category=Custom
pulldown_type=true
SEDCMD-removeheader=s/.*DEBUG:root:\.\.\. Dispatching: //g
SEDCMD-replace_simple_quotes=s/'/"/g
KV_MODE=json

It works in my lab.

Best,
Ch.

View solution in original post

PickleRick

Unfortunately, your data is of the "ugly" kind - json content with additional non-json elements. So you cannot use native json parsing.

There is an idea - https://ideas.splunk.com/ideas/EID-I-208 being in a "future prospect" state so we can hope this behaviour will be changed and there will be a possibility to easily manipulate such data. But for now you have more or less three possibilities of handling such data:

1) Strip the non-json part so that the whole of the event you have left is a full well-formed json structure. (kinda what @gargantua suggested). Of course this way you're bound to lose some data

2) Do manual regex-based extractions. That's rarely a good idea to hack with regex at structured data. Usually sooner or later ends with tears

3) Use explicit SPL to parse out the json part to a field and then throw spath at this field so the json is getting parsed. Unfortunately it complicates your search and makes it way worse performancewise since you have to parse all events to find those matching.

gargantua

Hi,

One option would be to :

1 - Get rid of whatever data is being before the valid JSON

For the example you posted, we can ask Splunk to delete this :

Mar 18 02:32:19 MachineName python3[948]: DEBUG:root:... Dispatching:

I'd use this in a props.conf :
SEDCMD-removeheader=s/.*DEBUG:root:\.\.\. Dispatching: //g

2- Replace simple quotes with double quotes.
Still in the props.conf :
SEDCMD-replace_simple_quotes=s/'/"/gs

3- Activate the kv_mode=auto in order to extract the JSON fields:
KV_MODE=json

The props.conf could look like this :

[custom_sourcetype]
SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]+)
NO_BINARY_CHECK=true
CHARSET=UTF-8
category=Custom
pulldown_type=true
SEDCMD-removeheader=s/.*DEBUG:root:\.\.\. Dispatching: //g
SEDCMD-replace_simple_quotes=s/'/"/g
KV_MODE=json

It works in my lab.

Best,
Ch.

PickleRick

Oh, unless you can make very strong assumptions about your data, you're in for a treat.

1. You will replace any escaped single quotes which might be in the original data. (and no, doing single backslash negative lookback will not cut it).

2. You will not replace any unescaped double quotes from the original data (and again - finding them and properly escaping is not so easy in general case - see p.1.

Long story short - don't manipulate structured data with regexes!

kiran_panchavat

JSON is valid, you can proceed with extracting the required fields in Splunk using spath

| makeresults 
| eval _raw="Mar 18 02:32:19 MachineName python3[948]: DEBUG:root:... Dispatching: {'id': '<id>', 'type': 'threat-detection', 'entity': 'threat', 'origin': '<redacted>', 'nature': 'system', 'user': 'system', 'timestamp': '2025-03-17T19:32:17.974Z', 'threat': {'id': '<redacted_uuid>', 'maGuid': '<redacted_guid>', 'detectionDate': '2025-03-17T19:32:17.974Z', 'eventType': 'Threat Detection Summary', 'threatType': 'non-pe-file', 'threatAttrs': {'name': '<filename>.ps1', 'path': 'C:\\Powershell\\Report\\<filename>.ps1', 'md5': '<redacted_hash>', 'sha1': '<redacted_hash>', 'sha256': '<redacted_hash>'}, 'interpreterFileAttrs': {'name': 'powershell.exe', 'path': 'C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe', 'md5': '097CE5761C89434367598B34FE32893B', 'sha1': '044A0CF1F6BC478A7172BF207EEF1E201A18BA02', 'sha256': 'BA4038FD20E474C047BE8AAD5BFACDB1BFC1DDBE12F803F473B7918D8D819436'}, 'severity': 's1', 'rank': '100', 'score': '50', 'detectionTags': ['@ATA.Discovery', '@ATA.Execution'], 'contentVersion': null}, 'firstDetected': '2025-03-17T19:32:17.974Z', 'lastDetected': '2025-03-17T19:32:17.974Z', 'tenant-id': '<redacted_tenant_id>', 'transaction-id': '<redacted_transaction_id>'}"
| rex field=_raw "Dispatching:\s*(?<json_data>{.*})"
| eval json_data = replace(json_data, "'", "\"") 
| eval json_data = replace(json_data, "\\\\", "\\\\\\\\") 
| spath input=json_data path=threat.threatAttrs.name output=threat_filename
| spath input=json_data path=threat.threatAttrs.path output=threat_filepath
| spath input=json_data path=threat.severity output=threat_severity
| spath input=json_data path=threat.score output=threat_score
| table threat_filename, threat_filepath, threat_severity, threat_score

@Na_Kang_Lim

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.

Na_Kang_Lim

Hi @kiran_panchavat ,
Your solution works for extracting the data, but can this be scaled broader by using props.conf and transforms.conf. With this approach, if I want to extract all the fields, I will need the same number of lines in a search, which may work, but looks really long.

kiran_panchavat

@Na_Kang_Lim

Yes, you can definitely use props.conf and transforms.conf to scale this broader and make your field extractions more manageable.

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.

Na_Kang_Lim

Do you know how to do that? I just know I can, I don't know how

kiran_panchavat

@Na_Kang_Lim

Check if json_data is Correctly Extracted

| makeresults 
| eval _raw="Mar 18 02:32:19 MachineName python3[948]: DEBUG:root:... Dispatching: {'id': '<id>', 'type': 'threat-detection', 'entity': 'threat', 'origin': '<redacted>', 'nature': 'system', 'user': 'system', 'timestamp': '2025-03-17T19:32:17.974Z', 'threat': {'id': '<redacted_uuid>', 'maGuid': '<redacted_guid>', 'detectionDate': '2025-03-17T19:32:17.974Z', 'eventType': 'Threat Detection Summary', 'threatType': 'non-pe-file', 'threatAttrs': {'name': '<filename>.ps1', 'path': 'C:\\Powershell\\Report\\<filename>.ps1', 'md5': '<redacted_hash>', 'sha1': '<redacted_hash>', 'sha256': '<redacted_hash>'}, 'interpreterFileAttrs': {'name': 'powershell.exe', 'path': 'C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe', 'md5': '097CE5761C89434367598B34FE32893B', 'sha1': '044A0CF1F6BC478A7172BF207EEF1E201A18BA02', 'sha256': 'BA4038FD20E474C047BE8AAD5BFACDB1BFC1DDBE12F803F473B7918D8D819436'}, 'severity': 's1', 'rank': '100', 'score': '50', 'detectionTags': ['@ATA.Discovery', '@ATA.Execution'], 'contentVersion': null}, 'firstDetected': '2025-03-17T19:32:17.974Z', 'lastDetected': '2025-03-17T19:32:17.974Z', 'tenant-id': '<redacted_tenant_id>', 'transaction-id': '<redacted_transaction_id>'}"
| rex field=_raw "Dispatching:\s*(?<json_data>{.*})"
| eval json_data = replace(json_data, "'", "\"") 
| eval json_data = replace(json_data, "\\\\", "\\\\\\\\")
| eval json_data = replace(json_data, "'null'", "null") 
| table json_data

Output:-

{
"id": "<id>",
"type": "threat-detection",
"entity": "threat",
"origin": "<redacted>",
"nature": "system",
"user": "system",
"timestamp": "2025-03-17T19:32:17.974Z",
"threat": {
"id": "<redacted_uuid>",
"maGuid": "<redacted_guid>",
"detectionDate": "2025-03-17T19:32:17.974Z",
"eventType": "Threat Detection Summary",
"threatType": "non-pe-file",
"threatAttrs": {
"name": "<filename>.ps1",
"path": "C:\\Powershell\\Report\\<filename>.ps1",
"md5": "<redacted_hash>",
"sha1": "<redacted_hash>",
"sha256": "<redacted_hash>"
},
"interpreterFileAttrs": {
"name": "powershell.exe",
"path": "C:\\Windows\\System32\\WindowsPowerShell\u000b1.0\\powershell.exe",
"md5": "097CE5761C89434367598B34FE32893B",
"sha1": "044A0CF1F6BC478A7172BF207EEF1E201A18BA02",
"sha256": "BA4038FD20E474C047BE8AAD5BFACDB1BFC1DDBE12F803F473B7918D8D819436"
},
"severity": "s1",
"rank": "100",
"score": "50",
"detectionTags": [
"@ATA.Discovery",
"@ATA.Execution"
],
"contentVersion": null
},
"firstDetected": "2025-03-17T19:32:17.974Z",
"lastDetected": "2025-03-17T19:32:17.974Z",
"tenant-id": "<redacted_tenant_id>",
"transaction-id": "<redacted_transaction_id>"
}

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.

How to parse this kind of log

JSON

props.conf

sourcetype

syslog

transforms.conf

Announcing the Expansion of the Splunk Academic Alliance Program

Learn Splunk Insider Insights, Do More With Gen AI, & Find 20+ New Use Cases You Can ...

Buttercup Games: Further Dashboarding Techniques (Part 7)