Getting Data In

How to parse this kind of log

Na_Kang_Lim
Explorer

I have this kind of log:

Mar 18 02:32:19 MachineName python3[948]: DEBUG:root:... Dispatching: {'id': '<id>', 'type': 'threat-detection', 'entity': 'threat', 'origin': '<redacted>', 'nature': 'system', 'user': 'system', 'timestamp': '2025-03-17T19:32:17.974Z', 'threat': {'id': '<redacted_uuid>', 'maGuid': '<redacted_guid>', 'detectionDate': '2025-03-17T19:32:17.974Z', 'eventType': 'Threat Detection Summary', 'threatType': 'non-pe-file', 'threatAttrs': {'name': '<filename>.ps1', 'path': 'C:\\Powershell\\Report\\<filename>.ps1', 'md5': '<redacted_hash>', 'sha1': '<redacted_hash>', 'sha256': '<redacted_hash>'}, 'interpreterFileAttrs': {'name': 'powershell.exe', 'path': 'C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe', 'md5': '097CE5761C89434367598B34FE32893B', 'sha1': '044A0CF1F6BC478A7172BF207EEF1E201A18BA02', 'sha256': 'BA4038FD20E474C047BE8AAD5BFACDB1BFC1DDBE12F803F473B7918D8D819436'}, 'severity': 's1', 'rank': '100', 'score': '50', 'detectionTags': ['@ATA.Discovery', '@ATA.Execution', '@ATE.T1083', '@ATE.T1059.001', '@MSI._apt_file_psgetfiles', '@ATA.CommandAndControl', '@ATE.T1102.003', '@MSI._process_PS_public_repos', '@MSI._process_ps_getchilditem', '@ATE.T1105', '@ATE.T1071.001', '@MSI._process_pswebrequest_remotecopy', '@ATA.DefenseEvasion', '@ATE.T1112', '@MSI._reg_ep0029_intranet'], 'contentVersion': None}, 'firstDetected': '2025-03-17T19:32:17.974Z', 'lastDetected': '2025-03-17T19:32:17.974Z', 'tenant-id': '<redacted_tenant_id>', 'transaction-id': '<redacted_transaction_id>'}

The "Dispatching" I want it to be a required text, so only log that have this keywork would I apply transforming.
I want to parse the JSON part so I can use its fields, like json_data.threatAttrs.name.
Any suggestions? I tried the regex editor UI, but it broke down since it couldn't differentiate the "name" fields, since the same field name appeared. So I am thinking of using props.conf and transforms.conf, but I don't know how.
Any help would be appreciated!

Labels (5)
0 Karma
1 Solution

gargantua
Explorer

Hi,

One option would be to :

1 - Get rid of whatever data is being before the valid JSON

For the example you posted, we can ask Splunk to delete this :

Mar 18 02:32:19 MachineName python3[948]: DEBUG:root:... Dispatching: 


I'd use this in a props.conf :
SEDCMD-removeheader=s/.*DEBUG:root:\.\.\. Dispatching: //g

2- Replace simple quotes with double quotes.
Still in the props.conf :
SEDCMD-replace_simple_quotes=s/'/"/gs


3- Activate the kv_mode=auto in order to extract the JSON fields:

KV_MODE=json


The props.conf could look like this :

[custom_sourcetype]
SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]+)
NO_BINARY_CHECK=true
CHARSET=UTF-8
category=Custom
pulldown_type=true
SEDCMD-removeheader=s/.*DEBUG:root:\.\.\. Dispatching: //g
SEDCMD-replace_simple_quotes=s/'/"/g
KV_MODE=json

 

It works in my lab.

Best,
Ch.

View solution in original post

PickleRick
SplunkTrust
SplunkTrust

Unfortunately, your data is of the "ugly" kind - json content with additional non-json elements. So you cannot use native json parsing.

There is an idea - https://ideas.splunk.com/ideas/EID-I-208 being in a "future prospect" state so we can hope this behaviour will be changed and there will be a possibility to easily manipulate such data. But for now you have more or less three possibilities of handling such data:

1) Strip the non-json part so that the whole of the event you have left is a full well-formed json structure. (kinda what @gargantua suggested). Of course this way you're bound to lose some data

2) Do manual regex-based extractions. That's rarely a good idea to hack with regex at structured data. Usually sooner or later ends with tears

3) Use explicit SPL to parse out the json part to a field and then throw spath at this field so the json is getting parsed. Unfortunately it complicates your search and makes it way worse performancewise since you have to parse all events to find those matching.

0 Karma

gargantua
Explorer

Hi,

One option would be to :

1 - Get rid of whatever data is being before the valid JSON

For the example you posted, we can ask Splunk to delete this :

Mar 18 02:32:19 MachineName python3[948]: DEBUG:root:... Dispatching: 


I'd use this in a props.conf :
SEDCMD-removeheader=s/.*DEBUG:root:\.\.\. Dispatching: //g

2- Replace simple quotes with double quotes.
Still in the props.conf :
SEDCMD-replace_simple_quotes=s/'/"/gs


3- Activate the kv_mode=auto in order to extract the JSON fields:

KV_MODE=json


The props.conf could look like this :

[custom_sourcetype]
SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]+)
NO_BINARY_CHECK=true
CHARSET=UTF-8
category=Custom
pulldown_type=true
SEDCMD-removeheader=s/.*DEBUG:root:\.\.\. Dispatching: //g
SEDCMD-replace_simple_quotes=s/'/"/g
KV_MODE=json

 

It works in my lab.

Best,
Ch.

PickleRick
SplunkTrust
SplunkTrust

Oh, unless you can make very strong assumptions about your data, you're in for a treat.

1. You will replace any escaped single quotes which might be in the original data. (and no, doing single backslash negative lookback will not cut it).

2. You will not replace any unescaped double quotes from the original data (and again - finding them and properly escaping is not so easy in general case - see p.1.

Long story short - don't manipulate structured data with regexes!

0 Karma

kiran_panchavat
Influencer

 JSON is valid, you can proceed with extracting the required fields in Splunk using spath

| makeresults 
| eval _raw="Mar 18 02:32:19 MachineName python3[948]: DEBUG:root:... Dispatching: {'id': '<id>', 'type': 'threat-detection', 'entity': 'threat', 'origin': '<redacted>', 'nature': 'system', 'user': 'system', 'timestamp': '2025-03-17T19:32:17.974Z', 'threat': {'id': '<redacted_uuid>', 'maGuid': '<redacted_guid>', 'detectionDate': '2025-03-17T19:32:17.974Z', 'eventType': 'Threat Detection Summary', 'threatType': 'non-pe-file', 'threatAttrs': {'name': '<filename>.ps1', 'path': 'C:\\Powershell\\Report\\<filename>.ps1', 'md5': '<redacted_hash>', 'sha1': '<redacted_hash>', 'sha256': '<redacted_hash>'}, 'interpreterFileAttrs': {'name': 'powershell.exe', 'path': 'C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe', 'md5': '097CE5761C89434367598B34FE32893B', 'sha1': '044A0CF1F6BC478A7172BF207EEF1E201A18BA02', 'sha256': 'BA4038FD20E474C047BE8AAD5BFACDB1BFC1DDBE12F803F473B7918D8D819436'}, 'severity': 's1', 'rank': '100', 'score': '50', 'detectionTags': ['@ATA.Discovery', '@ATA.Execution'], 'contentVersion': null}, 'firstDetected': '2025-03-17T19:32:17.974Z', 'lastDetected': '2025-03-17T19:32:17.974Z', 'tenant-id': '<redacted_tenant_id>', 'transaction-id': '<redacted_transaction_id>'}"
| rex field=_raw "Dispatching:\s*(?<json_data>{.*})"
| eval json_data = replace(json_data, "'", "\"")
| eval json_data = replace(json_data, "\\\\", "\\\\\\\\")
| spath input=json_data path=threat.threatAttrs.name output=threat_filename
| spath input=json_data path=threat.threatAttrs.path output=threat_filepath
| spath input=json_data path=threat.severity output=threat_severity
| spath input=json_data path=threat.score output=threat_score
| table threat_filename, threat_filepath, threat_severity, threat_score

kiran_panchavat_3-1742284034487.png

@Na_Kang_Lim

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.
0 Karma

Na_Kang_Lim
Explorer

Hi @kiran_panchavat ,
Your solution works for extracting the data, but can this be scaled broader by using props.conf and transforms.conf. With this approach, if I want to extract all the fields, I will need the same number of lines in a search, which may work, but looks really long.

kiran_panchavat
Influencer

@Na_Kang_Lim 

Yes, you can definitely use props.conf and transforms.conf to scale this broader and make your field extractions more manageable.

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.
0 Karma

Na_Kang_Lim
Explorer

Do you know how to do that? I just know I can, I don't know how

0 Karma

kiran_panchavat
Influencer

@Na_Kang_Lim 

Check if json_data is Correctly Extracted

kiran_panchavat_2-1742283844477.png

 

| makeresults 
| eval _raw="Mar 18 02:32:19 MachineName python3[948]: DEBUG:root:... Dispatching: {'id': '<id>', 'type': 'threat-detection', 'entity': 'threat', 'origin': '<redacted>', 'nature': 'system', 'user': 'system', 'timestamp': '2025-03-17T19:32:17.974Z', 'threat': {'id': '<redacted_uuid>', 'maGuid': '<redacted_guid>', 'detectionDate': '2025-03-17T19:32:17.974Z', 'eventType': 'Threat Detection Summary', 'threatType': 'non-pe-file', 'threatAttrs': {'name': '<filename>.ps1', 'path': 'C:\\Powershell\\Report\\<filename>.ps1', 'md5': '<redacted_hash>', 'sha1': '<redacted_hash>', 'sha256': '<redacted_hash>'}, 'interpreterFileAttrs': {'name': 'powershell.exe', 'path': 'C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe', 'md5': '097CE5761C89434367598B34FE32893B', 'sha1': '044A0CF1F6BC478A7172BF207EEF1E201A18BA02', 'sha256': 'BA4038FD20E474C047BE8AAD5BFACDB1BFC1DDBE12F803F473B7918D8D819436'}, 'severity': 's1', 'rank': '100', 'score': '50', 'detectionTags': ['@ATA.Discovery', '@ATA.Execution'], 'contentVersion': null}, 'firstDetected': '2025-03-17T19:32:17.974Z', 'lastDetected': '2025-03-17T19:32:17.974Z', 'tenant-id': '<redacted_tenant_id>', 'transaction-id': '<redacted_transaction_id>'}"
| rex field=_raw "Dispatching:\s*(?<json_data>{.*})"
| eval json_data = replace(json_data, "'", "\"")
| eval json_data = replace(json_data, "\\\\", "\\\\\\\\")
| eval json_data = replace(json_data, "'null'", "null")
| table json_data

kiran_panchavat_1-1742283831732.png

Output:- 

 

{
"id": "<id>",
"type": "threat-detection",
"entity": "threat",
"origin": "<redacted>",
"nature": "system",
"user": "system",
"timestamp": "2025-03-17T19:32:17.974Z",
"threat": {
"id": "<redacted_uuid>",
"maGuid": "<redacted_guid>",
"detectionDate": "2025-03-17T19:32:17.974Z",
"eventType": "Threat Detection Summary",
"threatType": "non-pe-file",
"threatAttrs": {
"name": "<filename>.ps1",
"path": "C:\\Powershell\\Report\\<filename>.ps1",
"md5": "<redacted_hash>",
"sha1": "<redacted_hash>",
"sha256": "<redacted_hash>"
},
"interpreterFileAttrs": {
"name": "powershell.exe",
"path": "C:\\Windows\\System32\\WindowsPowerShell\u000b1.0\\powershell.exe",
"md5": "097CE5761C89434367598B34FE32893B",
"sha1": "044A0CF1F6BC478A7172BF207EEF1E201A18BA02",
"sha256": "BA4038FD20E474C047BE8AAD5BFACDB1BFC1DDBE12F803F473B7918D8D819436"
},
"severity": "s1",
"rank": "100",
"score": "50",
"detectionTags": [
"@ATA.Discovery",
"@ATA.Execution"
],
"contentVersion": null
},
"firstDetected": "2025-03-17T19:32:17.974Z",
"lastDetected": "2025-03-17T19:32:17.974Z",
"tenant-id": "<redacted_tenant_id>",
"transaction-id": "<redacted_transaction_id>"
}

 

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.
0 Karma
Get Updates on the Splunk Community!

Announcing the Expansion of the Splunk Academic Alliance Program

The Splunk Community is more than just an online forum — it’s a network of passionate users, administrators, ...

Learn Splunk Insider Insights, Do More With Gen AI, & Find 20+ New Use Cases You Can ...

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...

Buttercup Games: Further Dashboarding Techniques (Part 7)

This series of blogs assumes you have already completed the Splunk Enterprise Search Tutorial as it uses the ...