Re: Replace non-alphanumeric characters in a multi...

DEADBEEF · ‎01-31-2024

I have a multivalue field and am hoping I can get help to replace all the non-alphanumeric characters within a specific place within each value of the mvfield. I am taking this multivalue field and creating a new field but my regex is simply ignoring entries whenever there is a special character. I have to ignore these characters, so I'm trying to find way to remove those characters before it reaches my eval statement to create the new field.

I know the problem is the capture group around the "name" value as it only allows \w and \s.

name\x22\x3a(?:\s+)?\x22([\w\s]+)\x22.

But I'm not sure how to fix it. I've tried extracting the name field first, using sed to remove the characters, but then don't know how to "re-inject" it back into the mv-field or build my new field but reference the now clean name field. Any ideas???

Sample Data

{"bundle": "com.servicenow.blackberry.ful", "name": "ServiceNow Agent\u00ae - BlackBerry", "name_version": "ServiceNow Agent\u00ae - BlackBerry-17.2.0", "sw_uid": "faa5c810a2bd2d5da418d72hd", "version": "17.2.0", "version_raw": "0000000170000000200000000"}

{"bundle": "com.penlink.pen", "name": "PenPoint", "name_version": "PenPoint-1.0.1", "sw_uid": "cba7d3601855e050d8new0f34", "version": "1.0.1", "version_raw": "0000000010000000000000001"}

SPL to create new field

| eval new = if(sourcetype=="custom:data", mvmap(old_field,replace(old_field,"\x7b.*?\x22bundle\x22\x3a\s+\x22((?:net|jp|uk|fr|se|org|com|gov)\x2e(\w+)\x2e.*?)\x22.*?name\x22\x3a(?:\s+)?\x22([\w\s]+)\x22.*?\x22sw_uid\x22\x3a(?:\s+)?\x22(?:([a-fA-F0-9]+)|[\w_:]+)\x22.*?\x22version\x22\x3a(?:\s+)?\x22(.*?)\x22.*$","cpe:2.3:a:\2:\3:\5:*:*:*:*:*:*:* - \1 - \4")),new)

This creates one good and one bad entry

{"bundle": "com.servicenow.blackberry.ful", "name": "ServiceNow Agent\u00ae - BlackBerry", "name_version": "ServiceNow Agent\u00ae - BlackBerry-17.2.0", "sw_uid": "faa5c810a2bd2d5da418d72hd", "version": "17.2.0", "version_raw": "0000000170000000200000000"}

cpe:2.3:a:penlink:PenPoint:1.0.1:*:*:*:*:*:*:* - com.penlink.penpoint - cba7d3601855e050d8new0f34

yuanliu · ‎02-02-2024

Before delving into regex details, could you explain what "badness" in the sample data that you are trying to rectify? What are the expected results? (Also, please use code section that auto wraps.) In the output of your sample code, the "good" entry is exactly unchanged from the original entry. (By the way, the alternative value in if function cannot be new. It should be old_field.)

To be clear, your sample code is not to replace non-alphanumeric characters at all, but to executes an extremely complex purpose-built matches. If the sole goal is to replace non-alphanumeric characters globally, replace(old_field, "\W", "__non_alphanumeric__") suffices. Here is a simple example to do this when old_field is the only field of interest.

| makeresults
| fields - _time
| eval old_field = mvappend("{\"bundle\": \"com.servicenow.blackberry.ful\", \"name\": \"ServiceNow Agent\\u00ae - BlackBerry\", \"name_version\": \"ServiceNow Agent\\u00ae - BlackBerry-17.2.0\", \"sw_uid\": \"faa5c810a2bd2d5da418d72hd\", \"version\": \"17.2.0\", \"version_raw\": \"0000000170000000200000000\"}",
"{\"bundle\": \"com.penlink.pen\", \"name\": \"PenPoint\", \"name_version\": \"PenPoint-1.0.1\", \"sw_uid\": \"cba7d3601855e050d8new0f34\", \"version\": \"1.0.1\", \"version_raw\": \"0000000010000000000000001\"}")
| eval sourcetype="custom:data"
``` data emulation above ```
| mvexpand old_field
| spath input=old_field
| fields - old_field
| foreach version *
    [eval <<FIELD>> = if(sourcetype == "custom:data", replace(<<FIELD>>, "\W", "__non_alphanumeric__"), <<FIELD>>)]
| tojson output_field=new
| stats values(new) as new

The result is a two-value field

{"bundle":"com__non_alphanumeric__penlink__non_alphanumeric__pen","name":"PenPoint","name_version":"PenPoint__non_alphanumeric__1__non_alphanumeric__0__non_alphanumeric__1","sourcetype":"custom__non_alphanumeric__data","sw_uid":"cba7d3601855e050d8new0f34","version":"1__non_alphanumeric__0__non_alphanumeric__1","version_raw":"0000000010000000000000001"}
{"bundle":"com__non_alphanumeric__servicenow__non_alphanumeric__blackberry__non_alphanumeric__ful","name":"ServiceNow__non_alphanumeric__Agent__non_alphanumeric____non_alphanumeric____non_alphanumeric____non_alphanumeric__BlackBerry","name_version":"ServiceNow__non_alphanumeric__Agent__non_alphanumeric____non_alphanumeric____non_alphanumeric____non_alphanumeric__BlackBerry__non_alphanumeric__17__non_alphanumeric__2__non_alphanumeric__0","sourcetype":"custom__non_alphanumeric__data","sw_uid":"faa5c810a2bd2d5da418d72hd","version":"17__non_alphanumeric__2__non_alphanumeric__0","version_raw":"0000000170000000200000000"}

Are you trying to replace, say "." with one alphanumeric string (e.g., "dot"), ":" with a different alphanumeric string (e.g., "colon") and so on and so forth? If so, what are the rules?

Simply put: Forget about regex at all. Could you explain the logic between sample data and desired results? Also, is the end goal to form a JSON field, or do you expect to extract JSON nodes into fields?

ITWhisperer · ‎02-01-2024

If I understand correctly, you want to extract with the special character into new_field, so that you can replace the special characters more easily?

Try something like this

| eval new = if(sourcetype=="custom:data", mvmap(old_field,replace(old_field,"\x7b.*?\x22bundle\x22\x3a\s+\x22((?:net|jp|uk|fr|se|org|com|gov)\x2e(\w+)\x2e.*?)\x22.*?name\x22\x3a(?:\s+)?\x22([^\x22]+)\x22.*?\x22sw_uid\x22\x3a(?:\s+)?\x22(([a-fA-F0-9]+)|[\w_:]+)\x22.*?\x22version\x22\x3a(?:\s+)?\x22(.*?)\x22.*$","cpe:2.3:a:\2:\3:\5:*:*:*:*:*:*:* - \1 - \4")),new)

Note that there was also a mistake in the fourth group as this should not have been a non-capture group.

DEADBEEF · ‎02-01-2024

I see that you set the 3rd capture group to simply grab all except ". The problem with that is that sometimes, there is a colon in that field, which you can see I am using replace to colon separate the new field. What I'm trying to do is find a way to remove all non-alphanumeric characters in that "section" of the log before running the eval. Then I could use your solution.

I tried extracting that section into a MVfield, then used sed to eliminate all the characters, but wasn't sure how to go further. I could use your solution, but when there is a colon character (:) then it would definitely break the building of the new field.

Thought about doing what you suggested and then using lookahead/lookbehinds to count the number of (:) and then sed anything non-alphanumeric [^a-zA-Z0-9] but wasn't sure how to do go about that either.

PickleRick · ‎02-01-2024

Ok, why not do

s/[^\X]/_/g

or something similar?

DEADBEEF · ‎02-01-2024

Because I need to create the new field for every entry. If I were to implement that, then it would simply not match that entry and move on, effectively ignoring it. So I'm trying to find a way to clean this portion of the data such that I can capture every entry. Right now I'm only matching on entries that contain [\w\s] but I'm missing a bunch. True, [\X] would ignore less, but I'm trying to ignore none and capture all, but avoid problems by cleaning/manipulating the text before capturing it within the capture group.

PickleRick · ‎02-01-2024

Ok. Honestly, you lost me here. What does have filtering characters have to do with extracting fields? Either you filter characters and parse resulting event or parse out fields than filter each field on its own. Or do I miss something?

DEADBEEF · ‎02-01-2024

Apologies, this was difficult to try to explain via text.

I have a MV field and am iterating through it and using a regex to create multiple capture groups, then create a new field using some those capture groups. That new field is colon separated.

Currently, I noticed that within my 3rd capture group, the values within the MV field can sometimes have non-alphanumeric characters which is causing the regex to not match (due to the regex being [\w\s]).

So... modify the regex to capture everything! But... what about when the special character is a colon ( : )? In that scenario, it will then add an additional colon in my new colon separated field which will make that entry invalid due to nonconformity to the pattern.

I thought, why not just get rid of every non-alphanumeric character that will be in the 3rd capture group before I create the new field so there aren't issues. Which then brought me here as I cannot seem to find a way to do that.

Instead, I am now thinking it may be better to simply capture all then clean up the new field instead as that will not be a MV field. Maybe I can use regex and sed to eliminate any special characters in the new field, just need to figure out how to account for the case when that character is a colon. Since its the 3rd capture group, I would need the pattern to have 4 colons before that part of the field and 7 colons after it.

cpe:2.3:a:\2:\3:\5:*:*:*:*:*:*:* - \1 - \4

PickleRick · ‎02-01-2024

Ok, now I think I understand. (I had 5 consecutive nights of less than 4 hours of sleep so I'm not my best self :)).

Honestly, your main problem is that you have structured data and try to approach it with simple text extractions. What will happen if you get a quote inside one of those fields?

I'd do a completely different thing - mvexpand the mvfield, throw spath on it, then collect the resulting fields into a new field and be done with it (if needed, recombine the results back to mvfields).

But if you insist on doing the regexes, don't do it all in one pass. Do one mvmap with replace to "clean up" your data, then extract the fields to your cpe record in another mvmap pass.

BTW, why don't you use normal symbols, but those escape codes? It's confusing 🙂

PickleRick · ‎01-31-2024

First question is whether you indeed have special characters which are displayed this way or whether they were rendered before/on ingest and are stored as literal "\xsomething" strings. Because that will change the way you must match them.

DEADBEEF · ‎02-01-2024

Sometimes there are unicode characters (e.g.: \u00e3) and sometimes there are other characters like ', :, #, etc...

I don't have an issue with the unicode characters, but occasionally one of the other characters is a colon (:) which breaks the new field as I am building it to be colon separated.

Replace non-alphanumeric characters in a multivalue field

regex

AppDynamics Summer Webinars

SOCin’ it to you at Splunk University

Credit Card Data Protection & PCI Compliance with Splunk Edge Processor

Are you a member of the Splunk Community?

Replace non-alphanumeric characters in a multivalue field

regex

AppDynamics Summer Webinars

SOCin’ it to you at Splunk University

Credit Card Data Protection & PCI Compliance with Splunk Edge Processor