topic Re: Parsing Fields Properly in Knowledge Management

How to parse fields properly?

Charlie5 — Fri, 26 May 2023 12:21:18 GMT

Hello,

I am trying to get a field extraction working, and have written regex accordingly that the field extractor seems to like. The raw logs are a list of quotes-encapsulated fields separated by commas:

"field1","field2","field3",...

Certain fields can have multiple values, wherein the values are separated only by a comma but quotes enclose only the entire list of fields. For example:

"field1","field2","field3value1,field3value2,field3value3",...

To complicate matters, values that belong to a certain field can contain multiple words separated by other characters, such as "Software/Technology" or "Business and Industry" so that the entire field may look something like this:

"Software/Technology,Business Services,Application,Business and Industry,Computers and Internet"

That field needs to be extracted and displayed exactly as it is shown, The regex I have attempted for this is as follows:

"(?<categories>[^\"]+|)
"(?<categories_again>[\w\s\/\,]+|)

Although the field extractor, rex function, and regex101 like both of these extractions and they work exactly as expected, when I search I get each word from within the field as its own independent value, which is not what I need:

Software
Technology
Business
Services
Application
and
Industry

At this point I'm out of ideas as to regex modifications or other work-arounds that can be applied to fix this. Has anyone else encountered this problem and if so, were you able to fix it and how? Otherwise I think I have to bring this to Splunk support.

Thank you

Re: Parsing Fields Properly

richgalloway — Fri, 26 May 2023 00:16:24 GMT

Please share some sanitized example events for us to test with. Are you trying to parse the fields at search time or index time? If the former, please share the SPL you're using; otherwise, share the relevant props.conf stanza.

Re: Parsing Fields Properly

ITWhisperer — Fri, 26 May 2023 05:52:15 GMT

It is not entirely clear what your expected results are. For example, are you looking for the extract to produce a multi-value field like this

Software/Technology Business Services Application Business and Industry Computers and Internet

or a single field like this

Software/Technology,Business Services,Application,Business and Industry,Computers and Internet

or in the more generic case a multi-value field like this

field1 field2 field3value1,field3value2,field3value3

or is this three fields

field1

field2

field3value1,field3value2,field3value3

or, in the case of the last field

field3value1 field3value2 field3value3

Re: Parsing Fields Properly

Charlie5 — Fri, 26 May 2023 23:30:48 GMT

Thanks for the responses thus far, it is much appreciated. Here are some sanitized examples of logs:

"2023-04-25 13:14:27","QZ-NewYork_DMZ","QZ-NewYork_DMZ","80.20.59.143","80.20.59.143","Allowed","28 (AAAA)","NOERROR","webdefence.global.whitespider.com","Software/Technology,Application,Computers and Internet","Networks","Networks",""

"2022-10-23 11:34:59","Charlie Five (cfive@workplace.com)","Charlie Five (cfive@workplace.com),QZ-NewYork_Verizon_VPN_NAT,QZ-845310891334","172.32.5.8","8.8.8.8","Allowed","1 (A)","NOERROR","outlook.office365.com","Software/Technology,Webmail,Business Services,Organizational Email,Application,Web-based Email,Online Document Sharing and Collaboration","AD Users","AD Users,Networks,Anyconnect Roaming Client",""

In the first example, I would want the values for the categories field to be as follows; each line represents one complete field value as it would display in a search:

Software/Technology
Application
Computers and Internet

Alternatively, this would also suffice, which is the entire string exactly as it displays in the log:

Software/Technology,Application,Computers and Internet

The same applies to the second example, here I will display them as if I clicked on the field in the event drop-down and selected "view events", this is what would be added to the search bar:

categories="Software/Technology,Webmail,Business Services,Organizational Email,Application,Web-based Email,Online Document Sharing and Collaboration"

Or (I'll only show 1 here for the sake of brevity):

categories="Online Document Sharing and Collaboration"

Hope this helps you more, and thank you again for your assistance.

Re: Parsing Fields Properly

richgalloway — Sat, 27 May 2023 00:34:11 GMT

Are you trying to parse the fields at search time or index time? If the former, please share the SPL you're using; otherwise, share the relevant props.conf stanza.

Re: Parsing Fields Properly

ITWhisperer — Sat, 27 May 2023 06:18:12 GMT

Depending on whether the final field is important, you could do something like this

| rex max_match=0 "(?<field>\"[^\"]*\"),?" | eval categories=split(trim(mvindex(field,9),"\""),",")

Re: Parsing Fields Properly

Charlie5 — Tue, 30 May 2023 21:40:22 GMT

@richgalloway Search time, here is the SPL for manual extraction:

index=my_index sourcetype=proxy_sourcetype
| rex field=_raw "^("([^\"]+)",){9}"(?<categories>[^\"]+)"

Re: Parsing Fields Properly

richgalloway — Wed, 31 May 2023 02:02:51 GMT

I think you're most of the way there. To separate the categories, use the split function.

index=my_index sourcetype=proxy_sourcetype | rex field=_raw "^("([^\"]+)",){9}"(?<categories>[^\"]+)" | eval categories=split(categories,",")