Solved: Re: My regular expression is working fine but why ...

premraj_vs · ‎06-13-2017

Hi All,

I am a newbie and i am trying to extract fields from raw log. I followed the below steps.

Using the link -https://regex101.com/

I created the regex expression matching my log.

Regex Expression is as follows

(?P<DP_Date_Time>\w+\s+\d+\s+\d+\s+\d+:\d+:\d+)\s+(?P<DP_Error_Code>[[^ ]\w+])\[\w+]\[\w+]\s(?P<DP_Service_Name>\w+\(\w+\)):\s+\w+\((?P<DP_Transaction_ID>\d+)\)\[(?P<DP_TID>\d+.\d+.\d+.\d+)\]\s+\w+\((?P<DP_GTID>\d+)\):\s+Latency:\s+(?P<DP_LATENCY_TIME_REQ_HDR_READ>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_REQ_HDR_SENT>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_FSTB>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_FSTC>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_ENTIRE_REQ_TRS>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_FS_SYTLE_READY>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_FS_PARSING_COM>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_RES_HDR_RECVD>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_RES_HDR_SENT>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_BSTB>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_BSTC>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_RES_TRS>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_BS_STYLE_READ>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_BSPC>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_BSCA>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_BSCC>[0-9]*) \[(?P<DP_Backside_URL>.*)\]

2) Now when i imported the log into Splunk, i selected Default source type and imported it.

3) I am trying with below search query and it returns no fields.

source="latency_0612.log" host="******" index="idx-integrations-test" sourcetype="dpower-latency" | rex _raw="^(?P\w+\s+\d+\s+\d+\s+\d+:\d+:\d+)\s+(?P[[^ ]\w+])\[\w+]\[\w+]\s(?P\w+\(\w+\)):\s+\w+\((?P\d+)\)\[(?P\d+.\d+.\d+.\d+)\]\s+\w+\((?P\d+)\):\s+Latency:\s+(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*)[ ]*(?P[0-9]*) \[(?P.*)\]"

What am i missing ? Why after search i am not seeing these fields ?

premraj_vs · ‎06-13-2017

Finally it worked.

I had to use Field Extractions option and use the same regex expression there. It returned me all the fields with correct values.

Thanks for all the help.

View solution in original post

DalJeanis · ‎06-13-2017

Looks like the only thing you are missing is something to get rid of the day of the week from the beginning, and use the proper syntax for rex. Oh, one other thing. You can make splunk's job easier if you do not use [ ]* for the spaces between numbers. Explanation after the code.

 | rex field=_raw "^\w+\s+(?P<DP_Date_Time>\w+\s+\d+\s+\d+\s+\d+:\d+:\d+)\s+(?P<DP_Error_Code>[[^ ]\w+])\[\w+]\[\w+]\s(?P<DP_Service_Name>\w+\(\w+\)):\s+\w+\((?P<DP_Transaction_ID>\d+)\)\[(?P<DP_TID>\d+.\d+.\d+.\d+)\]\s+\w+\((?P<DP_GTID>\d+)\):\s+Latency:\s+(?P<DP_LATENCY_TIME_REQ_HDR_READ>\d+)\s+(?P<DP_LATENCY_TIME_REQ_HDR_SENT>\d+)\s+(?P<DP_LATENCY_TIME_FSTB>\d+)\s+(?P<DP_LATENCY_TIME_FSTC>\d+)\s+(?P<DP_LATENCY_TIME_ENTIRE_REQ_TRS>\d+)\s+(?P<DP_LATENCY_TIME_FS_SYTLE_READY>\d+)\s+(?P<DP_LATENCY_TIME_FS_PARSING_COM>\d+)\s+(?P<DP_LATENCY_TIME_RES_HDR_RECVD>\d+)\s+(?P<DP_LATENCY_TIME_RES_HDR_SENT>\d+)\s+(?P<DP_LATENCY_TIME_BSTB>\d+)\s+(?P<DP_LATENCY_TIME_BSTC>\d+)\s+(?P<DP_LATENCY_TIME_RES_TRS>\d+)\s+(?P<DP_LATENCY_TIME_BS_STYLE_READ>\d+)\s+(?P<DP_LATENCY_TIME_BSPC>\d+)\s+(?P<DP_LATENCY_TIME_BSCA>\d+)\s+(?P<DP_LATENCY_TIME_BSCC>\d+) \[(?P<DP_Backside_URL>.*)\]"

Remember that * matches ZERO of something. With the spaces after numbers, that means that [0-9]*[ ]*[0-9]* matches a zero-length string, as well as an uncountable number of substrings of any succession of digits and spaces.

With this tiny piece of string...

243 254

the regex (?[0-9])(?[ ])(?[0-9]*) will match roughly 4*3^3 different ways, including...

1) the zero-length string before the first character where item1, space2 and item3 are all empty
2) the 1-length string "2" that has item1 and space2 empty and item3 as "2".
3) the 1-length string "2" that has item1 as "2" and space2 and item2 empty.
4) the 2-length string "24" that has item1 and space2 empty and item3 as "24".
5) the 2-length string "24" that has item1 as "2" and space2 and item3 as "4".
6) the 2-length string "24" that has item1 as "24" and space2 and item3 empty .

Since you were using the greedy *, those alternatives will not get tested until after the version where item1 gets "243" and item3 gets "254", so you will be okay as long as the overall pattern matches. However, the minute that your overall pattern somehow fails, your search is going bye-bye with way too many potential backtracks to ever come back from.

This is easily solved, because In each of these cases, you want one or more digits, and one or more spaces, so you can use + instead, so there are zero potential backtracks.

To see this in action, take your original rex string, go over to regex101, and plop it in the tester. Copy your sample into the test string box and see the match was found in 144 steps or so.

Now add some bad data late in the event - for example change one of the 36 to 36U. Up above to the right, after a short while, you will see the words "catastrophic backtracking". Now copy our version of the rex up there, and the message will instead be that it failed with no match after perhaps 136 steps.

premraj_vs · ‎06-13-2017

Finally it worked.

I had to use Field Extractions option and use the same regex expression there. It returned me all the fields with correct values.

Thanks for all the help.

woodcock · ‎06-13-2017

Please do elaborate with steps; I am not sure what you mean here.

premraj_vs · ‎06-13-2017

Sample Event

Thu Apr 20 2017 13:42:09 [0x80e00073][latency][info] mpgw(ORD_Gateway_Policy_02): tid(134637607)[YY:UU:UU:OO] gtid(134637607): Latency: 0 36 0 36 36 32 24 243 254 243 254 255 251 243 36 36 [https://XX.XX.XX.XX:10005/services/ORD/v2]

horsefez · ‎06-13-2017

Hi,

Could you provide a sample event?
The correct rex syntax is | rex field=_raw "yourregex"
your fields do not have a field name to it

do something like (?<field-name>...) instead of (?P...)

DalJeanis · ‎06-13-2017

Yes, per point 2, it looks like premraj_vs is mixing the syntax for rex and regex.

premraj_vs · ‎06-13-2017

Added sample event

horsefez · ‎06-13-2017

Hi premraj_vs,

I don't know if I get it correctly, but how about using your first Regex-Statement in the query?

Like...

source="latency_0612.log" host="******" index="idx-integrations-test" sourcetype="dpower-latency" | rex field=_raw "(?P<DP_Date_Time>\w+\s+\d+\s+\d+\s+\d+:\d+:\d+)\s+(?P<DP_Error_Code>[[^ ]\w+])\[\w+]\[\w+]\s(?P<DP_Service_Name>\w+\(\w+\)):\s+\w+\((?P<DP_Transaction_ID>\d+)\)\[(?P<DP_TID>\d+.\d+.\d+.\d+)\]\s+\w+\((?P<DP_GTID>\d+)\):\s+Latency:\s+(?P<DP_LATENCY_TIME_REQ_HDR_READ>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_REQ_HDR_SENT>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_FSTB>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_FSTC>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_ENTIRE_REQ_TRS>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_FS_SYTLE_READY>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_FS_PARSING_COM>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_RES_HDR_RECVD>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_RES_HDR_SENT>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_BSTB>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_BSTC>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_RES_TRS>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_BS_STYLE_READ>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_BSPC>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_BSCA>[0-9]*)[ ]*(?P<DP_LATENCY_TIME_BSCC>[0-9]*) \[(?P<DP_Backside_URL>.*)\]"

premraj_vs · ‎06-13-2017

I am doing that already

My regular expression is working fine but why is my search not retrieving fields?

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!