I am trying to dig through some records and trying to get the q (query) from the raw data, but I keep getting data back that includes a backslash after the requested field (mostly as a unicode character representation, /u0026 which is an &).
For example, I have this search query to capture the page from which a search is being made (i.e., "location"):
index="xxxx-data" | regex query="location=([a-zA-Z0-9_]+)+[^&]+" | rex field=_raw "location=(?<location>[a-zA-Z0-9%-]+).*" | rex field=_raw "q=(?<q>[a-zA-Z0-9%-_&+/]+).*"| table location,q
Which mostly works viewing the Statistics tab, except that it occasionally returns the next URL parameter, i.e.,
location | q |
home_page | hello+world // this is ok |
about_page | goodbye+cruel+world\u0026anotherparam=anotherval // not ok |
The second result should just be goodbye+cruel+world without the following parameter.
I have tried adding variations on regex NOT [^\\] for a backslash character but everything I've tried has either resulted in an error of the final bracket being escaped, or the backslash character ignored like so:
rex field=_raw ...
regex attempt | result |
"q=(?<q>[a-zA-Z0-9%-_&+/]+[^\\\]).*" | goodbye+cruel+world\u0026param=val |
"q=(?<q>[a-zA-Z0-9%-_&+/]+[^\\]).*" | Error in 'rex' command: Encountered the following error while compiling the regex 'q=(?<q>[a-zA-Z0-9%-_&+/]+[^\]).*': Regex: missing terminating ] for character class. |
"q=(?<q>[a-zA-Z0-9%-_&+/]+[^\]).*" | Error in 'rex' command: Encountered the following error while compiling the regex 'q=(?<q>[a-zA-Z0-9%-_&+/]+[^\]).*': Regex: missing terminating ] for character class. |
"q=(?<q>[a-zA-Z0-9%-_&+/]+[^\\u0026]).*" | Error in 'rex' command: Encountered the following error while compiling the regex 'q=(?<q>[a-zA-Z0-9%-_&+/]+[^\u0026]).*': Regex: PCRE does not support \L, \l, \N{name}, \U, or \u. |
"q=(?<q>[a-zA-Z0-9%-_&+/]+[^u0026]).*" | goodbye+cruel+world\u0026param=val" |
"q=(?<q>[a-zA-Z0-9%-_&+/]+[^&]).*" | goodbye+cruel+world\u0026param=val" |
"q=(?<q>[a-zA-Z0-9%-_&+/]+).*" | goodbye+cruel+world\u0026param=val |
Events tab data is like:
Event
apple: honeycrisp
ball: baseball
car: Ferrari
query: param1=val1¶m2=val2¶m3=val3&q=goodbye+cruel+world¶m=val
status: 200
... etc ...
SO, how can I get the q value to return just the first parameter, ignoring anything that has a \ or & before it and terminating just at q?
And please, if you would be so kind, include an explanation of why what you suggest works?
Thanks
Hi @isxtn,
There's probably going to be a few ways to tackle this - here's one that may work for you:
| rex field=_raw "q=(?<q>.+?)(&|\\\u\d)"
That breaks down like this:
Create a field called "q" that uses up all characters until it sees either:
This should match when things are correctly separated by an ampersand, but also if the ampersand is character encoded.
The question mark after the .+ in the regex tells Splunk to not use greedy matching, so it will stop looking at the first "&" or "\u" that it sees.
To avoid the "Regex: PCRE does not support \L, \l, \N{name}, \U, or \u" error, I've escaped both the backslash and the u character.
Here's a test search to show it in action:
| makeresults
| eval raw = "apple: honeycrisp
ball: baseball
car: Ferrari
query: param1=val1¶m2=val2¶m3=val3&q=goodbye+cruel+world\\u0026param=val
status: 200@apple: honeycrisp
ball: baseball
car: Ferrari
query: param1=val1¶m2=val2¶m3=val3&q=goodbye+cruel+world¶m=val
status: 200"
| makemv raw delim="@" | mvexpand raw
| rename raw as _raw
| rex field=_raw "q=(?<q>.+?)(&|\\\u\d)"
| table _raw, q
That results in:
Cheers,
Daniel