I've been fighting with and researching Splunk regex for the past month, and I just cannot seem to get the PCREs being produced by another source to work for me for searching proxy logs in Splunk. I'm assuming there are some syntaxual differences, possibly some missing features, but I haven't been able to find any solid documentation on what those may be.
Can anyone help me get the below working properly in a Splunk search? I've been trying variations on vendor = proxyname | regex = "<expressioin>" but it doesn't work.
^http:\/\/(?!www|forums?)(?:[^\.]+\.[^\.\x2f]+|[^\.]+\.[^\.]+\.(?:[^\.\x2f]+?|[^\.]+\.[^\.]+))\/[^\x3f]+\/(?:index\.php\?PHPSESSID=[^&]+?&action=(?!dlattach)[^&]+?&?|view(?:forum|topic)\.php\?[a-z]=[^&]{1,5}&[a-z]{1,3}=(?![0-9a-f]{32})[0-9a-z\._-]{13,})&?$
In Splunk, the syntax to do regex matching in a search is:
<base search> | rex field=_raw_or_another_field "some regex here (?<extracted_field> regex here for match) some ending regex here" | table extracted_field
Verify that you're utilizing the rex command in this fashion, then we can talk what is or is not matching.
It would help if you could share some sample data.
regex101.com is a good site for testing regex strings. It is pretty compatible with Splunk regexes.
I'll edit in a sample URL... I HAVE checked it at regex101.com, and it checks out there. But it fails in Splunk.
Ok, it won't let me revise it apparently, here's the URL, with the disclaimer that it was a live Angler EK link a week or so back. I've defanged it for safety reasons, so you'll have to fix the http and .com parts to check it properly. hxxp://nosprivsliikeradan.pfgfoxriver-localguide2[.]com/boards/viewforum.php?f=5x827&sid=7q0as14.5i4x8
Your regex string matches the URL example, but nothing is extracted because the regex has no capturing groups. What are you attempting to do with the regex?
I want to be able to search the proxy logs for any and all instances of the regex. If there's a log with a URL matching that regex, I want to see it when I run the search.
So when you enter index=foo | regex "^http:\/\/(?!www|forums?)(?:[^\.]+\.[^\.\x2f]+|[^\.]+\.[^\.]+\.(?:[^\.\x2f]+?|[^\.]+\.[^\.]+))\/[^\x3f]+\/(?:index\.php\?PHPSESSID=[^&]+?&action=(?!dlattach)[^&]+?&?|view(?:forum|topic)\.php\?[a-z]=[^&]{1,5}&[a-z]{1,3}=(?![0-9a-f]{32})[0-9a-z\._-]{13,})&?$"
, what do you get?
If I do that I get no results returned, but I just figured out the problem. It's the way the proxy logs are stored in Splunk. Which is a single line that is more or less a hash style data structure, with metadata tags and values. So, when I'm searching the regex like that, the ^ and $ characters at the beginning and end of the regex, while good for regex filtering on the proxy, break the Splunk searches since they show up in line surrounded by other garbage.
Removing the anchors was going to be my next suggestion. 😉