Utilizing web logs, I am trying to extract via rex, all text after the last /
of the URL field and put the text into a field called, "filename". The catch is that I only want the text if it ends in .zip
After trying multiple variations of my regex statement, splunk keeps returning values that do not match my regex statement (I tested it on multiple online testers).
index=bc_logs | rex field=url "(?<filename>[^/]+\.zip)" | stats count by filename | sort -count
filename count
1 sprint.zip 400
2 message.zip 31
3 track.zip 4
4 www.zip 4
5 Software%20Update 3
6 signaturerq.png 2
7 3po.zip 1
8 W2n=41#cb=fb4&domain=www.zip 1
9 [455DE-DA3-4A-BCE-69F56D4] 1
10 americaninfidelmiddlefi.jpg 1
Some results end in .zip
and some don't... not sure what's going on.
EDIT: added url log samples
url=track.ziprecruiter.com
url=files.getsoftfree.com/get/click/479ymt8s/?uid=6X102VhaCZ&filename=Software%20Update&sid=173652
url=desmond.imageshack.us/Himg62/scaled.php?server=62&filename=americaninfidelmiddlefi.jpg&res=medium
From your log samples it seems likely that Splunk's auto-kv extraction is overwriting your own field extraction in cases where there's a "filename=<something>"
as part of a log event. Verify this by calling your field something else and check if results are correct.
EDIT: Or rather, it's the other way around - Splunk's auto-kv will run first, and find some "filename" values. Then you apply your own field extraction which will only write results to the "filename" field if it finds anything that matches your regex. However, for results where it DOESN'T match, but auto-kv has extracted something, that value will not get overwritten and so you're left with matches from both kinds of extractions.
@gpradeepkumarreddy: This kinda works. It works as I inteded, but eliminates all logs that already contain 'filename' in the URL. The final solution was combining your addition along with Ayn's.
@rroberts: tried with the anchor, doesn't help. Documentation says, "The rex command matches the value of the specified field against the unanchored regular expression..."
From your log samples it seems likely that Splunk's auto-kv extraction is overwriting your own field extraction in cases where there's a "filename=<something>"
as part of a log event. Verify this by calling your field something else and check if results are correct.
EDIT: Or rather, it's the other way around - Splunk's auto-kv will run first, and find some "filename" values. Then you apply your own field extraction which will only write results to the "filename" field if it finds anything that matches your regex. However, for results where it DOESN'T match, but auto-kv has extracted something, that value will not get overwritten and so you're left with matches from both kinds of extractions.
That was it! when I changed 'filename' to 'blabla' and re-ran it, it worked perfectly. Thank you all!
Final query for any future readers:
index=bc_logs url=*.zip | rex field=url "(?<blabla>[^/]+\.zip)" | stats count by blabla | sort -count
Did you try .. index=bc_logs | rex field=url "(?
Putting a $ after zip to declare "ends with zip"?
Can you post some of the values of url?
One more way would be filtering out the results where the url contains "zip" prior to your extraction url=*.zip