Here are 2 events from an apache log. I have a field extraction regex which works unless the content-type contains a "charset" field.
[01/Aug/2014:07:43:48 +0100] 1150 xxx.xxx.xxx.xxx xxx.xxx.xxx.xxx, xxx.xxx.xxx.xxx, xxx.xxx.xxx.xxx xxx.xxx.xxx.xxx IT 200 GET www.URL.com/images/favicon.ico - "Mozilla/5.0 (Linux; U; Android 4.4.2; en-gb; HTC_Desire_610 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30" text/plain; charset=UTF-8 2148
[01/Aug/2014:07:43:58 +0100] 293 xxx.xxx.xxx.xxx xxx.xxx.xxx.xxx xxx.xxx.xxx.xxx - 200 GET www.URL.com/robots.txt - "Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +go.mail.ru/help/robots)" text/javascript 118943
The regex that works on the second event is:
(?i)^[^\+]*\+\d+\]\s+(?P<bytes>[^ ]+)\s(?P<clientip>[^ ]+)\s(?P<xforward_ip>[^ ]+)\s(?P<cluster_ip>[^ ]+)\s(?P<lang>[^ ]+)\s(?P<response>[^ ]+)\s(?P<method>[^ ]+)\s(?P<uri>[^ ]+)\s(?P<referer>[^ ]+)\s"(?P<useragent>[^"]*?)"\s(?P<mime_type>[^ ]+)\s(?P<response_time>[^ ]+)
So what I’m trying to do is have the regex match "text/plain" but if it sees "; charset=UTF-8" to also match that in the same group.
So my attempt at the regex is:
(?i)^[^\\+]*\\+\\d+\\]\\s+(?P<bytes>[^ ]+)\\s(?P<clientip>[^ ]+)\\s(?P<xforward_ip>[^ ]+)\\s(?P<cluster_ip>[^ ]+)\\s(?P<lang>[^ ]+)\\s(?P<response>[^ ]+)\\s(?P<method>[^ ]+)\\s(?P<uri>[^ ]+)\\s(?P<referer>[^ ]+)\\s\"(?P<useragent>[^\"]*?)\"\\s?(\\w+\\W\\w+\\W\\s\\S+?)(?P<mime_type>[^;]+)|(?P<mime_type>[^ ]+)\\s(?P<response_time>[^ ]+)
The if-then-else statement is ?(\\w+\\W\\w+\\W\\s\\S+?)(?P<mime_type>[^;]+)|(?P<mime_type>[^ ]+)
but splunk gives the error "Regex: two named subpatterns have the same name", which I understand.
Unfortunately I'm a regex noob, so this is my understanding...
?(\\w+\\W\\w+\\W\\s\\S+?)
= if(condition)
(?P<mime_type>[^;]+)
= then field is
|(?P<mime_type>[^ ]+)
= else match
Hope that makes sense 🙂
Give this a try
(?i)^[^\+]*\+\d+\]\s+(?P<bytes>[^ ]+)\s(?P<clientip>[^ ]+)\s(?P<xforward_ip>[^ ]+)\s(?P<cluster_ip>[^ ]+)\s(?P<lang>[^ ]+)\s(?P<response>[^ ]+)\s(?P<method>[^ ]+)\s(?P<uri>[^ ]+)\s(?P<referer>[^ ]+)\s\"(?P<useragent>[^\"]*?)\"\s(?P<mime_type>(\w+\/\w+))(.*)\s(?P<response_time>\d+)
Give this a try
(?i)^[^\+]*\+\d+\]\s+(?P<bytes>[^ ]+)\s(?P<clientip>[^ ]+)\s(?P<xforward_ip>[^ ]+)\s(?P<cluster_ip>[^ ]+)\s(?P<lang>[^ ]+)\s(?P<response>[^ ]+)\s(?P<method>[^ ]+)\s(?P<uri>[^ ]+)\s(?P<referer>[^ ]+)\s\"(?P<useragent>[^\"]*?)\"\s(?P<mime_type>(\w+\/\w+))(.*)\s(?P<response_time>\d+)
perfect! Thank you very much 🙂 Looks like I was over engineering it.