Looking at all the posts regarding User-Agent HTTP header searches, one of the commonalities is that they were told to change their format to Combined Log Format. I unfortunately cannot do that but I am still being asked to create a dashboard reports to show most common OS used and most common browser. Here is a log:
XX.XX.XX.XX - - [30/Jul/2013:15:16:40 -0700] 0 "GET /portal-web/images/denied.png HTTP/1.1" 200 882 "htps://ABC.ABC.com/portal-web/stuff/stuff.action" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.0)"
Ultimately I want separate count columns for browser type and OS type. How do I go about extracting the info I want? I believe I need to use a Regex statement, but I am unsure on how to proceed especially since both the client and browser are going to change in size?
A pure regex is not going to do it alone. If you are a novice you can get some help for yourself by using the interactive field extraction creator. It is one of the options in the per-record drop down.
The difficulty is that there is no defined order or format for sub fields of the UA. I just tried myself with the following sample list culled from recent access logs for the generator to weave its magic on:
Windows NT 5.1 Linux x86_64 Windows NT 6.0 Android 4.1.2 Windows Phone OS 7.5 Windows NT 6.1
The resulting sample extractions it offered were:
Linux x86_64 Windows NT 5.1 +http://yandex.com/bots)" RU Windows NT 5.1)" US http://www.majestic12.co.uk/bot.php?+)"; US rv:17.0) Gecko/20130626 Firefox/17.0 Iceweasel/17.0.7" FR +http://www.exabot.com/go/robot)" FR Windows NT 6.2 Mail.RU_Bot/2.0 Windows NT 6.0)" JP Windows NT 6.1 Windows NT 6.0)" CN +http://www.google.com/bot.html)" US Android 4.1.2 +http://www.bing.com/bingbot.htm)" US +http://www.baidu.com/search/spider.html)" CN Windows Phone OS 7.5
Even after some manual refinement it continues to miss the mark more than hit it.
Correct. There is no way to do this just by parsing. UA strings are not strongly-specified, they are mostly suggestive. If you need great accuracy, you must use a lookup that maps known patterns to the item you want. (I mean, technically, you can probably write a regex that includes all the logic of a lookup table, but it would be an impractically enormous regex, so let's just say you can't.)
If you want to job done right, you pretty much need an application. There is no simple way to parse a UA string. It requires either a massive lookup, or a combination of complex logic and a slightly-less-massive lookup. If you have a limited number of UA strings, your best bet is to simply enumerate them all into your own lookup, then set any others to "other" or something.