So, fun problem:
We're wanting to do some data enrichment so that we can build good reports. What we want to do is take proxy logs which contain a userid and an IP address and resolve the userid against AD to get the business group and resolve the IP against DNS to get the proper hostname.
Issue:
1: Millions of proxy events per hour.
2: over 300k accounts
3: over 200k endpoints
Because of this, subsearches are failing (limits reached), an inline ldapsearch fails, and inline dnslookup fails because of max events. I've thought of chunking it, but I still get back to the fact that the sub data is too large for the limits. I almost would like to just append this data to the events in the proxy logs index, but I don't know if that is possible.
Current search logic:
sourcetype=proxy_logs user!=""
| fields user category, src, dest, http_referrer, url
| join user [search sourcetype="ActiveDirectory" | fields sAMAccountName, displayName, company, department | rename sAMAccountName AS user ]
| stats values(displayName) AS "Display Name" values(company) AS Company values(department) AS Dept values(category) AS Category values(src) AS Source values(dest) AS DEST count(_raw) AS "URL Count" values(http_referrer) AS Referrer values(url) AS URL by user
I've thought about doing the proxy search as the subsearch, but the logs can get just as large and fail on max returns. I figure if I solve the AD problem then the DNS enrichment will use the same logic.
So, anyone have thoughts on how to solve this gordian knot? Do I just need to live with increasing the limits? Or am I thinking about the problem wrong? I figure whatever it is I'll have to chunk it and write it to a summary index to actually do any kind of reporting, but I need to get the fields in there first.
Thanks everyone!
Try this:
(sourcetype=proxy_logs user!="" ) OR (sourcetype="ActiveDirectory") | eval user=coalesce(user, sAMAccountName) | fields user category src, dest http_referrer url displayName company department | stats values(displayName) AS "Display Name" values(company) AS Company values(department) AS Dept values(category) AS Category values(src) AS Source values(dest) AS DEST count(_raw) AS "URL Count" values(http_referrer) AS Referrer values(url) AS URL by user
Try this:
(sourcetype=proxy_logs user!="" ) OR (sourcetype="ActiveDirectory") | eval user=coalesce(user, sAMAccountName) | fields user category src, dest http_referrer url displayName company department | stats values(displayName) AS "Display Name" values(company) AS Company values(department) AS Dept values(category) AS Category values(src) AS Source values(dest) AS DEST count(_raw) AS "URL Count" values(http_referrer) AS Referrer values(url) AS URL by user
That works awesome! Thanks!