I'm working on building the lookups and data model for the Splunk App for Web Analytics. I'm supporting an environment with multiple websites (eg roadrunner.acme.com, coyote.acme.com, anvil.acme.com) and lots of web servers (server1, server2, server3, and so on).
I note that the Setup New Website dialog asks for site, host, and source information, and will accept a wildcard for the host field. Two things: in the real world, I might have hundreds of hosts that match a wildcard search (server*), and yet only a few that host roadrunner.acme.com. It would be better if I could name multiple specific hosts for a given website, rather than have to do this through a wildcard; yet the form actually allows the user to input multiple hosts for a given website, but then will not work to build lookups. And if my host naming convention (prior to the *) and log file path are the same for coyote.acme.com, I suspect that will also cause a problem in the app, though I've not gotten that far yet.
Finally, the initial lookup build for "Generate User Sessions," and "Generate pages," is excruciatingly slow, perhaps due to the wildcard search on the host name. Tips for speeding this up in the documentation, or at least a clearer idea of what "a long time," means, would be helpful.
I understand that it can be complex if you are running hundreds of website and hundreds of hosts. The only two options you have for setting this app up is either to using absolute links between host, source and a site or by using wildcards for host /source. If you are using a combination of absolute links to sites and wildcards, make sure there are no conflicts as the traffic might be assigned the wrong site.
For maximum flexibility you can specify each host and source combination and link them to a site. You can have multiple host and source combinations link to the same site by just adding more entries under site setup. This might not be practical if you have hundreds of host and source combinations in your data.
For the initial lookup - Yes, this can take a long time. It all depends on the amount of data and the performance of your system. The reason it's taking a long time is not because of the site assignment but because of the transaction command that is generating session ids. The lookups are also set to run for "All time". You can change this to be run for narrower time frame to speed things up. Once these initial lookups have been made only the incremental changes will be added.
The transaction command is heavy so you don't want to run this everytime a dashboard is loaded. The lookup that is being generated is then fed into a datamodel that powers the dashboards. That's why the dashboard performance are great.
Data pipeline for the Splunk App for Web Analytics
I hope this answers your questions.
Yes ,that helps. For the initial lookup and data model build, it sounds like you're saying I can stop the default "All time," build and then rerun it for a shorter time period if desired, after which the model will be incrementally added to every 10 minutes - is that correct?
In my example above, will it work to specify server 1 as host and /var/log/httpd/accesslog as source for roadrunner.com, and then the same input for server2, /var/log/httpd/accesslog, also for roadrunner.com?
Yes you can modify the initial lookup build and run it for a shorter time frame.
For the site configuration, this will work:
roadrunner.com server1 /var/log/httpd/accesslog
roadrunner.com server2 /var/log/httpd/accesslog