 
					
				
		
I have ~10,000 hosts each generating around 100,000 events of ~100 bytes each day. I ingest all of these events, batched daily from each host. At ~10MB per host, I ingest ~100GB/day of uncompressed data.
Each of the 10,000 hosts is a member of exactly one "group". In total, I have ~1000 groups that contain every single host. For reporting purposes, I want to build reports and dashboards that can filter and slice the data by these groups as the groups are the nominal unit of measure for describing hosts in my organization.
Making things slightly more complicated, hosts move between groups from time to time. I'd like reports and dashboards to understand and reflect this behavior. For example, if host A is in group 2 now, but was in group 1 up until one month ago, it should not appear in reports and dashboards filtered to the last week and group 1.
Sample data:
index=main
10/13/14 hostA event1
10/22/14 hostA event1
10/28/14 hostB event1
index=hostToGroup
10/10/14 hostA group1
10/20/14 hostA group2
10/10/14 hostB group2
Given this sample data, an example request would be to return the count of event1 in group1 during October 2014. The correct answer would be one, since one of the hostA events (the second one) was while hostA was in group2 and the only hostB event was while it was in group2 as well.
I have tried or considered a few different strategies to model this information, but I'm looking for advice from the experts, so here I am. 😉
Model 1: Import information about the host to group relationship into some Splunk index as events and then use joins, "run everywhere" queries or similar to combine this information with the other events to enable filtering. Cons: Seems slow, very complicated queries.
Model 2: Similar to Model 1, but import the host to group relationship information into a time-based lookup table and then use that lookup table during queries to filter. (I have not even prototyped this model.) Cons: Slow performance? Complicated queries? Need for lookup table generation scheduled queries.
Model 3: Simply add the group information to every single event as it is indexed. Cons: Inefficient storage.
Model 4: Encode host's group into file at index time, effectively burying it in the 'source' field. Cons: Non-obvious.
Others?
I have built Model 1, but it is extremely slow. Since I need to effective combine every event (100GB/day) with the group information queries are slow. The queries are very complicated as well. I suspect Model 2 will have similar properties.
For simplicity, I like Model 3. The queries become very clear and simple. On the flip side, I don't really like that every event's _raw field ends up containing the group (as the names are large and cumbersome in practice). Is there a way to "hide" a field in an event to avoid this?
I also like Model 4 since it seems to better model the real world. I don't actually need to store the group on each event. Instead, each host's daily ingest file could simply include a "global" group value that applies to all the events therein. The 'source' and 'host' fields feel like they have this arity, but are they actually stored this way or are they actually stored on each field under the covers? Or perhaps the whole system is more efficient than that anyway (columnar db?) and I really shouldn't be trying to optimize this...?
Sorry for the long question. I wasn't sure how to describe my situation more briefly. A few days and nights of searching here and manual reading and I haven't come up with much, yet. On that last thought, I'd also welcome hints/advice on such terms/terminology that might help me better research this kind of problem...I suspect I may not be the only person with this kind of use case...
Thanks in advance!
 
					
				
		
I would use a summary index for this.  On the first of every month, run a search that calculates all of event/group pairings in that month and save this.  Then do a subsearch to pull this in as a search qualifier and use NOT [search earliest=-1m@m latest=-0m@m from SI].
 
					
				
		
Thanks for the feedback! I am doing something similar to this with run everywhere queries, but on a nightly basis. I end up having to build and maintain tens of queries though to do all the different kinds of nightly aggregations. The upside is that my dashboards are very fast since they only use the nightly, pre-aggregated data. The downsides are that I can't easily support drilldown (the details are aggregated away) and exploratative querying is not sped up really at all.
 
					
				
		
What you need is a time-base lookup setup for host to group relationship and then use the lookup to get the correct group assignment based on _time and TIME field of lookup. See more details on how to configure the same here.
 
					
				
		
Thanks for the advice. I haven't prototyped that path yet, but I will look into it. Do you foresee any problems with a lookup table that contains millions of items? I suspect that over time my lookup table will grow to such sizes...
 
					
				
		
Well, very large lookup are not good. I won't go for this solution if the count is more than 100K. Beside, you've 10,000 hosts and each belongs to only one group at a time, with total group count is 1000, so the count in lookup table should be 10,000 to starts with. It'll grow as hosts get reassigned to different groups, it should go in millions. How frequently the hosts get re-assigned to different groups?
