 
					
				
		
All,
I am working on project to "predict" how much Splunk license I may need in order to onboard new customer. Usually we ingest the same information for all customers, what is the main difference is the number of entries in the logs.
The problem I am having is that I cannot trust on the _internal metrics.log of my indexers. Looks like it does not have all information. For example, if I run:
index=ssn host=*xey* earliest="01/31/2019:1:29:00" latest="01/31/2019:2:29:00" | stats count by host
1   lasxeypr01dem01.las.ssnsgs.net  107
2   lasxeypr01slv01.las.ssnsgs.net  120
3   lasxeypr01vmw01.las.ssnsgs.net  28865
4   lasxeypr01vmw02.las.ssnsgs.net  12242 
At the same time:
index="_internal" source="*metrics.log" group="per_host_thruput" earliest="01/31/2019:1:29:00" latest="01/31/2019:2:29:00" series=*xey* | chart sum(kb) by series | sort - sum(kb)
No results found. 
Some data is there:
index="_internal" source="*metrics.log" group="per_host_thruput" earliest="01/31/2019:1:29:00" latest="01/31/2019:2:29:00" | stats count by host
1   arnvtnpr01spl01.arn.ssnsgs.net  117
2   iadphite01spl01.iad.ssnsgs.net  116
3   janentpr01spl01.jan.ssnsgs.net  116
4   lascocpr01mys01.las.ssnsgs.net  116
5   lascocpr01mys02.las.ssnsgs.net  117
6   lascocpr01mys03.las.ssnsgs.net  116
7   lashrmpr01kaf05     117
8   lashrmpr01wor02     116
9   lasssnpr01spl01.las.ssnsgs.net  1160
10  lasssnpr01spl02.las.ssnsgs.net  1160
11  lasssnpr01spl03.las.ssnsgs.net  1160
12  lasssnpr01spl04.las.ssnsgs.net  679
13  lasssnpr01spl05.las.ssnsgs.net  170
14  lasssnpr01spl06.las.ssnsgs.net  116
15  lasssnpr01spl07.las.ssnsgs.net  116
16  lasssnpr01spl08.las.ssnsgs.net  213
17  lasssnspl01app01.las.ssnsgs.net     188
18  lcxfplpr02spl01.fpl.ssnsgs.net  1160
19  litentpr02spl01.lit.ssnsgs.net  117
20  okcogepr02spl01.okc.ssnsgs.net  116
21  pdxpcfte01spl01.pdx.ssnsgs.net  152
22  phlphipr01spl01.phl.ssnsgs.net  117
23  sanssnpoc02slv01.san.ssnsgs.net     116
24  sanssnpr01spl01.san.ssnsgs.net  1160
25  sanssnpr01spl02.san.ssnsgs.net  1160
26  sanssnpr01spl03.san.ssnsgs.net  1160
27  sanssnpr01spl04.san.ssnsgs.net  125
28  sanssnpr01spl05.san.ssnsgs.net  160
29  sanssnpr01spl06.san.ssnsgs.net  195
30  sanssnpr01spl10     1160 
I know for sure I have data ingested for these hosts.
So, how can get the exactly amount of data that is indexed? It is some rotation on the _internal index that I am missing?
Thank you,
Gerson
 
		
		
		
		
		
	
			
		
		
			
					
		Hi @GersonGarcia
The metrics.log can squash or summarise the metrics for source, sourcetype or host if there are too many. If you need exact data and you don't mind this query being slow, then you can do this: <search> | eval len = len(_raw) | stats sum(len) as bytes
 
					
				
		
@chrisyoungerjds I believe I can find in a different log:
index=_internal sourcetype=splunkd group=tcpout_connections host=*xey* earliest="01/31/2019:1:29:00" latest="01/31/2019:2:29:00" | chart sum(kb) by host | sort - sum(kb)
1   lasxeypr01vmw01.las.ssnsgs.net  19532.55
2   lasxeypr01vmw02.las.ssnsgs.net  17314.92
3   lasxeypr01nan01.las.ssnsgs.net  1520.58
4   lasxeypr01sla01.las.ssnsgs.net  1393.90
5   lasxeypr01gpl01.las.ssnsgs.net  1360.50
6   lasxeypr01dem01.las.ssnsgs.net  1283.92
7   lasxeypr01vmw03.las.ssnsgs.net  1269.57
8   sanxeyte01dem01.san.ssnsgs.net  1233.25 
 
					
				
		
The problem here is that if I have any transformation before ingestion, it will be lost...
 
		
		
		
		
		
	
			
		
		
			
					
		Hi @GersonGarcia
The metrics.log can squash or summarise the metrics for source, sourcetype or host if there are too many. If you need exact data and you don't mind this query being slow, then you can do this: <search> | eval len = len(_raw) | stats sum(len) as bytes
 
					
				
		
Humm the problem is it will take forever to complete the search for all hosts past day:
host=xey earliest=-1d@d latest=@d | eval len = len(_raw) | stats sum(len) as bytes by index host
1   main    lasxeypr01slv01.las.ssnsgs.net  24430
2   os  lasxeypr01dem01.las.ssnsgs.net  10702044
3   os  lasxeypr01gpl01.las.ssnsgs.net  11615854
4   os  lasxeypr01nan01.las.ssnsgs.net  19561100
5   os  lasxeypr01sla01.las.ssnsgs.net  14134946
6   os  lasxeypr01vmw01.las.ssnsgs.net  111012962
7   os  lasxeypr01vmw02.las.ssnsgs.net  56708985
8   os  lasxeypr01vmw03.las.ssnsgs.net  9954705
9   os  sanxeyte01dem01.san.ssnsgs.net  9743627
10  ssn     lasxeypr01dem01.las.ssnsgs.net  569558
11  ssn     lasxeypr01slv01.las.ssnsgs.net  3102610
12  ssn     lasxeypr01vmw01.las.ssnsgs.net  135302275
13  ssn     lasxeypr01vmw02.las.ssnsgs.net  51478532 
This search has completed and has returned 13 results by scanning 1,724,992 events in 86.817 seconds
 
		
		
		
		
		
	
			
		
		
			
					
		yes that is the downside. The only real solution I can offer is with estimation. Basically run this query over a smaller time range to find out how large events typically are:
host=xey earliest=-1d@d latest=@d | eval len = len(_raw) | stats avg(len) as avg_bytes by index host
then you can run a super fast tstats command to get the count of events per index host
|tstats count  where index = _internal sourcetype=splunkd by host index
and you can multiply the numbers together to determine approx how much data was used by each host.
 
					
				
		
Yeah, I guess I could, but the problem is the log size depends of many factors, and it is never the same in two hosts...
Thank you for your help.
