Getting Data In

Difference in Size Between Events

Path Finder

I have two indexes that contain different sets of events.

Index 1
Event Count – 23,952
Current Size – 19

Index 2
Event Count – 431,026
Current Size – 20

The size is the same, but the number of events is drastically different. This would make sense except that the events in both indexes are generally the same length. Any explanation for the difference in size here?

Index 1 - Event Example

    {"time":"Fri Apr 03 17:57:08 CDT 2015","web_request_response_time":"0.45356011390686035","application":"node_count":"1","DataType":"PurepathData","state":"OK","cpu":"0.448837012052536","System Profile":"c_prodissue","breakdown":"CPU: 0.449 ms, Sync: -, Wait: -, Suspension: -","agent":"_JavaApp06_sin@sin:1547","root_path_thread_name":"http-apr-169.97.17.67-11000-exec-2","time":"Fri Apr 03 17:57:08 CDT 2015","response_time":"0.45356011390686035","execsum":"0.45356011390686035","name":"/SUI/monitoring","exec":"0.45361328125"}

     {"time":"Fri Apr 03 17:57:03 CDT 2015","web_request_response_time":"0.5128860473632812","application":"applic","node_count":"1","DataType":"PurepathData","state":"OK","cpu":"0.5083289742469788","System Profile":"_uat_prodissue","breakdown":"CPU: 0.508 ms, Sync: -, Wait: -, Suspension: -","agent":"UAT_JavaApp05_sin@sin:28893","root_path_thread_name":"http-apr-169.97.17.62-11000-exec-17","time":"Fri Apr 03 17:57:03 CDT 2015","response_time":"0.5128860473632812","execsum":"0.5128860473632812","name":"/UI/monitoring","exec":"0.512939453125"}

Index 2 - Event Example

            System_Profile=Monitoring #document dynatrace version=6.1.0.8054 systemprofile capture=true modifiedby=E745984 repositoryaccess=true incidentrules incidentrule flags=1 id=Host Disk Unhealthy incidentdashboardname=Incident Zero Conf Dashboard timeframe=10 actions actionref bundleversion=0.0.0 execution=begin key=com.dynatrace.diagnostics.plugins.EmailNotification refaction=com.dynatrace.diagnostics.plugins.EmailNotification rolekey=com.dynatrace.diagnostics.plugins.EmailNotificationAction roletype=1 severity=informational smartalert=false type=Email Notification property key=from typeid=string value= 

            System_Profile=Monitoring #document dynatrace version=6.1.0.8054 systemprofile capture=true modifiedby=E745984 repositoryaccess=true incidentrules incidentrule flags=1 id=Host Network Unhealthy incidentdashboardname=Incident Zero Conf Dashboard timeframe=10 actions actionref bundleversion=0.0.0 execution=begin key=com.dynatrace.diagnostics.plugins.EmailNotification refaction=com.dynatrace.diagnostics.plugins.EmailNotification rolekey=com.dynatrace.diagnostics.plugins.EmailNotificationAction roletype=1 severity=informational smartalert=false type=Email Notification property key=bcc typeid=string value= 
1 Solution

SplunkTrust
SplunkTrust

I'm going to guess that your data in index 1 has INDEXED_EXTRACTIONS=json activated in props.conf. More space used in that case is expected behaviour, that space is traded for speed when using those fields - especially in tstats situations.

To further investigate, run these two searches:

| dbinspect index=index1 | eval rawSizeMB = rawSize / 1048576 | table id eventCount rawSizeMB sizeOnDiskMB

| dbinspect index=index2 | eval rawSizeMB = rawSize / 1048576 | table id eventCount rawSizeMB sizeOnDiskMB

That'll give you the event count, the raw size ingested into each bucket for that index, and how much space each bucket occupies on disk. If you have single huge rogue events you should see one bucket behaving differently from the others, if my JSON guess is correct all buckets for an index should look fairly similar.

As for your events themselves, it seems the data in index 1 has more unique tokens - for example, those huge precision numbers. Lots of unique tokens will increase the size of dictionaries, and hence Splunk's index structures. The index 2 sample events seems to have lots of repeating tokens in the field values, not a lot of unique ones.

View solution in original post

SplunkTrust
SplunkTrust

By default, Splunk will force an event break after 10000 characters. You can modify that in props.conf using the TRUNCATE setting. In the same spirit, the default will break after 256 lines in one event, see MAX_EVENTS in props.conf.

These default limits are there to mitigate either wrong configurations or systems throwing unexpected log data.

SplunkTrust
SplunkTrust

I'm going to guess that your data in index 1 has INDEXED_EXTRACTIONS=json activated in props.conf. More space used in that case is expected behaviour, that space is traded for speed when using those fields - especially in tstats situations.

To further investigate, run these two searches:

| dbinspect index=index1 | eval rawSizeMB = rawSize / 1048576 | table id eventCount rawSizeMB sizeOnDiskMB

| dbinspect index=index2 | eval rawSizeMB = rawSize / 1048576 | table id eventCount rawSizeMB sizeOnDiskMB

That'll give you the event count, the raw size ingested into each bucket for that index, and how much space each bucket occupies on disk. If you have single huge rogue events you should see one bucket behaving differently from the others, if my JSON guess is correct all buckets for an index should look fairly similar.

As for your events themselves, it seems the data in index 1 has more unique tokens - for example, those huge precision numbers. Lots of unique tokens will increase the size of dictionaries, and hence Splunk's index structures. The index 2 sample events seems to have lots of repeating tokens in the field values, not a lot of unique ones.

View solution in original post

SplunkTrust
SplunkTrust

The configuration reference is here: http://docs.splunk.com/Documentation/Splunk/6.2.3/Admin/Propsconf (search for INDEXED_EXTRACTIONS)
There's a bit of human-readable docs here: docs.splunk.com/Documentation/Splunk/6.2.3/Data/Extractfieldsfromfileheadersatindextime

Regular searches should run at similar speeds. What benefits the most is stuff like this:

| tstats avg(cpu) avg(web_request_response_time) where index=index1 by _time span=auto prestats=t | timechart avg(cpu) avg(web_request_response_time)

That should be massively faster than trying to pry the cpu and web_request_response_time fields from the JSON at search time.

0 Karma

Path Finder

You're assumption is correct. So you're saying that the data in index1 can be searched faster?

This data is coming from a custom made script. If the trade off for larger file size is quicker results then I will leave the formatting as is. Otherwise if there were no pros to having the events formatted as such I would change it to be simpler.

Thanks for the heads up. Are there any reference docs available related to this?

0 Karma

Communicator

Is it possible there are one or two rogue gigantic events in Index 1? I've never used it personally, but I've read of people using "eval esize" to check this kind of thing.

Path Finder

I believe there is a character limit for events. So even if there were a handful of rogue events that still couldn't account for the tenfold size increase.

0 Karma

Communicator

Ah, I didn't know that actually.

0 Karma

Path Finder
0 Karma

Communicator

Yeah, I immediately looked into that as soon as you mentioned it. That post exactly, actually. Thanks!

0 Karma

Esteemed Legend

how are you calculating "size"?

0 Karma

Path Finder

That is coming from the Indexes view in the Splunk Settings. "Current size in MB"

0 Karma

Path Finder

There are more field extractions occurring in the heavier events. So that could possibly be the case.

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!