Splunk Search

Is it normal for the loadjob command in 4.0.10 to only load of approx. 150k events with events=t parameter?

Path Finder

I'm currently experiencing this:

1) Run a query that returns a large number of events (say, 1mil)

2) Save the job

3) Load the events using | loadjob events=t

Step 3 only loads about approx. 150k events, whereas if I click on the link in the job manager directly it returns the full result set.

Is this normal? If this is a bug, was this fixed in 4.1?

Tags (2)
1 Solution

Contributor

For any given search, Splunk will only retain a limited number of raw events, i.e., actual event data pulled out of the index by the search command. The results of a search -- what is produced by evaluating the entire search string -- are completely preserved along side of field summary and timeline information. The data contained in the *.csv.gz files is capped by settings in limits.conf.

To illustrate the necessity for this, consider this simple example:

search source="*apache_access.log" 200

If your index contains 2 billion events that match the search, storing all 2 billion events every time you run that search would consume the entire storage system in no time. Generally speaking, the 2 billion row data set is not what you're after -- it's the summarized or transformed version that is of interest.

Note that the limitation described here does not mean that Splunk cannot handle lots of events. The search language will process all events asked of it, but will abide by these practical safety controls and not cache all of the raw data.

For perspective, search for a word like computer in Google. Even though Google reports that there are 705,000,000 hits, you will only ever be able to access up to 1,000 results because it is implausible to store that result in its entirely (aside from the fact that nobody is actually going to look at 705M links).

View solution in original post

Path Finder

anyone? I can't offer a bounty as yet, since I don't have enough rep to give away 😛

0 Karma

Contributor

For any given search, Splunk will only retain a limited number of raw events, i.e., actual event data pulled out of the index by the search command. The results of a search -- what is produced by evaluating the entire search string -- are completely preserved along side of field summary and timeline information. The data contained in the *.csv.gz files is capped by settings in limits.conf.

To illustrate the necessity for this, consider this simple example:

search source="*apache_access.log" 200

If your index contains 2 billion events that match the search, storing all 2 billion events every time you run that search would consume the entire storage system in no time. Generally speaking, the 2 billion row data set is not what you're after -- it's the summarized or transformed version that is of interest.

Note that the limitation described here does not mean that Splunk cannot handle lots of events. The search language will process all events asked of it, but will abide by these practical safety controls and not cache all of the raw data.

For perspective, search for a word like computer in Google. Even though Google reports that there are 705,000,000 hits, you will only ever be able to access up to 1,000 results because it is implausible to store that result in its entirely (aside from the fact that nobody is actually going to look at 705M links).

View solution in original post

Super Champion

Could someone provide information about the actual settings name and stanza in the limits.conf file? I'm also wondering if it's possible to override this behavior for a specific search instead of changing it globally?

Communicator

I'm curious about overriding this behavior locally also... I have 27 dashboard panels I'm trying to build that are unique filterings of the same underlying search that contains 80,000+ results. I'd prefer to just run that underlying search once, then call the loadjob and add the filters. It seems to be working just fine when the filtered results are less than 10,000 but if/when it's over that it truncates.

0 Karma

Path Finder

Assuming that this answer explains the question, then the parameter/value events=t for loadjob loses its meaning (or is misrepresented/misunderstood by me :P) since the description says "Loads the events that were generated by the search job..."

Though there's a strong case for conserving disk space, I'd think that when a search is saved there's an implicit expectation to be able to reference (in the future) all the raw events in addition to the summarized/transformed data with the appropriate loadjob command. 🙂

Super Champion

Did some digging in the dispatch folder, and now I'm really confused. If I look in the events folder, there are a number of *.csv.gz files which, when added up only contain 10,000 events. But if I pull the job from the jobs manager, I can see all 200,000 events. So where are the other 190,000 events stored? I must be missing something.

0 Karma

Super Champion

I can confirm this behavior in Splunk 4.1. I ran the search * | head 200000, saved the job, and when I try to load it using loadjob <sid> events=t, I only get 10,000 events. (I looked in limits.conf but didn't see any settings in there)

0 Karma