Solved: Create Summary Index from Table?

BradL · ‎11-07-2014

I have a scheduled search to extract a tiny subset of my data set and attempt to perform a field extraction on the name of the source file:

index=<myindex> host=xyz* <string to find> | eval report="my_summary_detail" | eval logtime=_time | table _time, report, host, source, logtime, _raw | rex field=source "XYZ*-(?<upload_date>\d*-\d\d-\d\d \d\d-\d\d-\d\d)Z.txt" | convert timeformat="%Y-%m-%d %H-%M-%S" mktime(upload_date) as uploadtime

When I run this in Search, it returns exactly what I want. When I schedule the search and tell it to use summary indexing, the table of values in the query (report, host, source, logtime, uploadtime) are all lost and all I see in the summary index is the time and _raw.

I want to extract this data and then do further processing on it such as creating multiple other summaries against this summary with some stats/timecharts. My goal is to perform the rex only once as well as extracting these records from the other 99% that don't match my query.

FYI - When I scheduled the search, I used a new index just for this report. (not index=summary).

Am I doing something wrong? Is this use of a summary index impossible? Do I have to do something special when I create my separate index?

Thanks,
Brad

martin_mueller · ‎11-10-2014

Great... does that mean your original question has been answered?

View solution in original post

NeerajDhapola7 · ‎07-25-2017

just form the concatenation of the result data and keep as _raw data custom in summary index
this is workaround basically which is also good solution

....... | eval newraw=newtime . " report=\"" . report. "\",logtime=\"" . logtime . "\",origsource=\"" ...
| eval _raw=newraw | collect index=my_summary

martin_mueller · ‎11-10-2014

Great... does that mean your original question has been answered?

BradL · ‎11-11-2014

yes. Thanks for the advice.

KevinAdu · ‎08-03-2015

Could you explain what you did to add your data to the summary index?
Did you end up changing the search string to use a reporting command? Or somehow used the collect command to place your data in the summary index?

NeerajDhapola7 · ‎07-25-2017

just form the concatenation of the result data and keep as _raw data custom summary index
this is workaround basically which is also good solution

....... | eval newraw=newtime . " report=\"" . report. "\",logtime=\"" . logtime . "\",origsource=\"" ...
| eval _raw=newraw | collect index=my_summary

BradL · ‎11-10-2014

Figured out why collect wasn't performing field extraction. In the transforms.conf file the stash sourcetype was configured to split fields on commas, not whitespace, so including commas in the new _raw field works:

... | eval newraw=newtime . " report=\"" . report. "\",logtime=\"" . logtime . "\",origsource=\"" ... | eval _raw=newraw | collect index=my_summary

martin_mueller · ‎11-08-2014

Some difference is to be expected when comparing a very dense search to a slightly less dense search. However, my point is the biggest benefit from summary indexing comes when you actually do some summarizing.

Say you want to report on counts per uploadtimelater, if you calculate preliminary counts per uploadtime say every hour or five minutes and store those in the summary index you'll see your search on that drop another order of magnitude or three.

Alternatively, build one of your final searches with a reporting command at the end and enable report acceleration on that - let Splunk worry about making it blazingly fast.

As another alternative, build an accelerated data model on top of your data and use the pivot interface to build your final reports.

In fact, probably start from the bottom of this list... if all these more Splunky ways don't achieve the desired speed then you can start thinking of tinkering.

BradL · ‎11-10-2014

Report acceleration simply doesn't get the job done. My queries still take hours to run.

I understand that things could be "even faster" if I summarized my output more, but I need the details and _raw. Even if I end my query with stats count by a, b,c,d - for some reason I still don't get those fields in my summary index.

So the point is that I can get a 10x faster query by the simple expedient of creating a new summary index and annotating the input _raw with the set of calculated fields that I need and feeding that data into a new index.

But it seems that no matter what I try Splunk refuses to feed the calculated fields into the new summary index and retains only the original input events, which is actually worse because the original source field is lost, which as you can see I need to extract a field from.

Can we please go back to my original question: How can I feed the output from my query into a new index that retains all my calculated fields?

martin_mueller · ‎11-10-2014

Sure - if you're including _raw then you need to add your calculated field to the _raw text. That way they stop being transient fields in that search and instead will be added to the summary index. There they will get search-time extracted, so it's easiest to add them as key="value".

BradL · ‎11-10-2014

Thanks for hanging in there with questions: I generated the results like this: ... |
eval newraw=newtime . " report=\"" . report. "\" logtime=\"" . logtime . "\" origsource=\"" . source . "\" upload_date=\"" . upload_date . "\" uploadtime=\"" . uploadtime . "\" orighost=\"" . host . "\"" | eval _raw=newraw | collect index=eng_summary3

The events are in eng_summary3 with sourcetype="stash" as I would expect, but when I query them the fields are not extracted as expected. I'm using query mode "verbose" (and "smart").

Is there something about sourcetype="stash" that changes or blocks search-time field extraction?

BradL · ‎11-10-2014

P.S. in the actual query string I am escaping the double quotes internal to the string constants, but when pasted here the escape \ characters were removed.

martin_mueller · ‎11-07-2014

In order to find those events in the full index splunk doesn’t need to scan all the data, just those events that match your keywords. Run your basic search over some timerange and compare the events scanned vs events matched counts as it runs. If both are similar then splunk is not loading a lot f irrelevant events.
That's kinda the point of an Index, find stuff quickly.

BradL · ‎11-07-2014

I tried this with a time range (1w) that returned ~500k events. The events scanned vs. matched are very close (<500 difference in 500k), but pulling those 500k events from the big index took 486 seconds. Pulling them from the "summary" took 30 seconds.

[Except of course that my summary doesn't have the original source field I need]

Is there any good documentation on how splunk indexes it's data? As shown here, it seems to take forever to find things in my large indexes.

martin_mueller · ‎11-07-2014

Do read this and the following pages for a little background: http://docs.splunk.com/Documentation/Splunk/6.2.0/Knowledge/Aboutsummaryindexing

martin_mueller · ‎11-07-2014

I see.

a: provided `is some kind of token that's well-defined, e.g. not*foowith wildcards at the front, I don't see how that's slower to find than findingmy_summary_detail`.
b: when loading the events from the summary index you'd run search-time field extraction again. That doesn't hurt though, because it's blazingly fast anyway - especially on short fields such as source.
c: this won't reduce the amount of data that needs to be loaded because no summarizing is happening. You're just copying stuff elsewhere.

Running the same over this in a summary index will be similarly slow because the bottleneck will likely be loading the raw events off disk.

To actually make this faster you need to do some kind of reporting command such as stats or timechart. What's best here depends on what further processing you're planning to do.

BradL · ‎11-07-2014

Thanks for the reply - please be a patient for another question:

this won't reduce the amount of data that needs to be loaded because no summarizing is happening. You're just copying stuff elsewhere.

I don't understand this comment. I understand it won't reduce the amount of data I need for the results I want, but I do expect it to reduce the total I/O for the entire search by a factor of 98-99% since I won't be scanning an index with the rows that I know I don't want.

For example, suppose the original source index has 1700m rows and is 90GB, but the "summary" only has ~5m rows and is 100MB. 100MB may not be tiny, but it seems like it should be a lot faster just because there is less data to search. I think my mental model for how Splunk searches an index must be very broken.

How does Splunk pull the right 5m rows out of the original 1.7 billion as quickly as it can scan 5m rows in a separate index?

The string I'm searching for is a multi word string, like this: "crashhandler.sh: crash detected, recovery complete"

(updated counts)

martin_mueller · ‎11-07-2014

Not quite what you're asking, but what's the point of using a summary index for this? You're not actually summarizing anything... no stats, for example. You're basically copying a part of an index into a second index - I'd doubt that'll be much faster than just querying the original index.

As for dropping the timestamp and _raw text into the summary index, I think that's expected behaviour for summarizing searches that include _raw.

BradL · ‎11-07-2014

I probably misunderstand the purpose of a summary index, but, here's why I thought this was a good idea. Please correct my misunderstandings.

I expect a summary here to be faster compared to searching over the original data because:
(a) I don't have to run the match on _raw over 100m's of records to get the 1-2% I want since they will be tagged with a single field (report) in a smaller overall index.
(b) I don't have to re-run the rex for every reporting query I do since I have already done that conversion
(c) I want to feed this data set into multiple summarizing queries and I don't want to incur the large scan costs over and over.

Running just this query over the time horizon I need takes many hours, so I'm looking for anything I can do to make the development of my final summary queries faster.

I don't mind getting the timestamp and _raw - I just don't understand why I get none of my other fields...?

If I look in the results.csv.gz file in the dispatch folder for the scheduled job, they look correct (all the fields are there). But for some reason they are not in the target index when I query it.

Create Summary Index from Table?

Prove Your Splunk Prowess at .conf25—No Prereqs Required!

Splunk Observability Cloud's AI Assistant in Action Series: Observability as Code

Splunk Answers Content Calendar, July Edition I

Are you a member of the Splunk Community?

Create Summary Index from Table?

Prove Your Splunk Prowess at .conf25—No Prereqs Required!

Splunk Observability Cloud's AI Assistant in Action Series: Observability as Code

Splunk Answers Content Calendar, July Edition I