Splunk Search

difference between pipe summaryindex and collect

LearningGuy
Builder

Hello,

1) What is the difference between using "| summaryindex" and "| collect"?
Thank you for your help.

Summaryindex is generated by a scheduled report. I clicked "view recent" and the following is appended after the search.

 

| summaryindex     spool=t    uselb=t   addtime=t     index="summary" file="summary_test_1.stash_new"    name="summary_test_1" marker="hostname=\"https://test.com/\",report=\"summary_test_1\""

 




Collect can be used to push outside of a scheduled report.
2) Can "| summary index" also be used to push data outside of a scheduled report?

 

| collect    index=summary_test_1    testmode=false     marker="report=summary_test_1"

 

Labels (3)
0 Karma
1 Solution

bowesmana
SplunkTrust
SplunkTrust

As @PickleRick says, please ignore the generative AI response.

collect is the documented command and it is what you should use when you want to save data to an index from an SPL command 

https://docs.splunk.com/Documentation/SplunkCloud/9.0.2305/SearchReference/Collect

summaryindex is the command that is still used internally by Splunk when you enable summary indexing from within a scheduled saved search and is effectively a synonym for collect. Don't use it - it is not a documented command.

A "summary index" is perhaps a poor name for the concept - collect allows you to push anything you like to an index and there is nothing special about that index. Yes, the original intention is that it should contain "summarised data", but in practice a summary index is just an index.

Note that the behaviour of _time when you collect data to an index is not well documented. It can change depending on what your data looks like and if your search is done from a scheduled report or not.

 

View solution in original post

jotne
Builder

I did try to find in documentation that summaryindex is an alias for collect, but it's not documented as far as I can see.  But if you start typing | summaryin , splunk will show info for collect command.  So yes its the same command.

collect.png

LearningGuy
Builder

Hello,
Thank you for your explanation.
1) I ran the following search (without scheduled report):   This will push the data from original index to summaryindex

 

index=originalindex
```---- multiple searches-----
table ID, name, address
```
| summaryindex spool=t uselb=t addtime=t index="summary" file="summary_test_1.stash_new" name="summary_test_1" marker="hostname=\"https://test.com/\",report=\"summary_test_1\""

 


2)  I  ran index=summary report="summary_test_1"  
It gave me the data that contains ID, name, address

It appeared that the first search pushed the data to    index=summary report="summary_test_1"thus this command does not only tie to a scheduled report like you mentioned earlier

So, what is the difference between summaryindex and collect if they provide the same function?

Thanks


0 Karma

bowesmana
SplunkTrust
SplunkTrust

As @PickleRick says, please ignore the generative AI response.

collect is the documented command and it is what you should use when you want to save data to an index from an SPL command 

https://docs.splunk.com/Documentation/SplunkCloud/9.0.2305/SearchReference/Collect

summaryindex is the command that is still used internally by Splunk when you enable summary indexing from within a scheduled saved search and is effectively a synonym for collect. Don't use it - it is not a documented command.

A "summary index" is perhaps a poor name for the concept - collect allows you to push anything you like to an index and there is nothing special about that index. Yes, the original intention is that it should contain "summarised data", but in practice a summary index is just an index.

Note that the behaviour of _time when you collect data to an index is not well documented. It can change depending on what your data looks like and if your search is done from a scheduled report or not.

 

kiran_panchavat
Builder

Attention, this is an AI generated answer and it wrong
Moderator

 

@LearningGuy 

Let’s delve into the differences between | summaryindex and | collect in Splunk:

| summaryindex:

Purpose:

| summaryindex is primarily used for creating and managing summary indexes.

A summary index is a pre-aggregated index that stores summarized data from your original events.

It’s useful for speeding up searches and reducing the load on your search infrastructure.

How It Works: When you use | summaryindex, it generates summary data based on existing reports. This means that you can create a summary index only from scheduled reports.

Example Usage: If you have a scheduled report that summarizes data, you can pipe it

| collect:

Purpose: | collect is a versatile command that allows you to push data to a new index. Unlike | summaryindex, it’s not limited to existing reports.

How It Works: You can use | collect to send specific data to an index of your choice. This is particularly useful when you want to extract relevant information from your search results and store it in a separate index. Name of the summary index where the events are added. The index must exist before the events are added. The index is not created automatically.

Example Usage: Suppose you want to create a custom index called “test_summary” to store specific data. You can use | collect index=test_summary to achieve this. The testmode=false ensures that the data is actually indexed.

In summary, while both commands involve indexing data, | summaryindex is tied to scheduled reports, whereas | collect provides more flexibility for pushing data to custom indexes regardless of report schedules.

Remember that creating the summary index (whether through | summaryindex or | collect) requires defining the index specifications in indexes.conf beforehand. Happy Splunking! 🚀

https://docs.splunk.com/Splexicon:Summaryindex 

https://docs.splunk.com/Documentation/Splunk/9.2.0/Knowledge/Usesummaryindexing 

https://docs.splunk.com/Documentation/Splunk/9.2.0/Knowledge/Managesummaryindexgapsandoverlaps 

https://docs.splunk.com/Documentation/SplunkCloud/9.1.2308/SearchReference/Collect 

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.

PickleRick
SplunkTrust
SplunkTrust

@kiran_panchavatPlease stop spreading misinformation (especially created by generative language models).

The summaryindex command is an alias for the collect command. There is absolutely no difference in behaviour of those two commands since they're the same command which can be called with either name.

This is just my speculation but I suspect the command was originally called summaryindex because it was meant to collect data for summary indexing but was later "generalized" to the "collect" name which is the current command name in docs and the "summaryindex" command name was retained for backward compatibility reasons.

LearningGuy
Builder

Hello @PickleRick   @bowesmana  @jotne 

When I ran the query with the summaryindex command, the data from the query got pushed just fine like my previous response to kiran.

When I ran the query with the collect command, the data from the query did not get pushed
I could see the the _raw data when I used testmode=true, but when I set testmode=flag, it ran, but the data didn't show up (although I already set to all time)

Different issue: I also tried to set _time to info_max_time by setting addtime=false and used this command but it always set to the current time.  (I am aware that by default it's set to info_min_time, if addtime=true)
| eval _time=strftime(info_max_time,"%m/%d/%y %I:%M:%S %p") - I can open another post for this if it's needed.  

Please suggest.  I appreciate your help.   Thank you

0 Karma

PickleRick
SplunkTrust
SplunkTrust

As @bowesmana pointed out - _time is a field which holds the current timestamp expressed as a number (number of seconds since epoch). it only gets formatted on display by default in the WebUI - the _time field is treated specially - you can check it for yourself

| makeresults | eval _time=0

If you render your _time to a string...  Honestly, I have no idea what will happen. Splunk will not use the value because it's not a number but whether it sets it to zero or treats the field as non-existant, I cannot tell. Anyway, the results will definitely not be what you expect.

bowesmana
SplunkTrust
SplunkTrust

_time in the data is ignored with collect and _time should only ever be an epoch anyway - it's a Splunk reserved field, so making it a string is a bad idea.

Are you collecting a _raw field or are you collecting fields without _raw.

Are you specifying an index to collect to?

What's your collect command?

Are you running this as an ad-hoc search or as a scheduled saved search?

If you are not specifying _raw the first value in the line of data collected will be the one parsed for the timestamp, hence addtime will add info_* fields to the start of the data.

I find the safest way to make an event if you want control over _time is to do this and only collect _raw and ensures that my time stamp is the only one I want.

| eval _raw=printf("_time=%d, ", your_epoch_time_field)
| foreach "*" 
    [| eval _raw=_raw.case(isnull('<<FIELD>>'),"",
                           mvcount('<<FIELD>>')>1,", <<FIELD>>=\"".mvjoin('<<FIELD>>',"###")."\"", 
                           true(), ", <<FIELD>>=\"".'<<FIELD>>'."\"") 
    | fields - "<<FIELD>>" ] 

 This simply builds a _raw field with null fields ignored and other fields quoted. It also flattens multi-value fields.

If you have access to the underlying OS you can use the spool flag so the file is left in the file system and you can go and see the real file that would be ingested to the index.

LearningGuy
Builder

Hello @bowesmana 

Q: Are you collecting a _raw field or are you collecting fields without _raw?
>> I am not sure what you meant, my  understanding _raw is the one that got pushed to index=summary

Q: Are you specifying an index to collect to?   What's your collect command?
I figured out why collect command didn't push the data. I put the wrong index name.  I stroke the incorrect name below
| collect   index= summary     summary_test_1    testmode=false    file=summary_test_1.stash_new   name=summary_test_1"   marker="report=\"summary_test_1\""

Q:Are you running this as an ad-hoc search or as a scheduled saved search?
I ran this as an ad-hoc search as a proof of concept using the past time
Once it's working I will use a scheduled saved search for future time

I added your suggestion to my search below and it worked, although I don't completely understand how.   Note that addtime=true/false didn't make any difference
I appreciate your help.  Thank you.  
If you have an easier way, please suggest 🙂

 

 

index=original_index
``` Query ```
| addinfo
| eval _raw=printf("_time=%d", info_max_time)
| foreach "*" 
    [| eval _raw=_raw.case(isnull('<<FIELD>>'),"",
                           mvcount('<<FIELD>>')>1,", <<FIELD>>=\"".mvjoin('<<FIELD>>',"###")."\"", 
                           true(), ", <<FIELD>>=\"".'<<FIELD>>'."\"") 
    | fields - "<<FIELD>>" ] 

table ID, name, address
| collect   index= summary     testmode=false    addtime=true   file=summary_test_1.stash_new   name=summary_test_1"   marker="report=\"summary_test_1\""

 



0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...