Solved: difference between pipe summaryindex and collect

LearningGuy · ‎02-29-2024

Hello,

1) What is the difference between using "| summaryindex" and "| collect"?
Thank you for your help.

Summaryindex is generated by a scheduled report. I clicked "view recent" and the following is appended after the search.

| summaryindex     spool=t    uselb=t   addtime=t     index="summary" file="summary_test_1.stash_new"    name="summary_test_1" marker="hostname=\"https://test.com/\",report=\"summary_test_1\""

Collect can be used to push outside of a scheduled report.
2) Can "| summary index" also be used to push data outside of a scheduled report?

| collect    index=summary_test_1    testmode=false     marker="report=summary_test_1"

bowesmana · ‎03-03-2024

As @PickleRick says, please ignore the generative AI response.

collect is the documented command and it is what you should use when you want to save data to an index from an SPL command

https://docs.splunk.com/Documentation/SplunkCloud/9.0.2305/SearchReference/Collect

summaryindex is the command that is still used internally by Splunk when you enable summary indexing from within a scheduled saved search and is effectively a synonym for collect. Don't use it - it is not a documented command.

A "summary index" is perhaps a poor name for the concept - collect allows you to push anything you like to an index and there is nothing special about that index. Yes, the original intention is that it should contain "summarised data", but in practice a summary index is just an index.

Note that the behaviour of _time when you collect data to an index is not well documented. It can change depending on what your data looks like and if your search is done from a scheduled report or not.

View solution in original post

jotne · ‎03-03-2024

I did try to find in documentation that summaryindex is an alias for collect, but it's not documented as far as I can see. But if you start typing | summaryin , splunk will show info for collect command. So yes its the same command.

LearningGuy · ‎02-29-2024

Hello,
Thank you for your explanation.
1) I ran the following search (without scheduled report): This will push the data from original index to summaryindex

index=originalindex
```---- multiple searches-----
table ID, name, address
```
| summaryindex spool=t uselb=t addtime=t index="summary" file="summary_test_1.stash_new" name="summary_test_1" marker="hostname=\"https://test.com/\",report=\"summary_test_1\""

2) I ran index=summary report="summary_test_1"
It gave me the data that contains ID, name, address

It appeared that the first search pushed the data to index=summary report="summary_test_1", thus this command does not only tie to a scheduled report like you mentioned earlier

So, what is the difference between summaryindex and collect if they provide the same function?

Thanks

bowesmana · ‎03-03-2024

As @PickleRick says, please ignore the generative AI response.

collect is the documented command and it is what you should use when you want to save data to an index from an SPL command

https://docs.splunk.com/Documentation/SplunkCloud/9.0.2305/SearchReference/Collect

summaryindex is the command that is still used internally by Splunk when you enable summary indexing from within a scheduled saved search and is effectively a synonym for collect. Don't use it - it is not a documented command.

A "summary index" is perhaps a poor name for the concept - collect allows you to push anything you like to an index and there is nothing special about that index. Yes, the original intention is that it should contain "summarised data", but in practice a summary index is just an index.

Note that the behaviour of _time when you collect data to an index is not well documented. It can change depending on what your data looks like and if your search is done from a scheduled report or not.

kiran_panchavat · ‎02-29-2024

Attention, this is an AI generated answer and it wrong
Moderator

@LearningGuy

Let’s delve into the differences between | summaryindex and | collect in Splunk:

| summaryindex:

Purpose:

| summaryindex is primarily used for creating and managing summary indexes.

A summary index is a pre-aggregated index that stores summarized data from your original events.

It’s useful for speeding up searches and reducing the load on your search infrastructure.

How It Works: When you use | summaryindex, it generates summary data based on existing reports. This means that you can create a summary index only from scheduled reports.

Example Usage: If you have a scheduled report that summarizes data, you can pipe it

| collect:

Purpose: | collect is a versatile command that allows you to push data to a new index. Unlike | summaryindex, it’s not limited to existing reports.

How It Works: You can use | collect to send specific data to an index of your choice. This is particularly useful when you want to extract relevant information from your search results and store it in a separate index. Name of the summary index where the events are added. The index must exist before the events are added. The index is not created automatically.

Example Usage: Suppose you want to create a custom index called “test_summary” to store specific data. You can use | collect index=test_summary to achieve this. The testmode=false ensures that the data is actually indexed.

In summary, while both commands involve indexing data, | summaryindex is tied to scheduled reports, whereas | collect provides more flexibility for pushing data to custom indexes regardless of report schedules.

Remember that creating the summary index (whether through | summaryindex or | collect) requires defining the index specifications in indexes.conf beforehand. Happy Splunking! 🚀

https://docs.splunk.com/Splexicon:Summaryindex

https://docs.splunk.com/Documentation/Splunk/9.2.0/Knowledge/Usesummaryindexing

https://docs.splunk.com/Documentation/Splunk/9.2.0/Knowledge/Managesummaryindexgapsandoverlaps

https://docs.splunk.com/Documentation/SplunkCloud/9.1.2308/SearchReference/Collect

I hope this helps, if any reply helps you, you could add your upvote/karma points to that reply, thanks.

PickleRick · ‎02-29-2024

@kiran_panchavatPlease stop spreading misinformation (especially created by generative language models).

The summaryindex command is an alias for the collect command. There is absolutely no difference in behaviour of those two commands since they're the same command which can be called with either name.

This is just my speculation but I suspect the command was originally called summaryindex because it was meant to collect data for summary indexing but was later "generalized" to the "collect" name which is the current command name in docs and the "summaryindex" command name was retained for backward compatibility reasons.

LearningGuy · ‎03-04-2024

Hello @PickleRick @bowesmana @jotne

When I ran the query with the summaryindex command, the data from the query got pushed just fine like my previous response to kiran.

When I ran the query with the collect command, the data from the query did not get pushed
I could see the the _raw data when I used testmode=true, but when I set testmode=flag, it ran, but the data didn't show up (although I already set to all time)

Different issue: I also tried to set _time to info_max_time by setting addtime=false and used this command but it always set to the current time. (I am aware that by default it's set to info_min_time, if addtime=true)
| eval _time=strftime(info_max_time,"%m/%d/%y %I:%M:%S %p") - I can open another post for this if it's needed.

Please suggest. I appreciate your help. Thank you

PickleRick · ‎03-05-2024

As @bowesmana pointed out - _time is a field which holds the current timestamp expressed as a number (number of seconds since epoch). it only gets formatted on display by default in the WebUI - the _time field is treated specially - you can check it for yourself

| makeresults | eval _time=0

If you render your _time to a string... Honestly, I have no idea what will happen. Splunk will not use the value because it's not a number but whether it sets it to zero or treats the field as non-existant, I cannot tell. Anyway, the results will definitely not be what you expect.

bowesmana · ‎03-04-2024

_time in the data is ignored with collect and _time should only ever be an epoch anyway - it's a Splunk reserved field, so making it a string is a bad idea.

Are you collecting a _raw field or are you collecting fields without _raw.

Are you specifying an index to collect to?

What's your collect command?

Are you running this as an ad-hoc search or as a scheduled saved search?

If you are not specifying _raw the first value in the line of data collected will be the one parsed for the timestamp, hence addtime will add info_* fields to the start of the data.

I find the safest way to make an event if you want control over _time is to do this and only collect _raw and ensures that my time stamp is the only one I want.

| eval _raw=printf("_time=%d, ", your_epoch_time_field)
| foreach "*" 
    [| eval _raw=_raw.case(isnull('<<FIELD>>'),"",
                           mvcount('<<FIELD>>')>1,", <<FIELD>>=\"".mvjoin('<<FIELD>>',"###")."\"", 
                           true(), ", <<FIELD>>=\"".'<<FIELD>>'."\"") 
    | fields - "<<FIELD>>" ]

This simply builds a _raw field with null fields ignored and other fields quoted. It also flattens multi-value fields.

If you have access to the underlying OS you can use the spool flag so the file is left in the file system and you can go and see the real file that would be ingested to the index.

LearningGuy · ‎03-05-2024

Hello @bowesmana

Q: Are you collecting a _raw field or are you collecting fields without _raw?
>> I am not sure what you meant, my understanding _raw is the one that got pushed to index=summary

Q: Are you specifying an index to collect to? What's your collect command?
I figured out why collect command didn't push the data. I put the wrong index name. I stroke the incorrect name below
| collect index= summary ~~summary_test_1~~ testmode=false file=summary_test_1.stash_new name=summary_test_1" marker="report=\"summary_test_1\""

Q:Are you running this as an ad-hoc search or as a scheduled saved search?
I ran this as an ad-hoc search as a proof of concept using the past time
Once it's working I will use a scheduled saved search for future time

I added your suggestion to my search below and it worked, although I don't completely understand how. Note that addtime=true/false didn't make any difference
I appreciate your help. Thank you.
If you have an easier way, please suggest 🙂

index=original_index
``` Query ```
| addinfo
| eval _raw=printf("_time=%d", info_max_time)
| foreach "*" 
    [| eval _raw=_raw.case(isnull('<<FIELD>>'),"",
                           mvcount('<<FIELD>>')>1,", <<FIELD>>=\"".mvjoin('<<FIELD>>',"###")."\"", 
                           true(), ", <<FIELD>>=\"".'<<FIELD>>'."\"") 
    | fields - "<<FIELD>>" ] 

table ID, name, address
| collect   index= summary     testmode=false    addtime=true   file=summary_test_1.stash_new   name=summary_test_1"   marker="report=\"summary_test_1\""

difference between pipe summaryindex and collect

Other

stats

table

Developer Spotlight with Paul Stout

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

Data-Driven Success: Splunk & Financial Services