The purpose of this topic is to create a home for legacy diagrams on how indexing works in Splunk, created by the legendary Splunk Support Engineer, Masa! Keep in mind the information and diagrams in this topic have not been updated since Splunk Enterprise 7.2. These used to live on an old Splunk community Wiki resource page that has been or will be taken down in the future, but many users have expressed that these have been and still are helpful.
Happy learning!
When we think about log events life cycle in Splunk, we can think about how to collect data (Input stage), then processes to parse data and ingest them to Splunk Database (Indexing stage), then, how to keep data in database (hot->Warm->Cold->Freezing). In Splunk Docs or presentations, Input and Indexing stages are often explained as a topic of Getting Data In.
Splunk processes data through pipelines. A pipeline is a thread, and each pipeline consists of multiple functions called processors. There is a queue between pipelines. With these pipelines and queues, index time event processing is parallelized.
This flow chart information is helpful to understand which configuration should be done in which process stage (input, parsing, routing/filtering or indexing). Also, for troubleshooting, it is helpful to understand which processors or queues would be influenced when a queue is filling up or when a processor's CPU time is huge.
Some definitions to start…
What Pipelines do...
Main queues and processors for indexing events
[inputs]
-> parsingQueue
-> [utf8 processor, line breaker, header parsing]
-> aggQueue
-> [date parsing and line merging]
-> typingQueue
-> [regex replacement, punct:: addition]
-> indexQueue
-> [tcp output, syslog output, http output, block signing, indexing, indexing metrics]
-> Disk
*NullQueue could be connected from any queueoutput processor by configuration of outputs.conf
Data in Splunk moves through the data pipeline in phases. Input data originates from inputs such as files and network feeds. As it moves through the pipeline, processors transform the data into searchable events that encapsulate knowledge.
The following figure shows how input data traverses event-processing pipelines (which are the containers for processors) at index-time. Upstream from each processor is a queue for data to be processed.
The next figure is a different version of how input data traverses pipelines with buckets life-cycle concepts. It shows the concept of hot buckets, warm buckets, cold buckets and freezing buckets. How data are stored in buckets and indexes is another good topic you should learn.
Detail Diagram - Standalone Splunk
Detail Diagram - Universal Forwarder to Indexer
When we think about log events life cycle in Splunk, we can think about how to collect data (Input stage), then processes to parse data and ingest them to Splunk Database (Indexing stage), then, how to keep data in database (hot->Warm->Cold->Freezing). In Splunk Docs or presentations, Input and Indexing stages are often explained as a topic of Getting Data In.
Splunk processes data through pipelines. A pipeline is a thread, and each pipeline consists of multiple functions called processors. There is a queue between pipelines. With these pipelines and queues, index time event processing is parallelized.
This flow chart information is helpful to understand which configuration should be done in which process stage (input, parsing, routing/filtering or indexing). Also, for troubleshooting, it is helpful to understand which processors or queues would be influenced when a queue is filling up or when a processor's CPU time is huge.
Some definitions to start…
What Pipelines do...
Main queues and processors for indexing events
[inputs]
-> parsingQueue
-> [utf8 processor, line breaker, header parsing]
-> aggQueue
-> [date parsing and line merging]
-> typingQueue
-> [regex replacement, punct:: addition]
-> indexQueue
-> [tcp output, syslog output, http output, block signing, indexing, indexing metrics]
-> Disk
*NullQueue could be connected from any queueoutput processor by configuration of outputs.conf
Data in Splunk moves through the data pipeline in phases. Input data originates from inputs such as files and network feeds. As it moves through the pipeline, processors transform the data into searchable events that encapsulate knowledge.
The following figure shows how input data traverses event-processing pipelines (which are the containers for processors) at index-time. Upstream from each processor is a queue for data to be processed.
The next figure is a different version of how input data traverses pipelines with buckets life-cycle concepts. It shows the concept of hot buckets, warm buckets, cold buckets and freezing buckets. How data are stored in buckets and indexes is another good topic you should learn.
Detail Diagram - Standalone Splunk
Detail Diagram - Universal Forwarder to Indexer