Splunk Search

Search capabilities of splunk - How powerful is it really?

wajihullahbaig
Explorer

I am new to splunk. Just 3 odd days at it. I have been using Lucene for indexing and searching raw data in forms of fielded and un-fielded data. I am very much impressed with lucenes performance for searching. I was wondering if the experience community can guide me here on a few capabilities of splunk. Specifically in comparison of splunk with respect to what I already know about Lucene. Not just limited to search.

  • How does splunk handle stop words? Words that are very common such a a,the,is... which we can provide manually to lucene.
  • Does splunk peform wildcard searches, proximity searches, regex searches? I know it can do fielded searches?
  • Optimizations on indices. Specially compression.
  • Is it possible to do Fuzzy, synonym based searches on splunk?

I know this must be a length question but definitely would like to know some points from experienced people on splunk.

Thank you.

Tags (3)
1 Solution

Stephen_Sorkin
Splunk Employee
Splunk Employee

This is potentially a very long discussion of the differences between Splunk, which seeks to index time-series, machine generated data, and Lucene, which was originally designed to index human-generated text documents. We can begin with your questions.

  1. Splunk has no notion of stop words. By default, Splunk indexes all keywords found in events, as defined by the segmentation rules.
  2. Splunk provides wildcard searches and phrase searches, but the index doesn't provide native proximity searches or regex searches. For those, we rely on subsequent commands in the search processing pipeline.
  3. Splunk aggressively compresses the rawdata we store, and we spend a lot of effort to make the indexes as small as possible, by means of explicit compression and other low footprint data structures. Typically, you can expect that the rawdata will be 10% the size of the original data and the indexes are 20-40% of the size of the original data, depending on entropy. Together Splunk typically requires 30-50% the size of the original raw data as storage.
  4. The index itself doesn't provide synonym support, since that's fundamentally a problem for human text. We provide an analogous concept however, in eventtypes, which can be used to represent meaningful classes of queries, including synonyms.

View solution in original post

Stephen_Sorkin
Splunk Employee
Splunk Employee

This is potentially a very long discussion of the differences between Splunk, which seeks to index time-series, machine generated data, and Lucene, which was originally designed to index human-generated text documents. We can begin with your questions.

  1. Splunk has no notion of stop words. By default, Splunk indexes all keywords found in events, as defined by the segmentation rules.
  2. Splunk provides wildcard searches and phrase searches, but the index doesn't provide native proximity searches or regex searches. For those, we rely on subsequent commands in the search processing pipeline.
  3. Splunk aggressively compresses the rawdata we store, and we spend a lot of effort to make the indexes as small as possible, by means of explicit compression and other low footprint data structures. Typically, you can expect that the rawdata will be 10% the size of the original data and the indexes are 20-40% of the size of the original data, depending on entropy. Together Splunk typically requires 30-50% the size of the original raw data as storage.
  4. The index itself doesn't provide synonym support, since that's fundamentally a problem for human text. We provide an analogous concept however, in eventtypes, which can be used to represent meaningful classes of queries, including synonyms.
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...