Splunk Search

How does Hunk search data from Hadoop?

Harishma
Communicator
  • Can someone please explain in simple layman terms how Splunk SEARCHES Hadoop Data? I understand it doesn't store them in indexes. This doc was quite confusing for me.

http://docs.splunk.com/Documentation/Hunk/6.4.4/Hunk/Processflow

  • What does it mean by the SEARCH is a "MapReduce Job". I understand what a mapreduce job is, but why is the search called a MapR job here?
  • How does the licensing cost depend on the Number of TaskTrackers ? Does this licensing cost less when compared to the cost for storing in Indexers?

Sorry If my doubts sound very lame, but kindly guide me on how Hunk works in the background.

1 Solution

kschon_splunk
Splunk Employee
Splunk Employee

For a "streaming" search, Splunk Analytics for Hadoop (which until recently was called Hunk) streams the data files back to the Search Head, and does all the work of examining the files there. In this case, SAH is essentially using your Hadoop cluster as a distributed storage system. For a "reporting" search, SAH uses your cluster as a distributed computing system as well. It launches a Map Reduce job that copies the Splunk code to the compute nodes of the cluster. The compute nodes parse the data files, filter for the events which match the query, and do any additional computation that can be done locally, without seeing data in other files. The results of these steps go back to the Search Head, where only the final compute steps are performed. For example, let's say that you ran this query:

index=my_virtual_index field1=foo | stats count

If you run this in report mode, the SH will launch an MR job which will run on the compute nodes in you Hadoop cluster. The compute nodes will run "tasks", and each "task" will look at one data "split" (this may be an entire file, or a piece of a file). The task will break the split into events, and count the number of events that have a "field1" field with value "foo". It will write the subtotal for this split where the SH can find it. The SH only needs to add all of the subtotals and report the final total to the user. So we have used the full compute power of the Hadoop cluster, and minimized the need to send large data files over the network.

If you run the same search in streaming mode, the SH will read all the data splits in their entirety, and do all of the work itself. For a large amount of data, this will be much slower.

As for which is cheaper--storing your data on indexers or storing your data in Hadoop--the answer is "it depends". One license is based on how much data you ingest per day, and the other is based on the number of nodes in your cluster, so they cannot be directly compared. On top of this, your cost to set up and maintain Splunk indexers may be different from your cost to set up and maintain a Hadoop cluster. Which option is right for you will depend on your business needs. In general, Splunk Analytic for Hadoop is likely to be better if:

--A lot of your data is already in Hadoop.
--You have other Hadoop-based systems that you want to use on the same data.

Splunk Enterprise will generally be better if:

--You want low-latency "needle in a haystack" searches.
--You want real-time search and alert capabilities.

One or the other may definitely be cheaper for you, but which one will depend on your needs and your current setup.

View solution in original post

kschon_splunk
Splunk Employee
Splunk Employee

For a "streaming" search, Splunk Analytics for Hadoop (which until recently was called Hunk) streams the data files back to the Search Head, and does all the work of examining the files there. In this case, SAH is essentially using your Hadoop cluster as a distributed storage system. For a "reporting" search, SAH uses your cluster as a distributed computing system as well. It launches a Map Reduce job that copies the Splunk code to the compute nodes of the cluster. The compute nodes parse the data files, filter for the events which match the query, and do any additional computation that can be done locally, without seeing data in other files. The results of these steps go back to the Search Head, where only the final compute steps are performed. For example, let's say that you ran this query:

index=my_virtual_index field1=foo | stats count

If you run this in report mode, the SH will launch an MR job which will run on the compute nodes in you Hadoop cluster. The compute nodes will run "tasks", and each "task" will look at one data "split" (this may be an entire file, or a piece of a file). The task will break the split into events, and count the number of events that have a "field1" field with value "foo". It will write the subtotal for this split where the SH can find it. The SH only needs to add all of the subtotals and report the final total to the user. So we have used the full compute power of the Hadoop cluster, and minimized the need to send large data files over the network.

If you run the same search in streaming mode, the SH will read all the data splits in their entirety, and do all of the work itself. For a large amount of data, this will be much slower.

As for which is cheaper--storing your data on indexers or storing your data in Hadoop--the answer is "it depends". One license is based on how much data you ingest per day, and the other is based on the number of nodes in your cluster, so they cannot be directly compared. On top of this, your cost to set up and maintain Splunk indexers may be different from your cost to set up and maintain a Hadoop cluster. Which option is right for you will depend on your business needs. In general, Splunk Analytic for Hadoop is likely to be better if:

--A lot of your data is already in Hadoop.
--You have other Hadoop-based systems that you want to use on the same data.

Splunk Enterprise will generally be better if:

--You want low-latency "needle in a haystack" searches.
--You want real-time search and alert capabilities.

One or the other may definitely be cheaper for you, but which one will depend on your needs and your current setup.

Georgin
Engager

A follow up question regarding the second paragraph on "If you run this in report mode, the SH will launch an MR job which will run on the compute nodes in your Hadoop cluster."
My question is, can we limit the number of compute nodes in the Hadoop cluster that the MR job will run on? If so, how can it be done?

Thank you.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...