Solved: How does Hunk search data from Hadoop?

Harishma · ‎11-29-2016

Can someone please explain in simple layman terms how Splunk SEARCHES Hadoop Data? I understand it doesn't store them in indexes. This doc was quite confusing for me.

http://docs.splunk.com/Documentation/Hunk/6.4.4/Hunk/Processflow

What does it mean by the SEARCH is a "MapReduce Job". I understand what a mapreduce job is, but why is the search called a MapR job here?
How does the licensing cost depend on the Number of TaskTrackers ? Does this licensing cost less when compared to the cost for storing in Indexers?

Sorry If my doubts sound very lame, but kindly guide me on how Hunk works in the background.

kschon_splunk · ‎11-29-2016

For a "streaming" search, Splunk Analytics for Hadoop (which until recently was called Hunk) streams the data files back to the Search Head, and does all the work of examining the files there. In this case, SAH is essentially using your Hadoop cluster as a distributed storage system. For a "reporting" search, SAH uses your cluster as a distributed computing system as well. It launches a Map Reduce job that copies the Splunk code to the compute nodes of the cluster. The compute nodes parse the data files, filter for the events which match the query, and do any additional computation that can be done locally, without seeing data in other files. The results of these steps go back to the Search Head, where only the final compute steps are performed. For example, let's say that you ran this query:

index=my_virtual_index field1=foo | stats count

If you run this in report mode, the SH will launch an MR job which will run on the compute nodes in you Hadoop cluster. The compute nodes will run "tasks", and each "task" will look at one data "split" (this may be an entire file, or a piece of a file). The task will break the split into events, and count the number of events that have a "field1" field with value "foo". It will write the subtotal for this split where the SH can find it. The SH only needs to add all of the subtotals and report the final total to the user. So we have used the full compute power of the Hadoop cluster, and minimized the need to send large data files over the network.

If you run the same search in streaming mode, the SH will read all the data splits in their entirety, and do all of the work itself. For a large amount of data, this will be much slower.

As for which is cheaper--storing your data on indexers or storing your data in Hadoop--the answer is "it depends". One license is based on how much data you ingest per day, and the other is based on the number of nodes in your cluster, so they cannot be directly compared. On top of this, your cost to set up and maintain Splunk indexers may be different from your cost to set up and maintain a Hadoop cluster. Which option is right for you will depend on your business needs. In general, Splunk Analytic for Hadoop is likely to be better if:

--A lot of your data is already in Hadoop.
--You have other Hadoop-based systems that you want to use on the same data.

Splunk Enterprise will generally be better if:

--You want low-latency "needle in a haystack" searches.
--You want real-time search and alert capabilities.

One or the other may definitely be cheaper for you, but which one will depend on your needs and your current setup.

View solution in original post

kschon_splunk · ‎11-29-2016

For a "streaming" search, Splunk Analytics for Hadoop (which until recently was called Hunk) streams the data files back to the Search Head, and does all the work of examining the files there. In this case, SAH is essentially using your Hadoop cluster as a distributed storage system. For a "reporting" search, SAH uses your cluster as a distributed computing system as well. It launches a Map Reduce job that copies the Splunk code to the compute nodes of the cluster. The compute nodes parse the data files, filter for the events which match the query, and do any additional computation that can be done locally, without seeing data in other files. The results of these steps go back to the Search Head, where only the final compute steps are performed. For example, let's say that you ran this query:

index=my_virtual_index field1=foo | stats count

If you run this in report mode, the SH will launch an MR job which will run on the compute nodes in you Hadoop cluster. The compute nodes will run "tasks", and each "task" will look at one data "split" (this may be an entire file, or a piece of a file). The task will break the split into events, and count the number of events that have a "field1" field with value "foo". It will write the subtotal for this split where the SH can find it. The SH only needs to add all of the subtotals and report the final total to the user. So we have used the full compute power of the Hadoop cluster, and minimized the need to send large data files over the network.

If you run the same search in streaming mode, the SH will read all the data splits in their entirety, and do all of the work itself. For a large amount of data, this will be much slower.

As for which is cheaper--storing your data on indexers or storing your data in Hadoop--the answer is "it depends". One license is based on how much data you ingest per day, and the other is based on the number of nodes in your cluster, so they cannot be directly compared. On top of this, your cost to set up and maintain Splunk indexers may be different from your cost to set up and maintain a Hadoop cluster. Which option is right for you will depend on your business needs. In general, Splunk Analytic for Hadoop is likely to be better if:

--A lot of your data is already in Hadoop.
--You have other Hadoop-based systems that you want to use on the same data.

Splunk Enterprise will generally be better if:

--You want low-latency "needle in a haystack" searches.
--You want real-time search and alert capabilities.

One or the other may definitely be cheaper for you, but which one will depend on your needs and your current setup.

Georgin · ‎05-08-2018

A follow up question regarding the second paragraph on "If you run this in report mode, the SH will launch an MR job which will run on the compute nodes in your Hadoop cluster."
My question is, can we limit the number of compute nodes in the Hadoop cluster that the MR job will run on? If so, how can it be done?

Thank you.

How does Hunk search data from Hadoop?

Dashboards: Hiding charts while search is being executed and other uses for tokens

Splunk Observability Cloud's AI Assistant in Action Series: Explaining Metrics and ...

Brains, Bytes, and Boston: Learn from the Best at .conf25

Are you a member of the Splunk Community?

How does Hunk search data from Hadoop?

Dashboards: Hiding charts while search is being executed and other uses for tokens

Splunk Observability Cloud's AI Assistant in Action Series: Explaining Metrics and ...

Brains, Bytes, and Boston: Learn from the Best at .conf25