All Apps and Add-ons

Splunk Analytics for Hadoop: What is the difference between Hadoop Cluster and Hadoop CLI?

Harishma
Communicator

hI Team,

I'm trying to set up Splunk Analytics for Hadoop in my DEV environment. I'm setting up Hadoop cluster in one server and Splunk Search Head in another server.
I understand the basics in Hadoop so I'm learning further by working on this POC.

In the docs I came across, I need to set up Hadoop CLI on the Splunk instance.
What is this? Sorry, I read the doc but couldn't understand much. If someone could elaborate me on this, it would be great.

In docs at certain places it states the below:
"Download and extract the correct Hadoop CLI for each Hadoop cluster"
"test that your Hadoop CLI is set up properly and can connect to your Hadoop cluster"

I'm quite confused, what is this Hadoop CLI? Please guide.

0 Karma
1 Solution

ddrillic
Ultra Champion

Hi Harishma,

CLI is a command-line interface or command language interpreter.

From the Splunk Analystics for Hadoop server you need to be able to connect to HDFS and Hive via the CLI Hadoop commands. With Hadoop MapR we achieve it by installing the MapR client on the Splunk Analystics for Hadoop server .

I hope it helps...

View solution in original post

ddrillic
Ultra Champion

Hi Harishma,

CLI is a command-line interface or command language interpreter.

From the Splunk Analystics for Hadoop server you need to be able to connect to HDFS and Hive via the CLI Hadoop commands. With Hadoop MapR we achieve it by installing the MapR client on the Splunk Analystics for Hadoop server .

I hope it helps...

Harishma
Communicator

hI@ddrillic,

When you say "Splunk Analystics for Hadoop server" is it refering to the splunk instance(Search head) that is used to interact with the HDFS?

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

Correct, the Hadoop client directory needs to be present on the Search Head that is talking to Hadoop. This directory will contain both the executables for thing like the CLI, and the jar files (Java libraries) used to connect programmatically to Hadoop.

BTW, the reason that you need to provide these to Splunk Analytics for Hadoop, as opposed to them being provided with Splunk, is that the libraries need to match your Hadoop distribution, i.e. the vendor and version number of Hadoop that you are using.

Harishma
Communicator

Hi @kschon , @ddrillic ,

I set up the Hadoop YARN CLI in my server ABC - (Splunk Analystics for Hadoop server)

I ran the below command on server ABC to test my connection with my Hadoop cluster server - sl55selappn.tesco.com.

$HADOOP_HOME/bin/hadoop fs -ls hdfs://sl55selappn.tesco.com:9000

I got the below error. Where am I going wrong? Could you please guide?

16/12/02 04:41:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ls: `hdfs://sl55selappn.tesco.com:9000': No such file or directory

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

The problem is that you have specified the name-node, but not a directory. You can append a dir to the end of the name-node, e.g. to list the contents of "/foo" you could use:
$HADOOP_HOME/bin/hadoop fs -ls hdfs://sl55selappn.tesco.com:9000/foo

To make this a little easier to read, you can list the name-node separately with the "-fs" option, like so:
$HADOOP_HOME/bin/hadoop fs -fs hdfs://sl55selappn.tesco.com:9000 -ls /foo

To list the contents of the root dir, try this:
$HADOOP_HOME/bin/hadoop fs -fs hdfs://sl55selappn.tesco.com:9000 -ls /

Harishma
Communicator

Hi @kschon ,

Yup that worked !! THanks a lot 🙂

But I'm sorry I'm kinda bak to my original doubt. Sorry if it sounds lame.
My Splunk Analytics for Hadoop server in which I have the YARN CLI installed. Does it mean this server is like a single node Hadoop cluster where all the namenodes , datanodes , tasktrackers ..etc exists in this same server??
What I'm trying to understand is the folder under this server, $HADOOP_HOME/etc/hadoop -> The env xml files under these should reference my actual Hadoop cluster? OR the Splunk analytics for Hadoop server parameters?
Hope I have conveyed my doubt

I'm not understanding how they are able to communicate. I was able to create a text file in the datanode dir from the splunk analytics to hadoop server.
I dont have anything common between these two servers. Both are a different set of Single node cluster and via what is the communication between them happening??

Please clarify.

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

If your configuration is correct, your SH should be talking to your real Hadoop cluster, not running in local mode, and not running a local single node cluster. Please read the manual for configuring your version of Hadoop.

By setting the locations of the file system, resource manager, and scheduler in your XML files, you can control the default locations that your Hadoop client will point to. You can override these on the command line, for example using the "-fs" option for the filesystem. If you are not sure where you are pointing by default, try doing a "fs -ls" with and without specifying the filesystem and see if you get the same thing. If you can't tell, use "fs -put" to put a marker file that you can then look for.

When you configure a Splunk Analytics for Hadoop provider, you can specify parameters such as:
vix.fs.default.name
vix.yarn.resourcemanager.address
vix.yarn.resourcemanager.scheduler.address

These will override the values in your XML files.

Harishma
Communicator

Ohhhh Yup..!! Got it..!! I realized my mistake..Made SH also run in local mode..!!

I re-did SH configuration and it worked..!! 🙂 Thankyou so much for your patience in helping me out in this..!! 🙂 It really helped me a lot in understanding this entire flow..!!

If you really don't mind, my last two doubts in this topic,
1) While creating a provider, most of the values were auto-populated. Is there anything that I need to modify here?
For example should I modify the below?
vix.splunk.home.datanode = /tmp/splunk/$SPLUNK_SERVER_NAME/
Should I provide SPLUNK_SERVER_NAME here?

2) After I created a virtual Index and when I tried to explore data, it give below error in UI:
[myhadoopprovider] Error in 'ExternalResultProvider': Hadoop CLI may not be set correctly. Please check HADOOP_HOME and Default Filesystem in the provider settings for this virtual index. Running /home/splunkd1/hadoop-2.7.2/bin/hadoop fs -stat hdfs://ABC.tesco.com:8020/ should return successfully, rc=1, error=16/12/06 08:52:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable stat: Call From XYZ.tesco.com/11.199.169.176 to ABC.tesco.com:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

ABC --> Hadoop Server
XYZ --> Splunk SH

/home/splunkd1/hadoop-2.7.2/bin/hadoop fs -ls hdfs://ABC.fmrco.com:9000/ --> This workd but why not below command?
Should I change port number to 8020 in the xml file in Hadoop cluster?

/home/splunkd1/hadoop-2.7.2/bin/hadoop fs -ls hdfs://ABC.fmrco.com:8020/

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

Very glad we could help.

As for the default values, most of them should be fine. Change them if you have a specific purpose in mind (e.g. something is not working, or you want to do performance tuning). If you want to know what any of them do, find them here:
http://docs.splunk.com/Documentation/Splunk/6.5.1/Admin/Indexesconf

As for the error message, it gives you an exact command to run to help you debug:
/home/splunkd1/hadoop-2.7.2/bin/hadoop fs -stat hdfs://ABC.tesco.com:8020/

Try it from the command line. If it does not work, the problem is in your Hadoop configurations and/or network connectivity. If it does work, then the problem is on the Splunk side. It sounds like you configured HDFS to accept connections on port 9000? If so, then your provider configuration needs to match that:
vix.fs.default.name = ABC.tesco.com:9000

Harishma
Communicator

Yup as you rightly said , I had provided 9000 in my provider settings. Changed it to 9000 and worked..!! Thanks a Ton @kschon !! 🙂

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

Glad it's working!

0 Karma

aaraneta_splunk
Splunk Employee
Splunk Employee

Hey @Harishma - Looks like your original question was answered 🙂 Don't forget to click "Accept" to close out this question and to also up-vote the answer and any comments that were helpful to you. Thanks!

0 Karma
Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...