Splunk Search

How to configure 6.5.0 data roll to search archived buckets in S3?

heroku_curzonj
Explorer

I follow the instructions in [the documentation for archiving to S3 in 6.5.0 http://docs.splunk.com/Documentation/Splunk/6.5.0/Indexer/ArchivingSplunkindexestoS3
but Splunk still can't find the jars it wants. How to I properly configure the jars for searching S3 archived buckets?

I ran the | archivebuckets command and it worked fine and archived the buckets, but the search errors out saying it can't find the jars:

  [HadoopProvider] Error in 'ExternalResultProvider': Hadoop CLI may not be set correctly. Please check HADOOP_HOME and Default Filesystem in the provider settings for this virtual index. Running /opt/hadoop/bin/hadoop fs -stat s3a://bucketname/prefix/ should return successfully, rc=255, error=-stat: Fatal internal error java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)

I ran the command that I wanted and I could only get it to work if I provide the -libjars option.

$ /opt/hadoop/bin/hadoop fs -libjars $HADOOP_TOOLS/hadoop-aws-2.7.2.jar,$HADOOP_TOOLS/aws-java-sdk-1.7.4.jar,$HADOOP_TOOLS/jackson-databind-2.2.3.jar,$HADOOP_TOOLS/jackson-core-2.2.3.jar,$HADOOP_TOOLS/jackson-annotations-2.2.3.jar -Dfs.s3a.access.key=value -Dfs.s3a.secret.key=value -stat s3a://bucketname/prefix/
1970-01-01 00:00:00
$ export HADOOP_CLASSPATH=$HADOOP_TOOLS/hadoop-aws-2.7.2.jar,$HADOOP_TOOLS/aws-java-sdk-1.7.4.jar,$HADOOP_TOOLS/jackson-databind-2.2.3.jar,$HADOOP_TOOLS/jackson-core-2.2.3.jar,$HADOOP_TOOLS/jackson-annotations-2.2.3.jar
$ /opt/hadoop/bin/hadoop classpath
/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-2.7.2.jar,/opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar,/opt/hadoop/share/hadoop/tools/lib/jackson-databind-2.2.3.jar,/opt/hadoop/share/hadoop/tools/lib/jackson-core-2.2.3.jar,/opt/hadoop/share/hadoop/tools/lib/jackson-annotations-2.2.3.jar:/opt/hadoop/contrib/capacity-scheduler/*.jar
$ /opt/hadoop/bin/hadoop fs -Dfs.s3a.access.key=value -Dfs.s3a.secret.key=value -stat s3a://bucketname/prefix/
-stat: Fatal internal error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
    ... 16 more

Here is my provider configuration:

[provider:HadoopProvider]
vix.family                  = hadoop
vix.splunk.setup.package    = /opt/splunk_package.tgz

vix.env.JAVA_HOME           = /usr/lib/jvm/java-7-openjdk-amd64
vix.env.HADOOP_HOME         = /opt/hadoop
vix.env.HADOOP_TOOLS        = /opt/hadoop/share/hadoop/tools/lib
vix.splunk.home.datanode    = /opt/splunk
vix.splunk.home.hdfs        = /working-dir

vix.splunk.jars = $HADOOP_TOOLS/hadoop-aws-2.7.2.jar,$HADOOP_TOOLS/aws-java-sdk-1.7.4.jar,$HADOOP_TOOLS/jackson-databind-2.2.3.jar,$HADOOP_TOOLS/jackson-core-2.2.3.jar,$HADOOP_TOOLS/jackson-annotations-2.2.3.jar

vix.mapreduce.framework.name                = yarn
vix.yarn.resourcemanager.address            = <%= ENV['HADOOP_MASTER'] %>:8032
vix.yarn.resourcemanager.scheduler.address  = <%= ENV['HADOOP_MASTER'] %>:8030
vix.fs.s3a.access.key                       = <%= ENV['S3_ARCHIVE_ACCESS_KEY'] %>
vix.fs.s3a.secret.key                       = <%= ENV['S3_ARCHIVE_SECRET_KEY'] %>
vix.fs.default.name                         = s3a://<%= ENV['SPLUNK_HADOOP_BUCKET'] %>/prefix

[main_archive]
vix.provider                    = HadoopProvider
vix.output.buckets.from.indexes = main
vix.output.buckets.older.than   = 1
vix.output.buckets.path         = s3a://<%= ENV['SPLUNK_HADOOP_BUCKET'] %>/prefix

I'm running against a vanilla apache hadoop tarball, version 2.7.2. I'm not sure which commands are trying to run against the hadoop cluster, but I'm working against an AWS EMR cluster of the same hadoop version.

http://docs.splunk.com/Documentation/Splunk/6.5.0/Indexer/ArchivingSplunkindexestoS3

1 Solution

rdagan_splunk
Splunk Employee
Splunk Employee

Are you sure the Name Node flag vix.fs.default.name is correct?

Normally you will see  vix.fs.default.name = hdfs://" master-private-ip " :8020

View solution in original post

rdagan_splunk
Splunk Employee
Splunk Employee

Are you sure the Name Node flag vix.fs.default.name is correct?

Normally you will see  vix.fs.default.name = hdfs://" master-private-ip " :8020

heroku_curzonj
Explorer

This splunk blog post indicated that I could use S3 as the default FS, but switching to HDFS did solve the problem.

http://blogs.splunk.com/2013/11/13/analyze-data-with-hunk-on-amazon-emr/

For anybody that comes looking, I also had to add the following to my provider configs to get splunk to use the Hadoop 2 compatible splunkMR jars:

vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar
vix.splunk.impersonation = 0

heroku_curzonj
Explorer

I fixed the CLASSPATH to be colon separated and it works fine now. So the command that the search error says should work does work, but the search still doesn't.

$ export HADOOP_CLASSPATH=$HADOOP_TOOLS/hadoop-aws-2.7.2.jar:$HADOOP_TOOLS/aws-java-sdk-1.7.4.jar:$HADOOP_TOOLS/jackson-databind-2.2.3.jar:$HADOOP_TOOLS/jackson-core-2.2.3.jar:$HADOOP_TOOLS/jackson-annotations-2.2.3.jar
$ /opt/hadoop/bin/hadoop classpath
/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-2.7.2.jar:/opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar:/opt/hadoop/share/hadoop/tools/lib/jackson-databind-2.2.3.jar:/opt/hadoop/share/hadoop/tools/lib/jackson-core-2.2.3.jar:/opt/hadoop/share/hadoop/tools/lib/jackson-annotations-2.2.3.jar:/opt/hadoop/contrib/capacity-scheduler/*.jar
$ /opt/hadoop/bin/hadoop fs -Dfs.s3a.access.key=value -Dfs.s3a.secret.key=value -stat s3a://bucketname/prefix/
1970-01-01 00:00:00
0 Karma
Get Updates on the Splunk Community!

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...

Explore the Latest Educational Offerings from Splunk [January 2025 Updates]

At Splunk Education, we are committed to providing a robust learning experience for all users, regardless of ...

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...