Getting Data In

Search Archived Buckets in Hadoop

Kieffer87
Communicator

I'm working on pushing out Hadoop data roll for archived data to our index cluster. The buckets are rolling as expected and I have buckets in hadoop but I'm not able to search the archived indexes in Hadoop from Splunk. I'm not seeing anything in the Splunk logs, it just returns no events found. Am I missing something on my SH cluster? I've done this in the past on my test box but I can't for the life of me remember what I changed outside of the indexes.conf on the indexer. running a simple index=proxy_archive for all time returns nothing.

Same Indexes.conf on the indexers:

[proxy]
homePath = $SPLUNK_DB/$_index_name/db
coldPath = $SPLUNK_DB/$_index_name/colddb
thawedPath = $SPLUNK_DB/$_index_name/thaweddb
maxDataSize = auto_high_volume
maxTotalDataSizeMB = 3250000
repFactor = auto
disabled = false

[proxy_archive]
vix.output.buckets.from.indexes = proxy
vix.output.buckets.older.than = 7776000
vix.output.buckets.path = hdfs://ns/projects/csdc_splunk/cold/proxy_archive
vix.smart.search.cutoff_sec = 7775000
vix.provider = hadoop-lake2

[provider-family:hadoop]
vix.mode          = report
vix.command       = $SPLUNK_HOME/bin/jars/sudobash
vix.command.arg.1 = $HADOOP_HOME/bin/hadoop
vix.command.arg.2 = jar
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-h1.jar
vix.command.arg.4 = com.splunk.mr.SplunkMR
vix.env.MAPREDUCE_USER             =
vix.env.HADOOP_HEAPSIZE            = 512
vix.env.HADOOP_CLIENT_OPTS         = -Xmx4096m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr
vix.env.HUNK_THIRDPARTY_JARS       = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive/hive-exec-0.12.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive/hive-metastore-0.12.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive/hive-serde-0.12.0.jar
vix.mapred.job.reuse.jvm.num.tasks = 100
vix.mapred.child.java.opts         = -server -Xmx512m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr
vix.mapred.reduce.tasks            = 0
vix.mapred.job.map.memory.mb       = 2048
vix.mapred.job.reduce.memory.mb    = 512
vix.mapred.job.queue.name          = default
vix.mapreduce.job.jvm.numtasks     = 100
vix.mapreduce.map.java.opts        = -server -Xmx512m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr
vix.mapreduce.reduce.java.opts     = -server -Xmx512m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr
vix.mapreduce.job.reduces          = 0
vix.mapreduce.map.memory.mb        = 2048
vix.mapreduce.reduce.memory.mb     = 512
vix.mapreduce.job.queuename        = default
vix.splunk.search.column.filter    = 1
vix.splunk.search.mixedmode        = 1
vix.splunk.search.debug            = 0
vix.splunk.search.mr.maxsplits     = 10000
vix.splunk.search.mr.minsplits     = 100
vix.splunk.search.mr.splits.multiplier = 10
vix.splunk.search.mr.poll          = 2000
vix.splunk.search.recordreader     = SplunkJournalRecordReader,ValueAvroRecordReader,SimpleCSVRecordReader,SequenceFileRecordReader
vix.splunk.search.recordreader.avro.regex     = \.avro$
vix.splunk.search.recordreader.csv.regex      = \.([tc]sv)(?:\.(?:gz|bz2|snappy))?$
vix.splunk.search.recordreader.sequence.regex = \.seq$
vix.splunk.home.datanode           = /tmp/splunk/$SPLUNK_SERVER_NAME/
vix.splunk.heartbeat               = 1
vix.splunk.heartbeat.threshold     = 60
vix.splunk.heartbeat.interval      = 1000
vix.splunk.setup.onsearch          = 1
vix.splunk.setup.package           = current

[provider:hadoop-lake2]
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar
vix.env.HADOOP_HOME = /opt/hadoop/current
vix.env.HUNK_THIRDPARTY_JARS = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-exec-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-metastore-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-serde-1.2.1.jar
vix.env.JAVA_HOME = /usr/openv/java/jre
vix.family = hadoop
vix.fs.default.name = hdfs://ns/projects/csdc_splunk
vix.mapreduce.framework.name = yarn
vix.output.buckets.max.network.bandwidth = 0
vix.splunk.home.hdfs = /projects/csdc_splunk/workdir
vix.splunk.impersonation = 0
vix.yarn.resourcemanager.address = lake2-rm.hdp.deere.com:8050
vix.yarn.resourcemanager.scheduler.address = lake2-rm.hdp.deere.com:8030
vix.splunk.home.datanode = /tmp/splunk/$SPLUNK_SERVER_NAME/
vix.splunk.setup.package.setup.timelimit = 10000

Hadoop Client 2.6.0. We were using 2.7 but ran into issues with Splunk sending 2+GB packet sizes and were advised by Splunk to downgrade to 2.6 or lower until HDP corrected the issue.

0 Karma
1 Solution

shaskell_splunk
Splunk Employee
Splunk Employee

Turned out to be a typo in indexes.conf and the provider and vix were not deployed to the search heads in the search head cluster. Once the typo was fixed and indexes.conf was pushed via the deployer, we were able to search the archives.

The troubleshooting step that helped was to go to the CLI on one of the indexers and run the following search:

$SPLUNK_HOME/bin/splunk search "index=proxy_archive"

View solution in original post

shaskell_splunk
Splunk Employee
Splunk Employee

Turned out to be a typo in indexes.conf and the provider and vix were not deployed to the search heads in the search head cluster. Once the typo was fixed and indexes.conf was pushed via the deployer, we were able to search the archives.

The troubleshooting step that helped was to go to the CLI on one of the indexers and run the following search:

$SPLUNK_HOME/bin/splunk search "index=proxy_archive"

rdagan_splunk
Splunk Employee
Splunk Employee

Yes, a Java and Hadoop on the Search Head are required.
Archiving requires Hadoop and Java on the Search Head and Indexers
Searching requires Hadoop and Java only on the Search Head.

Here is the link regarding the Search Head install: http://docs.splunk.com/Documentation/Splunk/latest/HadoopAnalytics/Configureasearchhead

Here is the link to setup Name Node HA: http://docs.splunk.com/Documentation/Splunk/6.5.3/HadoopAnalytics/ProviderConfigurationVariables#Hig...

And Yarn Resource Manager HA: http://docs.splunk.com/Documentation/Splunk/6.5.3/HadoopAnalytics/RequiredConfigurationVariablesforY...

0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

Are you trying index=proxy or are you trying index=proxy_archive (All time) ?

In the Provider, this looks wrong
vix.fs.default.name = hdfs://ns/projects/csdc_splunk
Normally we see
vix.fs.default.name = hdfs:// some machine name:8020

In the Virtual Index normally we do not see this
vix.output.buckets.path = hdfs://ns/projects/csdc_splunk/cold/proxy_archive
but more like something similar to this
vix.output.buckets.path = /projects/csdc_splunk/cold/proxy_archive

As far as the relationship between the older.then and unified.search
You copy files to Hadoop after 90 days
vix.output.buckets.older.than = 7776000

So this flag should be older then 90 days. For example
vix.unified.search.cutoff_sec = 7948900 (92 days)
( vix.smart.search.cutoff_sec = 7775000 In Splunk 6.5 this is not a valid flag )

0 Karma

Kieffer87
Communicator

I'm trying index=proxy_archive (All Time).

in core-site.xml we define hdfs://ns for HA. From the indexer CLI I'm able to list the directory contents of hdfs://ns/projects/csdc_splunk and buckets are being copied as well without any issues.

vix.output.buckets.path - I can update the path though buckets are archiving correctly.

vix.unified.search.cutoff_sec - updated indexes.conf to use this flag.

Is it necessary to have java and the hadoop client installed on the search heads in addition to the indexers? I don't recall having to do that during my proof of concept but that was some time ago.

0 Karma
Get Updates on the Splunk Community!

Index This | Divide 100 by half. What do you get?

November 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with this ...

Stay Connected: Your Guide to December Tech Talks, Office Hours, and Webinars!

❄️ Celebrate the season with our December lineup of Community Office Hours, Tech Talks, and Webinars! ...

Splunk and Fraud

Watch Now!Watch an insightful webinar where we delve into the innovative approaches to solving fraud using the ...