Solved: Search Archived Buckets in Hadoop

Kieffer87 · ‎04-13-2017

I'm working on pushing out Hadoop data roll for archived data to our index cluster. The buckets are rolling as expected and I have buckets in hadoop but I'm not able to search the archived indexes in Hadoop from Splunk. I'm not seeing anything in the Splunk logs, it just returns no events found. Am I missing something on my SH cluster? I've done this in the past on my test box but I can't for the life of me remember what I changed outside of the indexes.conf on the indexer. running a simple index=proxy_archive for all time returns nothing.

Same Indexes.conf on the indexers:

[proxy]
homePath = $SPLUNK_DB/$_index_name/db
coldPath = $SPLUNK_DB/$_index_name/colddb
thawedPath = $SPLUNK_DB/$_index_name/thaweddb
maxDataSize = auto_high_volume
maxTotalDataSizeMB = 3250000
repFactor = auto
disabled = false

[proxy_archive]
vix.output.buckets.from.indexes = proxy
vix.output.buckets.older.than = 7776000
vix.output.buckets.path = hdfs://ns/projects/csdc_splunk/cold/proxy_archive
vix.smart.search.cutoff_sec = 7775000
vix.provider = hadoop-lake2

[provider-family:hadoop]
vix.mode          = report
vix.command       = $SPLUNK_HOME/bin/jars/sudobash
vix.command.arg.1 = $HADOOP_HOME/bin/hadoop
vix.command.arg.2 = jar
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-h1.jar
vix.command.arg.4 = com.splunk.mr.SplunkMR
vix.env.MAPREDUCE_USER             =
vix.env.HADOOP_HEAPSIZE            = 512
vix.env.HADOOP_CLIENT_OPTS         = -Xmx4096m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr
vix.env.HUNK_THIRDPARTY_JARS       = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive/hive-exec-0.12.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive/hive-metastore-0.12.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive/hive-serde-0.12.0.jar
vix.mapred.job.reuse.jvm.num.tasks = 100
vix.mapred.child.java.opts         = -server -Xmx512m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr
vix.mapred.reduce.tasks            = 0
vix.mapred.job.map.memory.mb       = 2048
vix.mapred.job.reduce.memory.mb    = 512
vix.mapred.job.queue.name          = default
vix.mapreduce.job.jvm.numtasks     = 100
vix.mapreduce.map.java.opts        = -server -Xmx512m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr
vix.mapreduce.reduce.java.opts     = -server -Xmx512m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr
vix.mapreduce.job.reduces          = 0
vix.mapreduce.map.memory.mb        = 2048
vix.mapreduce.reduce.memory.mb     = 512
vix.mapreduce.job.queuename        = default
vix.splunk.search.column.filter    = 1
vix.splunk.search.mixedmode        = 1
vix.splunk.search.debug            = 0
vix.splunk.search.mr.maxsplits     = 10000
vix.splunk.search.mr.minsplits     = 100
vix.splunk.search.mr.splits.multiplier = 10
vix.splunk.search.mr.poll          = 2000
vix.splunk.search.recordreader     = SplunkJournalRecordReader,ValueAvroRecordReader,SimpleCSVRecordReader,SequenceFileRecordReader
vix.splunk.search.recordreader.avro.regex     = \.avro$
vix.splunk.search.recordreader.csv.regex      = \.([tc]sv)(?:\.(?:gz|bz2|snappy))?$
vix.splunk.search.recordreader.sequence.regex = \.seq$
vix.splunk.home.datanode           = /tmp/splunk/$SPLUNK_SERVER_NAME/
vix.splunk.heartbeat               = 1
vix.splunk.heartbeat.threshold     = 60
vix.splunk.heartbeat.interval      = 1000
vix.splunk.setup.onsearch          = 1
vix.splunk.setup.package           = current

[provider:hadoop-lake2]
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar
vix.env.HADOOP_HOME = /opt/hadoop/current
vix.env.HUNK_THIRDPARTY_JARS = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-exec-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-metastore-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-serde-1.2.1.jar
vix.env.JAVA_HOME = /usr/openv/java/jre
vix.family = hadoop
vix.fs.default.name = hdfs://ns/projects/csdc_splunk
vix.mapreduce.framework.name = yarn
vix.output.buckets.max.network.bandwidth = 0
vix.splunk.home.hdfs = /projects/csdc_splunk/workdir
vix.splunk.impersonation = 0
vix.yarn.resourcemanager.address = lake2-rm.hdp.deere.com:8050
vix.yarn.resourcemanager.scheduler.address = lake2-rm.hdp.deere.com:8030
vix.splunk.home.datanode = /tmp/splunk/$SPLUNK_SERVER_NAME/
vix.splunk.setup.package.setup.timelimit = 10000

Hadoop Client 2.6.0. We were using 2.7 but ran into issues with Splunk sending 2+GB packet sizes and were advised by Splunk to downgrade to 2.6 or lower until HDP corrected the issue.

shaskell_splunk · ‎04-20-2017

Turned out to be a typo in indexes.conf and the provider and vix were not deployed to the search heads in the search head cluster. Once the typo was fixed and indexes.conf was pushed via the deployer, we were able to search the archives.

The troubleshooting step that helped was to go to the CLI on one of the indexers and run the following search:

$SPLUNK_HOME/bin/splunk search "index=proxy_archive"

View solution in original post

shaskell_splunk · ‎04-20-2017

Turned out to be a typo in indexes.conf and the provider and vix were not deployed to the search heads in the search head cluster. Once the typo was fixed and indexes.conf was pushed via the deployer, we were able to search the archives.

The troubleshooting step that helped was to go to the CLI on one of the indexers and run the following search:

$SPLUNK_HOME/bin/splunk search "index=proxy_archive"

rdagan_splunk · ‎04-17-2017

Yes, a Java and Hadoop on the Search Head are required.
Archiving requires Hadoop and Java on the Search Head and Indexers
Searching requires Hadoop and Java only on the Search Head.

Here is the link regarding the Search Head install: http://docs.splunk.com/Documentation/Splunk/latest/HadoopAnalytics/Configureasearchhead

Here is the link to setup Name Node HA: http://docs.splunk.com/Documentation/Splunk/6.5.3/HadoopAnalytics/ProviderConfigurationVariables#Hig...

And Yarn Resource Manager HA: http://docs.splunk.com/Documentation/Splunk/6.5.3/HadoopAnalytics/RequiredConfigurationVariablesforY...

rdagan_splunk · ‎04-13-2017

Are you trying index=proxy or are you trying index=proxy_archive (All time) ?

In the Provider, this looks wrong
vix.fs.default.name = hdfs://ns/projects/csdc_splunk
Normally we see
vix.fs.default.name = hdfs:// some machine name:8020

In the Virtual Index normally we do not see this
vix.output.buckets.path = hdfs://ns/projects/csdc_splunk/cold/proxy_archive
but more like something similar to this
vix.output.buckets.path = /projects/csdc_splunk/cold/proxy_archive

As far as the relationship between the older.then and unified.search
You copy files to Hadoop after 90 days
vix.output.buckets.older.than = 7776000

So this flag should be older then 90 days. For example
vix.unified.search.cutoff_sec = 7948900 (92 days)
( vix.smart.search.cutoff_sec = 7775000 In Splunk 6.5 this is not a valid flag )

Kieffer87 · ‎04-17-2017

I'm trying index=proxy_archive (All Time).

in core-site.xml we define hdfs://ns for HA. From the indexer CLI I'm able to list the directory contents of hdfs://ns/projects/csdc_splunk and buckets are being copied as well without any issues.

vix.output.buckets.path - I can update the path though buckets are archiving correctly.

vix.unified.search.cutoff_sec - updated indexes.conf to use this flag.

Is it necessary to have java and the hadoop client installed on the search heads in addition to the indexers? I don't recall having to do that during my proof of concept but that was some time ago.

Search Archived Buckets in Hadoop

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Monitoring AI Agents with Splunk Observability Cloud

[Puzzles] Solve, Learn, Repeat: Tiling

SOK it to Me: Top 3 Benefits of Using Splunk Operator on Kubernetes that’ll Make ...

Join the Conversation