Splunk Analytics for Hadoop not filtering correctl...

mdsnmss · ‎05-08-2019

I have a Splunk virtual index configured and am able to query it and retrieve data properly but am running into some issues when attempting to filter the data. When I put in a timerange and get an entire dataset back I can see the fields properly extracted in Splunk along with their counts. When I select a field to filter on I should see the same result count as the present field count in the entire dataset. Additionally, I should be able to eliminate unwanted field values. For example, if I have the search "index=firewall action=blocked", I should only see action=blocked and not action=allowed. This is proving not to be the case and it seems to ignore this filter. Other times it will take the filter but looking at the entire dataset it will give me something like client_ip=127.0.0.1 (count=3500). When I run the search "index=firewall client_ip=127.0.0.1" for the same timerange I should get 3500 results, but instead get something random like 758.

Is there something I am missing in my configs or a reason for this strange behaviour. I understand when I run a virtual index against Hadoop node it will deploy a full instance of Splunk to the node and they will work together to do some variation of Map Reduce. I'm not sure what Hadoop is doing on it's end to muck up the results so bad because at this point I'd just rather have Hadoop stream back the entire dataset to the search head and allow the search head to complete the processing since that would be more reliable, though horribly inefficient. Any thoughts or ideas. Below are my virtual provider/index settings:

[provider:POC]
vix.command = $SPLUNK_HOME/bin/jars/sudobash
vix.command.arg.1 = $HADOOP_HOME/bin/hadoop
vix.command.arg.2 = jar
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar
vix.command.arg.4 = com.splunk.mr.SplunkMR
vix.description = POC
vix.env.HADOOP_CLIENT_OPTS = -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr
vix.env.HADOOP_HEAPSIZE = 512
vix.env.HADOOP_HOME = /usr/bin/hadoop-2.7.3
vix.env.HUNK_THIRDPARTY_JARS = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-exec-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-metastore-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-serde-1.2.1.jar
vix.env.JAVA_HOME = /usr/lib/jvm/java
vix.env.MAPREDUCE_USER =
vix.family = hadoop
vix.fs.default.name = hdfs://IP:Port
vix.hadoop.security.authorization = 0
vix.mapred.child.java.opts = -server -Xmx1024m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr
vix.mapred.job.map.memory.mb = 2048
vix.mapred.job.queue.name = default
vix.mapred.job.reduce.memory.mb = 1024
vix.mapred.job.reuse.jvm.num.tasks = 100
vix.mapred.reduce.tasks = 0
vix.mapreduce.application.classpath = $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*, /usr/lib/hadoop-lzo/lib/*, /usr/share/aws/emr/emrfs/conf, /usr/share/aws/emr/emrfs/lib/*, /usr/share/aws/emr/emrfs/auxlib/*, /usr/share/aws/emr/lib/*, /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar, /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar, /usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar, /usr/share/aws/emr/cloudwatch-sink/lib/*, /usr/share/aws/aws-java-sdk/*
vix.mapreduce.framework.name = yarn
vix.mapreduce.job.jvm.numtasks = 100
vix.mapreduce.job.queuename = default
vix.mapreduce.job.reduces = 0
vix.mapreduce.map.java.opts = -server -Xmx512m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr
vix.mapreduce.map.memory.mb = 2048
vix.mapreduce.reduce.java.opts = -server -Xmx512m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr
vix.mapreduce.reduce.memory.mb = 512
vix.mode = report
vix.output.buckets.max.network.bandwidth = 0
vix.splunk.heartbeat = 1
vix.splunk.heartbeat.interval = 1000
vix.splunk.heartbeat.threshold = 60
vix.splunk.home.datanode = /tmp/splunk/$SPLUNK_SERVER_NAME/
vix.splunk.home.hdfs = /user/splunk/
vix.splunk.impersonation = 0
vix.splunk.search.column.filter = 1
vix.splunk.search.debug = 1
vix.splunk.search.mixedmode = 1
vix.splunk.search.mr.maxsplits = 10000
vix.splunk.search.mr.minsplits = 100
vix.splunk.search.mr.poll = 2000
vix.splunk.search.mr.splits.multiplier = 10
vix.splunk.search.recordreader = SplunkJournalRecordReader,ValueAvroRecordReader,SimpleCSVRecordReader,SequenceFileRecordReader
vix.splunk.search.recordreader.avro.regex = \.avro$
vix.splunk.search.recordreader.csv.regex = \.([tc]sv)(?:\.(?:gz|bz2|snappy))?$
vix.splunk.search.recordreader.sequence.regex = \.seq$
vix.splunk.setup.onsearch = 1
vix.splunk.setup.package = current
vix.yarn.application.classpath = $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*, $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*, $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*, $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*, /usr/lib/hadoop-lzo/lib/*, /usr/share/aws/emr/emrfs/conf, /usr/share/aws/emr/emrfs/lib/*, /usr/share/aws/emr/emrfs/auxlib/*, /usr/share/aws/emr/lib/*, /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar, /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar, /usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar, /usr/share/aws/emr/cloudwatch-sink/lib/*, /usr/share/aws/aws-java-sdk/*
vix.yarn.resourcemanager.address = IP:Port
vix.yarn.resourcemanager.scheduler.address = IP:Port

[hdfs_fw]
vix.input.1.accept = test.*\.log$
vix.input.1.et.format = s
vix.input.1.et.regex = tmp\/splunk\/(\d+)\/
vix.input.1.et.timezone = GMT
vix.input.1.lt.format = s
vix.input.1.lt.offset = 3600
vix.input.1.lt.regex = tmp\/splunk\/(\d+)\/
vix.input.1.lt.timezone = GMT
vix.input.1.path = /tmp/splunk/*/...
vix.provider = POC
vix.provider.description = POC

holdwang · ‎12-30-2019

Dear mdsnmss:
Would you like send me a copy of the add-on of splunk analytics for hadoop, I want to do some tests on it, while there is no download button on the detail page of "splunk analytics for hadoop", you would be greatly appriciated if you help me, please! my mail address: wangcq@allinfinance.com
great thanks!!!

rdagan_splunk · ‎05-08-2019

Can you verify that this flag vix.input.1.path = /tmp/splunk/*/... is pointing to HDFS (for example, /user/splunk/data/mylogs/...)?

It looks as if you are using the same local directory path as your vix.splunk.home.datanode = /tmp/splunk/$SPLUNK_SERVER_NAME/
That will explain the mix results.

mdsnmss · ‎05-08-2019

It is pointing to HDFS. I did move the data and update to vix.input.1.path = /var/log/splunk/*/... just so the wildcard doesn't catch the $SPLUNK_SERVER_NAME even though the whitelist should narrow it down to the specific files I am looking to search. Even after updating that and updating props to look to the new sourcetyping I am running into the same issue. I can retrieve the entire dataset but have inconsistency when filtering on extracted fields. This is using the Splunk_TA_paloalto on this data to test. It seems that some fields will filter properly, some ignore the filter altogether, and some seem to filter but have improper results like the count in the overall dataset doesn't match the result count that should be present when filtered.

mdsnmss · ‎05-08-2019

So I think I might have found the issue. It looks like when running filters using fields that are created via and EVAL in props.conf it ignores the filter. The filter is working when the field is created via REPORT in props+transforms as well as when the field is created via FIELDALIAS. It can be deceiving having those fields get extracted by the search head but ignored by Hadoop since it looks like they are available to filter on. Also makes it so I'll have to create a new sourcetype separate from the Palo Alto created add-on for CIM so that I can modify the way in which the fields are created.

mdsnmss · ‎05-09-2019

I've been continuing to test filtering based on the method in which a field is created. Here is what I have so far:
REPORT for string = YES
REPORT for number/ip = Kind of (appears to filter on value but ignores field)
FIELDALIAS = YES
LOOKUP = NO
EVAL = NO

The IP/number part is the painful piece. Not sure if there is some setting I'm missing or what but it's kind of important to be able to use those fields for stats. Also painful is I'd have to clone sourcetypes for this and remove those fields that don't work for filtering (since they still appear in the extracted fields) and maintain it outside of the vendor managed TA. You might be able to manipulate the search a bit to avoid the search optimizer to allow filtering to be done on the search head in some cases but the typical user wouldn't know to do that. The LOOKUP and the EVAL not working on the Hadoop side makes sense to me, but the REPORT IP/Integer is throwing me through a loop.

neal_potter · ‎02-01-2020

I've been running into exactly the same issues - did you ever get this resolved?

rdagan_splunk · ‎05-10-2019

Let me know and we can schedule a webex to debug these issues? My Splunk email is rdagan

Splunk Analytics for Hadoop not filtering correctly and providing inconsistent results

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!