I have a Splunk virtual index configured and am able to query it and retrieve data properly but am running into some issues when attempting to filter the data. When I put in a timerange and get an entire dataset back I can see the fields properly extracted in Splunk along with their counts. When I select a field to filter on I should see the same result count as the present field count in the entire dataset. Additionally, I should be able to eliminate unwanted field values. For example, if I have the search "index=firewall action=blocked", I should only see action=blocked and not action=allowed. This is proving not to be the case and it seems to ignore this filter. Other times it will take the filter but looking at the entire dataset it will give me something like client_ip=127.0.0.1 (count=3500). When I run the search "index=firewall client_ip=127.0.0.1" for the same timerange I should get 3500 results, but instead get something random like 758.
Is there something I am missing in my configs or a reason for this strange behaviour. I understand when I run a virtual index against Hadoop node it will deploy a full instance of Splunk to the node and they will work together to do some variation of Map Reduce. I'm not sure what Hadoop is doing on it's end to muck up the results so bad because at this point I'd just rather have Hadoop stream back the entire dataset to the search head and allow the search head to complete the processing since that would be more reliable, though horribly inefficient. Any thoughts or ideas. Below are my virtual provider/index settings:
[provider:POC] vix.command = $SPLUNK_HOME/bin/jars/sudobash vix.command.arg.1 = $HADOOP_HOME/bin/hadoop vix.command.arg.2 = jar vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar vix.command.arg.4 = com.splunk.mr.SplunkMR vix.description = POC vix.env.HADOOP_CLIENT_OPTS = -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr vix.env.HADOOP_HEAPSIZE = 512 vix.env.HADOOP_HOME = /usr/bin/hadoop-2.7.3 vix.env.HUNK_THIRDPARTY_JARS = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-18.104.22.168.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-exec-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-metastore-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-serde-1.2.1.jar vix.env.JAVA_HOME = /usr/lib/jvm/java vix.env.MAPREDUCE_USER = vix.family = hadoop vix.fs.default.name = hdfs://IP:Port vix.hadoop.security.authorization = 0 vix.mapred.child.java.opts = -server -Xmx1024m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr vix.mapred.job.map.memory.mb = 2048 vix.mapred.job.queue.name = default vix.mapred.job.reduce.memory.mb = 1024 vix.mapred.job.reuse.jvm.num.tasks = 100 vix.mapred.reduce.tasks = 0 vix.mapreduce.application.classpath = $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*, /usr/lib/hadoop-lzo/lib/*, /usr/share/aws/emr/emrfs/conf, /usr/share/aws/emr/emrfs/lib/*, /usr/share/aws/emr/emrfs/auxlib/*, /usr/share/aws/emr/lib/*, /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar, /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar, /usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar, /usr/share/aws/emr/cloudwatch-sink/lib/*, /usr/share/aws/aws-java-sdk/* vix.mapreduce.framework.name = yarn vix.mapreduce.job.jvm.numtasks = 100 vix.mapreduce.job.queuename = default vix.mapreduce.job.reduces = 0 vix.mapreduce.map.java.opts = -server -Xmx512m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr vix.mapreduce.map.memory.mb = 2048 vix.mapreduce.reduce.java.opts = -server -Xmx512m -XX:ParallelGCThreads=4 -XX:+UseParallelGC -XX:+DisplayVMOutputToStderr vix.mapreduce.reduce.memory.mb = 512 vix.mode = report vix.output.buckets.max.network.bandwidth = 0 vix.splunk.heartbeat = 1 vix.splunk.heartbeat.interval = 1000 vix.splunk.heartbeat.threshold = 60 vix.splunk.home.datanode = /tmp/splunk/$SPLUNK_SERVER_NAME/ vix.splunk.home.hdfs = /user/splunk/ vix.splunk.impersonation = 0 vix.splunk.search.column.filter = 1 vix.splunk.search.debug = 1 vix.splunk.search.mixedmode = 1 vix.splunk.search.mr.maxsplits = 10000 vix.splunk.search.mr.minsplits = 100 vix.splunk.search.mr.poll = 2000 vix.splunk.search.mr.splits.multiplier = 10 vix.splunk.search.recordreader = SplunkJournalRecordReader,ValueAvroRecordReader,SimpleCSVRecordReader,SequenceFileRecordReader vix.splunk.search.recordreader.avro.regex = \.avro$ vix.splunk.search.recordreader.csv.regex = \.([tc]sv)(?:\.(?:gz|bz2|snappy))?$ vix.splunk.search.recordreader.sequence.regex = \.seq$ vix.splunk.setup.onsearch = 1 vix.splunk.setup.package = current vix.yarn.application.classpath = $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*, $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*, $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*, $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*, /usr/lib/hadoop-lzo/lib/*, /usr/share/aws/emr/emrfs/conf, /usr/share/aws/emr/emrfs/lib/*, /usr/share/aws/emr/emrfs/auxlib/*, /usr/share/aws/emr/lib/*, /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar, /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar, /usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar, /usr/share/aws/emr/cloudwatch-sink/lib/*, /usr/share/aws/aws-java-sdk/* vix.yarn.resourcemanager.address = IP:Port vix.yarn.resourcemanager.scheduler.address = IP:Port [hdfs_fw] vix.input.1.accept = test.*\.log$ vix.input.1.et.format = s vix.input.1.et.regex = tmp\/splunk\/(\d+)\/ vix.input.1.et.timezone = GMT vix.input.1.lt.format = s vix.input.1.lt.offset = 3600 vix.input.1.lt.regex = tmp\/splunk\/(\d+)\/ vix.input.1.lt.timezone = GMT vix.input.1.path = /tmp/splunk/*/... vix.provider = POC vix.provider.description = POC
Would you like send me a copy of the add-on of splunk analytics for hadoop, I want to do some tests on it, while there is no download button on the detail page of "splunk analytics for hadoop", you would be greatly appriciated if you help me, please! my mail address: email@example.com
Can you verify that this flag vix.input.1.path = /tmp/splunk/*/... is pointing to HDFS (for example, /user/splunk/data/mylogs/...)?
It looks as if you are using the same local directory path as your vix.splunk.home.datanode = /tmp/splunk/$SPLUNK_SERVER_NAME/
That will explain the mix results.
It is pointing to HDFS. I did move the data and update to vix.input.1.path = /var/log/splunk/*/... just so the wildcard doesn't catch the $SPLUNK_SERVER_NAME even though the whitelist should narrow it down to the specific files I am looking to search. Even after updating that and updating props to look to the new sourcetyping I am running into the same issue. I can retrieve the entire dataset but have inconsistency when filtering on extracted fields. This is using the Splunk_TA_paloalto on this data to test. It seems that some fields will filter properly, some ignore the filter altogether, and some seem to filter but have improper results like the count in the overall dataset doesn't match the result count that should be present when filtered.
So I think I might have found the issue. It looks like when running filters using fields that are created via and EVAL in props.conf it ignores the filter. The filter is working when the field is created via REPORT in props+transforms as well as when the field is created via FIELDALIAS. It can be deceiving having those fields get extracted by the search head but ignored by Hadoop since it looks like they are available to filter on. Also makes it so I'll have to create a new sourcetype separate from the Palo Alto created add-on for CIM so that I can modify the way in which the fields are created.
I've been continuing to test filtering based on the method in which a field is created. Here is what I have so far:
REPORT for string = YES
REPORT for number/ip = Kind of (appears to filter on value but ignores field)
FIELDALIAS = YES
LOOKUP = NO
EVAL = NO
The IP/number part is the painful piece. Not sure if there is some setting I'm missing or what but it's kind of important to be able to use those fields for stats. Also painful is I'd have to clone sourcetypes for this and remove those fields that don't work for filtering (since they still appear in the extracted fields) and maintain it outside of the vendor managed TA. You might be able to manipulate the search a bit to avoid the search optimizer to allow filtering to be done on the search head in some cases but the typical user wouldn't know to do that. The LOOKUP and the EVAL not working on the Hadoop side makes sense to me, but the REPORT IP/Integer is throwing me through a loop.