Splunk Search

hunk search-time field extraction not working

suarezry
Builder

Hunk v6.2.2 to hortonworks hadoop v2.2.4.2. My search-time field extraction for client_host is not consistent. It will return too few results or none at all. For example, if I search "index=hadoop client_host=10.0.0.10" in the last 4 hours (at 4pm eastern time) I get no results. Can someone help troubleshoot?

Raw logs in /myprovider/syslogs/2015/2015-06-10_datacollector2.txt contain:

2015-06-10T20:13:33Z syslog.tcp {"message":"<14>Jun 10 16:07:03 WIN-VQCJADNQOGL MSWinEventLog\t1\tMicrosoft-Windows-LanguagePackSetup/Operational\t71\tWed Jun 10 16:07:03 2015\t4001\tMicrosoft-Windows-LanguagePackSetup\tSYSTEM\tUser\tInformation\tWIN-VQCJADNQOGL\tLanguage Pack cleanup functionality\t\tLPRemove terminating.\t16\r","client_host":"10.0.0.10"}
2015-06-10T20:13:33Z syslog.tcp {"message":"<14>Jun 10 16:07:03 WIN-VQCJADNQOGL MSWinEventLog\t1\tMicrosoft-Windows-MUI/Operational\t72\tWed Jun 10 16:07:03 2015\t3003\tMicrosoft-Windows-MUI\tSYSTEM\tUser\tInformation\tWIN-VQCJADNQOGL\tMUI resource cache builder\t\tMUI resource cache builder has been called with the following parameters: (null).\t29\r","client_host":"10.0.0.10"}
2015-06-10T20:13:45Z syslog.tcp {"message":"<14>Jun 10 16:07:13 WIN-VQCJADNQOGL MSWinEventLog\t1\tMicrosoft-Windows-MUI/Operational\t73\tWed Jun 10 16:07:13 2015\t3007\tMicrosoft-Windows-MUI\tSYSTEM\tUser\tInformation\tWIN-VQCJADNQOGL\tMUI resource cache builder\t\tNew resource cache built and installed on system. New cache index is 5, live cache index is 5 and config is set to 3.\t30\r","client_host":"10.0.0.10"}

My Hunk config:

index.conf

[provider:myprovider]
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-s6.0-hy2.0.jar
vix.env.HADOOP_HOME = /usr/hdp/2.2.4.2-2/hadoop
vix.env.JAVA_HOME = /usr/lib/jvm/java-7-openjdk-amd64
vix.family = hadoop
vix.fs.default.name = hdfs://hadoop-namenode1.internal:8020
vix.mapreduce.framework.name = yarn
vix.mapred.child.java.opts = -server -Xmx1024m
vix.output.buckets.max.network.bandwidth = 0
vix.splunk.home.hdfs = /user/splunk/myprovider
vix.yarn.resourcemanager.address = hadoop-namenode2.internal:8050
vix.yarn.resourcemanager.scheduler.address = hadoop-namenode2.internal:8030
vix.yarn.application.classpath = /etc/hadoop/conf,/usr/hdp/2.2.4.2-2/hadoop/client/*,/usr/hdp/2.2.4.2-2/hadoop/lib/*,/usr/hdp/2.2.4.2-2/hadoop-hdfs/*,/usr/hdp/2.2.4.2-2/hadoop-hdfs/lib/*,/usr/hdp/2.2.4.2-2/hadoop-yarn/*,/usr/hdp/2.2.4.2-2/hadoop-yarn/lib/*
vix.splunk.home.datanode = /user/splunk/splunk-search1/
vix.splunk.setup.package = /opt/hunk/hunk-6.2.2-257696-linux-2.6-x86_64.tgz

[hadoop]
vix.input.1.path = /myprovider/syslogs/...
vix.provider = myprovider
vix.input.1.accept = \.txt$
vix.input.1.et.format = yyyyMMdd
vix.input.1.et.offset = 3600
vix.input.1.et.regex = /myprovider/syslogs/(\d+)/\d+-(\d+)-(\d+)_\w+\.txt
vix.input.1.lt.format = yyyyMMdd
vix.input.1.lt.offset = 86400
vix.input.1.lt.regex = /myprovider/syslogs/(\d+)/\d+-(\d+)-(\d+)_\w+\.txt

props.conf

[source::/myprovider/syslogs/*/*]
EXTRACT-clienthost = client_host\"\:\"(?<client_host>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\"

sourcetype = hadoop
priority = 100
ANNOTATE_PUNCT = false
SHOULD_LINEMERGE = false
MAX_TIMESTAMP_LOOKAHEAD = 30
TIME_PREFIX = ^
TIME_FORMAT = %Y-%m-%dT%H:%M:%SZ
TZ=UTC
1 Solution

Ledion_Bitincka
Splunk Employee
Splunk Employee

Can you try a) replacing the stanza name and more importantly b) remove the unnecessary slashes from " in the extraction regex? If that works, given that the data seems partially like json I'd recommend that you add into the regex optional spaces between : and "

[source::/myprovider/syslogs/...]
EXTRACT-clienthost = client_host":"(?<client_host>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"

View solution in original post

Ledion_Bitincka
Splunk Employee
Splunk Employee

Can you try a) replacing the stanza name and more importantly b) remove the unnecessary slashes from " in the extraction regex? If that works, given that the data seems partially like json I'd recommend that you add into the regex optional spaces between : and "

[source::/myprovider/syslogs/...]
EXTRACT-clienthost = client_host":"(?<client_host>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"

Ledion_Bitincka
Splunk Employee
Splunk Employee

Regex and stanza are shown correctly (ie no format messup)

0 Karma

suarezry
Builder

Thanks! Looks like my regex was off.

0 Karma

rphillips_splk
Splunk Employee
Splunk Employee

Is your data being sourcetyped correctly? i.e.: does the sourcetype field return a value of hadoop for these events? If so I would add a field extraction definition to the hadoop sourcetype stanza in props.conf on your search head:

$SPLUNK_HOME/etc/system/local/props.conf
(or props.conf in the app of your choice) $SPLUNK_HOME/etc/apps/appofyourchoice/local/props.conf

[hadoop]

EXTRACT-client_host = (?m)client_host":"(?<client_host>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"}

then restart splunk on your SH:
$SPLUNK_HOME/bin
./splunk restart

to validate, you can run a search like this:
index=* sourcetype=hadoop | stats count by client_host

If not and you want to do the extraction on the source field, this should work:

on your search head:
$SPLUNK_HOME/etc/system/local/props.conf
(or props.conf in the app of your choice) $SPLUNK_HOME/etc/apps/appofyourchoice/local/props.conf

[source::/myprovider/syslogs/...]

EXTRACT-client_host = (?m)client_host":"(?<client_host>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"}

suarezry
Builder

Looks like this page formatted the escape characters. Here's my original question: http://pastebin.ca/3023980

0 Karma
Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...