Splunk Search

hunk search-time field extraction not working


Hunk v6.2.2 to hortonworks hadoop v2.2.4.2. My search-time field extraction for client_host is not consistent. It will return too few results or none at all. For example, if I search "index=hadoop client_host=" in the last 4 hours (at 4pm eastern time) I get no results. Can someone help troubleshoot?

Raw logs in /myprovider/syslogs/2015/2015-06-10_datacollector2.txt contain:

2015-06-10T20:13:33Z syslog.tcp {"message":"<14>Jun 10 16:07:03 WIN-VQCJADNQOGL MSWinEventLog\t1\tMicrosoft-Windows-LanguagePackSetup/Operational\t71\tWed Jun 10 16:07:03 2015\t4001\tMicrosoft-Windows-LanguagePackSetup\tSYSTEM\tUser\tInformation\tWIN-VQCJADNQOGL\tLanguage Pack cleanup functionality\t\tLPRemove terminating.\t16\r","client_host":""}
2015-06-10T20:13:33Z syslog.tcp {"message":"<14>Jun 10 16:07:03 WIN-VQCJADNQOGL MSWinEventLog\t1\tMicrosoft-Windows-MUI/Operational\t72\tWed Jun 10 16:07:03 2015\t3003\tMicrosoft-Windows-MUI\tSYSTEM\tUser\tInformation\tWIN-VQCJADNQOGL\tMUI resource cache builder\t\tMUI resource cache builder has been called with the following parameters: (null).\t29\r","client_host":""}
2015-06-10T20:13:45Z syslog.tcp {"message":"<14>Jun 10 16:07:13 WIN-VQCJADNQOGL MSWinEventLog\t1\tMicrosoft-Windows-MUI/Operational\t73\tWed Jun 10 16:07:13 2015\t3007\tMicrosoft-Windows-MUI\tSYSTEM\tUser\tInformation\tWIN-VQCJADNQOGL\tMUI resource cache builder\t\tNew resource cache built and installed on system. New cache index is 5, live cache index is 5 and config is set to 3.\t30\r","client_host":""}

My Hunk config:


vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-s6.0-hy2.0.jar
vix.env.HADOOP_HOME = /usr/hdp/
vix.env.JAVA_HOME = /usr/lib/jvm/java-7-openjdk-amd64
vix.family = hadoop
vix.fs.default.name = hdfs://hadoop-namenode1.internal:8020
vix.mapreduce.framework.name = yarn
vix.mapred.child.java.opts = -server -Xmx1024m
vix.output.buckets.max.network.bandwidth = 0
vix.splunk.home.hdfs = /user/splunk/myprovider
vix.yarn.resourcemanager.address = hadoop-namenode2.internal:8050
vix.yarn.resourcemanager.scheduler.address = hadoop-namenode2.internal:8030
vix.yarn.application.classpath = /etc/hadoop/conf,/usr/hdp/*,/usr/hdp/*,/usr/hdp/*,/usr/hdp/*,/usr/hdp/*,/usr/hdp/*
vix.splunk.home.datanode = /user/splunk/splunk-search1/
vix.splunk.setup.package = /opt/hunk/hunk-6.2.2-257696-linux-2.6-x86_64.tgz

vix.input.1.path = /myprovider/syslogs/...
vix.provider = myprovider
vix.input.1.accept = \.txt$
vix.input.1.et.format = yyyyMMdd
vix.input.1.et.offset = 3600
vix.input.1.et.regex = /myprovider/syslogs/(\d+)/\d+-(\d+)-(\d+)_\w+\.txt
vix.input.1.lt.format = yyyyMMdd
vix.input.1.lt.offset = 86400
vix.input.1.lt.regex = /myprovider/syslogs/(\d+)/\d+-(\d+)-(\d+)_\w+\.txt


EXTRACT-clienthost = client_host\"\:\"(?<client_host>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\"

sourcetype = hadoop
priority = 100
TIME_FORMAT = %Y-%m-%dT%H:%M:%SZ
1 Solution

Splunk Employee
Splunk Employee

Can you try a) replacing the stanza name and more importantly b) remove the unnecessary slashes from " in the extraction regex? If that works, given that the data seems partially like json I'd recommend that you add into the regex optional spaces between : and "

EXTRACT-clienthost = client_host":"(?<client_host>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"

View solution in original post

Splunk Employee
Splunk Employee

Can you try a) replacing the stanza name and more importantly b) remove the unnecessary slashes from " in the extraction regex? If that works, given that the data seems partially like json I'd recommend that you add into the regex optional spaces between : and "

EXTRACT-clienthost = client_host":"(?<client_host>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"

Splunk Employee
Splunk Employee

Regex and stanza are shown correctly (ie no format messup)

0 Karma


Thanks! Looks like my regex was off.

0 Karma

Splunk Employee
Splunk Employee

Is your data being sourcetyped correctly? i.e.: does the sourcetype field return a value of hadoop for these events? If so I would add a field extraction definition to the hadoop sourcetype stanza in props.conf on your search head:

(or props.conf in the app of your choice) $SPLUNK_HOME/etc/apps/appofyourchoice/local/props.conf


EXTRACT-client_host = (?m)client_host":"(?<client_host>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"}

then restart splunk on your SH:
./splunk restart

to validate, you can run a search like this:
index=* sourcetype=hadoop | stats count by client_host

If not and you want to do the extraction on the source field, this should work:

on your search head:
(or props.conf in the app of your choice) $SPLUNK_HOME/etc/apps/appofyourchoice/local/props.conf


EXTRACT-client_host = (?m)client_host":"(?<client_host>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"}


Looks like this page formatted the escape characters. Here's my original question: http://pastebin.ca/3023980

0 Karma
Get Updates on the Splunk Community!

.conf24 | Day 0

Hello Splunk Community! My name is Chris, and I'm based in Canberra, Australia's capital, and I travelled for ...

Enhance Security Visibility with Splunk Enterprise Security 7.1 through Threat ...

(view in My Videos)Struggling with alert fatigue, lack of context, and prioritization around security ...

Troubleshooting the OpenTelemetry Collector

  In this tech talk, you’ll learn how to troubleshoot the OpenTelemetry collector - from checking the ...