I have over 100 Apache webservers which forward their logs to a syslog-ng server, which then forwards the data a TCP data input on Splunk, as well as forwarding the data to other non-Splunk log-analysis servers.
In Splunk Search, the data looks like this:
Dec 16 10:29:59 192.168.99.100 httpd[10583]: site1.example.org 10.4.5.6 - - [16/Dec/2014:10:29:59 -0800] "GET /rest/somepath/12345" HTTP/1.1" 200 105066 "-" "-"
Dec 16 10:29:59 192.168.99.101 httpd[22404]: site2.example.org 4.4.12.15 - someuser [16/Dec/2014:10:29:59 -0800] "GET /wiki/javascript/foo.js" HTTP/1.1" 304 - "https://site2.example.org/wiki/somepage.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36"
Dec 16 10:29:59 192.168.6.100 httpd[6380]: site3.example.org 172.16.43.41 - - [16/Dec/2014:10:29:59 -0800] "GET /project/projectA/somescript.cgi?username=spiderman" 200 9048 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
However, Splunk recognizes only a few default fields in this data. It recognizes the host
, process
, source
, sourcetype
, data_hour
, etc. It does not recognize Apache-specific fields like clientip
status
, method
, etc. which are mentioned in the Splunk tutorial. It doesn't even recognize string like 4.4.12.15
as an IP address.
As a result, I need to create a whole bunch of custom field extractions in order to do many useful tasks in Splunk.
Why does Splunk not recognize fields in my Apache data? How can I transform the data so that Splunk will recognize the data correctly?
Second question: Would it help if I used a Splunk Forwarder on our syslog server instead of using TCP for data input?
KV_MODE attribute to specify the field/value extraction mode for your data in props.conf
auto: Extracts field/value pairs and separates them with equal signs. This is the default field extraction behavior if you do not include this attribute in your field extraction stanza.
KV_MODE = auto
Hope it can work
The original question has nothing to do with key/value equal-sign extractions.
I'm not sure what KV_MODE
has to do with my problem. Can you explain?
It automatically extracts the fields. In your case, clientip and status can be extracted by splunk intelligence. Which can be seen in interesting fields.mostly on indexer/seachheads to avoid the load it is kept to None. Kindly check with this option.
However, my data does not normally use key=value pairs, nor is it XML or JSON based, and KV_MODE=auto
is already the default. My log data is standard, Unix-type syslog data.
The first parts of each line in these events look like syslog data so this data is likely getting seen as a syslog sourcetype. The client IP field above is where it actually starts looking like combined apache access data. Events consisting of mish-mosh of two different sourcetypes is obviously not going to work with the built-ins so you either need to remove the part of the events that are not part of the pretrained apache access log sourcetype before input or implement a transform that trims all that syslog stuff before the clientip. Another way would be to customize either of the two extraction transforms to perhaps use bits from the other at which point you will have created your own syslog-httpd-access sourcetype. I wanted to share a little background on why it is not working but instead of doing all the work yourself, you might want to look at this:
Thanks for the help. I made progress, but I'm still not there yet.
I used transforms.conf
and props.conf
as described on that page to transform data from the old format:
Dec 16 10:29:59 192.168.99.100 httpd[10583]: site1.example.org 10.4.5.6 - - [16/Dec/2014:10:29:59 -0800] "GET /rest/somepath/12345" HTTP/1.1" 200 105066 "-" "-"
To the new format:
10.4.5.6 - - [16/Dec/2014:10:29:59 -0800] "GET /rest/somepath/12345" HTTP/1.1" 200 105066 "-" "-"
Splunk still doesn't recognize any of the Apache-specific fields such as clientip
or status
. Any ideas?
What sourcetype is the data getting indexed as. The sourcetype on this input might be set to something other than access_common. IIRC splunk determines pretrained sourcetypes based on some of the first data in the input. So you may need to set the sourcetype of the input to access_common.
The sourcetype is still set to syslog
. I'm not sure if or how I can change that.
try adding the reference to the correct extraction to the syslog sourcetype, if you have other types of data coming in as syslog, it might be impacted. The correct way to address this either requires breaking out different sourcetypes from your syslog data or doing something more advanced using an event based override as described here:
http://docs.splunk.com/Documentation/Splunk/6.2.1/Data/Advancedsourcetypeoverrides
If you have only apache data here you may be able to add this to the syslog sourcetype stanza in props.conf and have it work, but this may break not properly transform other events:
REPORT-access = access-extractions
This is what actually tells it what extraction definition to use.
@chanfoli, do you think this would be better if I put a Splunk Forwarder on my syslog server instead? I imagine that this way, the data won't automatically get tagged with the syslog
sourcetype and the fields might get extracted correctly. I would probably need to strip the Syslog header on the Splunk Forwarder, but I am not sure if that is possible.
Thanks. These syslogs contain data from thousands of systems and contain more than just Apache log data. I'll take a look at your suggestion.