Hi,
I have a bunch of log files from a webserver that have the following look:
195.14.65.67 - - [20/Apr/2011:23:59:52 +0200] "GET /xml/listen_xml.jsp?codigousu=x&clausu=y&afiliacio=RS&secacc=42106&xml=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22UTF-8%22%3F%3E%3Cpeticion%3E%3Ctipo%3E105%3C%2Ftipo%3E%3Cparametros%3E%3Ccodishotel%3E610442%23684572%23650777%23614213%23614210%23636068%23692186%23614211%23692507%23606403%23609104%23636372%23618638%23637005%23609145%23610500%23614212%23646694%23667369%23665818%23611550%23617439%23669231%23694714%23688454%23688453%23662191%23%3C%2Fcodishotel%3E%3Cregimen%3EOB%3C%2Fregimen%3E%3Cnumhab1%3E2%3C%2Fnumhab1%3E%3Cpaxes1%3E2-0%3C%2Fpaxes1%3E%3Cusuario%3E707064%3C%2Fusuario%3E%3Cafiliacion%3ERS%3C%2Fafiliacion%3E%3Cfechaentrada%3E09%2F12%2F2011%3C%2Ffechaentrada%3E%3Cfechasalida%3E09%2F16%2F2011%3C%2Ffechasalida%3E%3Cidioma%3E2%3C%2Fidioma%3E%3Cduplicidad%3E1%3C%2Fduplicidad%3E%3Crestricciones%3E1%3C%2Frestricciones%3E%3C%2Fparametros%3E%3C%2Fpeticion%3E
HTTP/1.1" 200 19036 "-" "Jakarta
Commons-HttpClient/3.1"
It is a regular apache log. Splunk is smart enough to gather the basic variables, like codigousu, clauusu, afiliacio, xml, etc... The problem I am facing is that I also want to extract the values inside the "xml" variable. Values in this field are URL encoded and in this case, they should look something along these lines:
<?xml version="1.0" encoding="UTF-8"?>
<peticion>
<tipo>105</tipo>
<parametros>
<codishotel>610442#684572#650777#614213#614210#636068#692186#614211#692507#606403#609104#636372#618638#637005#609145#610500#614212#646694#667369#665818#611550#617439#669231#694714#688454#688453#662191#</codishotel>
<regimen>OB</regimen>
<numhab1>2</numhab1>
<paxes1>2-0</paxes1>
<usuario>707064</usuario>
<afiliacion>RS</afiliacion>
<fechaentrada>09/12/2011</fechaentrada>
<fechasalida>09/16/2011</fechasalida>
<idioma>2</idioma>
<duplicidad>1</duplicidad>
<restricciones>1</restricciones>
</parametros>
</peticion>
Several questions:
Many thanks!
Have a look at eval
's urldecode
function.
<yourbasesearch> | eval urldecodedtext=urldecode(_raw)
This will take the _raw
field's contents, URL decode it and put the result in the field urldecodedtext
. More info om the urldecode
function here:
http://www.splunk.com/base/Documentation/latest/SearchReference/CommonEvalFunctions
As for your question 1, there is a search command called xmlkv
that automates extraction of key/value pairs in XML for you. Unfortunately it can only take the _raw
field as input. You could always rewrite the _raw
field yourself using urldecode
as above and write the result to _raw
, though it is a bit ugly. Nevertheless it works.
<yourbasesearch> | eval _raw=urldecode(_raw) | xmlkv
Regarding question 2, there's no easy one-liner way to do it that I can think of.
Question 3, you could achieve this using rex
and apply it to the field codishotel
. You'll have to specify how many matches rex
should retrieve at most.
<yourbasesearch>
| eval _raw=urldecode(_raw)
| xmlkv
| rex max_match=50 field=codishotel "(?<codishotel_value>\d+)"
It all works great, but performance is not the best you can get... And as we are dealing with 5 million log entries per day, performance is kind of key
I was wondering if playing with external script lookups on the props.conf/transforms.conf, we could index data already extracted....
Something inline with this:
http://splunk-base.splunk.com/answers/22630/putting-calculations-in-conf-files
In fact, I am also interested in knowing if the external script execution launches a process everytime it is executed....
Thanks!
Have a look at eval
's urldecode
function.
<yourbasesearch> | eval urldecodedtext=urldecode(_raw)
This will take the _raw
field's contents, URL decode it and put the result in the field urldecodedtext
. More info om the urldecode
function here:
http://www.splunk.com/base/Documentation/latest/SearchReference/CommonEvalFunctions
As for your question 1, there is a search command called xmlkv
that automates extraction of key/value pairs in XML for you. Unfortunately it can only take the _raw
field as input. You could always rewrite the _raw
field yourself using urldecode
as above and write the result to _raw
, though it is a bit ugly. Nevertheless it works.
<yourbasesearch> | eval _raw=urldecode(_raw) | xmlkv
Regarding question 2, there's no easy one-liner way to do it that I can think of.
Question 3, you could achieve this using rex
and apply it to the field codishotel
. You'll have to specify how many matches rex
should retrieve at most.
<yourbasesearch>
| eval _raw=urldecode(_raw)
| xmlkv
| rex max_match=50 field=codishotel "(?<codishotel_value>\d+)"