Splunk Search

Why am I unable to extract 2 fields from source at index-time with my current configuration and regex?

Communicator

All my log files are in foldes named:

 c:\blah\something\myapp_test\logs\somelogfile.log

 => app=myapp 
 => env=test

I want to extract two fields from source, to make it easy to just search for "app=myapp env=test"

Since the fields are always there and should be a part of most queries, it seems like a good idea to add them at index time(?)

In etc/system/local I have added:

transforms.conf

[add_app_env]
SOURCE_KEY=source
REGEX=^.*\\\\([a-zA-Z0-9-]+)_([A-Z]+)\\\\.*
FORMAT=app::$1 env::$2
WRITE_META=true

props.conf

[add_app_field]
TRANSFORMS-app = add_app_env

[add_env_field]
TRANSFORMS-env = add_app_env

fields.conf

[add_app_env]
INDEXED=true

But I do not get my app and env fields and I have no idea how to debug this other than trial and error.

I tested my regular expression with a rex extraction - so I think that part works.
I also tried simplifying and just extracting a single field.

0 Karma
1 Solution

Legend

I don't think that creating these fields at index time will improve performance. Instead, I think it makes your configuration more brittle, complex and hard to manage.

You could easily do the same field extraction at search time:

props.conf

[source::*somelogfile.log]
EXTRACT-xyz=^[cC]\:\\\w+\\\w+\\(?<app>[a-zA-Z0-9\-]+)_(?<env>[a-zA-Z0-9\-]+)\\\w+\\somelogfile\.log in source

View solution in original post

Communicator

I got index time field extraction to work by:

etc/system/local/transforms.conf
[add_app_env]
SOURCE_KEY = MetaData:Source
REGEX = ^.*\\(?<app>[a-zA-Z0-9\-]+)_(?<env>[a-zA-Z]+)\\.+
FORMAT = app::$1 env::$2
WRITE_META = true

etc/system/local/props.conf
[source::...]
TRANSFORMS-appenv = add_app_env

etc/system/local/fields.conf
[app]
INDEXED=true

[env]
INDEXED=true

I am still in doubt if @lguinn is right that a search time field is better.

Even though the field is extracted at index-time, I still don't get results from the query "app=myapp". I have to "index=* app=myapp", which is the same problem I have with the search-time field extraction...

Community Manager
Community Manager

Hi @lassel

This topic on search-time versus index-time extractions has been covered in documentation and in Splunk Answers throughout the years. Here's a page from documentation explaining this. I also ran a search and these are just a few of previous Answers posts that elaborate on the point brought up by @lguinn.

http://docs.splunk.com/Documentation/Splunk/6.2.2/Indexer/Indextimeversussearchtime
http://answers.splunk.com/answers/151939/how-do-index-and-search-time-field-extractions-differ-and-w...
http://answers.splunk.com/answers/57247/index-time-field-extraction.html
http://answers.splunk.com/answers/842/do-search-time-fields-have-performance-considerations.html#ans...
http://answers.splunk.com/answers/5817/search-time-versus-index-time-field-extractions.html

0 Karma

Communicator

Hello, I read through the docs and several answers yesterday.
Sure I can see, that in general Splunk recommends search-time extraction.

But in my case I concluded that index-time might be the correct answer.

http://answers.splunk.com/answers/842/do-search-time-fields-have-performance-considerations.html#ans...

If the field that you are trying to extract is a subset of a term. In other words, say you have the term ABC123456 in your event. And you want a field with the value 123456. In this scenario, this lookup can be very slow because you can't use indexed terms for the lookup (In order to actually search on this field, you have to set INDEXED_VALUE=false in the fields.conf file.) So if you use this field frequently for searching, an indexed field is your best option.

The the app+env fields follow the pattern above. Also app+env will almost never be a part of the actual event, so they wont be in the search terms without an index. And that made me conclude that in my situation an index-time field might be the best choice.

As I wrote I am open to arguments on why I am wrong in that conclusion. But please don't just give me a RTFM, because I already did.

0 Karma

Motivator

OK, here's the best argument I can come up with:

source is already an indexed field, so by adding these fields to the index, you're effectively making your indexing process bigger and slower without adding any new data to it, and you have made your setup less flexible, because any changes to those fields will require a big, painful re-index to occur.

You have an intuition that performing a regex on an indexed field at search time is slower than pulling the data out of the index directly. What everyone is telling you is that your intuition is misguided; the speed gain you think you're getting by doing the regex ahead of time is most likely counteracted by the overall additional slowness of the index with the new fields added to it. In the long run, it is most likely not worth it.

(As an aside: You almost always need to tell Splunk which index your fields occur in, because field extraction happens in the context of some index. It would be very inefficient for Splunk to search every index you have for fields that only occur in one of them, so it effectively doesn't let you, unless you tell it to do so using index=*. But you should always narrow your searches to the appropriate index if you know which one your data is in.)

Communicator

Thanks. I've gone with the search time index, and that seems to work fine.

Regarding your side note. I have one local installation, where I only get data from main if I run
* | stats count by index

To get all indexes I must run
index=* | stats count by index

But on our corp. splunk system (that I didn't configure), both queries gives me the same result.
Have you got any idea what setting might be different?

0 Karma

Motivator

Hard to tell without knowing what you mean by "data from main". Also, stats count by index doesn't do any field extraction at all, so it's not quite comparing the same things.

0 Karma

Communicator

It turned out that that the default indexes searched was different... So on one system I had to give the index and on the other all indexes was searched by default, and I didn't have to provide it in the query.

0 Karma

Legend

I don't think that creating these fields at index time will improve performance. Instead, I think it makes your configuration more brittle, complex and hard to manage.

You could easily do the same field extraction at search time:

props.conf

[source::*somelogfile.log]
EXTRACT-xyz=^[cC]\:\\\w+\\\w+\\(?<app>[a-zA-Z0-9\-]+)_(?<env>[a-zA-Z0-9\-]+)\\\w+\\somelogfile\.log in source

View solution in original post

Communicator

I was able to get the search time extraction working, by adding:

props.conf:
[source::...]
EXTRACT-app,env = ^.*\\(?<app>[a-zA-Z0-9\-]+)_(?<env>[a-zA-Z]+)\\.+ in source

Now the fields are there, but they cannot be searched.
The query:
app=myapp env=test
yields nothing.

I need to add:
index=* app=myapp env=test
to get any results.

0 Karma

Communicator

Hmm... for some reason I get results from the simple query now. Closing this issue.

0 Karma

Communicator

Are search-time fields indexed the same way index-time fields are?
If my app and env fields are a part of most queries, it is important that they are indexed once, not discovered at every search.

It is pretty hard to read in the documentation, how the two types differ.
Once extracted what is the difference between search and index-time fields?

0 Karma

Legend

Search-time fields are extracted at search time. They are more efficient than index-time fields.

It is not a matter of "indexed once" - Splunk works differently than you think. There only rare cases where an index-time field will be faster - in many years working with Splunk, I have yet to see one of these rare cases.

0 Karma

Splunk Employee
Splunk Employee

It could be just as simple as the REGEX not being formed properly. The regular expression in your example does not work... that is, unless the markup formatting messed something up.

I believe this may work better.

^[cC]\:\\\w+\\\w+\\([a-zA-Z0-9\-]+)_([a-zA-Z0-9\-]+)\\logs\\\w+\.log$
0 Karma

Communicator

My regex works inside splunk with rex field extraction. Perhaps the double escaped backslashes is not required in transforms.conf?

0 Karma