Getting Data In

Is there a maximum number of transforms that can be applied?

davidatpinger
Path Finder

...and is there a performance issue with a large number?

For various reasons, I have a bunch of data that has field names that depend on some of the values in the data, so I have something like 280 different transforms that I want to use. If I apply them to a single sourcetype something like this in props.conf:

[foo]
REPORT-stuff = t1, t2, t3, t4, t5, t6, <...>, t280

will I end up having a bunch of trouble? I can probably setup an experiment to find out, but it'll take some work so I'm hoping someone has already done something wacky like this and can tell me if I'm crazy.

Thanks!

0 Karma
1 Solution

davidatpinger
Path Finder

Just to close the loop on this, the final result was to have a sourcetype/index sorter such that each log line type gets its own sourcetype (and single regex for expansions), but all of them end up in the correct index.

This is the transform I use to setup the correct sourcetype (the third field is the key that describes the type of log line):

[api1-sourcetype]
DEST_KEY = MetaData:Sourcetype
REGEX = ^6,[^,]*,([^,]*),*
FORMAT = sourcetype::$1-api1

I then have a simple transform for getting to the correct index:

[api1-index]
SOURCE_KEY = MetaData:Sourcetype
DEST_KEY = _MetaData:Index
REGEX = .*-api1
FORMAT = api1

Each of the *-api1 sourcetypes then has a single entry in props.conf that matches a regex in transforms.conf (these are all generated, as their are a couple hundred of each).

Thus far, it works great! Performance is very good.

View solution in original post

0 Karma

davidatpinger
Path Finder

Just to close the loop on this, the final result was to have a sourcetype/index sorter such that each log line type gets its own sourcetype (and single regex for expansions), but all of them end up in the correct index.

This is the transform I use to setup the correct sourcetype (the third field is the key that describes the type of log line):

[api1-sourcetype]
DEST_KEY = MetaData:Sourcetype
REGEX = ^6,[^,]*,([^,]*),*
FORMAT = sourcetype::$1-api1

I then have a simple transform for getting to the correct index:

[api1-index]
SOURCE_KEY = MetaData:Sourcetype
DEST_KEY = _MetaData:Index
REGEX = .*-api1
FORMAT = api1

Each of the *-api1 sourcetypes then has a single entry in props.conf that matches a regex in transforms.conf (these are all generated, as their are a couple hundred of each).

Thus far, it works great! Performance is very good.

View solution in original post

0 Karma

jrodman
Splunk Employee
Splunk Employee

I can't find any limit in the codebase that would prevent this from working.

It doesn't seem very manageable, and sounds like it could have pretty significant performance impact. Usually, a need for this many extractions arises when an event stream has a large number of similar fields. If so, have you considered the repeat-match functions where a single regex can sometimes extract many fields?

It's also a little awkward to have all of these in one REPORT line. You can do it, but are they all for one single purpose? The performance of multiple REPORT lines will be identical to one giant line, and it's hard to imagine a 280 step dependency graph of extractions.

davidatpinger
Path Finder

This answer was extremely helpful, but I'm going to unaccept it and substitute my write up below in the interest of clarity for any unfortunate soul in the same boat and looking for some help.

Thanks much for the discussion!

0 Karma

davidatpinger
Path Finder

More thinking out of the box: we currently have sourcetype and index closely coupled. What if instead of a gazillion transforms (which is, in fact, not terribly performant), I use a TRANSFORMS at index time to set each of these slightly different lines to each be in its own sourcetype (but still in a common index) - that would let me apply a single search time transform for each sourcetype. A search like "index=foo *" would still see all of the same lines, but each one would be subject to just it's matching transform.

Outside of the obvious maintenance nightmare (which I can automate), I don't see a real downside to this idea. I'm assuming that for a given search, all of the lines are only subjected to a transform relevant to that sourcetype for that specific line.

Anything I'm missing here?

0 Karma

davidatpinger
Path Finder

A related question: given that this is a lot of regular expressions to process per line, is this a place where index-time extraction might be a good idea?

0 Karma

jrodman
Splunk Employee
Splunk Employee

It's a possibility. Assuming we're talking about "indexed fields" here, because the INDEXED_EXTRACTIONS function doesn't have the necessary flexibility to handle this dataset. It moves the "problem" to parsing time, which could be better or worse depending upon many factors. If this data is a large portion of the incoming datastream, the rate of indexing could slow. It's harder to troubleshoot index-time transforms, and even harder to correct errors because the data is already produced.

It will make the events in the journal significantly larger, which will slow search somewhat in its own way -- larger events means more I/O and more decompression for same events.

The larger events problem can be offset by retrieving fewer events sometimes. If there's a significant collision in the values present in these fields (ie, many fields can have values like 0 and 1), then making them indexed will allow Splunk to retrieve a much smaller event set, so the performance could be significantly better.

0 Karma

davidatpinger
Path Finder

Yeah, I wasn't sold on the idea, for most of the reasons you describe. Worth thinking about, but....

0 Karma

davidatpinger
Path Finder

It's manageable because I have a script to generate all of the necessary transforms (and I can generate the REPORT line(s) easily, too). I may make it multiple REPORT lines just for ease of reading, but it sounds like it doesn't matter internally. There's not a a lot of repetition in the patters, sadly. Here's the deal (in highly simplified terms). Our internal log format is a comma separated list with the first ten fields fixed. (And I have a single regex that decodes them). One of those fields is an event type, and each event then has (in one of the fixed fields) a list of keys (semi-colon separated). The values for those keys are then listed out in the remainder of the line. We do some pre-processing before sending these lines to Splunk to remove some less useful data and to suppress the long list of keys (since they are well-known per event). So, Splunk just sees a list of values. I want to provide field names for everything, so I have a transform per event type. Let's simplify the lines down to one fixed field (the event type) and the list of keys. Then lines might look something like this (where the letters in the values really represent unique field names):

event,value1,value2,...
1,A,B
2,A,C,D,E
3,F,C,G,H,A,J
4,W

This is reasonably realistic in the sense that there aren't many repeated patterns after the initial set of fixed fields. I'm actually already doing this at smaller scale and want to scale it up to all of the possible log lines. Sounds like it may have perf impact. Hmm. Well, we'll give it a try and see!

0 Karma

jrodman
Splunk Employee
Splunk Employee

Got it; many types of events with different extractions, all of which are syntactically similar. Splunk's support for handling a lot of implicit information like this is less than stellar. If I had my way, we'd have shipped some programmtic extensions to the parsing pipeline that you could use to cause indexed extractions to fully handle this dataset. I'm not getting my way though.

Splunk does a good job of telling you if any particular extraction is very slow, but it won't do a good job of telling you exactly what the costs are for a large set of extractions, none of which are that slow. Despite this, you can get an idea with a large dataset and the job inspector.

If you're willing to pay some cost per event for this data, that's maybe fine. It gets a little messy if you're triggering those costs when you don't want to, like the data cohabitates in an index with other commonly used data, and users aren't categorically excluding these events from their searches. The cost in 'fast mode' where none of the fields for this data are wanted should not be significant, but for verbose mode it could become unfortunate. Again, possibly all obvious.

Usually for this type of complex scenario, I challenge whether it might be worth changing or preprocessing to self-describing format, but you've probably already rejected that, based on what you've said so far.

0 Karma

davidatpinger
Path Finder

Correct. The full format is a set of fixed fields, the last of which is a description which includes a dynamically sized list of keys, followed by the values for those keys as additional fields. It's...difficult for Splunk to handle natively. I initially was pre-processing the data to unwrap the key-value pairs, with a single regex to handle the initial fixed fields. The problem is that some of these event types are very common, and the keys/field names can be of reasonable size. One of our most frequent loglines, for example, has 500 bytes of keys. At 400/second, that's a lot of volume (i.e., license cost). So I have transforms for the most common of our loglines already in use, but we'd like to expand this to all loglines for economy of license usage and for consistency in implementation - hence this question. (Gotta pay a cost in one way or another!)

The vast majority of our data is in this scheme, unfortunately. Our total scale isn't that large, though, so maybe it'll be acceptable. We're currently IO limited, with little real CPU impact. We're in the process of moving to new indexer hardware, so I'm hopeful this won't completely hose our performance. Oh, and each of these sourcetypes that has this issue has all of its data in dedicated indexes, so maybe the small amount of other data won't be impacted.

0 Karma

lguinn2
Legend

I haven't done this, so I can't say if it is crazy or not. BUT - every transform is a regular expression that must be evaluated against every event (that is selected by the props.conf stanza, usually a sourcetype).
That might create congestion on your indexer if it causes sufficient parsing overhead.
I would do some testing and use the Distributed Management Console to see if there is any noticeable slow down.

0 Karma

somesoni2
Revered Legend

'+1
If the sourcetype for which you're configuring this has heavy volume, you'll be putting load on Indexer/Heavy Forwarder. I would try to reduce the number. Having separate transform for each field will be easy to write but could be in-efficient. Merging multiple extraction may be little complex but would be easy on Indexer/HF.

0 Karma
Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!