I have the following types of events in FIX format. This is what they look like in vi or emacs:
For the sake of simplicity, I have discarded the rest of the FIX message for this example. Notice, the ^A as the delimiter between "fields".
After indexing the data in Splunk, the ^A becomes hex \x1 within Splunk Web and Splunk CLI.
My props.conf looks like this:
[FIX] SHOULD_LINEMERGE = false KV_MODE = none REPORT-all = get_all_fields
My transforms.conf looks like this:
[get_all_fields] DELIMS="\\x1" FIELDS = "a", "b", "c", "d"
I have tried \\x1, \x1, and \\x01. None of them extract the 4 "fields" in the example. What should the hex value be for the DELIMS to properly break the fields? Is there is a limitation where DELIMS can only take one character? I also tried using "\\", but that did not create any field extraction.
Splunk 6, FIX 4.2
Another approach is to use the key value pair extractions defined in transforms.conf.
The short of it is that its using negative lookahead to not match on \x01.
To register this extraction, following: link text
REPORT-fields = fixkv
REGEX = (\d+)=((?:(?!\x01).)+)
FORMAT = $1::$2
Hope this helps someone. If anyone has suggestions on how to make this one more efficient, please feel free to add.
If you're just trying to substitute the SOH character I was FINALLY able to do it after spending a ton of time and it's a very simple solution. I may be reiterating what Lowell said but hopefully this example saves a ton of time for someone else. Additionally the solution handles it at index time and not at search time. So it makes it easier to read for users who don't realize there's a SOH delimiter to deal with:
edit $SPLUNK_HOME/etc/system/local/props.conf (on the indexer box if your search head and indexer are 2 different boxes) and add the following:
SEDCMD-stripsoh = s/\x01/ /g
Then restart Splunk. Now any NEW FIX data will have the SOH character replaced with a space character. This will NOT affect existing, indexed FIX data in Splunk already.
Note: Of course, the "myfixsourcetype" needs to be replaced with the actual sourcetype name that your FIX data is coming in as otherwise it has no way of identifying your data in order to apply the sed command to. See props.conf spec for other data identifiers you can use (ie. host or source).
FYI - I'm running Splunk on a RedHat Linux box.
FIX protocol field delimiter
Yes, splunk will replace the unprintable character with their C-style hex notation before indexing. That can be quite annoying, but then again, so is trying to search for unprintable characters. If your curious, you can see a table of these conversions on the Wikipedia ASCII page, search down the page for the "Start of Header" character.
It seems like you have a fields inside of a field thing going on here, right?
You have fields delimited by a pipe (
|), and then the 8th field (at least in your given example) has and additional delimited field. I'm not sure how splunk handles that exactly. If you simply setup your delimiter as the
\x1) then your first field would contain:
M|219620|0|i|I|20100506-16:15:53.443|463|8=FIX.4.4, when you probably only want it to contain
8=FIX.4.4. So simply getting your delimiter set properly isn't going to fully work.
I'm guessing it would make the most sense to first extract the outer set of fields first using
DELIMS="|" and then, setup a secondary field extract to pull out your embedded fields.
So, perhaps you would end up with something like this:
[FIX] SHOULD_LINEMERGE = false KV_MODE = none REPORT-outer_fields = get_outer_fields, get_inner_fields
[get_outer_fields] DELIMS="|" FIELDS = "f1", "f2", "f3", "f4", "f5", "_f6", "f7", "inner_fields" [get_inner_fields] REGEX = (?:^|\\x1) (?<a>.+)\\x1(?<b>.+)\\x1(?<c>.+)\\x1(?<d>.+)$ SOURCE_KEY = inner_fields
I think this should work. This does seem like a complicated scenario.
If the number of subfields is not constant (4), then you could use a multi-value field extraction like this: (That regex should work, it took me a few tries, but it seems to be best solution I could come up with)
[get_inner_fields] REGEX= (?=^|\\x1)(?:\\x1)?(?<my_fields>.+?)(?:\\x1)?(?=$|\\x1) SOURCE_KEY = inner_fields MV_ADD = True
Another possible option (and I don't know the FIX format at all, so this may not work). If the 8 in
8=FIX.4.4 means something like 'fix_version_number', you could just write a bunch of extracts that use the leading number of map to different field names. So for example of "8", you could add something like this to your props file:
EXTRACT-fix_field_8 = (?:\||\\x1|^)8=(?<fix_version_number>.*?)(?:\||\\x1|$)
Another thought (which may make all of the above options simpler) would be to add a
SEDCMD to your soucetype to change all of the
^A characters into something more useful at index time. Maybe something like a comma? (You would probably want to find a character or sequence of characters not already being used in your events)
Also, using a punctuation character like a comma also has the advantage of improving the way terms are segmented in your index which will let your search on more of these embedded fields more efficiently. For example, in your example event, you can search for
"8=FIX.4.4", but you can't search for
"50=FXSpot" because it's would be stored in the index as
"150=FXSpot", you would have to search with "*50=FXSpot" instead. Using a better punctuation character works around this problem.
One more option. Email Glenn and take a look at a custom search command he is using to handle FIX log processing. See his post here:
Thanks Glenn. It's certainly possible to get a custom search script to add fields (it's pretty easy from a pure programmatic perspective), but your right in saying that
kv) could be used after
translatefix. Thanks again for jumping in. 😉
I haven't yet managed to upload my "translatefix" custom command as an add-on to Splunkbase, but I have sent the useful contents directly to ndoshi. Hopefully it will do the trick - it should replace all \x01 with a space and also translate a large number of FIX encoded fields and values into plain english. What it won't do is actually extract any fields in Splunk land - if that is required I guess you'll need to pipe your results to translatefix, and then pipe this to rex (or similar).
Unfortunately, I can't change the log entry at index time to substitute the delimiter with something more manageable. Bob Fox provided me an answer to place in props.conf. Use:
EXTRACT-myfields = (?.)\x01(?.)\x01(?
This means the ^A character will be represented by \x01. It seems as if the rule is to use \xnn for any HEX character where nn represents the HEX code.
No you were clear on that point. What I'm trying to tell you is that your issue where splunk will not accept the literal "\x1" as a delimiter doesn't really matter because it will not work the way you want anyways (based on the 2nd group of sample events you provided.) Using a delimiter-based field extraction only works if your entire event is delimited by the same character, which is not the case for your events. Try adding
SEDCMD = /\x01/;/g to your props, feed some events in, then set DELIMS=";" and see what I mean. Your first field (a), will contain leading junk it its' value.
I should have been more clear. The original example log is not what you can use for your regex as it contains unprintable characters (^A) that Splunk turns into \x1 in the index.
What I really wanted to do was figure out how to use \x1 as a delimiter. It may be that DELIMS can only have 1 character so that would not work. I tried the following:
EXTRACT-myfields = (?.)\x1(?.)\x1(?
This works fine with online regex testers (also used +), but it does not work here. This is an unprintable character that needs to be in the regex. I do not know what that is for ^A.