Archive

Can you help me with my custom dynamic field extraction?

New Member

Hi, looking for some help on this one. I have multi-line events that I'm trying to create dynamically named fields from.

Here is the "format" of these events...

"01","180905180210113",""
"02","ABCXYZ","2018-09-05","18:18:56"
"03","0","0","0","0","0","1","1"
"04","Batch_Type",""
"05","2018-09-05","18:18:56","2018-09-05","18:18:57","ModuleA","64","HOST","0","","0","0","0","0","0","0"
"05","2018-09-05","18:18:58","2018-09-05","18:18:59","ModuleB","64","HOST","0","","0","0","0","0","0","0"
"05","2018-09-05","18:19:31","2018-09-05","18:19:33","ModuleC","64","HOST","0","","0","0","0","0","0","0"

I've already created fields for the first 4 lines without issue. For the purpose of this question, I'm only interested in lines that start with "05" and to break the data down here. It's the 05 followed by start date and time, finish date and time, modulename, status, host, and unimportant garbage at the end...

Field-ized, ideally, it would be like this:

 "05","ModuleAStartDate","ModuleAStartTime","ModuleAFinishDate","ModuleAFinishTime","NameOfModule","ModuleAStatusCode","ModuleAHost",...
        "05","ModuleBStartDate","ModuleBStartTime","ModuleBFinishDate","ModuleBFinishTime","NameOfModule","ModuleBStatusCode","ModuleBHost",...
        "05","ModuleCStartDate","ModuleCStartTime","ModuleCFinishDate","ModuleCFinishTime","NameOfModule","ModuleCStatusCode","ModuleCHost",...

There are a couple problems here...

There is NOT always a set number of Modules for each event. There could be 1, there could be 30, that number is a moving target. The module name does have at least have a limited array of possibilities. (No more than 10)

They are not always in the "same" order. TYPICALLY events do start at one module, but the "path" it takes after that can be almost anything. (or even nothing after that)

The next problem is that the SAME module can exist in the same event, it would read nearly identical except for the times. I suppose we could just go with the last occurrence of it, or even more accurately the one with a specific status number. "64" represents "Success" in this case, but really all events are somewhat important.

The last problem is that this is in Splunk Cloud, so I don't think we have direct access to .conf files, which is what I see a lot of users doing to accomplish something similar...? I'm just an ordinary user, so I wouldn't know anyway, but if that's true, then we would have to do this with just rex, yeah?

So yeah, that's the sum of what I am hoping to accomplish, any assistance would be greatly appreciated here!

If all of that is simply not possible or far too complex, then maybe someone could at least tell me how to pull a "previous" comma separated value from the same line containing a search match, for example, I can search for \"ModuleA\",\"64\",\"HOST\" and already find "successful" processes of ModuleA, but I'm more interested in the preceding date/time data to do some calculations off of or other reporting, how can I grab those fields?

0 Karma

New Member

I've used the following assumptions:
- "01","180905180210113","" --> the event ID
- If there are "05" rows with the same module name, keep them in a multivalue field for further processing (ie: if mvcount>1 then .....)

|makeresults |eval _raw =
"01,180905180210113,0
02,ABCXYZ,2018-09-05,18:18:56
03,0,0,0,0,0,1,1
04,Batch_Type,0
05,2018-09-05,18:18:56,2018-09-05,18:18:57,ModuleA,99,HOST
05,2018-09-05,18:18:58,2018-09-05,18:18:59,ModuleB,64,HOST
05,2018-09-05,18:19:31,2018-09-05,18:19:33,ModuleC,64,HOST
05,2018-09-05,18:19:51,2018-09-05,18:19:53,ModuleD,64,HOST
05,2018-09-05,18:19:55,2018-09-05,18:19:58,ModuleA,64,HOST"
`comment("REMOVE the above, it's just to create a dummy event, and fit your base search")`
| rex field=_raw "(?<Module05>05\,\d{4}-\d{2}-\d{2}.*)" max_match=10
| rex field=_raw "01,(?<ID>\d{15}),0"
| fields - _raw _time
| mvexpand Module05
| rex field=Module05 "Module(?<ModuleIdentifier>.)\," max_match=10
| rex field=Module05 "05,(?<StartDate>\d{4}-\d{2}-\d{2}),(?<StartTime>\d{2}:\d{2}:\d{2}),(?<FinishDate>\d{4}-\d{2}-\d{2}),(?<FinishTime>\d{2}:\d{2}:\d{2}),(?<NameOfModule>Module.),(?<StatusCode>\d{2}),(?<Host>.*)"
| fields - Module05
| eval Module{ModuleIdentifier}StartDate = StartDate 
| eval Module{ModuleIdentifier}StartTime = StartTime
| eval Module{ModuleIdentifier}FinishDate = FinishDate
| eval Module{ModuleIdentifier}FinishTime = FinishTime
| eval Module{ModuleIdentifier}StatusCode = StatusCode
| eval Module{ModuleIdentifier}Host = Host
| fields Module* ID
| fields - ModuleIdentifier
| stats values(*) as * by ID

Does it answer you questions?

0 Karma

New Member

I'm sorry for not explaining that field. This data is rather complex. It is an identifier, in that it is generally unique. However, it is not, always a number, it can be any length, it can contain numbers, letters, characters, etc.

Let me expand on what these events represent. This is a log of a document flowing through an document processing application. The first four lines contain the unique information about the document... the name, description, type of document, page count, originating host. Things like that. Those are all easily identifiable and always in the same spot, and I have extracted fields for each.

The "05" lines represent the actual flow of the document as it is processed by the different modules of the application, the date/timestamp of start and finish of the module, the module name, the result of that process, and an error message, if any occurred. This is a flow though, time generally the time modules have different names of course, but the format of these "05" lines is always the same.

But the thing is that these document processes sometimes require manual intervention in the application. This causes repeats of the same modules in the log as the user has the ability to fix certain attribute and re-submit the document for normal automatic processing.

Example would be a document that goes to the "Export" module and then failed that process for whatever reason, this would result in subsequent modules like "Batch Manager" or "Validation" and then again seeing "Export" (hopefully with a success code) down the line.

The goal here is to be able to report on metrics like, how long certain modules take, how often they "fail" and taking it further there, how often they fail but are resubmitted successfully, and ultimately how often they fail and are never able to process successfully. This gets even more complicated as sometimes instead of showing the event with a successful submission, it simply ends and is then a NEW event entry, but with the same time stamp, just with a Document name with an attached "~0" on the end, but this one shows a successful process...

So far, what I've done is create field extractions based on these unique modules, but my problem is that it only captures the first instance of that module type.

I didn't know about multivalue fields. I'm fairly new to Splunk, so I'll do some digging on that.

You've helped a bunch, albeit not exactly what I'm looking for, but honestly I don't expect anyone to get this exactly right with all the moving targets here.

Thanks!

0 Karma

New Member

Hi,
I'll try to help but you need to be more specific on the output; have you copied/pasted the qry into Splunk? Also:

  • don't focus on my "ID" assumption, it really doesn't matter nor change the results. You will, however, need an ID to identify which row belongs to which "master" event. Also, grouping by the document ID will solve the other problem (first event fails without notice and a new event, successfull, is created later).

  • the query I sent you above, doesn't care if there are 2 or 2000 ModuleX rows, nor if there are multiple rows with the same ModuleName). If you look at the dummy event in the query, I have added two ModuleA rows with different times and StatusCode. Once you have all fields separated and identified, you can perform as many calculations as you want; in the below qry, as example, I have added the duration field for each ModuleX row.

    | makeresults 
    | eval _raw =
        "01,180905180210113,0
     02,ABCXYZ,2018-09-05,18:18:56
     03,0,0,0,0,0,1,1
     04,Batch_Type,0
     05,2018-09-05,18:18:56,2018-09-05,18:18:57,ModuleA,99,HOST
     05,2018-09-05,18:18:58,2018-09-05,18:18:59,ModuleB,64,HOST
     05,2018-09-05,18:19:31,2018-09-05,18:19:33,ModuleC,64,HOST
     05,2018-09-05,18:19:51,2018-09-05,18:19:53,ModuleD,64,HOST
     05,2018-09-05,18:19:55,2018-09-05,18:19:58,ModuleA,64,HOST"
        `comment("REMOVE the above, it's just to create a dummy event, and fit your base search")` 
    | rex field=_raw "(?<Module05>05\,\d{4}-\d{2}-\d{2}.*)" max_match=10 
    | rex field=_raw "01,(?<ID>\d{15}),0" 
    | fields - _raw _time 
    | mvexpand Module05 
    | rex field=Module05 "Module(?<ModuleIdentifier>.)\," max_match=10 
    | rex field=Module05 "05,(?<StartDate>\d{4}-\d{2}-\d{2}),(?<StartTime>\d{2}:\d{2}:\d{2}),(?<FinishDate>\d{4}-\d{2}-\d{2}),(?<FinishTime>\d{2}:\d{2}:\d{2}),(?<NameOfModule>Module.),(?<StatusCode>\d{2}),(?<Host>.*)" 
    | fields - Module05 
    | eval Module{ModuleIdentifier}StartDate = StartDate 
    | eval Module{ModuleIdentifier}StartTime = StartTime
    | eval Module{ModuleIdentifier}FinishDate = FinishDate 
    | eval Module{ModuleIdentifier}FinishTime = FinishTime
    | eval Module{ModuleIdentifier}Duration = strptime(FinishTime, "%H:%M:%S") - strptime(StartTime, "%H:%M:%S") 
    | eval Module{ModuleIdentifier}StatusCode = StatusCode 
    | eval Module{ModuleIdentifier}Host = Host 
    | fields Module* ID 
    | fields - ModuleIdentifier 
    | stats values(*) as * by ID
    
  • By using multivalued fields, you are able to group all events related to the same ModuleX and then process them individually (ie: row by row) or as group (sum, ie).

Note: the query can be optimized; I'm keeping it simple (aka more steps than needed) to allow an easier reading for you

PaoloR

0 Karma

New Member

The query doesn't give me anything and that's likely because it's searching for the word "Module", that's not actually part of the events. It also doesn't seem to account for the double quotes... I only used the ModuleA/B/C names just to show that they are different. The Module names are completely different from each other. Let me give you a better example.

This is a direct copy/paste of an event. (I did change the email address and server names for security reasons.)

"01","[emailaddress@email.com]-[9/20/2018 9:26:19 AM]",""
"02","SERVER05APWPD","2018-09-20","09:26:19"
"03","0","0","0","0","0","1","0"
"04","Email Import",""
"05","2018-09-20","09:26:19","2018-09-20","09:26:21","Scan","64","SERVER05APWPD","0","","0","0","0","0","0","0"
"05","2018-09-20","09:26:22","2018-09-20","09:26:30","KTM Server","64","SERVER14APWPD","0","","0","0","0","0","0","0"
"05","2018-09-20","09:26:31","2018-09-20","09:26:40","KTM Server 2","64","SERVER13APWPD","0","","0","0","0","0","0","0"
"05","2018-09-20","10:19:12","2018-09-20","10:19:31","KTM Validation","8","CITRIXSERVER:Sess 3","0","","0","0","0","0","0","0"
"05","2018-09-20","10:19:33","2018-09-20","10:19:40","KTM Validation","64","CITRIXSERVER:Sess 3","0","","0","0","0","0","0","0"
"05","2018-09-20","10:19:41","2018-09-20","10:19:41","Export","32","SERVER03APWPD","0","Export Error: [Could not identify document.]","0","0","0","0","0","0"
"05","2018-09-20","10:20:56","2018-09-20","10:20:58","Batch Manager","2","SERVER18APWPD:Sess 9","0","","0","0","0","0","0","0"
"05","2018-09-20","10:21:07","2018-09-20","10:21:17","Scan","64","SERVER18APWPD:Sess 9","0","","0","0","0","0","0","0"
"05","2018-09-20","10:21:18","2018-09-20","10:21:29","KTM Server","64","SERVER15APWPD","0","","0","0","0","0","0","0"
"05","2018-09-20","10:21:29","2018-09-20","10:21:37","KTM Server 2","64","SERVER13APWPD","0","","0","0","0","0","0","0"
"05","2018-09-20","10:21:46","2018-09-20","10:24:45","KTM Validation","64","CITRIXSERVER:Sess 6","0","","0","0","0","0","0","0"

I think what you gave me would likely work if adapted to find the events correctly, I would have to change several pieces of the query here to adapt it. such as setup each search with escape characters to pick up the double quotes as well.

However, the fact that there are spaces in the module names will likely throw that off, probably needing to be replaced by underscores or just eliminated altogether.

What I've done so far is to do field extraction based on just the first instance. (In most cases, there is only one, so this works fairly well, but ends up being inaccurate as there are some times that have more than one entry...)

Example extraction here:

"05","(?P<ExportStartDate>[^"]+)","(?P<ExportStartTime>[^"]+)","(?P<ExportFinishDate>[^"]+)","(?P<ExportFinishTime>[^"]+)","Export","(?P<ExportResult>[^"]+)","(?P<ExportHost>[^"]+)","\d+","(?P<ExportError>[^"]*)"

I created the same as above for each Module type. (Replacing "Export" with the corresponding Module name)

This seems to work fairly well, but these, again, only capture the first instance of these results. With knowing that, I'm working through this and determining whether or not that will be "enough" to get what I'm being asked for out of this data.

I took it a little further and extracted even more fields by using the status codes, since I know that the Result Code generally dictates whether or not I'll see duplicated runs through other modules, I can extract fields and name them similarly just with the word "Error" on them based on seeing that "32" Result Code. The use case where it would be a problem is if I have to report on time spent on a Module that ended up being repeated in a single event. Example is like the above event, you see it went through "KTM Server" multiple times, it didn't get an error on them (as they have a result of 64 each time), but this is where using those fields I extracted starts giving inaccurate data, as if I were to do a start to finish query on that, it wouldn't be capturing everything properly.

I'm still getting the requirements panned out as to what I'll be asked to provide, right now I'm just proof of concepting.

I'm playing around with it still, but these queries, I'm seeing quickly, can get very complex.

0 Karma

New Member

With a real event, it's easier!

Let's proceed by steps:

index="main" sourcetype="event"
| rex max_match=100 field=_raw "\"05\"\,\"(?<Module_StartDate>\d{4}-\d{2}-\d{2})\",\"(?<Module_StartTime>\d{2}:\d{2}:\d{2})\",\"(?<Module_EndDate>\d{4}-\d{2}-\d{2})\",\"(?<Module_EndTime>\d{2}:\d{2}:\d{2})\",\"(?<Module_Name>.*?)\",\"(?<Module_ExitCode>.\d{1,3})\",\"(?<Module_Host>.*?)\","
| stats values(*) as * by Module_Name

The above rex parse each 05 row, regardless of spaces, quotes etc and break into fields. The stats command is just an example to output a result (it's wrong! just to check field extraction). Starting from here, what output will work for you?

PaoloR

0 Karma

New Member

I think because of the sheer volume and my already extracted fields, but it's blowing up my browser, but I did a much more time limited view and got what this is doing. However, when I tag on a | search Module_Name=Export AND Module_ExitCode=32 I'm not getting expected results. What I am hoping to get out of that would be only events that had an ExitCode of 32 on the same 05 row that had Export as the module. But it also returns events that some other modules that have exit Code 32. Example result:

"01","[emailaddress@email.com]-[9/19/2018 11:15:56 AM]",""
"02","SERVER05APWPD","2018-09-19","11:15:56"
"03","0","0","0","0","0","1","2"
"04","Email Import",""
"05","2018-09-19","11:15:56","2018-09-19","11:15:57","Scan","64","SERVER05APWPD","0","","0","0","0","0","0","0"
"05","2018-09-19","11:15:59","2018-09-19","11:16:00","KTM Server","32","SERVER14APWPD","200","[snip Long Error Message]","0","0","0","0","0","0"
"05","2018-09-19","11:29:48","2018-09-19","11:29:50","Batch Manager","2","SERVER18APWPD:Sess 9","0","","0","0","0","0","0","0"
"05","2018-09-19","11:29:54","2018-09-19","11:30:07","Scan","64","SERVER18APWPD:Sess 9","0","","0","0","0","0","0","0"
"05","2018-09-19","11:30:08","2018-09-19","11:30:19","KTM Server","64","SERVER14APWPD","0","","0","0","0","0","0","0"
"05","2018-09-19","11:30:19","2018-09-19","11:30:30","KTM Server 2","64","SERVER15APWPD","0","","0","0","0","0","0","0"
"05","2018-09-19","11:30:41","2018-09-19","11:30:43","Export","64","SERVER03APWPD","0","","0","0","0","0","0","0"

There's no Export line with an exit code of 32 in that event, but it matches because it has those two "somewhere" in the entirety of the event.

Seems like the same difference as these two:
index=main \"Export\" AND \"32\"
and
index=main \"Export\"\,\"32\"

0 Karma

New Member

Splunk is a wonderful tool but...it needs some study! Your task isn't very complex but I strongly suggest you to study some documentation or:
- case 1: you will cycle inside this query forever
- case 2: you will not achieve the result
- case 3: both 🙂

Back to the qry:

  • Be sure you're indexing your event correctly --> it starts with "01" and ends with n "05"; if, for example, you've indexed with a newline as linebreak, you can't process it in the right way. In your event, the linebreak is the "01"

  • If you don't need all the steps (01, 02, and so on) and you have a large dataset, a good approach is to create a summary index with one line event (ie: Module started at, ended at, duration and exit code); you can (read should!) put the rex extraction inside a props and perform it at index time or search time

  • you will need an unique ID per event; it's useful for later calculations. If your event doesn't have an ID, you can create it yourself using random function

  • Searching for Export AND 32 doesn't work because, with qry I sent you, you end with a multivalue field (one row for each event, all event steps inside the same row). Use this

    index="main" sourcetype="event"
    | rex max_match=100 field=_raw "\"05\"\,\"(?\d{4}-\d{2}-\d{2})\",\"(?\d{2}:\d{2}:\d{2})\",\"(?\d{4}-\d{2}-\d{2})\",\"(?\d{2}:\d{2}:\d{2})\",\"(?.?)\",\"(?.\d{1,3})\",\"(?.?)\","
    | fields Module*
    | table Module*

and look how the event steps are grouped togheter; starting from there, you have multiple options (have a look at the docs, especially for handling multivalued fields); some random examples

Calculate the overall StartDate and EndDate + the last exit code

index="main" sourcetype="event" 
| rex max_match=100 field=_raw "\"05\"\,\"(?<Module_StartDate>\d{4}-\d{2}-\d{2})\",\"(?<Module_StartTime>\d{2}:\d{2}:\d{2})\",\"(?<Module_EndDate>\d{4}-\d{2}-\d{2})\",\"(?<Module_EndTime>\d{2}:\d{2}:\d{2})\",\"(?<Module_Name>.*?)\",\"(?<Module_ExitCode>.\d{1,3})\",\"(?<Module_Host>.*?)\"," 
| fields Module*
| eval Task_Start = mvindex(Module_StartDate,0) . " " . mvindex(Module_StartTime,0)
| eval Task_End = mvindex(Module_EndDate,-1) . " " . mvindex(Module_EndTime,-1)
| eval Task_ExitCode = mvindex(Module_ExitCode,-1) 
| table Task_Start, Task_End Task_ExitCode

Same as above but with an uniqueID and a distict of all ExitCodes

index="main" sourcetype="event" 
| rex max_match=100 field=_raw "\"05\"\,\"(?<Module_StartDate>\d{4}-\d{2}-\d{2})\",\"(?<Module_StartTime>\d{2}:\d{2}:\d{2})\",\"(?<Module_EndDate>\d{4}-\d{2}-\d{2})\",\"(?<Module_EndTime>\d{2}:\d{2}:\d{2})\",\"(?<Module_Name>.*?)\",\"(?<Module_ExitCode>.\d{1,3})\",\"(?<Module_Host>.*?)\"," 
| fields Module*
| eval uniqueID = random()
| eval Task_Start = mvindex(Module_StartDate,0) . " " . mvindex(Module_StartTime,0)
| eval Task_End = mvindex(Module_EndDate,-1) . " " . mvindex(Module_EndTime,-1)
| eval Task_ExitCode = mvindex(Module_ExitCode,-1) 
| stats values(Task_Start), values(Task_End) values(Module_ExitCode) by uniqueID
0 Karma