I am a splunk newbie, so some obvious explanations might need further clarification.
What I have:
Advanced medical imaging system of systems that produces a global output log of a specific format (example given later)
I apply a repetitive task to this system: Example: startup until all statuses are reported, issue shutdown and repeat, this will go on for days without operator intervention. (there are many other tests I do, but this is the one I am testing with the splunk concept)
What I am trying to do (big picture):
Index/chop up log files based on testing time period. [testing time period = time operator turns on the script to perform tests until the script is turned off]
Index/chop up log files based on cycle. [system startup to shutdown would be one cycle]
Index all output messages. [ I will get about 5 cycles per hour with 200-400 time stamped reported events per cycle, unless something unexpected occures]
Goal: find out which events are not supposed to happen and investigate to fix
Types of Outputs: categorize # of specific event_identifier that occur in each cycle to create a baseline/statistical prediction based on event_identifier and event_identifier content. Find errors that reflect a need to fix something.
I am not expecting someone to do my job for me, but more of being lead in the right direction. I am still learning the splunk data mining lingo.
What I am currently doing:
I am using the source log file for the cycle period
[this is what I can not figure out] For "cycle" I want the cycle to start every time the log outputs an event with a specific message until it sees that same message again.
Each event is divided based on example message below, event being from start message to end message
my (users\admin\search\local\props.conf) is as follows:
[Power Cycle]
"EXTRACT-event_source = (?im)^\t(?P<'event_source>[^\t]+)"
EXTRACT-event_identifier = (?im)^(?:[^\t\n]*\t){4}(?P<'event_identifier>[^\t]+)
EXTRACT-event_location = (?im)^\t\w+\t(?P<'event_location>.+)
EXTRACT-event_start_ID = (?im)^(?P<'event_start_ID>.+) '
[Power Cycle]
BREAK_ONLY_BEFORE = SR \d\d\d
MUST_BREAK_AFTER = EN \d\d\d
NO_BINARY_CHECK = 1
SHOULD_LINEMERGE = true
pulldown_type = 1
I am testing this on my own time and hope to eventually present it to my supervisor to try and implement it as a common tool within our engineering department, especially when trying to prove system reliability.
Example Message:
SR 145
1371027603 1 1 Wed Jun 12 09:00:03 2013 200002348 4
bay90ct cupMonitor
ssProcStop.c 1509
The System Software has terminated.
EN 145
SR ### (event_start_ID) (--start message
1371027603(unique ID for specific time) 1(ignore) 1(ignore) Wed Jun 12 09:00:03 2013(tstamp) 200002348(event_identifier)
bay90ct(event_source) cupMonitor(Process)
ssProcStop.c(event_location) 1509(line in source)
The System Software has terminated.(message, can be multi-lined)
SR ### (---end message
Each cycle will be differentite by an event message that begins the next cycle at that specific message. This is the first message logged when the system is first turned on.
SR 415
1372052120 0 1 Mon Jun 24 05:35:20 2013 0 7
bay92ct Svc_Notepad
Notepad.c 44
This message was added by the OPERATOR to report on a problem:
PRODUCT CONFIGURATION|-- insert unique product information here--
EN 415
Example cycle test period
SR 261
1370995620 0 1 Wed Jun 12 00:07:00 2013 0 7
bay90ct Svc_Notepad
Notepad.c 44
This message was added by the OPERATOR to report on a problem:
RstHast Enabled - start command: startrsthast -shutdown . Type stoprsthast in unix shell to disable
EN 261
/////PLACE A BUNCH OF Cycles with messages HERE
SR 179
1371027942 0 1 Wed Jun 12 09:05:42 2013 0 7
bay90ct Svc_Notepad
Notepad.c 44
This message was added by the OPERATOR to report on a problem:
Rsthast Disabled
EN 179
If you have done this much: "I have been able to index the logs based on events and have been able to identify fields at search time", then you have made a good start. In this discussion, I will assume that "SR xxx" begins an event and "EN xxx" ends it - the full multi-line text is one event. This represents a single cycle.
In Splunk, you should be able to work from the events (cycles) to build the larger items - you should not index the same data three different ways.
So the real questions are - how do you define the "cycle period" in the first list item? Why do you need to do that? I assume that by "message" (third list item) you mean the message text that is embedded in the cycle. You should be able to define a field named message
that contains that text.
You could actually see a list of the messages and the number of times each occurred with this simple search:
yoursearchhere | stats count by message
Where yoursearchhere
represents something like sourcetype=mydatatype
or source=logfilename
etc.
What exactly do you want to see in your output?
SR xxx to EN xxx is one multi-line event, a cycle is about 200-400 events, some of them being expected messages, some of them unwanted messages.
A PET/CT medical imaging scanner runs a script having it startup, reporting its status at a specific time, and then the system shuts down. This cycle will generate about 200-400 events. startup/down runs in continuous loop for a specified period of time.(like a weekend)
Type of Output: categorize # of specific message that occur in each cycle to create a baseline/statistical prediction based on message ID and message ID content. Search for errors.