Solved: How do I create a calculated field based on data f...

ahmed · ‎11-04-2020

Hello,

I am working with historical log data from a train system and I have two different types of log files:

log1: each row is an event that was logged every time a train arrived at a station.
log2: each row is an event that was logged every time a train station sign displayed a message. The messages predicted how many minutes it will take for the next train to arrive.

My goal:

Add a new column to the log2 events which is a calculated field. The calculation is the difference between the predicted arrival time (time of log2 event + predicted minutes until arrival) and the actual arrival time (time of the log1 event which corresponds to the log2 event).

My proposed solution:

For each log1 row, generate a table of all the corresponding log2 rows. Corresponding means:

The log2 timestamp is within a 30 minute time range BEFORE the log1 timestamp.
The log2 "next_serial" field is equal to the log1 "serial" field.
The log2 "platform" field is equal to the log1 "platform" field.

In other words, for every train that arrived, I will list all the logged train station sign messages that predicted the arrival time of that train.

After creating the table of log2 events specified above, I will calculate the new field, which I will call "prediction_deviation", for all the log2 rows in the table. I am able to do this because I now know which log1 event all the rows in this table correspond with.

Question:

Can someone please help me implement my proposed solution, or suggest a better method to create this calculated field? Which SPL commands can I use? I have been looking at the documentation for the stats, eval, and transaction commands, but I have not found a way to use these to solve my problem.

I would appreciate any help or guidance. Thank you.

ahmed · ‎11-11-2020

Thank you for your detailed solution. Unfortunately it did not work because as you pointed out there are overlapping messages for the same serial and platform. This is why one of the conditions for correlation that I listed in the original question is that the log2 events occurred within 30 minutes before the log1 event.

I was actually able to solve the correlation problem by creating a field alias for the "next_serial" field and then using the transaction command with maxspan=30m.

Thank you for your time and effort.

View solution in original post

isoutamo · ‎11-05-2020

Can you post sample of logs so we could understand this better?

ahmed · ‎11-07-2020

Thank you for responding to my question.

Here is what a log1 file looks like:

date	time	serial	platform
2020/10/17	12:00:01	104	24A
2020/10/17	12:08:38	216	36B
2020/10/17	12:14:52	115	9A
2020/10/17	12:25:34	104	25A
2020/10/17	12:33:13	115	10A
2020/10/17	12:47:40	216	35B
2020/10/17	12:54:05	104	26A

Each row in a log1 file is logged when a train arrives at a station.

Here is what a log2 file looks like:

date	time	next_serial	platform	next_min
2020/10/17	11:32:25	104	24A	26
2020/10/17	11:46:07	115	9A	28
2020/10/17	11:54:07	216	36B	13
2020/10/17	11:43:21	104	24A	15
2020/10/17	11:56:44	115	9A	20
2020/10/17	11:59:02	216	36B	5
2020/10/17	11:57:08	104	24A	3
2020/10/17	12:09:29	115	9A	4
2020/10/17	12:05:58	216	36B	1

Each row in a log2 file is logged when a message is displayed on a platform sign at a train station. The "next_min" field is the predicted number of minutes until that specific train will arrive at the platform.

The "serial" field in log1 is the serial number of the train which arrived. The "next_serial" field in log2 is the serial number of the train which the message is predicting the arrival time for. This serial number is unique to each train and is the same number logged in both files.

To correlate a log2 platform sign message with the log1 arrival of the train, the three conditions I described in my original question need to be true. This is because each train will make stops at different platforms and will also stop at each platform multiple times in a day. Platform signs begin displaying predictions for a trains arrival up to 30 minutes before its predicted arrival.

I hope that clears up any confusion. Thanks again for any insight you can share regarding my question.

ITWhisperer · ‎11-05-2020

What are serial and next_serial? Are they unique identifiers for the train? If not, do your events in both logs have a unique train identifier field?

ahmed · ‎11-07-2020

Thank you for your response.

The "serial" field in log1 and the "next_serial" field in log2 are the serial number of the train which is a unique identifier for each train. However, each train will complete several circuits around the track in a day, so the same train will be logged at the same platform multiple times a day. This is why I chose to limit the search for lines in log2 which correspond to a line in log1 to messages logged 30 minutes before the scheduled train arrives at the platform.

Moreover, each train makes stops at different train platforms, which is why corresponding log1 and log2 lines need to have the same platform and serial number logged.

I hope that clears it up.

ITWhisperer · ‎11-08-2020

| makeresults | eval _raw="date	time	serial	platform
2020/10/17	12:00:01	104	24A
2020/10/17	12:08:38	216	36B
2020/10/17	12:14:52	115	9A
2020/10/17	12:25:34	104	25A
2020/10/17	12:33:13	115	10A
2020/10/17	12:47:40	216	36B
2020/10/17	12:54:05	104	24A"
| multikv
| fields - _time _raw linecount
| append [ | makeresults | eval _raw="date	time	next_serial	platform	next_min
2020/10/17	11:32:25	104	24A	26
2020/10/17	11:46:07	115	9A	28
2020/10/17	11:54:07	216	36B	13
2020/10/17	11:43:21	104	24A	15
2020/10/17	11:56:44	115	9A	20
2020/10/17	11:59:02	216	36B	5
2020/10/17	11:57:08	104	24A	3
2020/10/17	12:09:29	115	9A	4
2020/10/17	12:05:58	216	36B	1
2020/10/17	11:58:25	104	25A	26
2020/10/17	12:05:07	115	10A	28
2020/10/17	12:34:07	216	36B	13
2020/10/17	12:09:21	104	25A	15
2020/10/17	12:13:44	115	10A	20
2020/10/17	12:42:02	216	36B	5
2020/10/17	12:21:08	104	25A	3
2020/10/17	12:29:29	115	10A	4
2020/10/17	12:45:58	216	36B	1
2020/10/17	12:28:25	104	24A	26
2020/10/17	12:39:21	104	24A	15
2020/10/17	12:51:08	104	24A	3"
| multikv
| fields - _time _raw linecount 
]
| eval arrival_time=if(isnull(next_min),strptime(date." ".time,"%Y/%m/%d %H:%M:%S"),null)
| eval message_time=if(isnull(next_min),null,strptime(date." ".time,"%Y/%m/%d %H:%M:%S"))
| eval next_time=if(isnull(next_min),null,message_time+(next_min * 60))
| eval _time=coalesce(arrival_time,message_time)
| eval serial=coalesce(serial,next_serial)
| sort - _time
| streamstats last(arrival_time) as arrival_time by platform serial
| where isnotnull(message_time)
| eval prediction_deviation=arrival_time-next_time
| fieldformat message_time=strftime(message_time,"%Y/%m/%d %H:%M:%S")
| fieldformat next_time=strftime(next_time,"%Y/%m/%d %H:%M:%S")
| fieldformat arrival_time=strftime(arrival_time,"%Y/%m/%d %H:%M:%S")

I added some extra message data for the other trains and put third 104 on the same platform as the first and put both 216's on the same platform. Negative prediction_deviations (in seconds) means the train actually arrived before the time predicted by the message. I ignored the 30 minute requirement. Filtering out next_time > 30 might suffice for this requirement otherwise you may need to do some extra calculations. Problems arise if there are overlapping messages for the same serial and platform e.g. if all the 104's were on platform 24A. Hopefully, this is not a real occurrence for you, otherwise you need another way of distinguishing between trains.

ahmed · ‎11-11-2020

Thank you for your detailed solution. Unfortunately it did not work because as you pointed out there are overlapping messages for the same serial and platform. This is why one of the conditions for correlation that I listed in the original question is that the log2 events occurred within 30 minutes before the log1 event.

I was actually able to solve the correlation problem by creating a field alias for the "next_serial" field and then using the transaction command with maxspan=30m.

Thank you for your time and effort.

How do I create a calculated field based on data from a different type of log file?

eval

fields

lookup

stats

table

transaction

Stay Connected: Your Guide to November Tech Talks, Office Hours, and Webinars!

Transform your security operations with Splunk Enterprise Security

Splunk Admins and App Developers | Earn a $35 gift card!