Hello,
I am working with historical log data from a train system and I have two different types of log files:
My goal:
Add a new column to the log2 events which is a calculated field. The calculation is the difference between the predicted arrival time (time of log2 event + predicted minutes until arrival) and the actual arrival time (time of the log1 event which corresponds to the log2 event).
My proposed solution:
For each log1 row, generate a table of all the corresponding log2 rows. Corresponding means:
In other words, for every train that arrived, I will list all the logged train station sign messages that predicted the arrival time of that train.
After creating the table of log2 events specified above, I will calculate the new field, which I will call "prediction_deviation", for all the log2 rows in the table. I am able to do this because I now know which log1 event all the rows in this table correspond with.
Question:
Can someone please help me implement my proposed solution, or suggest a better method to create this calculated field? Which SPL commands can I use? I have been looking at the documentation for the stats, eval, and transaction commands, but I have not found a way to use these to solve my problem.
I would appreciate any help or guidance. Thank you.
Thank you for your detailed solution. Unfortunately it did not work because as you pointed out there are overlapping messages for the same serial and platform. This is why one of the conditions for correlation that I listed in the original question is that the log2 events occurred within 30 minutes before the log1 event.
I was actually able to solve the correlation problem by creating a field alias for the "next_serial" field and then using the transaction command with maxspan=30m.
Thank you for your time and effort.
Thank you for responding to my question.
Here is what a log1 file looks like:
date | time | serial | platform |
2020/10/17 | 12:00:01 | 104 | 24A |
2020/10/17 | 12:08:38 | 216 | 36B |
2020/10/17 | 12:14:52 | 115 | 9A |
2020/10/17 | 12:25:34 | 104 | 25A |
2020/10/17 | 12:33:13 | 115 | 10A |
2020/10/17 | 12:47:40 | 216 | 35B |
2020/10/17 | 12:54:05 | 104 | 26A |
Each row in a log1 file is logged when a train arrives at a station.
Here is what a log2 file looks like:
date | time | next_serial | platform | next_min |
2020/10/17 | 11:32:25 | 104 | 24A | 26 |
2020/10/17 | 11:46:07 | 115 | 9A | 28 |
2020/10/17 | 11:54:07 | 216 | 36B | 13 |
2020/10/17 | 11:43:21 | 104 | 24A | 15 |
2020/10/17 | 11:56:44 | 115 | 9A | 20 |
2020/10/17 | 11:59:02 | 216 | 36B | 5 |
2020/10/17 | 11:57:08 | 104 | 24A | 3 |
2020/10/17 | 12:09:29 | 115 | 9A | 4 |
2020/10/17 | 12:05:58 | 216 | 36B | 1 |
Each row in a log2 file is logged when a message is displayed on a platform sign at a train station. The "next_min" field is the predicted number of minutes until that specific train will arrive at the platform.
The "serial" field in log1 is the serial number of the train which arrived. The "next_serial" field in log2 is the serial number of the train which the message is predicting the arrival time for. This serial number is unique to each train and is the same number logged in both files.
To correlate a log2 platform sign message with the log1 arrival of the train, the three conditions I described in my original question need to be true. This is because each train will make stops at different platforms and will also stop at each platform multiple times in a day. Platform signs begin displaying predictions for a trains arrival up to 30 minutes before its predicted arrival.
I hope that clears up any confusion. Thanks again for any insight you can share regarding my question.
What are serial and next_serial? Are they unique identifiers for the train? If not, do your events in both logs have a unique train identifier field?
Thank you for your response.
The "serial" field in log1 and the "next_serial" field in log2 are the serial number of the train which is a unique identifier for each train. However, each train will complete several circuits around the track in a day, so the same train will be logged at the same platform multiple times a day. This is why I chose to limit the search for lines in log2 which correspond to a line in log1 to messages logged 30 minutes before the scheduled train arrives at the platform.
Moreover, each train makes stops at different train platforms, which is why corresponding log1 and log2 lines need to have the same platform and serial number logged.
I hope that clears it up.
| makeresults | eval _raw="date time serial platform
2020/10/17 12:00:01 104 24A
2020/10/17 12:08:38 216 36B
2020/10/17 12:14:52 115 9A
2020/10/17 12:25:34 104 25A
2020/10/17 12:33:13 115 10A
2020/10/17 12:47:40 216 36B
2020/10/17 12:54:05 104 24A"
| multikv
| fields - _time _raw linecount
| append [ | makeresults | eval _raw="date time next_serial platform next_min
2020/10/17 11:32:25 104 24A 26
2020/10/17 11:46:07 115 9A 28
2020/10/17 11:54:07 216 36B 13
2020/10/17 11:43:21 104 24A 15
2020/10/17 11:56:44 115 9A 20
2020/10/17 11:59:02 216 36B 5
2020/10/17 11:57:08 104 24A 3
2020/10/17 12:09:29 115 9A 4
2020/10/17 12:05:58 216 36B 1
2020/10/17 11:58:25 104 25A 26
2020/10/17 12:05:07 115 10A 28
2020/10/17 12:34:07 216 36B 13
2020/10/17 12:09:21 104 25A 15
2020/10/17 12:13:44 115 10A 20
2020/10/17 12:42:02 216 36B 5
2020/10/17 12:21:08 104 25A 3
2020/10/17 12:29:29 115 10A 4
2020/10/17 12:45:58 216 36B 1
2020/10/17 12:28:25 104 24A 26
2020/10/17 12:39:21 104 24A 15
2020/10/17 12:51:08 104 24A 3"
| multikv
| fields - _time _raw linecount
]
| eval arrival_time=if(isnull(next_min),strptime(date." ".time,"%Y/%m/%d %H:%M:%S"),null)
| eval message_time=if(isnull(next_min),null,strptime(date." ".time,"%Y/%m/%d %H:%M:%S"))
| eval next_time=if(isnull(next_min),null,message_time+(next_min * 60))
| eval _time=coalesce(arrival_time,message_time)
| eval serial=coalesce(serial,next_serial)
| sort - _time
| streamstats last(arrival_time) as arrival_time by platform serial
| where isnotnull(message_time)
| eval prediction_deviation=arrival_time-next_time
| fieldformat message_time=strftime(message_time,"%Y/%m/%d %H:%M:%S")
| fieldformat next_time=strftime(next_time,"%Y/%m/%d %H:%M:%S")
| fieldformat arrival_time=strftime(arrival_time,"%Y/%m/%d %H:%M:%S")
I added some extra message data for the other trains and put third 104 on the same platform as the first and put both 216's on the same platform. Negative prediction_deviations (in seconds) means the train actually arrived before the time predicted by the message. I ignored the 30 minute requirement. Filtering out next_time > 30 might suffice for this requirement otherwise you may need to do some extra calculations. Problems arise if there are overlapping messages for the same serial and platform e.g. if all the 104's were on platform 24A. Hopefully, this is not a real occurrence for you, otherwise you need another way of distinguishing between trains.
Thank you for your detailed solution. Unfortunately it did not work because as you pointed out there are overlapping messages for the same serial and platform. This is why one of the conditions for correlation that I listed in the original question is that the log2 events occurred within 30 minutes before the log1 event.
I was actually able to solve the correlation problem by creating a field alias for the "next_serial" field and then using the transaction command with maxspan=30m.
Thank you for your time and effort.