How to calculate session times from large data set...

eddieddieddie · ‎11-22-2022

I'm analysing VPN connection logs to produce a report of the count of staff working from home for longer than 6 hours a day.

Unfortunately the VPN session isn't started and ended by the staff member - the VPN just writes a log when data is sent. This means there is nothing that can be used as a start or end of flag in the data.

I have tried using the TRANSACTION command, the Username as the unique element, and set the 'maxpause' between sessions to 65minutes. Example query:

index=VPN sourcetype=VPNlog
| transaction Username maxevents=-1 maxpause=65m
| stats sum(duration) as Duration BY Username

This worked for small sample sets of data. I could then extract the count of staff who's total duration of sessions was over 6 hours a day.

However when I attempted to run this same query over the complete set of data it produced an incomplete set of results along with the message:

"Some transactions have been discarded. To include them, add keepevicted=true to your transaction command."

Enabling keepevicted produces more results but the figures are incorrect - I assume there are still too many events for the transaction command to analyse?

After reading about the limitations of the transaction command I tried using STATS in its place - this works far quicker and for all the data except it can't there doesn't seem to be an equivalent of 'maxpause' to end a session. Instead the duration of a staff members session always ends up being the duration from the start of their first connection to the end of their last - which leads to people appearing for work 12 hour days because they logon remotely in the morning for a brief while, then briefly again in the evening.

Is there another way to use the transaction command that will allow it to handle more data? The results don't have to return overly quickly as it will be run over night to produce reporting.

eddieddieddie · ‎11-29-2022

Thank you both @VatsalJagani and @bowesmana - it's great you both came up with working solutions using different approaches. I've learnt something new from both of them.

However they both have a catch:
@bowesmana - Using your technique with Streamstats I get results up to a point then get the message "The maximum window size (10000) was reached." When this happens I start getting blanks in the data - the results start containing durations of zero. This only happens when applying it across the last 24 hours, shorter periods or less staff are fine, so I assume it's the number of events in the raw data it's analysing.

@VatsalJagani - Your solution almost works perfectly except that remote working staff can seem to have more than only session ID simultaneously. I'm not really sure how this can happen but it seems the VPN records the same staff member, on the same computer, with more than one session ID at the same time. This makes the total very inaccurate as it ends up counting the same timespan more than once. I experimented with using combinations of private and public IP as the secondary BY clauses but for a remote worker these don't change much (if at all) so I end up with counting from the very start of the first session to the end of the last.

bowesmana · ‎11-29-2022

@eddieddieddie

Ah... I've also learnt something to - in that streamstats has the max_stream_window limit default of 10,000 events.

I'm guessing that the 65m time window is significant - if you reduced it to 30 minutes on the same 24h window, I would expect the issue would go away, as it's holding on to fewer events.

@VatsalJagani's approach of finding some other correlation data field that can be used to delineate 'session' is naturally the best approach. If you have session id, then you could handle the 'two concurrent sessions' , by looking at the start/end times of sessions for 'Username' and filter out overlapping sections

eddieddieddie · ‎12-11-2022

Can you offer any advice on filtering out (or merging) the two concurrent sessions?

Based on this answer: https://community.splunk.com/t5/Splunk-Search/Can-you-help-me-identify-when-end-time-overlaps-start-... and @VatsalJagani's answer I came up with:

index=VPN sourcetype=VPNlog
| stats earliest(_time) as session_start, latest(_time) as session_end by Username, SessionID
| sort 0 session_start
| streamstats current=false last(session_start) as next_start_time by Username
| eval overlap=if(next_start_time<session_end,"yes","no")
| convert ctime(session_start) ctime(session_end) ctime(next_start_time)

Which when sorted by username now produces a column showing overlapping sessions. But I'm stuck merging these overlaps into one block of time: my thinking is that if a session starts at 2:15pm and last 30 minutes, and another starts at 2:25pm and lasts 35 minutes, these should be merged into one event starting at 2:15pm lasting for 45 minutes. I think the Streamstats command, using range() and 'reset_before', would do it but because of the earlier stats command I've lost the _time from the data? Am I heading in the right direction here do you think?

bowesmana · ‎12-12-2022

@eddieddieddie I think your streamstats logic is wrong in your example

I believe you should bring forward the session_end from the previous row and then check if the start time is before the previous end.

Here it then uses a second streamstats to find the earliest session start and resets before an overlap.

| streamstats current=false window=1 last(session_end) as last_end_time by Username
| eval overlap=if(session_start<last_end_time,"yes","no")
| streamstats min(session_start) as first_start max(session_end) as last_end by Username reset_before="overlap=\"no\""
| stats max(session_end) as session_end by Username first_start
| eval duration=session_end-first_start
| convert ctime(session_end) ctime(first_start)

This example below has a single user with 5 events, 2 sessions and the first two overlap and the last 3 overlap. Hopefully this demonstrates something that will work for you

| makeresults
| eval session_start=strptime("2022-12-11 14:25", "%F %H:%M")
| eval session_end  =strptime("2022-12-11 14:55", "%F %H:%M")
| append [
  | makeresults
  | eval session_start=strptime("2022-12-11 14:35", "%F %H:%M")
  | eval session_end  =strptime("2022-12-11 15:10", "%F %H:%M")
]
| append [
  | makeresults
  | eval session_start=strptime("2022-12-11 15:35", "%F %H:%M")
  | eval session_end  =strptime("2022-12-11 16:10", "%F %H:%M")
]
| append [
  | makeresults
  | eval session_start=strptime("2022-12-11 15:45", "%F %H:%M")
  | eval session_end  =strptime("2022-12-11 16:30", "%F %H:%M")
]
| append [
  | makeresults
  | eval session_start=strptime("2022-12-11 15:45", "%F %H:%M")
  | eval session_end  =strptime("2022-12-11 17:10", "%F %H:%M")
]

| eval Username="eddieddieddie"
| eval SessionID="12345"
| fields - _time
| sort 0 session_start
| streamstats current=false window=1 last(session_end) as last_end_time by Username
| eval overlap=if(session_start<last_end_time,"yes","no")
| streamstats min(session_start) as first_start max(session_end) as last_end by Username reset_before="overlap=\"no\""
| stats max(session_end) as session_end by Username first_start
| eval duration=session_end-first_start
| convert ctime(session_end) ctime(first_start)

VatsalJagani · ‎11-22-2022

@eddieddieddie - I would say use stats for better performance.

Here is how I would approach it:

I will look for any field other than the "Username" field that determines the unique VPN session.
That could be a unique SessionID generated by the VPN system. Or if you don't find such field you could use a combination of Username and Src_IP or Private_IP (private internal IP assigned by the VPN system)
Search query would be something like:

index=VPN sourcetype=VPNlog
| stats earliest(_time) as session_start, latest(_time) as session_end by Username, <other group by fields>
| eval duration = session_end - session_start
| where duration > 6*3600

Now for <other group by fields>, you could try the below combination and see what gives the most accurate result:
- SessionID
- src, private_ip
- src
- private_ip

Now the above search considers only 1 VPN session. If you want to include and sum the duration of all the sessions of a user, you could add one more stats command.

index=VPN sourcetype=VPNlog
| stats earliest(_time) as session_start, latest(_time) as session_end by Username, <other group by fields>
| eval duration = session_end - session_start
| stats count as no_of_sessions, sum(duration) as total_duration by Username
| where total_duration > 6*3600

I hope this helps!!!

bowesmana · ‎11-22-2022

stats/streamstats is the way to go. transaction is not really designed to work with large data sets and long duration

streamstats, which although a relatively slow command, will not have the limitations that transaction has.

For example, you can do

| streamstats time_window=65m range(_time) as range by Username

That will give you a range value of 0 when there has been a 65 minute gap between events, which indicates the effective end of the session.

so, then it's a job to find out the duration of each 'session'. This search would then work out the length of each session, with the last row before a reset of duration to 0 being the true duration of log event times.

| streamstats reset_before=range=0 range(_time) as duration by Username

then it's a matter of working out how to extract that duration from the stream, which could be done with

| eval session=if(range=0, _time, null)
| filldown session
| stats max(duration) as duration by Username session

then the final stats to sum up the totals

| stats sum(duration) as duration by Username

I think that may do the job - might be a little off in detail, but give it a go

How to calculate session times from large data set?

stats

transaction

Splunk Classroom Chronicles: Training Tales and Testimonials (Episode 3)

Operationalizing TDIR: Building a More Resilient, Scalable SOC

Almost Too Eventful Assurance: Part 1

Are you a member of the Splunk Community?