Splunk Search

## List common substrings of at least 5 stations in a subway travel history of users

Contributor

List common substrings of at least 5 stations. List also the users followed each substring. Is this splunk problem or algorithm class homework? A similar problem would be listing the top click streams in a web analytics sankey diagram of at least 5 pages.

https://en.wikipedia.org/wiki/Longest_common_substring_problem

subwayHistory.csv
_time,user,station
...

EDIT1
changed subsequence to substring, since users cannot jump stops

EDIT2
web analytics click stream problem.

Tags (2)
1 Solution
SplunkTrust

I had been asking about a use case because I don't really want to be doing people's homework for them, and this sounds amazingly like homework. Regardless of that, I have an answer, at least as far as I see it.

So, let's start with my csv. I generated it in Excel with three columns. Column A is initially "=now()" with subsequent fields equaling the previous field+(1/86000). I know that's not quite adding one second per event, but I like the occasional missed second - it makes me feel alive. Column B (user) is "=CEILING.MATH(RAND()*12)" (I should have used more users, but it works). Column C (station) is "=CHAR((RAND()*26)+97)"

Here's the first few lines of that.

``````time    user        station
11/14/15 20:10:42   3   i
11/14/15 20:10:43   9   h
11/14/15 20:10:44   2   k
11/14/15 20:10:45   10  p
``````

I repeat that, oh, a couple thousand times. I think 2200 in this example.

So here's a terrible search, full of repeated things that I'm sure someone smarter can fix for me. I break it down somewhat later, but here it is whole.

``````index="temp_answers" sourcetype="csv" | transaction user maxpause=30s
| search eventcount>=5 | fields user, station | mvcombine station
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| fields - station | stats count by path | sort - count
``````

The top 5 or so results gives (since I sorted it to put the most paths at top)

``````path    count
p q r s t   20
n o p q r   17
m n o p q   16
o p q r s   15
q r s t u   14
``````

Which seems fine to me (there's a lot more, I just didn't paste it all in here.)

So, taking the search apart...

``````... | transaction user maxpause=30s
``````

The transaction combines all the events with the same user, assuming those events aren't separated by more than 30 seconds. If they are, it starts a new transaction/group. The transactions occur in lengths of 1 to about 18 stations, so that seems to work OK.

``````| search eventcount>=5 | fields user, station | mvcombine station
``````

We then only include transactions that have more than 5 items (because our minimum path is 5 stations). We strip out all the other fields, just leaving "user" and "station". Then we do some magic with mvcombine. That takes the set of events in each transaction and combines the individual stations into a string of stations, so it gives me things like "a b d e f g h j m n p r s u v w y ".

``````| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
``````

The first part uses rex to make as many matches of 5 characters separated by spaces as it can. This takes "a b d e f g h j m n p r s u v w y " and creates "a b d e f ", "g h j m n " and so on, but the second and subsequent extractions starts after the previous one string of 5, not one character shifted. Since I couldn't find a direct way to start at each character, I then rex it again but in sed mode, shifting the string one character (removing the first character and space). So, now we have several of the paths and the stations string shifted to look like this: "b d e f g h j m n p r s u v w y ".

This is repeated 5 times, each time getting the shifted strings. In the example string I've been using, "b d e f g ", etc... then "d e f g h" and so on.

``````| fields - station | stats count by path | sort - count
``````

Then I remove the station field because it's no longer needed, then do a stats count by path. The final bit is just to sort it so the bigger counts are on top - not important for this, I just wanted to.

SplunkTrust

I had been asking about a use case because I don't really want to be doing people's homework for them, and this sounds amazingly like homework. Regardless of that, I have an answer, at least as far as I see it.

So, let's start with my csv. I generated it in Excel with three columns. Column A is initially "=now()" with subsequent fields equaling the previous field+(1/86000). I know that's not quite adding one second per event, but I like the occasional missed second - it makes me feel alive. Column B (user) is "=CEILING.MATH(RAND()*12)" (I should have used more users, but it works). Column C (station) is "=CHAR((RAND()*26)+97)"

Here's the first few lines of that.

``````time    user        station
11/14/15 20:10:42   3   i
11/14/15 20:10:43   9   h
11/14/15 20:10:44   2   k
11/14/15 20:10:45   10  p
``````

I repeat that, oh, a couple thousand times. I think 2200 in this example.

So here's a terrible search, full of repeated things that I'm sure someone smarter can fix for me. I break it down somewhat later, but here it is whole.

``````index="temp_answers" sourcetype="csv" | transaction user maxpause=30s
| search eventcount>=5 | fields user, station | mvcombine station
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
| fields - station | stats count by path | sort - count
``````

The top 5 or so results gives (since I sorted it to put the most paths at top)

``````path    count
p q r s t   20
n o p q r   17
m n o p q   16
o p q r s   15
q r s t u   14
``````

Which seems fine to me (there's a lot more, I just didn't paste it all in here.)

So, taking the search apart...

``````... | transaction user maxpause=30s
``````

The transaction combines all the events with the same user, assuming those events aren't separated by more than 30 seconds. If they are, it starts a new transaction/group. The transactions occur in lengths of 1 to about 18 stations, so that seems to work OK.

``````| search eventcount>=5 | fields user, station | mvcombine station
``````

We then only include transactions that have more than 5 items (because our minimum path is 5 stations). We strip out all the other fields, just leaving "user" and "station". Then we do some magic with mvcombine. That takes the set of events in each transaction and combines the individual stations into a string of stations, so it gives me things like "a b d e f g h j m n p r s u v w y ".

``````| rex max_match=0 field=station "(?<path>(\w\s){5})" | rex field=station mode=sed "s/^\w\s(.*)/\1/"
``````

The first part uses rex to make as many matches of 5 characters separated by spaces as it can. This takes "a b d e f g h j m n p r s u v w y " and creates "a b d e f ", "g h j m n " and so on, but the second and subsequent extractions starts after the previous one string of 5, not one character shifted. Since I couldn't find a direct way to start at each character, I then rex it again but in sed mode, shifting the string one character (removing the first character and space). So, now we have several of the paths and the stations string shifted to look like this: "b d e f g h j m n p r s u v w y ".

This is repeated 5 times, each time getting the shifted strings. In the example string I've been using, "b d e f g ", etc... then "d e f g h" and so on.

``````| fields - station | stats count by path | sort - count
``````

Then I remove the station field because it's no longer needed, then do a stats count by path. The final bit is just to sort it so the bigger counts are on top - not important for this, I just wanted to.

SplunkTrust

It occurred to me that if you only needed any paths 5 stops or longer (instead of all precisely 5-stop paths and sub-paths), this search is quite a bit shorter.

``````index="temp_answers" sourcetype="csv" | transaction user maxpause=30s
| search eventcount>=5 | fields user, station | mvcombine station
``````

Obviously, fix up index and base search as required and adjust your maxpause (if necessary) on the transaction.

The, the rest is just as it was above. In my data this doesn't show any duplicated paths when run against all the 5 stops or more paths, but that's to be expected because I chose my data randomly whereas subway stops probably show definite patterns.

Contributor

Sankey doesn't show the subway map. I switched to a choropleth map with arrow-shaped islands overlayed on the geographical map. The islands are named station1_arrow_station2. The resultset contains from,to,count. I used paths of exactly 2 steps. This is an unorthodox application of the choropleth map visualization.

SplunkTrust

So, what is the problem you are trying to solve?

Contributor

Given the csv file of the abive schema, have u got any splunk solution?

Get Updates on the Splunk Community!

#### Optimize Cloud Monitoring

TECH TALKS Optimize Cloud Monitoring Tuesday, August 13, 2024  |  11:00AM–12:00PM PST   Register to ...

#### What's New in Splunk Cloud Platform 9.2.2403?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.2.2403! Analysts can ...

#### Stay Connected: Your Guide to July and August Tech Talks, Office Hours, and Webinars!

Dive into our sizzling summer lineup for July and August Community Office Hours and Tech Talks. Scroll down to ...