Hello, imagine you have two fields: IP, ACCOUNT
An IP can access any number of ACCOUNT, an ACCOUNT can be accessed by any number of IP.
For each IP, the number of ACCOUNT it accesses.
For each ACCOUNT the number of IP accessed by it.
Potentially easy.
Show number of ACCOUNTS accessed by IP where those ACCOUNT are accessed by more than one IP and the ACCOUNT that IP accesses are accessed by a different IP not accessed by the other ACCOUNTs
Confused? I'd like to find IPs acccessing a lot of accounts where those accounts are also being accesed by more than one IP and the other IPs accessing those accounts are not all the same.
To start simple -
For each IP, the number of ACCOUNT it accesses.
<search terms> | stats dc(ACCOUNT) by IP
likewise,
<search terms> | stats dc(IP) by ACCOUNT
Those are much simpler than what you're asking for obviously.
Here's the best approach I can think of. Breaking down the following search in english, we take the unique combinations of ACCOUNT and IP (using stats). We then pipe these rows through eventStats so that each row will get a 'distinctIPs' field. The distinctIPs value is the number of IP values that that row's ACCOUNT field was accessed by. Then we treat this as a rough weighting, and we just add up the values for each IP. It's kind of a ridiculous field name, but for clarity I've called it "totalDistinctIPsAccessedByAccountsTheyAccessed"
<searchterms> | stats count by ACCOUNT IP | eventstats dc(IP) as distinctIPs by ACCOUNT | stats count sum(distinctIPs) as totalDistinctIPsAccessedByAccountsTheyAccessed by IP | sort - totalDistinctIPsAccessedByAccountsTheyAccessed
In the end you get a list of the top IP addresses that had accessed LOTS of accounts, weighted heavily towards those where the accessed accounts were themselves accessed by a LOT of IP's.
phew. Hopefully I'm close. 😃
(Note - it's best to click 'comment on this answer', under my answer, rather than posting a new answer as a comment.. things get very confusing when the order of the answers changes later)
Thanks Nick, I'll take a stab using your suggestions. I really wish I could do this in something like perl or python but the data set is too large.