I have a very large dataset of events (millions of events per hour of various event types) which are all part of the same dataset.
There are two event types that I am interested in (DNS Events and Process information) that i would to link/join/combine in someway for output/reporting purposes. I have scoured numerous other posts but have not found an appropriate solution that seems to work for my use case.
An example of what I want to achieve is to be able to search for ALL DNS Requests (event_simpleName=DNSRequest) in a certain timeframe and ascertain the corresponding Process Name (event_simpleName=ProcessRollup2). The only thing that links these two event types is a corresponding ProcessId that is common amongst both event types although has a different name.
Example DNS Query:
event_simpleName=DNSRequest | table event_simpleName DomainName ContextProcessId_decimal
event_simpleName DomainName ContextProcessId_decimal
DnsRequest blah.com 2345465654232
DnsRequest example.com 1679455209688
...
Example Process Information:
event_simpleName=ProcessRollup2 | table event_simpleName ProcessName TargetProcessId_decimal
event_simpleName ProcessName TargetProcessId_decimal
ProcessRollup2 chrome.exe 1679455209688
ProcessRollup2 firefox.exe 2345465654232
...
Desired Output
ProcessName DomainName NormalizedProcessId_decimal
chrome.exe example.com 1679455209688
firefox.exe blah.com 2345465654232
Note the difference between process id's is the DNS Requests process information is called "ContextProcessId_decimal" where as the process id is called TargetProcessId_decimal. However I understand I can rename one field as part of the query or use coalesce to normalize these fields.
I first tried using join which I can get to successfully work if i narrow down the search criteria for example to specific event process id's or domain name successfully. However the problem is i do no know this information in advance and want the lookups to be dynamic in nature. I believe the limitation I am encountering with this approach is that there is a join row limit and with my dataset being so large it is not feasible as there is far more Process Events than the join limit.
The next approach I had was to use subsearch so firstly I search for the DNSRequests then search the Process information.
event_simpleName=ProcessRollup2 [search event_simpleName=DnsRequest
| fields + ContextProcessId_decimal
| rename ContextProcessId_decimal AS TargetProcessId_decimal]
This approach identifies all of the processes I care about however I lose the underlying Domain Name associated to the process.
Any assistance/guidance would be very much appreciated.
Thanks
Avoid join. This should be sufficient:
(event_simpleName=DNSRequest) OR (event_simpleName=ProcessRollup2) | eval NormalizedProcessId_decimal=coalesce(ContextProcessId_decimal, TargetProcessId_decimal) | stats values(ProcessName) as ProcessName, values(DomainName) as DomainName by NormalizedProcessId_decimal
Hth,
Kai.
Avoid join. This should be sufficient:
(event_simpleName=DNSRequest) OR (event_simpleName=ProcessRollup2) | eval NormalizedProcessId_decimal=coalesce(ContextProcessId_decimal, TargetProcessId_decimal) | stats values(ProcessName) as ProcessName, values(DomainName) as DomainName by NormalizedProcessId_decimal
Hth,
Kai.
Thankyou very much that worked and is a viable solution to my problem. I have marked your answer accordingly. If i understand correctly, this effectively runs two separate queries in parallel and then merges the results based on the stats based on the normalized field being common. Given the DNS Requests I am looking for are likely going to be in the numbers of thousands of events where as the process information as a whole is going to be in the hundreds of millions depending upon the timescale and the majority being of no value. The query as you can imagine operates quite slowly on large datasets as it is gathering all process information.
Is it possible to expand upon this foundation to optimize it to make it even more efficient for larger datasets?
For example perform the DNS query and ascertain the ID's we care about then search the process events just for those IDs.
I have tested this concept and it works albeit manual and a two stage process which ideally I would like to streamline as much as possible.
Initially identify the process event id's for the dns entries I care about. (This is very quick). Export the list and comma delimit in an editor then use these event id's as a filter for stage 2.
event_simpleName=DNSRequest | table ContextProcessId_decimal
Perform the below query specifying the process id's identified above for both the DNS Requests and Process Information using the query you provided.
(event_simpleName=DNSRequest ContextProcessId_decimal IN (10348240759135,1939925450819)) OR (event_simpleName=ProcessRollup2 TargetProcessId_decimal IN (10348240759135,1939925450819))
| eval NormalizedProcessId_decimal=coalesce(ContextProcessId_decimal, TargetProcessId_decimal)
| stats values(ProcessName) as ProcessName, values(DomainName) as DomainName by NormalizedProcessId_decimal
Notes:
Whilst the above executes much faster it is a manual two stage process. Ideally I would like to automate this concept for speed of execution benefits if possible.
Although the above queries to prove the concept I have only specified two IDs, the reality is this will be in the thousands. I suspect there will be a finite number.
Is it possible to output stage 1 into a list then use that list in stage 2 seamlessly?
Any assistance would be appreciated.
Thanks