I am needing to find earlier version number of linux patches. I have to compare many patches, so I was wanting to use a join for two queries (assuming patching happens once a month, but not all packages have an update every month). The first query would get the latest packages patched (with in the last 30 days) - depending on what day of the month the patching occurred - I would like to pass the earliest datetime stamp found minus X seconds (as MaxTime) to the second query. So, the second query could use the same index, source, sourcetype but where latest=MaxTime.
Don't try this at home,
putting latest=MaxTime-10 in the second query caused Splunk to laugh at me and return "Invalid value 'MaxTime-10' for time term 'latest'"...no hard feelings, Splunk laughs at me often.
Thanks for any assistance in advance.
JLund
It does sounds from your data description of package, current_version, current_date, previous_version, previous_date, that a lookup may be a practical way to maintain the current state of events.
Are you planning to run this search once a month or daily, but it would seem that an approach might be to have a search that looks for data in the last 30 days. At this point it will have version and date info for package X
It can then do a lookup of that package to the lookup data and get 'current/previous' info for the package.
It's not difficult to manage that lookup to update it, but I'm not clear on what you want from this list.
As you don't know how far back a 'previous' patch was installed you have know way of knowing how far back to search data, so the lookup will give you all that immediately.
I am guessing you may also want to be looking at host+package, not just package, so depending on how many hosts/packages you have, the lookup could be reasonably big, but much depends on how often you want to read/use this data.
Ok, you probably come to Splunk from a different background so it's only natural that you don't have Splunk habits (yet) and try to solve problems the way you know. Don't worry it will come with time.
One thing worth remembering is that join is very rarely the way to go with Splunk.
Tell us what you have in your data (sample events, obfuscated if needed), explaining if there are any relationships between different events, what you want to get from that data (not how you're trying to get it) and we'll see what can be done.
An example would be the linux kernel, for instance if this month's patching process a new kernel patch was applied, we would like to find the last kernel version that was installed, the month before, two months before or even earlier if that was the case. Kernel version change fairly regularly, but some of the other linux packages might change/update a little less frequently. After we patch we capture the list of installed packages and ingest the data into Splunk. So every month we have the data of the current installed packages, for compliance reasons, we need to verify what packages were updated during our patching process. So we are trying to compare the latest installed packages list, with the installed package lists from previous months.
Our output would be something like this:
package current version install date previous version previous install date
kernel
ssh
python
glibc
etc...
Ok. Now it's a bit better described but.
1. You still haven't shown us a sample of the actual events.
2. Not everything in Splunk can be done (reasonably and effectively) with just a single search. Maybe you could bend over backwards and compose some monster using subsearches and map but it will definitely not be a good solution. Performance would be bad and you still might hit limits for subsearches and get wrong results. It sounds like something that should be done by means of repeated scheduled search storing intermediate state in a lookup.
You might try to to search through "all-time" and build a huge list of everything that happened in your index only to choose two most recent changes but it would consume a lot of memory and is not really a good solution.
Also, our SMEs don't all patch their servers on the same day, but they usually patch similar servers on the same day. So Monday, they might patch IDS servers, and Wednesday they might scan the vulnerability scanners and the following week they might patch all the application servers, etc.
First, some quick comment. Judging from your attempted SPL, and your tendency to think "join", Splunk is not laughing. It is just a stranger to you. And strangers can be intimidating. I often keep a browser tab open with https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/ so I can easily lookup what's available, and what syntax any command or function requires. For example, Time modifiers describe in much detail what you can give as earliest and latest. MaxTime is just not a thing in SPL. If MaxTime is a field you calculated previously, you cannot easily pass it into a subsearch (which is what join command must call). For this latter point to strike home, you will need some more familiarity about how Splunk work. Splunk's Search Manual can be a good start point to learn those.
Back to your actual use case, technically you can make Splunk do exactly what you wanted. But as I hinted above, Splunk - and Splunk practitioners like me, intimidate, nay bully those who dare to join. Unless absolutely necessary, just use a Splunk-friendly query to achieve the same goal. It will benefit you in the short term as well as long.
You mention time periods you tried to connect the two searches, but give no indication as to what is the link between the two searches. It seems obvious that you are not trying to "join" the two searches by _time. So, there must be some other logic other than just wanting to set time interval differently. Can you describe the actual logic in your use case? What is the output you are trying to get? What are some data characteristics that help you arrive at your output? Illustrate in concrete terms or mockup data.
The link between the two searches would be our monthly list of installed packages, after patching we gather the current list of installed packages and ingest the data into Splunk. We would like to compare the list from the current month to the lists from previous months, because not all packages have an update/patch each month.
So for kernel as an example, there are frequent updates/patches and usually changes every month. But, for less frequently update/patched packages we might need to compare back two or more months. So I would want to compare the current installed packages with the last two or even as far back as six months or a year.
I thought if I "joined" the list of previous installed packages that had been deduped or stats latest(version) AS previous_version, latest(_time) AS previous_installed_date by package, I could capture the last version and installed date of each package.
search 1 would have the list of the current packages - package, installed date, version
search 2 would have the list of the last installed date and last version of all previously installed packages with different field names for installed date and version
the join would join the two lists by package name
output would be package, version, installed date, last version, last installed date
I get the impression that your requirement can be reinterpreted as listing the last two installed version and their installed times. Is this accurate?
As @bowesmana suggested, this problem would be best solved by maintaining a lookup table, then work from there. Any search that does not use a static dataset like lookup is bound to be inefficient because your lookback period cannot be predetermined.
As a proof of concept, here is a literal implementation of my interpretation of your requirement. The premise is that you make a search with sufficient coverage for last two versions of packages of interest. Assume that the search returns something like the following
_time | host | package | version |
2024-01-21 | host1 | somesoft1 | 1.2.1 |
2024-01-21 | host2 | somesoft2 | 2.2.3 |
2024-03-02 | host1 | somesoft1 | 1.2.5 |
2024-03-03 | host2 | somesoft2 | 2.3.0 |
2024-04-10 | host1 | somesoft1 | 1.2.10 |
You then apply the following:
<some search with sufficient history>
| stats max(_time) as _time by package version
| eval version = json_object("version", version, "install_time", _time)
| stats list(version) as version_installed by package
| eval version = json_extract(mvindex(version_installed, -1), "version"), "installed date" = json_extract(mvindex(version_installed, -1), "install_time")
| eval last_version = json_extract(mvindex(version_installed, -2), "version"), "last installed date" = json_extract(mvindex(version_installed, -2), "install_time")
| fieldformat "installed date" = strftime('installed date', "%F")
| fieldformat "last installed date" = strftime('last installed date', "%F")
| fields - version_installed
This should give a table like this
package | installed_date | last installed date | last_version | version |
somesoft1 | 2024-03-02 | 2024-04-10 | 1.2.10 | 1.2.5 |
somesoft2 | 2024-03-03 | 2024-01-21 | 2.2.3 | 2.3.0 |
What the code really illustrates is the general approach of a semantic "join" without using join command. stats is a lot more efficient in SPL. lookup, using binary search, is another very efficient method.
Here is an emulation that produces the mock search output above. Play with it and compare with real data.
| makeresults format=csv data="_time,host,package,version
2024-01-21,host1,somesoft1,1.2.1
2024-01-21,host2,somesoft2,2.2.3
2024-03-02,host1,somesoft1,1.2.5
2024-03-03,host2,somesoft2,2.3.0
2024-04-10,host1,somesoft1,1.2.10"
| eval _time = strptime(_time, "%F")
``` data emulation above ```
Also, our SMEs don't all patch their servers on the same day, but they usually patch similar servers on the same day. So Monday, they might patch IDS servers, and Wednesday they might scan the vulnerability scanners and the following week they might patch all the application servers, etc. So the installed dates are frequently different and sometime versions are different between all our linux hosts but are consistent between the servers of the same type (IDS, Scanner, Application, etc.)
Is splunk laughing even harder, does the second query - the one inside the join run before the query outside the join?