I have the following query that gives me a list of pods that are missing based off the comparison of what should be deployed as defined in the pod_list.csv inputlookup.
index=abc sourcetype=kubectl importance=non-critical
| dedup pod_name
| eval Observed=1
| append
[| inputlookup pod_list.csv
| eval Observed=0
| eval importance=if(isnull(importance), "critical", importance)
| search importance=non-critical]
| lookup pod_list pod_name_lookup as pod_name OUTPUT pod_name_lookup
| eval importance=if(isnull(importance), "critical", importance
| stats max(Observed) as Observed by pod_name_lookup, importance
| where Observed=0 and importance="non-critical"
The data in the pod_list.csv looks like so:
namespace | pod_name_lookup | importance |
ns1 | kafka-* | critical |
ns1 | apache-* | critical |
ns2 | grafana-backup-* | non-critical |
This works as expected.
I am now having difficulties creating a timechart with this data to be able to see when a pod wasnt deployed, not just what is currently missing.
Any help is greatly appreciated.
Removing the dedup from your original suggestion seems to have cleared up the weird issue i was
I should have noticed that dedup counters your goal. (I copied from your original illustration without considering implications in time interval.) You are correct, this is one more reason you don't want to throw dedup around.
Is there an easy way to instead of having a individual line for each "missing" pod, to either have a single line with the total count of "non-critical" pods and possibly also have two lines for "critical" and "non-critical"?
First, let's clarify that your goal is to count the number of missing pod groups by importance. Something like this should do:
index=abc sourcetype=kubectl
| lookup pod_list pod_name_lookup as pod_name OUTPUT pod_name_lookup
| where sourcetype == "kubectl"
| bin span=1h@h _time
| stats values(pod_name_lookup) as pod_name_lookup values(pod_name_all) as pod_name_all by importance _time
| append
[ inputlookup pod_list
| rename pod_name_lookup as pod_name_all]
| eventstats values(pod_name_all) as pod_name_all importance
| eval missing = if(isnull(pod_name_all), pod_name_all, mvappend(missing, mvmap(pod_name_all, if(pod_name_all IN (pod_name_lookup), null(), pod_name_all))))
| where isnotnull(missing)
| timechart span=1m@m dc(missing) by importance
Here is an emulation.
| makeresults format=csv data="_time, pod_name, importance
10,apache-12, critical
22,apache-2, critical
34,kakfa-8, critical
80,superapp-13, critical
88,someapp-6
160,grafana-backup-11
166,apache-4, critical
168,kafka-6, critical
566,apache-4, critical
568,kafka-6, critical
174,someapp-2
250,grafana-backup-6
374,anotherapp-10"
| fillnull importance value=non-critical
| eval _time = now() - _time
| eval sourcetype = "kubectl"
| eval pod_name_lookup = replace(pod_name, "\d+", "*")
``` the above emulates
index=abc sourcetype=kubectl
| lookup pod_list pod_name_lookup as pod_name OUTPUT pod_name_lookup
| dedup pod_name
```
| where sourcetype == "kubectl"
| bin span=1m@m _time
| stats values(pod_name_lookup) as pod_name_lookup values(pod_name_all) as pod_name_all by importance _time
| append
[makeresults format=csv data="namespace, pod_name_lookup, importance
ns1, kafka-*, critical
ns1, apache-*, critical
ns2, grafana-backup-*, non-critical
ns2, someapp-*, non-critical"
``` subsearch thus far emulates
| inputlookup pod_list
```
| rename pod_name_lookup as pod_name_all]
| eventstats values(pod_name_all) as pod_name_all by importance
| eval missing = if(isnull(pod_name_all), pod_name_all, mvappend(missing, mvmap(pod_name_all, if(pod_name_all IN (pod_name_lookup), null(), pod_name_all))))
| where isnotnull(missing)
| timechart span=1m@m dc(missing) by importance
Thank you for this. I feel it is close but I am getting some inconsistent/incomplete findings in the search. I have a pod that I know is "missing" it is in the pod lookup table , but is not deployed.
The Line chart shows that it is missing in the current hour but not in the previous hours that i know that it was missing.
This is a data analytics forum. So, you cannot just say "I know is missing" without data to substantiate. My mock data actually includes conditions where a group of pods are missing in both current interval and previous intervals. They are shown as missing in all intervals in which they are missing in the chart screenshot. If you need concrete help, always post sample data that will demonstrate all features necessary. (Anonymize as needed.)
Speaking of pod groups, you still haven't confirmed whether it is the pod groups you are trying to mark. As I said, there is no logic that will support detecting missing of individual instances of any pod by using lookup table with wildcards.
Some context: I have a scripted input that sends the output of kubectl get po -A --show-labels into the abc index and kubectl sourcetype. Each pod has the importance label with a value of either critical or non-critical. Some pods havent had that label applied as of yet, that is why I had the fill null importance lines in my original query. Agree that they can be ignored for this task. An event in the index would look something like this:
_time | namespace | pod_name | importance | etc.... |
The other fields are irrelevant to this search. This data gets ingested into splunk every minute.
Each pod's name is unique (one container per pod) so i believe the lookup table wildcard pod groups will map to the individual pod. For example, there would only be one apache pod, not apache-2 and apache-12. I dont believe there are any concerns with the grouping.
When I say that I know the data is missing, I mean that there is a pod (kafka) that hasnt been deployed for over a week. It is not being reported by the scripted input I mentioned earlier. That pod is listed in the pod lookup as (kafka-*) Unfortunately I am unable to provided screenshots.
The trouble I am having with the query you generously shared is that when it runs and gets visualized as a line chart with the 24h time range picked, the kafka pod shows a count of 1(missing) 24 hours ago, then goes to 0 for the rest of time up until the current hour, where it returns to 1. Nothing changed during that time with the actual deployment status of that pod.
Do you mean to say that some periods have no data about pods? (Or rather, no data about pods with importance value "non-important".) My initial suggestion was based on the assumption that during any given interval, there are some pods. Now that I think about it, it is possible that that assumption is still true but some intervals may only get important pods and all non-important ones are missing.
Try this
index=abc sourcetype=kubectl importance=non-critical
| lookup pod_list pod_name_lookup as pod_name OUTPUT pod_name_lookup
| dedup pod_name
| where sourcetype == "kubectl"
| timechart span=1m@m values(pod_name_lookup) as pod_name_lookup values(pod_name_all) as pod_name_all
| append
[makeresults format=csv data="namespace, pod_name_lookup, importance
ns1, kafka-*, critical
ns1, apache-*, critical
ns2, grafana-backup-*, non-critical
ns2, someapp-*, non-critical"
| where importance = "non-critical"
``` subsearch thus far emulates
| inputlookup pod_list where importance = non-critical
```
| rename pod_name_lookup as pod_name_all]
| eventstats values(pod_name_all) as pod_name_all
| eval missing = if(isnull(pod_name_all), pod_name_all, mvappend(missing, mvmap(pod_name_all, if(pod_name_all IN (pod_name_lookup), null(), pod_name_all))))
| where isnotnull(missing)
| timechart span=1m@m count by missing
Exactly the same idea, just fill intervals with no non-important pod groups. Those intervals will see all pod groups marked as missing.
In my environment, unless something catastrophic happens, there will always be data from critical and non-critical pods being ingested into this index/sourcetype. I dont have an environment to test the addition you suggested to account for a period of time where no "non-critical" pods reporting, but as soon as i do i will test your updated query.
Removing the dedup from your original suggestion seems to have cleared up the weird issue i was seeing where i was getting anamolous results in the timechart. This seems to be working now which is great. Huge thanks!!
index=abc sourcetype=kubectl importance=non-critical
| lookup pod_list pod_name_lookup as pod_name OUTPUT pod_name_lookup
| append
[inputlookup pod_list where importance = non-critical
| rename pod_name_lookup as pod_name_all]
| eventstats values(pod_name_all) as pod_name_all
| where sourcetype == "kubectl"
| timechart span=1h@h values(pod_name_lookup) as pod_name_lookup values(pod_name_all) as pod_name_all
| eval missing = mvappend(missing, mvmap(pod_name_all, if(pod_name_all IN (pod_name_lookup), null(), pod_name_all)))
| where isnotnull(missing)
| timechart span=1h@h count by missing
Is there an easy way to instead of having a individual line for each "missing" pod, to either have a single line with the total count of "non-critical" pods and possibly also have two lines for "critical" and "non-critical"?
I guess im also looking to have a timechart summary of the total count of missing non-critical and critical pods. Hope that makes sense.
Removing the dedup from your original suggestion seems to have cleared up the weird issue i was
I should have noticed that dedup counters your goal. (I copied from your original illustration without considering implications in time interval.) You are correct, this is one more reason you don't want to throw dedup around.
Is there an easy way to instead of having a individual line for each "missing" pod, to either have a single line with the total count of "non-critical" pods and possibly also have two lines for "critical" and "non-critical"?
First, let's clarify that your goal is to count the number of missing pod groups by importance. Something like this should do:
index=abc sourcetype=kubectl
| lookup pod_list pod_name_lookup as pod_name OUTPUT pod_name_lookup
| where sourcetype == "kubectl"
| bin span=1h@h _time
| stats values(pod_name_lookup) as pod_name_lookup values(pod_name_all) as pod_name_all by importance _time
| append
[ inputlookup pod_list
| rename pod_name_lookup as pod_name_all]
| eventstats values(pod_name_all) as pod_name_all importance
| eval missing = if(isnull(pod_name_all), pod_name_all, mvappend(missing, mvmap(pod_name_all, if(pod_name_all IN (pod_name_lookup), null(), pod_name_all))))
| where isnotnull(missing)
| timechart span=1m@m dc(missing) by importance
Here is an emulation.
| makeresults format=csv data="_time, pod_name, importance
10,apache-12, critical
22,apache-2, critical
34,kakfa-8, critical
80,superapp-13, critical
88,someapp-6
160,grafana-backup-11
166,apache-4, critical
168,kafka-6, critical
566,apache-4, critical
568,kafka-6, critical
174,someapp-2
250,grafana-backup-6
374,anotherapp-10"
| fillnull importance value=non-critical
| eval _time = now() - _time
| eval sourcetype = "kubectl"
| eval pod_name_lookup = replace(pod_name, "\d+", "*")
``` the above emulates
index=abc sourcetype=kubectl
| lookup pod_list pod_name_lookup as pod_name OUTPUT pod_name_lookup
| dedup pod_name
```
| where sourcetype == "kubectl"
| bin span=1m@m _time
| stats values(pod_name_lookup) as pod_name_lookup values(pod_name_all) as pod_name_all by importance _time
| append
[makeresults format=csv data="namespace, pod_name_lookup, importance
ns1, kafka-*, critical
ns1, apache-*, critical
ns2, grafana-backup-*, non-critical
ns2, someapp-*, non-critical"
``` subsearch thus far emulates
| inputlookup pod_list
```
| rename pod_name_lookup as pod_name_all]
| eventstats values(pod_name_all) as pod_name_all by importance
| eval missing = if(isnull(pod_name_all), pod_name_all, mvappend(missing, mvmap(pod_name_all, if(pod_name_all IN (pod_name_lookup), null(), pod_name_all))))
| where isnotnull(missing)
| timechart span=1m@m dc(missing) by importance
This is perfect. Thank you! Only had to add the missing "by" in
| eventstats values(pod_name_all) as pod_name_all importance
index=abc sourcetype=kubectl
| lookup pod_list pod_name_lookup as pod_name OUTPUT pod_name_lookup
| where sourcetype == "kubectl"
| bin span=1h@h _time
| stats values(pod_name_lookup) as pod_name_lookup values(pod_name_all) as pod_name_all by importance _time
| append
[ inputlookup pod_list
| rename pod_name_lookup as pod_name_all]
| eventstats values(pod_name_all) as pod_name_all by importance
| eval missing = if(isnull(pod_name_all), pod_name_all, mvappend(missing, mvmap(pod_name_all, if(pod_name_all IN (pod_name_lookup), null(), pod_name_all))))
| where isnotnull(missing)
| timechart span=1m@m dc(missing) by importance
Was able to get this to give me one line for the non-critical pods total missing count over time.
index=abc sourcetype=kubectl importance=non-critical
| lookup pod_list pod_name_lookup as pod_name OUTPUT pod_name_lookup
| append
[inputlookup pod_list where importance = non-critical
| rename pod_name_lookup as pod_name_all]
| eventstats values(pod_name_all) as pod_name_all
| where sourcetype == "kubectl"
| timechart span=1h@h values(pod_name_lookup) as pod_name_lookup values(pod_name_all) as pod_name_all
| eval missing = mvappend(missing, mvmap(pod_name_all, if(pod_name_all IN (pod_name_lookup), null(), pod_name_all)))
| timechart span=1h@h count(missing) as non-critical-pods-missing
Working towards the goal of being able to get another line for critical.
Let me first point out that you can only determine if a group of pods as denoted in pod_name_lookup is completely absent (missing), not any individual pod_name. As such, your "timechart" can only have values 1 and 0 for each missing pod_name_lookup. Second, I want to note that calculations to fill null importance values is irrelevant to the problem in hand, therefore I will ignore them.
The way to think through a solution is as follows: You want to populate a field that contains all non-critical pod_name_lookup values in every event so you can compare with running ones in each time interval. (Hint: eventstats.) In other words, if you have these pods
_time | pod_name | sourcetype |
2024-05-08 01:42:10 | apache-12 | kubectl |
2024-05-08 01:41:58 | apache-2 | kubectl |
2024-05-08 01:41:46 | kakfa-8 | kubectl |
2024-05-08 01:41:00 | apache-13 | kubectl |
2024-05-08 01:40:52 | someapp-6 | kubectl |
2024-05-08 01:39:40 | grafana-backup-11 | kubectl |
2024-05-08 01:39:34 | apache-4 | kubectl |
2024-05-08 01:39:32 | kafka-6 | kubectl |
2024-05-08 01:39:26 | someapp-2 | kubectl |
2024-05-08 01:38:16 | apache-12 | kubectl |
2024-05-08 01:38:10 | grafana-backup-6 | kubectl |
and pod_list lookup contains the following
importance | namespace | pod_name_lookup |
critical | ns1 | kafka-* |
critical | ns1 | apache-* |
non-critical | ns2 | grafana-backup-* |
non-critical | ns2 | someapp-* |
(As you can see, I added "someapp-*" because in your illustration, only one app is "non-critical". This makes data nontrivial.) You will want to produce an intermediate table like this (please ignore the time interval differences just focus on material fields):
_time | pod_name_lookup | pod_name_all |
2024-05-08 01:35:00 | ||
2024-05-08 01:36:00 | apache-* grafana-backup-* | grafana-backup-* someapp-* |
2024-05-08 01:37:00 | kafka-* someapp-* | grafana-backup-* someapp-* |
2024-05-08 01:38:00 | apache-* grafana-backup-* | grafana-backup-* someapp-* |
2024-05-08 01:39:00 | apache-* someapp-* | grafana-backup-* someapp-* |
2024-05-08 01:40:00 | apache-* kakfa-* | grafana-backup-* someapp-* |
(This illustration assumes that you are looking for missing pods in each calendar minute; I know this is ridiculous, but it is easier to emulate.) From this table, you can calculate which value(s) in pod_name_all is/are missing from pod_name_lookup. (Hint: mvmap can be an easy method.)
In SPL, this thought process can be implemented as
index=abc sourcetype=kubectl importance=non-critical
| lookup pod_list pod_name_lookup as pod_name OUTPUT pod_name_lookup
| dedup pod_name
| append
[inputlookup pod_list where importance = non-critical
| rename pod_name_lookup as pod_name_all]
| eventstats values(pod_name_all) as pod_name_all
| where sourcetype == "kubectl"
| timechart span=1h@h values(pod_name_lookup) as pod_name_lookup values(pod_name_all) as pod_name_all
| eval missing = mvappend(missing, mvmap(pod_name_all, if(pod_name_all IN (pod_name_lookup), null(), pod_name_all)))
| where isnotnull(missing)
| timechart span=1h@h count by missing
In the above, I changed time bucket to 1h@h (as opposed to 1m@m used in illustrations). You need to change that to whatever suits your needs.
Here is an emulation used to produce the above tables and this chart:
| makeresults format=csv data="_time, pod_name
10,apache-12
22,apache-2
34,kakfa-8
80,apache-13
88,someapp-6
160,grafana-backup-11
166,apache-4
168,kafka-6
174,someapp-2
244,apache-12
250,grafana-backup-6"
| eval _time = now() - _time
| eval sourcetype = "kubectl", importance = "non-critical"
| eval pod_name_lookup = replace(pod_name, "\d+", "*")
``` the above emulates
index=abc sourcetype=kubectl importance=non-critical
| lookup pod_list pod_name_lookup as pod_name OUTPUT pod_name_lookup
| dedup pod_name
```
| append
[makeresults format=csv data="namespace, pod_name_lookup, importance
ns1, kafka-*, critical
ns1, apache-*, critical
ns2, grafana-backup-*, non-critical
ns2, someapp-*, non-critical"
| where importance = "non-critical"
``` subsearch thus far emulates
| inputlookup pod_list where importance = non-critical
```
| rename pod_name_lookup as pod_name_all]
| eventstats values(pod_name_all) as pod_name_all
| where sourcetype == "kubectl"
| timechart span=1m@m values(pod_name_lookup) as pod_name_lookup values(pod_name_all) as pod_name_all
| eval missing = mvappend(missing, mvmap(pod_name_all, if(pod_name_all IN (pod_name_lookup), null(), pod_name_all)))
| where isnotnull(missing)
| timechart span=1m@m count by missing
The timechart command neeeeeeeeeeeeeds a _time field for the time bucketing. With your stats command you have not done a "by _time" and with that ignored / "eliminated" the _time field from the results. That means the field is no longer available for any command after the stats line.
You would have to at least add the _time field to the by clause of the stats command. With that said... I think your "append" with the inputlookup would create results where there is no "_time" field at the end of the resultset. So it will be funny to see what stats does with those results.
My recommendation is to either try to just do the lookup without the append or eval some _time field with a value that makes sense to the inputlookup append.