Solved: What is wrong with my sub-pipeline in appendpipe?

yshen · ‎06-10-2023

I have a lookup table bsm_string_new_overheat_records.csv:

_time overheat_location start_CCU_AMBI_TEMP start_time_secs end_CCU_AMBI_TEMP end_time_secs overheat_duration_minutes
2023-06-10 11:41:44 A10 29 1686421604 23 1686422504 15
2023-06-10 11:42:44 A10 29 1686413444 23 1686422564 152
2023-06-10 11:43:44 A10 29 1686412844 23 1686422624 163
2023-06-10 11:44:44 A10 29 1686413984 23 1686422684 145
2023-06-10 11:45:44 A10 29 1686420584 23 1686422744 36

I also have an `index==“battery_data” sourcetype=“battery_field_data”` that contains the battery data. From this, I can compute the average temperature for a location (“A10”), with start_time_secs (1686421604), and end_time_secs (1686422504). Here is an example:

index="battery_data" sourcetype="battery_field_data"
    | rex field=Tag "^(?P<CCU_location>\w+)_BQMS\.\1\.((BMS_\1_(?P<tag_suffix>.*))|(MU(?P<MU_number>\d+)\.BMS_\1_MU\6_UNIT(?P<UNIT_number>\d+)_(?P<tag_type>.*)))"
    | where CCU_location=="A10"
    | eval start_time_secs=1686421604, end_time_secs=1686422504
    | where (tag_suffix == "CCU_AMBI_TEMP") AND (start_time_secs <= _time) AND (_time <= end_time_secs)
    | eval Value=round(Value*0.1, 2)
    | stats avg(Value) AS average_temperature_in_overheat latest(_time) AS _time latest(*) AS * BY CCU_location start_time_secs end_time_secs
    | table _time CCU_location start_time_secs end_time_secs average_temperature_in_overheat

yielding:

_time CCU_location start_time_secs end_time_secs average_temperature_in_overheat
2023-06-10 11:39:54.971 A10 1686421604 1686422504 17.45

I would like to compute the average temperature for each record in bsm_string_new_overheat_records.csv

with the following query,

| inputlookup bsm_string_new_overheat_records.csv
| rename CCU_location AS overheat_location
| appendpipe [ search
index="battery_data" sourcetype="battery_field_data"
| rex field=Tag "^(?P<CCU_location>\w+)_BQMS\.\1\.((BMS_\1_(?P<tag_suffix>.*))|(MU(?P<MU_number>\d+)\.BMS_\1_MU\6_UNIT(?P<UNIT_number>\d+)_(?P<tag_type>.*)))"
| where CCU_location==overheat_location
| where (tag_suffix == "CCU_AMBI_TEMP") AND (start_time_secs <= _time) AND (_time <= end_time_secs)
| eval Value=round(Value*0.1, 2)
| stats avg(Value) AS average_temperature_in_overheat latest(_time) AS _time latest(*) AS * BY CCU_location start_time_secs end_time_secs
]
| table _time overheat_location CCU_location start_time_secs end_time_secs average_temperature_in_overheat start_CCU_AMBI_TEMP end_CCU_AMBI_TEMP overheat_duration_minutes

I’d expect one additional record of average temperature for each overheat record (total 5+5=10 records).

However, here is what I got:

_time overheat_location CCU_location start_time_secs end_time_secs average_temperature_in_overheat start_CCU_AMBI_TEMP end_CCU_AMBI_TEMP overheat_duration_minutes
2023-06-10 11:41:44 A10 1686421604 1686422504 29 23 15
2023-06-10 11:42:44 A10 1686413444 1686422564 29 23 152
2023-06-10 11:43:44 A10 1686412844 1686422624 29 23 163
2023-06-10 11:44:44 A10 1686413984 1686422684 29 23 145
2023-06-10 11:45:44 A10 1686420584 1686422744 29 23 36

None of the average temperature results is there.

What’s wrong with my above query?

yuanliu · ‎06-22-2023

Funny you should mentioned your previous solution now because I was reading your Slack thread posted before the start of this question and realized that you were using exactly mvzip and mvexpand (except overheat_location was not zipped in).

Now that your concern is in execution time, I took a second look at @VatsalJagani's suggestion about map. In addition to the strange-looking "%m/%d/%Y:%H:%M:%S" format, search command's earliest and latest also accept numeric epoc, which happens to be the unit used by start_time_secs and end_time_secs. (You can always calculate epoc even if it is not.) So, instead of using one arbitrary 24-hour search interval, you map into multiple searches with precise intervals prescribed in the lookup. Although multiple small searches may not be better than one large search, at least that's an option.

| inputlookup bsm_string_new_overheat_records.csv
| rename CCU_location AS overheat_location
| map maxcount=0 search="search index=\"battery_data\" sourcetype=\"battery_field_data\" earliest=$start_time_secs$ latest=$end_time_secs$
| rex field=Tag \"^(?P<CCU_location>\w+)_BQMS\.\1\.((BMS_\1_(?P<tag_suffix>.*))|(MU(?P<MU_number>\d+)\.BMS_\1_MU\6_UNIT(?P<UNIT_number>\d+)_(?P<tag_type>.*)))\"
| where CCU_location==$overheat_location|s$ AND (tag_suffix == \"CCU_AMBI_TEMP\")
| eval Value=round(Value*0.1, 2)
| stats avg(Value) AS average_temperature_in_overheat latest(_time) AS _time latest(*) AS * BY CCU_location, start_time_secs, end_time_secs
| eval start_CCU_AMBI_TEMP=$start_CCU_AMBI_TEMP$, end_CCU_AMBI_TEMP=$end_CCU_AMBI_TEMP$
]
| table _time overheat_location CCU_location start_time_secs end_time_secs average_temperature_in_overheat start_CCU_AMBI_TEMP end_CCU_AMBI_TEMP overheat_duration_minutes

Also, if Tag is an ordinary field, and that it is part of the raw event, it is possible to make it somewhat more efficient by working overheat_location and tag_suffix in the search.

| inputlookup bsm_string_new_overheat_records.csv
| rename CCU_location AS overheat_location
| map maxcount=0 search="search index=\"battery_data\" sourcetype=\"battery_field_data\" earliest=$start_time_secs$ latest=$end_time_secs$ TERM($overheat_location|s$_BQMS.*_CCU_AMBI_TEMP*)
| rex field=Tag \"^(?P<CCU_location>\w+)_BQMS\.\1\.((BMS_\1_(?P<tag_suffix>.*))|(MU(?P<MU_number>\d+)\.BMS_\1_MU\6_UNIT(?P<UNIT_number>\d+)_(?P<tag_type>.*)))\"
| where CCU_location==$overheat_location|s$ (tag_suffix == \"CCU_AMBI_TEMP\")
| eval Value=round(Value*0.1, 2)
| stats avg(Value) AS average_temperature_in_overheat latest(_time) AS _time latest(*) AS * BY CCU_location, start_time_secs, end_time_secs
| eval start_CCU_AMBI_TEMP=$start_CCU_AMBI_TEMP$, end_CCU_AMBI_TEMP=$end_CCU_AMBI_TEMP$
]
| table _time overheat_location CCU_location start_time_secs end_time_secs average_temperature_in_overheat start_CCU_AMBI_TEMP end_CCU_AMBI_TEMP overheat_duration_minutes

In fact, if Tag is an ordinary field, there is even chance to further improve performance by using tstats in subsearches.

Meanwhile, looking for the perfect solution is not always a productive pursuit. You need to decide what is good enough for the problem at hand.

View solution in original post

yuanliu · ‎06-21-2023

I agree with @PickleRick that there ought to be more efficient ways to do this. Especially because you already have lookup table which is very efficient. One important point is to present the requirement in relationship to the data and explain the logic that link your data with desired results - as opposed to asking volunteers to speculate what is wrong with a complex SPL snippet. (There are other inconsistencies among posted SPL snippets which I can make reasonable speculations about, but not the requirements and logic.)

From your original post, these are my hypotheses:

The lookup table bsm_string_new_overheat_records.csv is exactly as presented; in particular, there is a field named overheat_location rather than CCU_location.
The index search, i.e., index="battery_data" sourcetype="battery_field_data", contains a field named Value that represents temperature in 1/10 of the scale used in that lookup table when tag_suffix == "CCU_AMBI_TEMP".
The logic to determine average temperature is to match values of overheat_location in lookup and the location from index, then calculate the average of Value between start_time_sec and end_time_sec.

Are the above correct? So, the first search shows how to calculate average temperature IF location, start_time_sec, and end_time_sec are already known; consider this a recipe. Using lookup command anchored on overheat_location, Splunk can easily determine all these parameters for each _time value entered in the lookup table. All you need to do is to apply the recipe after lookup. There are some calculations to perform, but it is all doable. This is not about syntax, but about expressing the logic to connect data with desired output.

index="battery_data" sourcetype="battery_field_data"
| rex field=Tag "^(?P<CCU_location>\w+)_BQMS\.\1\.((BMS_\1_(?P<tag_suffix>.*))|(MU(?P<MU_number>\d+)\.BMS_\1_MU\6_UNIT(?P<UNIT_number>\d+)_(?P<tag_type>.*)))"
| lookup bsm_string_new_overheat_records.csv overheat_location as CCU_location
``` after this, matching start_time_secs and ent_time_secs are present but each may have multiple values ```
| eval overheat_interval = mvzip(mvzip(start_time_secs, end_time_secs, ":"), overheat_location, ":")
| mvexpand overheat_interval
| eval overheat_interval = split(overheat_interval)
| eval start_time_secs = mvindex(overheat_interval, 0)
| eval end_time_secs = mvindex(overheat_interval, 1)
| eval overheat_location = mvindex(overheat_interval, 2)
``` at this point, each event has only one set of overheat lookup ```
| where (tag_suffix == "CCU_AMBI_TEMP") AND (start_time_secs <= _time) AND (_time <= end_time_secs) AND (CCU_location == overheat_location)
| eval Value=round(Value*0.1, 2)
| stats avg(Value) AS average_temperature_in_overheat latest(_time) AS _time latest(*) AS * BY CCU_location start_time_secs end_time_secs
| table _time CCU_location start_time_secs end_time_secs average_temperature_in_overheat

yshen · ‎06-21-2023

@yuanliu Thank you for very detailed review and explanation!

Your understanding of the my requirements is almost perfect, except some minor naming difference.

I had an almost identical solution as yours. I was worrying the first part of the query for all locations, and for all the times (theoretically speaking the overheat could last from the beginning of the ingested data to the moment of computing its average, but mostly likely, the overheat should be rare and of short duration, not too long before the current time of calculation. It seems the blind query for all temperatures would be rather wasteful. Especially, the performance might get worse over the increasing time of the temperature observations collected.

Besides, I find dealing with both the list of start_time_secs, end_time_secs rather complicated after the lookup operation, albert doable.

Here is my solution. In order to reduce the impact of excessive delays in query all temperature data, I'm limiting the query to only last 24 hours. Thus the implementation is correct for the overheat that had happened not earlier than 24 hours before. (I was just wondering if there were a better solution.)

`bsm_battery_data_parsed_by_time_window(-24h, now)`
| lookup bsm_string_new_overheat_records.csv CCU_location OUTPUT start_time_secs end_time_secs overheat_duration_minutes
| where isnotnull(start_time_secs)
| eval interval=mvzip(start_time_secs, end_time_secs)
| mvexpand interval
| rex field=interval "(?<start_secs>\S+),(?<end_secs>\S+)"
| where (start_secs <= _time) AND (_time <= end_secs)
| stats avg(CCU_AMBI_TEMP) as average_temperature_in_interval latest(*) AS * BY CCU_location start_secs end_secs
| eval average_temperature_in_interval=round(average_temperature_in_interval, 2)
| convert ctime(start_secs) AS start_time ctime(end_secs) AS end_time
| outputlookup bsm_battery_string_overheat_records.csv append=true create_empty=true

The definition of the macro is

index="battery_data" sourcetype="battery_field_data" earliest=$earliest$ latest=$latest$
| rex field=Tag "^(?P<CCU_location>\w+)_BQMS\.\1\.((BMS_\1_(?P<tag_suffix>.*))|(MU(?P<MU_number>\d+)\.BMS_\1_MU\6_UNIT(?P<UNIT_number>\d+)_(?P<tag_type>.*)))"

If this is the best with Splunk, then my quest for more perfection can be closed.

yuanliu · ‎06-22-2023

Funny you should mentioned your previous solution now because I was reading your Slack thread posted before the start of this question and realized that you were using exactly mvzip and mvexpand (except overheat_location was not zipped in).

Now that your concern is in execution time, I took a second look at @VatsalJagani's suggestion about map. In addition to the strange-looking "%m/%d/%Y:%H:%M:%S" format, search command's earliest and latest also accept numeric epoc, which happens to be the unit used by start_time_secs and end_time_secs. (You can always calculate epoc even if it is not.) So, instead of using one arbitrary 24-hour search interval, you map into multiple searches with precise intervals prescribed in the lookup. Although multiple small searches may not be better than one large search, at least that's an option.

| inputlookup bsm_string_new_overheat_records.csv
| rename CCU_location AS overheat_location
| map maxcount=0 search="search index=\"battery_data\" sourcetype=\"battery_field_data\" earliest=$start_time_secs$ latest=$end_time_secs$
| rex field=Tag \"^(?P<CCU_location>\w+)_BQMS\.\1\.((BMS_\1_(?P<tag_suffix>.*))|(MU(?P<MU_number>\d+)\.BMS_\1_MU\6_UNIT(?P<UNIT_number>\d+)_(?P<tag_type>.*)))\"
| where CCU_location==$overheat_location|s$ AND (tag_suffix == \"CCU_AMBI_TEMP\")
| eval Value=round(Value*0.1, 2)
| stats avg(Value) AS average_temperature_in_overheat latest(_time) AS _time latest(*) AS * BY CCU_location, start_time_secs, end_time_secs
| eval start_CCU_AMBI_TEMP=$start_CCU_AMBI_TEMP$, end_CCU_AMBI_TEMP=$end_CCU_AMBI_TEMP$
]
| table _time overheat_location CCU_location start_time_secs end_time_secs average_temperature_in_overheat start_CCU_AMBI_TEMP end_CCU_AMBI_TEMP overheat_duration_minutes

Also, if Tag is an ordinary field, and that it is part of the raw event, it is possible to make it somewhat more efficient by working overheat_location and tag_suffix in the search.

| inputlookup bsm_string_new_overheat_records.csv
| rename CCU_location AS overheat_location
| map maxcount=0 search="search index=\"battery_data\" sourcetype=\"battery_field_data\" earliest=$start_time_secs$ latest=$end_time_secs$ TERM($overheat_location|s$_BQMS.*_CCU_AMBI_TEMP*)
| rex field=Tag \"^(?P<CCU_location>\w+)_BQMS\.\1\.((BMS_\1_(?P<tag_suffix>.*))|(MU(?P<MU_number>\d+)\.BMS_\1_MU\6_UNIT(?P<UNIT_number>\d+)_(?P<tag_type>.*)))\"
| where CCU_location==$overheat_location|s$ (tag_suffix == \"CCU_AMBI_TEMP\")
| eval Value=round(Value*0.1, 2)
| stats avg(Value) AS average_temperature_in_overheat latest(_time) AS _time latest(*) AS * BY CCU_location, start_time_secs, end_time_secs
| eval start_CCU_AMBI_TEMP=$start_CCU_AMBI_TEMP$, end_CCU_AMBI_TEMP=$end_CCU_AMBI_TEMP$
]
| table _time overheat_location CCU_location start_time_secs end_time_secs average_temperature_in_overheat start_CCU_AMBI_TEMP end_CCU_AMBI_TEMP overheat_duration_minutes

In fact, if Tag is an ordinary field, there is even chance to further improve performance by using tstats in subsearches.

Meanwhile, looking for the perfect solution is not always a productive pursuit. You need to decide what is good enough for the problem at hand.

yshen · ‎06-22-2023

@yuanliu Thanks for concurring @VatsalJagani 's approach, especially optimizing the multiple parallel sub-searches by the constraints of earliest and latest!

I think your suggestion may be the best solution that I learned so far.

I'm using this more convoluted (challenging) problem to help me to learn of the nuances in Splunk query execution. Unfortunately, I find it extremely hard to find more in depth discussion of Splunk queries' execution behavior. The Splunk's own documentation is too sketchy of the nuances. It makes too easy for toy problems.

What is your recommendation to learn more of Splunk queries for such more nuanced behaviors/performance.

Besides, I'd like to leave behind a query expression that is more intuitive, more readable, when I'm no longer around to maintain it. This map expression is a lot intuitive than the convoluted list gymnastics to deal with the results of lookup by location with multiple time windows of overheat for a same location.

We want more than doable, but performant, and readable code!

PickleRick · ‎06-24-2023

Well, it's all complicated and performance tuning searches might be a huge headache.

As a rule of thumb it's usually better to avoid subsearches and it's best to leave as much work to the indexers as you can (so you can benefit from parallelization). So that's why I was pointing out that there should be a better way to do what you're trying to do than use map.

But, having said that:

1) Sometimes map-based solution can be good enough (I have such solution at home - I wrote a map-based search way back when I was learning SPL and since the subsearch is only run twice, it's not worth the time needed to rework the search to touch it).

2) Sometimes map can be the only reasonable solution (if indeed the outer search results narrow down the subsearch parameters that the stats- or anything-else-based approach would mean too much searching from raw events and digging through too much data.

So just get to know your job inspector and use it 🙂

One could argue about the readability argument though since often what is "readable" for a non-splunk-experienced people is simply something that is written with a completely different paradigm in mind and is simply not the way it's done in SPL. Like typical overuse of the "join" command by people coming to Splunk from relational database environments.

yuanliu · ‎06-23-2023

Obviously the question of performance is nuanced, and I do not have a ready recipe for learning such behaviors. From what I heard, some factors even depends on your deployment architecture and how data is ingested. But I admire your desire to use challenging problems to explore and observe.

Narrowing down to this example, I simplify Splunk cost into two buckets: raw event search (indexer search) and compute. Raw event search is harder for the user to optimize. The principle is really just try to fetch as narrowly as your subsequent processing will require. (Optimizing Searches) The reason why @VatsalJagani and @PickleRick both say map is usually not the best option is because you usually do not want to hit the indexer again and again. If you have all data fetched in one indexer search, all that is left is compute. When lookup is available, I take lookup (over stats with append which is my usual route) because binary search is fast.

Your problem is interesting because no one here (maybe except yourself) was aware that the one raw event search was itself too costly. (If you didn't know at the time, job inspector could have indicated that indexer search was dominating costs.) And you are correct, the string-list gymnasts add to maintenance cost.

Back to avoiding repeat raw event search. This is mostly about avoiding repeatedly fetching the same data. Whereas it is still more efficient to fetch different dataset in one search over multiple subsearches for each dataset, in your case, the narrow intervals of overheat probably only make up a fraction of the day. When each narrow interval can be programmed into raw event search, you end up with much smaller total data fetches. Even with the added cost of multiple visits to indexer, this MAY end up with better overall performance than a single day search. The cost comparison now really depends on intervals in the file.

In all, that search manual section contains lots of useful information. (To think, what really helped in this example was all in a short subsection titled Narrow the time window.) But to apply that in practice requires observation of real data. These days I deal with relatively small amount of data. (In other words, my days of being screamed at by admins are over not because I become better at performance.) So, I pay more attention to compute. Some others on this board have better intuition and discipline about the relationship between indexer and search head. Browsing some previous answers often gives me inspiration.

VatsalJagani · ‎06-10-2023

@yshen - Appendpipe will not generate results for each record. I think the command you are looking for here is "map".

But just to be sure, the map command will run one additional search for every record in your lookup, so if your lookup has many records it could be time-consuming as well as resource hungry.

| inputlookup bsm_string_new_overheat_records.csv
| rename CCU_location AS overheat_location
| map maxcount=0 search="search search index=\"battery_data\" sourcetype=\"battery_field_data\"
| rex field=Tag \"^(?P<CCU_location>\w+)_BQMS\.\1\.((BMS_\1_(?P<tag_suffix>.*))|(MU(?P<MU_number>\d+)\.BMS_\1_MU\6_UNIT(?P<UNIT_number>\d+)_(?P<tag_type>.*)))\"
| where CCU_location==$overheat_location|s$
| eval start_time_secs=$start_time_secs$, end_time_secs=$end_time_secs$
| where (tag_suffix == \"CCU_AMBI_TEMP\") AND (start_time_secs <= _time) AND (_time <= end_time_secs)
| eval Value=round(Value*0.1, 2)
| stats avg(Value) AS average_temperature_in_overheat latest(_time) AS _time latest(*) AS * BY CCU_location, start_time_secs, end_time_secs
| eval start_CCU_AMBI_TEMP=$start_CCU_AMBI_TEMP$, end_CCU_AMBI_TEMP=$end_CCU_AMBI_TEMP$
]
| table _time overheat_location CCU_location start_time_secs end_time_secs average_temperature_in_overheat start_CCU_AMBI_TEMP end_CCU_AMBI_TEMP overheat_duration_minutes

I know there is definitely scope for improvement on this as the map is usually not the best option because it creates a separate search for each result in the lookup.

But this is the starting point for what you want to achive.

I hope this helps!! If it does kindly upvote!!!

yshen · ‎06-20-2023

I think that this is the closest semantics to what I want. I'll give it a try.

Note, maxcount should be maxsearches, and

maxsearches=0 does not mean unlimited searches.

maxsearches
Syntax: maxsearches=<int>
Description: The maximum number of searches to run. A message is generated if there are more search results than the maximum number that you specify.
Zero ( 0 ) does not equate to unlimited searches.
Default: 10

https://docs.splunk.com/Documentation/Splunk/9.0.5/SearchReference/Map

PickleRick · ‎06-20-2023

OK. map is very very rarely the desired form of getting the right results. Usually you can do the same thing with some clever statsing. I admit I didn't dig too deeply into your search but usually you might do something like

<first_search>
| rename * as first_*
| rename first_my_chosen_field as my_chosen_field
| append
    [  <second_search>
       | rename * as second_*
       | rename second_my_chosen_field as my_chosen_field ]
| stats my_stats_functions(first_fields) values(second_*) as * by my_chosen_field

That's of course just a rough sketch of the method, adjust it to your needs. Just make those searches so that they have separate field sets with just one (or more) overlapping fields on which you want to summarize your results.

Going for map is a wrong (very non-splunky) way of thinking.

PickleRick · ‎06-10-2023

If the subsearch does not produce expected results when used as a subsearch but works perfectly well when used as a standalone search, the most probable cause is that the subsearch yields too many results resulting in truncation to the limit (50k rows by default) or hits the time limit (30s if I remember correctly) for the subsearch resulting in premature subsearch finalization.

There is no warning or any other indication of the finalization/truncation of the subsearch so you have to be careful when writing your searches.

In your case, since the inputlookup is definitely way way faster than the subsearch I'd try to swap the logic - do the inputsearch in subsearch (and do the "reverse" command at the end if you want the lookup contents at the end).

What is wrong with my sub-pipeline in appendpipe?

subsearch

Database Performance Sidebar Panel Now on APM Database Query Performance & Service ...

IM Landing Page Filter - Now Available

Dynamic Links from Alerts to IM Navigators - New in Observability Cloud