Q: Given a "timechart span=1m sep='-" last(foo) as foo last( bar) as bar by hostname", how would I get a unique value of the bar-* fields?
This has to be a standard problem, but I cannot find any writeup of solving it...
Background:
I'm processing Apache Impala logs for data specific to a query, server, and pool (i.e., cluster). The data arrives on multiple lines that are easily combined with a transaction and rex-ed out to get the values. Ignoring the per-query values, I end up with:
| fields _time hostname reserved max_mem
The next step is to summarize the reserved and max_mem by minute, taking the last value by hostname and summing the reserved values, extracting a single max_mem value.
I can get the data by host using:
| timechart span=1m sep="-" last( reserved ) as reserved last( max_mem ) as max_mem by hostname
which gives me a set of reserved-* and max_mem-* fields. The reserved values can be summed with:
| addtotals fieldname=reserved reserved-*
Issue:
The problem I'm having is getting the single unique value of max_mem back out of it. The syntax "| stats values( max_mem-* ) as max_mem" does not work, but gives the idea of what I'm trying to accomplish.
I've tried variations on bin to group the values with stats to post-process them, but gotten nowhere.
I get the funny feeling that there may be a way to "| addcols [ values( max_mem-* ) as max_mem " but that doesn't get me anywhere either.
A slightly different approach would be leaving the individual reserved values as-is, finding some way to get the single max_mem value out of the timechart, and plotting it as an area chart using max_mem as a layover (i.e., the addtotals can be skipped).
In either case, I'm still stuck getting the unique value from max_mem-* as a single field for propagation with the reserved values.
Aside: The input to this report is taken from the transaction list which includes memory estimates and SQL statements per query. I need that much for other purposes. The summary here of last reserved & max_mem per time unit is taken from the per-query events because the are the one place that the numbers are available.
Impala logs contain data at the level of query, server, and pool (global, shared) level.
The data is mixed together into the log entries for each query and the only way to extract it is combining rows from a single query into a transaction and extracting them from there.
Using timechart for count( query_id ), last( reserved ), last( max_mem ) provides a nice, accurate time-sample of the state of that one server. The query_id counts can and reserved values can be combined to produce useful values for the pool about the number of running queries and total reserved memory allocated in the pool. The max_mem value is per-pool and will have a single value (e.g, 123456) across all hosts in the pool. It may change over time but all of the pool members will show the same value all the time. At that point the last( max_mem ) will be have identical values for each value in the pool -- the value is common to all pool members and will invariably have the same value for log transactions acquired from the various hosts.
If there were some way to simply take values( max_mem-* ) from the timechart then I'd have a valid sample of the pool-level max_mem value.
There may be a way to do this with an addcols, there may be a keyword I don't know about similar to "addcols fieldname=x x-*" that will summarize multiple fields into a single value; if so then that command would work.
I cannot be the first person who is trying to summarize values from multiple levels of a hierarchy of data in a timeslice, I think?
You still haven't said if my suggested solution is valid in your case. Did you try it, did it give you what you want or not?
That will give you a single value for each of the categories of data you have in your timechart.
The other question is what do you want to do with this data, is this to be displayed in another dashboard panel or are you trying to show this somehow in the same timechart without splitting by host for these two fields or something else?
Here's a specific example: Say I have a row that looks like:
fields _time reserved max_mem-foo max_mem-bar max_mem-bim max_mem-bam
I know in advance that all of the max_mem-* values must be identical but have no way of knowing the suffixes in advance (e.g., I cannot just hardcode "max_mem-foo" as a workaround).
Q: What is the simplest way to get a single max_mem value in a field from the collection of max_mem-* fields?
I would ask, if all the values are the same for all hosts, then what are you producing a timechart for?
Did you try my suggestion?
Unfortunately security constraints prevent me from displaying the actual code or any of the error messages.
The max_mem value will be identical for all hosts, that's why I need to extract a single value for it. The sum of per-host values will be compared to the single pool value in a graph.
If I understand the "*-*" notation will process all of the fields.
I can get an appropriate total for the per-host value via addtotals on reserved-*" to reduce the dozen "reserved-<hostname>" to a single reserved total value.
My problem so far has been that the syntax "| stats values( max_mem-* ) as max_mem " fails with
Error in Stats command: The number of wildcards between field specifier "max_mem" and the rename specifier "max_mem" do not match.
Net result is that I cannot extract the known-single value of max_mem from the list of max_mem-<hostname> fields.
Thanks
Not sure I understand the "single" bit of last(reserved) by host, as I assume there will be lots of different values for different hosts, however, you can do this
...
| stats values(*-*) as *-*
| foreach *-* [ eval <<MATCHSEG1>>=mvsort(mvdedup(mvappend(<<MATCHSEG1>>, '<<FIELD>>'))) | fields - <<FIELD>> ]