Solved: How to get dc and then sum of field?

Sathiya123 · ‎04-06-2023

<search>
| eval vm_unit=case(vmSize="Standard_F16s_v2",2,vmSize="Standard_F8s_v2",1,vmSize="Standard_F4s",0.5,vmSize="Standard_F2s_v2",0.25)
| timechart span=1h dc(vm_name) sum(vm_unit) as USED_VMS

Looking for the sum of vm_unit for distinct VM's by the hour. But it considers all the VM's instead of distinct VM's.

PickleRick · ‎04-07-2023

To be fully honest, I'm not sure what you want to achieve and what your data looks like. I'll assume that your events have vm_name and vm_unit fields and vm_unit is always the same for given vm_name and you want to count number of disctinct vm_names and sum of corresponding vm_units. Is that right?

<your_base_search>
| bin _time span=1h
| stats values(vm_unit) as vm_unit by _time vm_name
| stats dc(vm_name) sum(vm_unit)  by _time

You can also probably use dedup but I don't like that command and don't use it much so don't remember the syntax 😄

View solution in original post

ITWhisperer · ‎04-07-2023

Is this what you were after?

<search>
| eval vm_unit=case(vmSize="Standard_F16s_v2",2,vmSize="Standard_F8s_v2",1,vmSize="Standard_F4s",0.5,vmSize="Standard_F2s_v2",0.25)
| bin _time span=1h
| stats values(vm_unit) as vm_unit by _time vm_name
| stats dc(vm_name) as vm_count sum(vm_unit) as USED_VMS by _time

gcusello · ‎04-07-2023

Hi @Sathiya123,

if you want the sume of vm_unit for each VM, the solution fom @woodcock is the correct one.

If instead (as it seems from yur example) you want both the sum of VMs and the count of distinct VMs for each time unit, you could use stats instead timechart, because timechart permits to display only one value for each time unit, something like this:

<search>
| eval vm_unit=case(vmSize="Standard_F16s_v2",2,vmSize="Standard_F8s_v2",1,vmSize="Standard_F4s",0.5,vmSize="Standard_F2s_v2",0.25)
| bin span=1h _time
| stats dc(vm_name) AS vm_name_count sum(vm_unit) AS USED_VMS BY _time

Ciao.

Giuseppe

Sathiya123 · ‎04-07-2023

Thanks, @gcusello for replying.

Your query provides the same output as mine.

The issue here is that events got duplicated in our Splunk index for some reason.

In a given hour, there should not be two events for the same vm_name.

In order to solve the duplicate issue I am using dc(vm_name) thinking that sum(vm_unit) will avoid the duplicate entries. But in my case sum(vm_unit) includes the duplicate entries.

For e.g. consider all my vm entries are duplicated twice.

_time count(vm_name) sum(vm_unit) ==> _time 120 200

My expectation is

_time dc(vm_name) sum(vm_unit) ==> _time 60 100

Any help is highly appreciated.

PickleRick · ‎04-07-2023

If you do a single stats command with two different aggregation functions, they are performed separately. So dc() counts distinct values and sum() sums _all_ values. That's how it works.

Sathiya123 · ‎04-07-2023

@PickleRick Thanks for your reply.

I agree with your statement.

But how to form the Splunk query to get my expected results?

PickleRick · ‎04-07-2023

To be fully honest, I'm not sure what you want to achieve and what your data looks like. I'll assume that your events have vm_name and vm_unit fields and vm_unit is always the same for given vm_name and you want to count number of disctinct vm_names and sum of corresponding vm_units. Is that right?

<your_base_search>
| bin _time span=1h
| stats values(vm_unit) as vm_unit by _time vm_name
| stats dc(vm_name) sum(vm_unit)  by _time

You can also probably use dedup but I don't like that command and don't use it much so don't remember the syntax 😄

ITWhisperer · ‎04-07-2023

This made me smile - the accepted solution looks remarkably similar to my solution posted 3 hours prior to this 😁

PickleRick · ‎04-07-2023

True. I must say I didn't read into your answer because I didn't understand the question at that moment XD

Sathiya123 · ‎04-07-2023

@ITWhisperer Apologise, somehow I missed your reply, otherwise, I would have replied back on that thread.

Thanks again. Given thumps up to your reply.

Sathiya123 · ‎04-07-2023

Thanks, @PickleRick for your reply.

I think your query works. Let me confirm.

Basically, I want to sum(vm_unit) on an hourly basis, by ignoring duplicate entries in an hour.

PickleRick · ‎04-07-2023

That's where your wording is a bit confusing. Because - as far as I understand - you can have several vm_names with the same vm_unit. Otherwise it kinda doesn't make sense to sum some unrelated values as long as they're unique.

Sathiya123 · ‎04-07-2023

For e.g.

Events are like below:

TIME | VM_NAME | VM_UNIT

2023-03-08 01:05 | MONGO_VM | 1

2023-03-08 01:10 | NODE_VM | 2

2023-03-08 01:05 | MONGO_VM | 1

The splunk query should return SUM(VM_UNIT):

TIME | SUM(VM_UNIT)

2023-03-08 01:00 | 3

3rd event has to ignored since its duplicate of 1st event. Hope its clear.

PickleRick · ‎04-07-2023

That's what I thought. More or less. In your example both events have the same time. I suppose in real data the timestamp can differ.

Sathiya123 · ‎04-07-2023

@PickleRick Thanks a ton! It worked!

Got my query, just putting it here so that it will be helpful to others.

<search>
| eval vm_unit=case(vmSize="Standard_F16s_v2",2,vmSize="Standard_F8s_v2",1,vmSize="Standard_F4s",0.5,vmSize="Standard_F2s_v2",0.25)
| bin _time span=1h
| stats values(vm_unit) as vm_unit values(location) as location by _time id
| timechart span=1h usenull=true sum(vm_unit) as vm_count by location
| fillnull value=0

woodcock · ‎04-06-2023

Like this:

... | timechart span=1h sum(vm_unit) AS USED_VMS BY vm_name

Sathiya123 · ‎04-06-2023

Thanks, @woodcock for your reply!

What I am expecting

But what I am getting

woodcock · ‎04-07-2023

Don't use pictures; use text. Your images are not showing.

PickleRick · ‎04-07-2023

There must be something with your browser. The pictures are OK.

How to get dc and then sum of field?

timechart

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!