Solved: Tstats datamodel combine three sources by common f...

JDukeSplunk · ‎05-22-2020

In an attempt to speed up long running searches I Created a data model (my first) from a single index where the sources are sales_item (invoice line level detail) sales_hdr (summary detail, type of sale) and sales_tracking (carrier and tracking).

Skipping a lot of detail and back story.. I got this search working with goal being to bring in POType from the sales_hdr, tracking info from sales_tracking and plug it in to each item line from sales_item based on Order which is common to all three nodes in the DM.

|tstats summariesonly=t
dc(sales_item.Material) as MaterialCount
dc(sales_item.OrderLine) as OrderLineCount
latest(sales_item.OrderLineStatus) as OrderLineStatus
latest(sales_item.DateCreated) as DateCreated
latest(sales_item.OrderLine) as OrderLine
from datamodel=SCM.sales_item where sales_item.DateCreated=* AND sales_item.Order=0137737819 by sales_item.Order sales_item.Material 
| rename sales_item.Order AS Order
| eventstats sum(OrderLineCount) as OrderLineSum sum(MaterialCount) as MaterialSum by Order
| appendcols 
    [
| tstats summariesonly=t latest(sales_hdr.POType) AS POType from datamodel=SCM.sales_hdr where sales_hdr.CreationDateHdr=* AND sales_hdr.Order=0137737819 by sales_hdr.Order 
    | rename sales_hdr.Order AS Order]
    | eventstats last(POType) as POType by Order
|appendcols 
    [| tstats summariesonly=t latest(sales_tracking.CarrierName) AS CarrierName latest(sales_tracking.TrackingNumber) AS TrackingNumber from datamodel=SCM.sales_tracking where sales_tracking.Order=0137737819 by sales_tracking.Order sales_tracking.OrderLine
    | rename sales_tracking.Order AS Order
    | eventstats latest(CarrierName) as CarrierName latest(TrackingNumber) as TrackingNumber by sales_tracking.OrderLine]
    | streamstats count as Row by Order

This works great for a single order. But, as soon as I change the nodename.Order=0137737* in all three tstats lines this search returns empty columns, see output in the picture below. I know some of the "sum" columns in here are a little redundant but I am using them to validate the results of the search.

So my questions are.

Do I need use appendcols? If not, how would you bring in these other fields into SCM.sales_item from SCM.sales_hdr and SCM.sales_tracking?
If I do need the appendcols, where am I screwing this up? I suspect it has something to do with the "by" clauses being slightly different in each tstats. I have to "by" it out by Order and Material in sales_item to get the individual rows of the invoice, and same thing in tracking to get each individual tracking number.

Output of a * order number search. Where in this case I know that 0137737819 POType is Email. And If I search only that Order in all three tstats lines I get the expected result.

Thanks for reading.. I'll be here scratching my head.

One last note. I have tried using an |eval Order=coalesce to combine Order but this fails worse than the rename I am doing here.

DalJeanis · ‎05-26-2020

appendcols is almost never the right answer. It will fail in that query whenever you have more than one "material" per order, because it literally does not "align" based on any fields, it just appends the columns of the output to whichever record happens to come up in the order.

If you change the appendcols to join, then you might get a slightly more robust query.

Most likely, though you should revisit your assumptions and use append, then stats or eventstats to massage the results together.

I'm not positive how your data is supposed to work - grouping by OrderLine and/or Material seems like there might be a mismatch. Here's what I think is a decent first cut, assuming that the data from the tracking record maps to the orderlines as if they were keys. If not, then adjust the final eventstats keys to line the data up correctly.

 | tstats summariesonly=t
     dc(sales_item.Material) as MaterialCount
     dc(sales_item.OrderLine) as OrderLineCount
     latest(sales_item.OrderLineStatus) as OrderLineStatus
     latest(sales_item.DateCreated) as DateCreated
     latest(sales_item.OrderLine) as OrderLine
     from datamodel=SCM.sales_item 
     where sales_item.DateCreated=* AND sales_item.Order=0137737819 
     by sales_item.Order sales_item.Material   
 | rename sales_item.Order AS Order
 | eval rectype="detail"
 | eventstats sum(OrderLineCount) as OrderLineSum sum(MaterialCount) as MaterialSum by Order

 | rename COMMENT as "Add header data then throw away the header records"  
 | append [
     | tstats summariesonly=t 
        latest(sales_hdr.POType) AS POType 
        from datamodel=SCM.sales_hdr 
        where sales_hdr.CreationDateHdr=* AND sales_hdr.Order=0137737819 
        by sales_hdr.Order 
     | rename sales_hdr.Order AS Order
     | eval rectype="header"]
 | eventstats last(POType) as POType by Order
 | where rectype!="header"   


 | rename COMMENT as "Add tracking data then throw away the tracking records"  
 | append [ 
     | tstats summariesonly=t 
        latest(sales_tracking.CarrierName) AS CarrierName 
        latest(sales_tracking.TrackingNumber) AS TrackingNumber 
        from datamodel=SCM.sales_tracking 
        where sales_tracking.Order=0137737819 
        by sales_tracking.Order sales_tracking.OrderLine
     | rename sales_tracking.Order AS Order, sales_tracking.OrderLine AS OrderLine
     | eval rectype="tracking"
        ]
 | eventstats last(CarrierName ) as CarrierName  last(TrackingNumber) as TrackingNumber by Order  OrderLine
 | where rectype!="tracking"   

 | streamstats count as Row by Order

It REALLY seems like this query ought to be based on individual Orderline lines, not grouped by Material first, to get a valid readout of what is shipping, unless there is some way that you really can be sure that all lines of a particular material in an order always ship at the same time. Please feel free to explain your system in more detail if you need more help, tehn we can get you on the road fastest.

View solution in original post

DalJeanis · ‎05-26-2020

appendcols is almost never the right answer. It will fail in that query whenever you have more than one "material" per order, because it literally does not "align" based on any fields, it just appends the columns of the output to whichever record happens to come up in the order.

If you change the appendcols to join, then you might get a slightly more robust query.

Most likely, though you should revisit your assumptions and use append, then stats or eventstats to massage the results together.

I'm not positive how your data is supposed to work - grouping by OrderLine and/or Material seems like there might be a mismatch. Here's what I think is a decent first cut, assuming that the data from the tracking record maps to the orderlines as if they were keys. If not, then adjust the final eventstats keys to line the data up correctly.

 | tstats summariesonly=t
     dc(sales_item.Material) as MaterialCount
     dc(sales_item.OrderLine) as OrderLineCount
     latest(sales_item.OrderLineStatus) as OrderLineStatus
     latest(sales_item.DateCreated) as DateCreated
     latest(sales_item.OrderLine) as OrderLine
     from datamodel=SCM.sales_item 
     where sales_item.DateCreated=* AND sales_item.Order=0137737819 
     by sales_item.Order sales_item.Material   
 | rename sales_item.Order AS Order
 | eval rectype="detail"
 | eventstats sum(OrderLineCount) as OrderLineSum sum(MaterialCount) as MaterialSum by Order

 | rename COMMENT as "Add header data then throw away the header records"  
 | append [
     | tstats summariesonly=t 
        latest(sales_hdr.POType) AS POType 
        from datamodel=SCM.sales_hdr 
        where sales_hdr.CreationDateHdr=* AND sales_hdr.Order=0137737819 
        by sales_hdr.Order 
     | rename sales_hdr.Order AS Order
     | eval rectype="header"]
 | eventstats last(POType) as POType by Order
 | where rectype!="header"   


 | rename COMMENT as "Add tracking data then throw away the tracking records"  
 | append [ 
     | tstats summariesonly=t 
        latest(sales_tracking.CarrierName) AS CarrierName 
        latest(sales_tracking.TrackingNumber) AS TrackingNumber 
        from datamodel=SCM.sales_tracking 
        where sales_tracking.Order=0137737819 
        by sales_tracking.Order sales_tracking.OrderLine
     | rename sales_tracking.Order AS Order, sales_tracking.OrderLine AS OrderLine
     | eval rectype="tracking"
        ]
 | eventstats last(CarrierName ) as CarrierName  last(TrackingNumber) as TrackingNumber by Order  OrderLine
 | where rectype!="tracking"   

 | streamstats count as Row by Order

It REALLY seems like this query ought to be based on individual Orderline lines, not grouped by Material first, to get a valid readout of what is shipping, unless there is some way that you really can be sure that all lines of a particular material in an order always ship at the same time. Please feel free to explain your system in more detail if you need more help, tehn we can get you on the road fastest.

JDukeSplunk · ‎05-29-2020

Your search worked great. However I ended up going with a join left which returned the same results, only slightly faster.

Yeah you we're right about grouping by material and I knew that going in. What I also saw is that if an order had two lines with the same material that only counted as 1 "total" for the order. This is why I was getting a dc of line and materiel to see which one when grouped by the other matched up with individual order searches and lines.

The real rub of this data is that it is event based. So any time someone ticks a box in the order screen, or the carrier updates the shipping information a whole new data set for all that row is generated. Meaning, we have to search from the date created forward for all rows with that Order number, and only count/use the latest one for it. A single order of 10 lines can have 10k raw events in Splunk. Whats worse, is that tracking and header sources do not have a "datecreated" field that matches up with sales_item. This means that I can't limit the number rows searched when joining those others. While the first search runs quickly because it is usually only searching for Order=* DateCreated=20200520, the joins have to search the whole timespan for Order=*. On larger timespans this results in empty columns because Splunk reached the 50k row limit. Thankfully nobody has asked for that level of detail, these are mostly being used for Count-by pie charts and "top".

richgalloway · ‎05-22-2020

appendcols should be fine as long as all appended searches return the same number of results in the same order as the main search. To be clear, the first result returned by the appended search should be directly related to the first result of the main search, and so on for each result.

---
If this reply helps you, Karma would be appreciated.

to4kawa · ‎05-22-2020

| tstats summariesonly=t
    dc(sales_item.Material) as MaterialCount
    dc(sales_item.OrderLine) as OrderLineCount
    latest(sales_item.OrderLineStatus) as OrderLineStatus
    latest(sales_item.DateCreated) as DateCreated
    latest(sales_item.OrderLine) as OrderLine
    latest(sales_hdr.POType) AS POType
    latest(sales_tracking.CarrierName) AS CarrierName
    latest(sales_tracking.TrackingNumber) AS TrackingNumber
    from datamodel=SCM.sales_item where sales_item.Order=0137737* OR sales_hdr.Order=0137737* OR sales_tracking.Order=0137737* by sales_item.Order sales_item.Material sales_hdr.Order sales_tracking.Order sales_tracking.OrderLine

Why don't you try searching the whole thing once and then try to fix it?

JDukeSplunk · ‎05-26-2020

@to4kawa I would love to do all of this in one line. The issue there is the "from datamodel=" portion. If I leave this as the root of the datamodel "from datamodel=SCM" I only get results from sales_hdr, I guess because sales_hdr is the first branch in the tree. So to get results from sales_item I have to specify "from datamodel=SCM.sales_item" which excludes the results from hdr and tracking.

Tstats datamodel combine three sources by common field.

table

tstats

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Join the Conversation