Splunk Search

How to join two tables where the key is named differently and must be cleaned up first?

dw10j
Engager

I'm new, have had no training. I have two distinct logs from same index and sourcetype. In the first log I want to filter by field A and transform field B with upper(replace(B,":","")). I then want to join to field X in the second log where X=upper(replace(B,":","")) so that I can retrieve the value of field Y in the second log. I have tried many things and scoured this forum for clues but am now stuck. Here is one of my many failed attempts:

index=its sourcetype=syslog dhip="100.200.300.406" | eval dhmacUP=upper(replace(dhmac,":","")) | append [search index=its sourcetype=syslog rmac=* ] | fields dhip, ruid, rmac, dhmacUP | transaction rmac dhmacUP
1 Solution

DalJeanis
Legend

@dw10j - The architecture you generally want to use for these has been described really well by @martin_mueller, so he deserves the "accepted answer".

I just wanted to add some orientation for you, because I came from the relational world last year, and let me tell you, translating from SQL to SPL will screw you up nearly every time. There are many things that are efficient in SQL that will turn your SPL to a swamp.

Here's an analogy that will hopefully help you transition over...

Relational databases are like an office building attached to a bunch of warehouses. Splunk is like a city library system.

Repeat that to yourself a couple of times, then read on.

Relational databases are like an office building attached to a bunch of warehouses. Splunk is like a city library system.

THE RELATIONAL OFFICE BUILDING AND WAREHOUSES

So, relational databases are based on the idea of everything having a place, all the stuff of a particular type stored in one warehouse or one floor of the office building, in nice little organized rows, and if you want anything out, then you ask for the stuff that's in any particular warehouse, THEN tell the system how to connect it to the stuff that's stored somewhere else, and that's efficient for the forklifts.

You with me here?

That will be inefficient in splunk, every time.

THE SPLUNK MAIN LIBRARIAN AND BRANCH LIBRARIES

Now, if you remember how libraries work, there's a bunch of stuff lying around on tables, and there's shelves, some of which are pretty well indexed and some aren't, and in the old days there were card files of 3x5 index cards that let you find books exactly one of three ways - author, subject, title. And they had stacks in the back of stuff in odd sizes that didn't fit.

Splunk.

Let me go a little further here - if you want to know what's in ALL the libraries, how does that work? Assuming you had the authority to demand that knowledge, you send a request to the head librarian, she writes instructions for all the other librarians, sends it out to the branch libraries, and they send back the results, which are then collated by the head librarian again.

And, one more little thing, ONLY the head librarian's senior staff is allowed to compile and analyze stuff, compare books to other books, and so on. Union rules. Well, that and the head librarian is sure that her staff can read and count.

So, if you send a request that requires everything of a particular type to be TOGETHER before analyzing it and connecting it to something else, then that means that EVERY BOOK needs to be sent to the head librarian. Twice.

Okay, maybe not twice, but at least once more than it needs to be. Because, unions, right? Anyway, she has to have all the work done by her own staff, instead of having most of it done by all the good (cheap) junior people at the branch libraries.

When possible, write your splunk queries so that the branch librarians are only sending back the LIST of information you really need, one line per shelf item, not the shelf items themselves. And write them so that they are sending back ALL the materials you need at the same time, rather than having to have the head librarian compile things, then ask again. ( verbs like map and some kinds of join go here.)

THE SEARCH PSEUDOCODE

The above discussion explains the first line of Martin's search.

The NEXT line should be a | fields command, limiting the values returned to the fields you absolutely need. The Splunk optimizer will do this for you most of the time, but it's better at your stage of programming SPL to do it explicitly, as early as possible, so that you learn to think about it.

You are telling the branch librarians, i.e. the indexers, that you need nothing else besides fields x, y and z. Let's keep those books on the shelves.

Other than that minor omission, and not laying the code out pretty for you, Martin's advice is golden.

 ((filters identifying events of type A) OR (filters identifying events of type B)) 
| fields ... the list of every field that you need from either type A or B... 
| eval joinfield = case(expression to detect type A, functions(to(transform(events, of, type, A))), 
                        expression to detect type B, functions(to(transform(events, of, type, A)))) 
| stats values(field1) as field1 values(... as fieldN by joinfield

So, what's our mantra for today?

Relational databases are like an office building attached to a bunch of warehouses. Splunk is like a city library system.

Okay, one more thing. The key to knowing what a branch librarian can do is the words "streaming" and "distributable"

STREAMING vs NON-STREAMING, DISTRIBUTABLE vs everything else

In our analogy, "streaming" means the librarians can do it to one book at a time. "Distributable" means it can be done at a branch library. (Yes, there ARE things the head librarian can do one at a time, if she has to.)

Whether something is streaming and/or distributable is normally pretty intuitive, but there are exceptions if you don't know all the functions of the verb. Table was a gotcha for me. I thought it just reformatted the output, so it would of course be streaming and distributable, but in reality among other things it also potentially limits the number of records returned... which cannot be done at the branch libraries, because they are not aware of what the other branches are doing. Therefore, table is non streaming, use fields instead.

(I STILL haven't yet gone back through six months of potentially erroneous posts to fix that...)

When you are trying to speed up a query, as a general case, you want to limit _time first, records second, fields third, then do streaming commands, then nonstreaming. Subject to that, do reformats AFTER all aggregations if the formatted field is one of the aggregation keys (and especially if the reformat is increasing the field size). Do reformats and renames that are solely for display purposes last, after everything else has been done.

When you have multiple options and a bit of time, or a slow search, then write it up multiple ways and TEST THEM. Performance in splunk, like in everything else, is highly data-dependent and architecture-dependent, and the same "efficient" query on sparse data might be inefficient on dense data, and so on.

Second mantra for the day: Where theory and real performance differ, believe the real performance.

There's some guidelines and a few new mental maps laid down. Happy splunking!

View solution in original post

DalJeanis
Legend

@dw10j - if your issue has been resolved, then please accept the answer that solved your issue. If not, then please let us know how we can help you more.

0 Karma

DalJeanis
Legend

@dw10j - The architecture you generally want to use for these has been described really well by @martin_mueller, so he deserves the "accepted answer".

I just wanted to add some orientation for you, because I came from the relational world last year, and let me tell you, translating from SQL to SPL will screw you up nearly every time. There are many things that are efficient in SQL that will turn your SPL to a swamp.

Here's an analogy that will hopefully help you transition over...

Relational databases are like an office building attached to a bunch of warehouses. Splunk is like a city library system.

Repeat that to yourself a couple of times, then read on.

Relational databases are like an office building attached to a bunch of warehouses. Splunk is like a city library system.

THE RELATIONAL OFFICE BUILDING AND WAREHOUSES

So, relational databases are based on the idea of everything having a place, all the stuff of a particular type stored in one warehouse or one floor of the office building, in nice little organized rows, and if you want anything out, then you ask for the stuff that's in any particular warehouse, THEN tell the system how to connect it to the stuff that's stored somewhere else, and that's efficient for the forklifts.

You with me here?

That will be inefficient in splunk, every time.

THE SPLUNK MAIN LIBRARIAN AND BRANCH LIBRARIES

Now, if you remember how libraries work, there's a bunch of stuff lying around on tables, and there's shelves, some of which are pretty well indexed and some aren't, and in the old days there were card files of 3x5 index cards that let you find books exactly one of three ways - author, subject, title. And they had stacks in the back of stuff in odd sizes that didn't fit.

Splunk.

Let me go a little further here - if you want to know what's in ALL the libraries, how does that work? Assuming you had the authority to demand that knowledge, you send a request to the head librarian, she writes instructions for all the other librarians, sends it out to the branch libraries, and they send back the results, which are then collated by the head librarian again.

And, one more little thing, ONLY the head librarian's senior staff is allowed to compile and analyze stuff, compare books to other books, and so on. Union rules. Well, that and the head librarian is sure that her staff can read and count.

So, if you send a request that requires everything of a particular type to be TOGETHER before analyzing it and connecting it to something else, then that means that EVERY BOOK needs to be sent to the head librarian. Twice.

Okay, maybe not twice, but at least once more than it needs to be. Because, unions, right? Anyway, she has to have all the work done by her own staff, instead of having most of it done by all the good (cheap) junior people at the branch libraries.

When possible, write your splunk queries so that the branch librarians are only sending back the LIST of information you really need, one line per shelf item, not the shelf items themselves. And write them so that they are sending back ALL the materials you need at the same time, rather than having to have the head librarian compile things, then ask again. ( verbs like map and some kinds of join go here.)

THE SEARCH PSEUDOCODE

The above discussion explains the first line of Martin's search.

The NEXT line should be a | fields command, limiting the values returned to the fields you absolutely need. The Splunk optimizer will do this for you most of the time, but it's better at your stage of programming SPL to do it explicitly, as early as possible, so that you learn to think about it.

You are telling the branch librarians, i.e. the indexers, that you need nothing else besides fields x, y and z. Let's keep those books on the shelves.

Other than that minor omission, and not laying the code out pretty for you, Martin's advice is golden.

 ((filters identifying events of type A) OR (filters identifying events of type B)) 
| fields ... the list of every field that you need from either type A or B... 
| eval joinfield = case(expression to detect type A, functions(to(transform(events, of, type, A))), 
                        expression to detect type B, functions(to(transform(events, of, type, A)))) 
| stats values(field1) as field1 values(... as fieldN by joinfield

So, what's our mantra for today?

Relational databases are like an office building attached to a bunch of warehouses. Splunk is like a city library system.

Okay, one more thing. The key to knowing what a branch librarian can do is the words "streaming" and "distributable"

STREAMING vs NON-STREAMING, DISTRIBUTABLE vs everything else

In our analogy, "streaming" means the librarians can do it to one book at a time. "Distributable" means it can be done at a branch library. (Yes, there ARE things the head librarian can do one at a time, if she has to.)

Whether something is streaming and/or distributable is normally pretty intuitive, but there are exceptions if you don't know all the functions of the verb. Table was a gotcha for me. I thought it just reformatted the output, so it would of course be streaming and distributable, but in reality among other things it also potentially limits the number of records returned... which cannot be done at the branch libraries, because they are not aware of what the other branches are doing. Therefore, table is non streaming, use fields instead.

(I STILL haven't yet gone back through six months of potentially erroneous posts to fix that...)

When you are trying to speed up a query, as a general case, you want to limit _time first, records second, fields third, then do streaming commands, then nonstreaming. Subject to that, do reformats AFTER all aggregations if the formatted field is one of the aggregation keys (and especially if the reformat is increasing the field size). Do reformats and renames that are solely for display purposes last, after everything else has been done.

When you have multiple options and a bit of time, or a slow search, then write it up multiple ways and TEST THEM. Performance in splunk, like in everything else, is highly data-dependent and architecture-dependent, and the same "efficient" query on sparse data might be inefficient on dense data, and so on.

Second mantra for the day: Where theory and real performance differ, believe the real performance.

There's some guidelines and a few new mental maps laid down. Happy splunking!

niketn
Legend

Just wanted to add. This is a good reference for SQL users to switch to SPL

https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/SQLtoSplunk

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"
0 Karma

DalJeanis
Legend

No, it isn't. That SQL and that SPL will not get the same results. Working on an improved version.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Amazeballs.

Side note, stats will propagate its required fields upstream, see remoteSearch in the job inspector.

DalJeanis
Legend

@martin_mueller - Regarding the side note...Yep. I was keeping the details on "optimizer" artfully vague so that the usefulness of the post hopefully won't change much over time. The other fact, that splunk could potentially allow some collecting and pre-work to be done at the indexers under x, y and z circumstances, would just confuse the main points that needed to be made to help a SQL guy to come over to the splunky side.

And, I have some ideas for enhancement requests... that will have to wait until I can have a beer with some splunkers and hash them out, either here in Dallas or -- wife willing and the creek don't rise -- in D.C.

martin_mueller
SplunkTrust
SplunkTrust

As a good starting point, check out Nick's March 2016 talk on aggregating data: https://wiki.splunk.com/Virtual_.conf
Also this: https://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-jo...

In general, you'll want something like this:

((filters identifying events of type A) OR (filters itentifying events of type B)) | eval joinfield = case(expression to detect type A, functions(to(transform(events, of, type, A))), expression to detect type B, functions(to(transform(events, of, type, A)))) | stats values(field1) as field1 values(... as fieldN by joinfield
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...