Splunk Search

How to join data from index and dbxquery without using JOIN, APPEND or stats command?

LearningGuy
Builder

Hello,
How to join data from index and dbxquery without using JOIN, APPEND or stats command?
Issue with JOIN:  limit of subsearch 50,000 rows or fewer.  Missing data.
Issue with APPEND: requires "stats values" command to correlate the data, gives "merged data" in one row that needs to be split (using MVexpand or other methodology). MVexpand has memory issue and slow.

At this time, my solution is moving DBXquery data into CSV file and use "lookup" command to join the data from index, but the CSV file is static and needs to be manually updated.   
Please suggest. Thank you

| index=vulnerability_index
| table ip_address, vulnerability, score
Table 1:

ip_addressvulnerabilityscore
192.168.1.1SQL Injection9
192.168.1.1OpenSSL7
192.168.1.2Cross Site-Scripting8
192.168.1.2DNS5

 

| dbxquery query="select * from tableCompany"
Table 2:

ip_addresscompanylocation
192.168.1.1Comp-ALoc-A
192.168.1.2Comp-BLoc-B
192.168.1.5Comp-ELoc-E


Expected result after join data from index(Table 1) and index (Table 2)
Table 3: 

ip_addresscompanylocationvulnerabilityScore
192.168.1.1Comp-ALoc-ASQL Injection9
192.168.1.1Comp-ALoc-AOpenSSL7
192.168.1.2Comp-BLoc-BCross Site-Scripting8
192.168.1.2Comp-BLoc-BDNS5
Labels (3)
0 Karma
1 Solution

bowesmana
SplunkTrust
SplunkTrust

If the 'merge' doesn't work on real data on the ip_address field, then there must be some difference in the data, e.g. possible trailing spaces? If the data is exactly the same it will work, so it must be different somehow.

If you have a subsearch you will have limits of 50,000 with append - so if you have more than 50k ip addresses, you cannot do the two searches, unless you switch around the searches if the index search is less than 50k.

Another issue with using the append is that all data will be sent to the search head for processing, thereby losing the benefit of running in parallel on the indexers (if you are clustered)

stats is a fast command, eventstats and anything to do with subsearches can be slow. If stats is in the outer search as these examples, it will be fast, but the overall time for the two searches will be slow.

Moving the dbxquery and save to CSV seems to be a sensible approach - I imagine that your vulnerabilities list is hopefully a small subset of the hosts, so it seems overkill to have to fetch every host for each result.

Creating a lookup (option 4) will also mean it will be distributed to indexers, so you will get additional performance benefit of the lookup occurring on the indexers, assuming you have not done something to cause the data to go to the search head. Yes it will scale - there are some constraints on how a lookup works when it gets to a certain size (25MB I think), but it still works well.

Hope this helps.

View solution in original post

isoutamo
SplunkTrust
SplunkTrust

Hi

If you are currently using lookup, is the dblookup an optio for you? 

https://docs.splunk.com/Documentation/DBX/3.14.0/DeployDBX/Createandmanagedatabaselookups

On docs there is also way to update that normal lookup file with dbxquery when you are using it.

r. Ismo

LearningGuy
Builder

Hello,

It looks like DBlookup requires additional setup, it didn't work when I tried it. 
I was able to use outputlookup and will perform some test.
Is there a way to create a job in Splunk to run this query so the CSV file gets updated frequently?
| dbxquery query="select * from tableCompany"
| outputlookup company.csv

Thank you for your help

Tags (1)
0 Karma

bowesmana
SplunkTrust
SplunkTrust

You can do it a number of ways - below are examples with your data

First way - do the index search first

| makeresults
``` This simulates your index search ```
| eval _raw="ip_address	vulnerability	score
192.168.1.1	SQL Injection	9
192.168.1.1	OpenSSL	7
192.168.1.2	Cross Site-Scripting	8
192.168.1.2	DNS	5"
| multikv forceheader=1
| fields ip_address	vulnerability	score
| append [
| makeresults
``` This simulates your dbxquery ```
| eval _raw="ip_address	company	location
192.168.1.1	Comp-A	Loc-A
192.168.1.2	Comp-B	Loc-B
192.168.1.5	Comp-E	Loc-E"
| multikv forceheader=1
| fields company ip_address location
]
| fields - _raw _time
``` This collects the company data from the dbxquery into the vulnerability records ```
| eventstats values(company) as company values(location) as location by ip_address
``` and this joins the two ```
| stats values(*) as * by ip_address vulnerability
| table ip_address company location vulnerability score

 and the second using mvexpand on the score/vulnerability.

Your statement that you have to use stats values to make a single row is not right - there's a single row per ip address - but you may have meant that. However, whether it is fast or slow will depend on your data and volume. In the first search, eventstats can also be slow, so you should test and look at the job inspector.

| makeresults
| eval _raw="ip_address	vulnerability	score
192.168.1.1	SQL Injection	9
192.168.1.1	OpenSSL	7
192.168.1.2	Cross Site-Scripting	8
192.168.1.2	DNS	5"
| multikv forceheader=1
| fields ip_address	vulnerability	score
| append [
| makeresults
| eval _raw="ip_address	company	location
192.168.1.1	Comp-A	Loc-A
192.168.1.2	Comp-B	Loc-B
192.168.1.5	Comp-E	Loc-E"
| multikv forceheader=1
| fields company ip_address location
]
| fields - _raw _time
| eval sv=score.":".vulnerability
| fields - vulnerability score
| stats values(*) as * by ip_address 
| mvexpand sv
| rex field=sv "(?<score>\d+):(?<vulnerability>.*)"
| table ip_address company location vulnerability score
| where isnotnull(vulnerability)

 Using a previously created lookup where you create it with dbxquery on a regular basis would certainly be a good way to go if that fits your use case.

In any case, you will have a subsearch limit of 50,000 results with append also, but if you are doing that with the dbxquery search then unless you have more than 50,000 IP addresses it won't be an issue.

LearningGuy
Builder

Hello,
1. Your query worked on the sample data, but it didn't work in real data - I am not sure why it didn't merge the ip address
Yes, I meant the stats value gives "merged data / multiple values" in one row (see Table 1), this issue merged data were fixed by using "eventstats" you recommended.
2. In real data, dbxquery has more than 50,000 rows, so APPEND and JOIN cannot support this?  
3. The performance is very slow using APPEND and eventstats and stats
4. Is moving data from DBXquery regularly and and output it to CSV regularly is scalable for large data?
Thank you for your help

Table 1: Sample Merged Data

ip_address vulnerability score company location

192.168.1.1
OpenSSL
SQL Injection
7
9
Comp-A
Comp-Z
Loc-A
Loc-Z
192.168.1.2
Cross Site-Scripting
DNS
5
8
Comp-B
Comp-Y
Loc-B
Loc-Y
192.168.1.5  Comp-ELoc-E
Tags (2)
0 Karma

bowesmana
SplunkTrust
SplunkTrust

If the 'merge' doesn't work on real data on the ip_address field, then there must be some difference in the data, e.g. possible trailing spaces? If the data is exactly the same it will work, so it must be different somehow.

If you have a subsearch you will have limits of 50,000 with append - so if you have more than 50k ip addresses, you cannot do the two searches, unless you switch around the searches if the index search is less than 50k.

Another issue with using the append is that all data will be sent to the search head for processing, thereby losing the benefit of running in parallel on the indexers (if you are clustered)

stats is a fast command, eventstats and anything to do with subsearches can be slow. If stats is in the outer search as these examples, it will be fast, but the overall time for the two searches will be slow.

Moving the dbxquery and save to CSV seems to be a sensible approach - I imagine that your vulnerabilities list is hopefully a small subset of the hosts, so it seems overkill to have to fetch every host for each result.

Creating a lookup (option 4) will also mean it will be distributed to indexers, so you will get additional performance benefit of the lookup occurring on the indexers, assuming you have not done something to cause the data to go to the search head. Yes it will scale - there are some constraints on how a lookup works when it gets to a certain size (25MB I think), but it still works well.

Hope this helps.

LearningGuy
Builder

What is the maximum CSV file I can upload to the Splunk lookup?

Thanks

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Do you mean maximum file size?  I do not think there is a limit to file number. (There might be a practical limit but it probably will not have practical impact.

File size can be limited by available RAM and other factors. (I had to learn this the hard way.)  See [lookup] section in limits.conf.

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Two alternatives.

  1. Instead of using dbxquery to create a lookup table, you can simply Create an external lookup.
  2. Use map command.  Like

 

| index=vulnerability_index
| table ip_address, vulnerability, score
| map search="| dbxquery query="select * from tableCompany where ip_address=$ip_address$"

 

yuanliu
SplunkTrust
SplunkTrust
Two alternatives.

Also want to correct myself: There is the Splunk DB Connect app, a third option.  If your database is supported, you may be in luck.

LearningGuy
Builder

Hello,
1. It seems like external lookup is a viable solution, I am still trying to find a good example in order to test this.
     I read the instruction you sent and still wondering on how I use external lookup based on my DBXquery or CSV file from DBXquery, then correlate it with the index.
I appreciate if you can give an example.  

2. It gave me an error when I ran your command using map. The search result count (727) exceeds maximum (10), using max. To override it, set maxsearches appropriately.

Thank you for your help

Tags (1)
0 Karma

yuanliu
SplunkTrust
SplunkTrust

2. It gave me an error when I ran your command using map. The search result count (727) exceeds maximum (10), using max. To override it, set maxsearches appropriately.

As the message (and the document) suggests, default maxsearches for map command is 10.  You can explicitly specify how many you would allow. In your example, you can say something like

 

| map maxsearches=1000 search="| dbxquery query="select * from tableCompany where ip_address=$ip_address$"

 

Like @bowesmana notes, map can be expensive.

Meanwhile, you cannot directly use dbxquery in an external lookup.  You need to write a script that accept the specified arguments and that outputs data in the format suitable for external lookup.  That document describes the basic structure in Python, although you can adapt it to other languages.  Every Splunk installation also comes with a script $SPLUNK_HOME/system/bin/external_lookup.py and an external lookup dnslookup.  You can study that simple script for a basic feeling. (That script uses socket.gethostbyname_ex(host) to execute IP lookup and socket.gethostbyaddr(ip) for hostname lookup.  You will need to use corresponding DB interface method to execute the external query.)

bowesmana
SplunkTrust
SplunkTrust

Don't forget, map runs serially and will run once per vulnerability found

Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...