Solved: How to join data from index and dbxquery without u...

LearningGuy · ‎08-22-2023

Hello,
How to join data from index and dbxquery without using JOIN, APPEND or stats command?
Issue with JOIN: limit of subsearch 50,000 rows or fewer. Missing data.
Issue with APPEND: requires "stats values" command to correlate the data, gives "merged data" in one row that needs to be split (using MVexpand or other methodology). MVexpand has memory issue and slow.

At this time, my solution is moving DBXquery data into CSV file and use "lookup" command to join the data from index, but the CSV file is static and needs to be manually updated.
Please suggest. Thank you

| index=vulnerability_index
| table ip_address, vulnerability, score
Table 1:

ip_address	vulnerability	score
192.168.1.1	SQL Injection	9
192.168.1.1	OpenSSL	7
192.168.1.2	Cross Site-Scripting	8
192.168.1.2	DNS	5

| dbxquery query="select * from tableCompany"
Table 2:

ip_address	company	location
192.168.1.1	Comp-A	Loc-A
192.168.1.2	Comp-B	Loc-B
192.168.1.5	Comp-E	Loc-E

Expected result after join data from index(Table 1) and index (Table 2)
Table 3:

ip_address	company	location	vulnerability	Score
192.168.1.1	Comp-A	Loc-A	SQL Injection	9
192.168.1.1	Comp-A	Loc-A	OpenSSL	7
192.168.1.2	Comp-B	Loc-B	Cross Site-Scripting	8
192.168.1.2	Comp-B	Loc-B	DNS	5

bowesmana · ‎08-23-2023

If the 'merge' doesn't work on real data on the ip_address field, then there must be some difference in the data, e.g. possible trailing spaces? If the data is exactly the same it will work, so it must be different somehow.

If you have a subsearch you will have limits of 50,000 with append - so if you have more than 50k ip addresses, you cannot do the two searches, unless you switch around the searches if the index search is less than 50k.

Another issue with using the append is that all data will be sent to the search head for processing, thereby losing the benefit of running in parallel on the indexers (if you are clustered)

stats is a fast command, eventstats and anything to do with subsearches can be slow. If stats is in the outer search as these examples, it will be fast, but the overall time for the two searches will be slow.

Moving the dbxquery and save to CSV seems to be a sensible approach - I imagine that your vulnerabilities list is hopefully a small subset of the hosts, so it seems overkill to have to fetch every host for each result.

Creating a lookup (option 4) will also mean it will be distributed to indexers, so you will get additional performance benefit of the lookup occurring on the indexers, assuming you have not done something to cause the data to go to the search head. Yes it will scale - there are some constraints on how a lookup works when it gets to a certain size (25MB I think), but it still works well.

Hope this helps.

View solution in original post

isoutamo · ‎08-23-2023

Hi

If you are currently using lookup, is the dblookup an optio for you?

https://docs.splunk.com/Documentation/DBX/3.14.0/DeployDBX/Createandmanagedatabaselookups

On docs there is also way to update that normal lookup file with dbxquery when you are using it.

r. Ismo

LearningGuy · ‎08-23-2023

Hello,

It looks like DBlookup requires additional setup, it didn't work when I tried it.
I was able to use outputlookup and will perform some test.
Is there a way to create a job in Splunk to run this query so the CSV file gets updated frequently?
| dbxquery query="select * from tableCompany"
| outputlookup company.csv

Thank you for your help

bowesmana · ‎08-23-2023

You can do it a number of ways - below are examples with your data

First way - do the index search first

| makeresults
``` This simulates your index search ```
| eval _raw="ip_address	vulnerability	score
192.168.1.1	SQL Injection	9
192.168.1.1	OpenSSL	7
192.168.1.2	Cross Site-Scripting	8
192.168.1.2	DNS	5"
| multikv forceheader=1
| fields ip_address	vulnerability	score
| append [
| makeresults
``` This simulates your dbxquery ```
| eval _raw="ip_address	company	location
192.168.1.1	Comp-A	Loc-A
192.168.1.2	Comp-B	Loc-B
192.168.1.5	Comp-E	Loc-E"
| multikv forceheader=1
| fields company ip_address location
]
| fields - _raw _time
``` This collects the company data from the dbxquery into the vulnerability records ```
| eventstats values(company) as company values(location) as location by ip_address
``` and this joins the two ```
| stats values(*) as * by ip_address vulnerability
| table ip_address company location vulnerability score

and the second using mvexpand on the score/vulnerability.

Your statement that you have to use stats values to make a single row is not right - there's a single row per ip address - but you may have meant that. However, whether it is fast or slow will depend on your data and volume. In the first search, eventstats can also be slow, so you should test and look at the job inspector.

| makeresults
| eval _raw="ip_address	vulnerability	score
192.168.1.1	SQL Injection	9
192.168.1.1	OpenSSL	7
192.168.1.2	Cross Site-Scripting	8
192.168.1.2	DNS	5"
| multikv forceheader=1
| fields ip_address	vulnerability	score
| append [
| makeresults
| eval _raw="ip_address	company	location
192.168.1.1	Comp-A	Loc-A
192.168.1.2	Comp-B	Loc-B
192.168.1.5	Comp-E	Loc-E"
| multikv forceheader=1
| fields company ip_address location
]
| fields - _raw _time
| eval sv=score.":".vulnerability
| fields - vulnerability score
| stats values(*) as * by ip_address 
| mvexpand sv
| rex field=sv "(?<score>\d+):(?<vulnerability>.*)"
| table ip_address company location vulnerability score
| where isnotnull(vulnerability)

Using a previously created lookup where you create it with dbxquery on a regular basis would certainly be a good way to go if that fits your use case.

In any case, you will have a subsearch limit of 50,000 results with append also, but if you are doing that with the dbxquery search then unless you have more than 50,000 IP addresses it won't be an issue.

LearningGuy · ‎08-23-2023

Hello,
1. Your query worked on the sample data, but it didn't work in real data - I am not sure why it didn't merge the ip address
Yes, I meant the stats value gives "merged data / multiple values" in one row (see Table 1), this issue merged data were fixed by using "eventstats" you recommended.
2. In real data, dbxquery has more than 50,000 rows, so APPEND and JOIN cannot support this?
3. The performance is very slow using APPEND and eventstats and stats
4. Is moving data from DBXquery regularly and and output it to CSV regularly is scalable for large data?
Thank you for your help

Table 1: Sample Merged Data

ip_address vulnerability score company location

192.168.1.1	OpenSSL SQL Injection	7 9	Comp-A Comp-Z	Loc-A Loc-Z
192.168.1.2	Cross Site-Scripting DNS	5 8	Comp-B Comp-Y	Loc-B Loc-Y
192.168.1.5			Comp-E	Loc-E

bowesmana · ‎08-23-2023

If the 'merge' doesn't work on real data on the ip_address field, then there must be some difference in the data, e.g. possible trailing spaces? If the data is exactly the same it will work, so it must be different somehow.

If you have a subsearch you will have limits of 50,000 with append - so if you have more than 50k ip addresses, you cannot do the two searches, unless you switch around the searches if the index search is less than 50k.

Another issue with using the append is that all data will be sent to the search head for processing, thereby losing the benefit of running in parallel on the indexers (if you are clustered)

stats is a fast command, eventstats and anything to do with subsearches can be slow. If stats is in the outer search as these examples, it will be fast, but the overall time for the two searches will be slow.

Moving the dbxquery and save to CSV seems to be a sensible approach - I imagine that your vulnerabilities list is hopefully a small subset of the hosts, so it seems overkill to have to fetch every host for each result.

Creating a lookup (option 4) will also mean it will be distributed to indexers, so you will get additional performance benefit of the lookup occurring on the indexers, assuming you have not done something to cause the data to go to the search head. Yes it will scale - there are some constraints on how a lookup works when it gets to a certain size (25MB I think), but it still works well.

Hope this helps.

LearningGuy · ‎08-23-2023

What is the maximum CSV file I can upload to the Splunk lookup?

Thanks

yuanliu · ‎08-23-2023

Do you mean maximum file size? I do not think there is a limit to file number. (There might be a practical limit but it probably will not have practical impact.

File size can be limited by available RAM and other factors. (I had to learn this the hard way.) See [lookup] section in limits.conf.

yuanliu · ‎08-22-2023

Two alternatives.

Instead of using dbxquery to create a lookup table, you can simply Create an external lookup.
Use map command. Like

| index=vulnerability_index
| table ip_address, vulnerability, score
| map search="| dbxquery query="select * from tableCompany where ip_address=$ip_address$"

yuanliu · ‎08-23-2023

Two alternatives.

Also want to correct myself: There is the Splunk DB Connect app, a third option. If your database is supported, you may be in luck.

LearningGuy · ‎08-23-2023

Hello,
1. It seems like external lookup is a viable solution, I am still trying to find a good example in order to test this.
I read the instruction you sent and still wondering on how I use external lookup based on my DBXquery or CSV file from DBXquery, then correlate it with the index.
I appreciate if you can give an example.

2. It gave me an error when I ran your command using map. The search result count (727) exceeds maximum (10), using max. To override it, set maxsearches appropriately.

Thank you for your help

yuanliu · ‎08-23-2023

2. It gave me an error when I ran your command using map. The search result count (727) exceeds maximum (10), using max. To override it, set maxsearches appropriately.

As the message (and the document) suggests, default maxsearches for map command is 10. You can explicitly specify how many you would allow. In your example, you can say something like

| map maxsearches=1000 search="| dbxquery query="select * from tableCompany where ip_address=$ip_address$"

Like @bowesmana notes, map can be expensive.

Meanwhile, you cannot directly use dbxquery in an external lookup. You need to write a script that accept the specified arguments and that outputs data in the format suitable for external lookup. That document describes the basic structure in Python, although you can adapt it to other languages. Every Splunk installation also comes with a script $SPLUNK_HOME/system/bin/external_lookup.py and an external lookup dnslookup. You can study that simple script for a basic feeling. (That script uses socket.gethostbyname_ex(host) to execute IP lookup and socket.gethostbyaddr(ip) for hostname lookup. You will need to use corresponding DB interface method to execute the external query.)

bowesmana · ‎08-23-2023

Don't forget, map runs serially and will run once per vulnerability found

How to join data from index and dbxquery without using JOIN, APPEND or stats command?

join

lookup

stats

Join Us for Splunk University and Get Your Bootcamp Game On!

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

Announcing Scheduled Export GA for Dashboard Studio