Avoid Duplicate Events using splunk DB Connect

swathis · ‎01-28-2013

Hi,
I am using splunk DB connect to get data from MYSQL to splunk server.I am taking dump everytime.Say i have 20 rows in MYSQL and when I add data initially to Splunk I will have 20 events.My database gets updated thrice everyday. so I have given cron as 6h and when splunk trys to fetch the data after 6 hours it takes all 20 rows again and adds it to splunk server so now I have totally 40 events where I should be having just latest 20 events.
My second issue is similar where one of table is MYSQL is deleted everytime and rows are inserted again every 6 hours. Say i have 10 rows here. Splunk gets all rows for the first time. But next run the events would be like 12 or 13 events where it should be just latest 10 events.
I want exactly same data as how it is in MYSQL. Please let me know as how can I solve the issue.I checked my input.conf there are no duplicate entries.
I add connection using Manager>ExternalDatabases.
Here I was just giving an example. Actually we have tables with 84000 rows and some tables are with just 10 rows.I was looking at solution where splunk has same data as MYSQL.can you please let me know as how dump works in splunk.I want to aviod duplicates at data level not at query level.Becuase now I have given cron as 15 min thats the time my data gets updated for some of the tables in MYSQL.Splunk data is growing by n time for every run which is not good..

Manvir · ‎04-28-2014

I think you should simply clean the corresponding index. Clearing the index will clear all the existing events but will also automatically index the existing data from datasource if its configured to continuously pick data from datasource.
I had this problem with csv file as datasource and this approach worked for me.

To clean index, visit http://www.manvir.net/how-to-remove-the-events-from-splunk/

jpass · ‎02-20-2013

I have another way to do this without writing custom scripts. I haven't tried this myself and personally I'd use a different method. But since it seems you have not found a suitable answer yet let me add this idea. I didn't test this either but I don't see why it wouldn't work.

Step 1: Create your database input using SPlunk DB Connect and schedule it to run 3 times per day to import the entire data set each time

Step 2: Create a saved search that deletes the data in the index that the Splunk DB Connect input pushes data to. The saved search would be set to all time and would be: index=myindex | delete

This saved search would need to be scheduled to run before the db input runs.

Step 3: I'd create a special user specifically for this purpose and obviously don't share the search with other users and don't provide 'delete' capability to normal users.

Step 4: Optionally, you might want to periodically 'clean' the index because the delete command doesn't reclaim disk space.

Read about the delete command here:
http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Delete

To clean the index you will have to shut down Splunk and run the following at command line (I'm on Linux but I imagine Windows is similar process):

First, issue this command at the command line to stop SPlunk:
sudo -H -u splunk /$splunk_home$/bin/splunk stop

Next, use the clean command to clean all events from an index:
sudo -H -u splunklnx /$splunk_home$/bin/splunk clean eventdata -index yourindexname

Last, restart Splunk:
sudo -H -u splunkuser /$splunk_home$/bin/splunk start

$splunk_home$ - refers to your own directory path

IMPORTANT: Notice this part: 'sudo -H -u splunkuser'
Depending on your configuration, you might have to run Splunk as a specific user. I do and so 'splunkuser' is the system username I've dedicated for Splunk. Again, it might depend on your own set of circumstances. But if I start splunk as root or another user, Splunk file ownership and permissions are changed & Splunk web runs into many problems. Unfortunately I figured out how to deal with this by experience. I started as root and to fix the problem I had to shut down splunk and change the owner of all the files in the splunk directory back to the 'splunkuser'.

spandal · ‎02-18-2013

Correct me if i am wrong.
Above python script is used to index data? and do we have to write script to connect to DB in Python?
Even I am new to python 😞

xchang1226 · ‎02-18-2013

No indexing at all.

You either connect to DB using Python or like what i showed there using subprocess to create a child process to connect to DB, that child process can be in any language of your choice, for me, I use Java.

xchang1226 · ‎02-14-2013

With this script, Splunk will NOT connect to MySql, this script will, and Splunk will NOT index any data and will NOT keep any data in local. If the data is not indexed, then you will not have duplicate events issue.

spandal · ‎02-13-2013

I too have exactly same issue with Splunk. From last post i have one doubt.
"Assuming you already have a script or program that can get data from DB, the following code should be able to get you going".
Splunk will connect to MySql and keeps data in local. Then why again we need to write a script which retrives data from DB?

xchang1226 · ‎02-05-2013

Assuming you already have a script or program that can get data from DB, the following code should be able to get you going.

import splunk.Intersplunk
import sys
import string
import subprocess
import os

results = []

proc = subprocess.Popen(['script_dir' + os.path.sep + 'script_name', 'any', 'number', 'of', 'arguments'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
while True:
output = proc.stdout.readline()
output = output.strip()
if output != '':
arr = string.split(output, ',')
row = {}
row['field_1'] = arr[0]
row['field_2'] = arr[1]
results.append(row)
else:
break

splunk.Intersplunk.outputResults(results)

jpass · ‎01-29-2013

I had a similar issue and solved it differently. Maybe this would help someone:

-Instead of indexing the data I created a lookup table
-problem is I had to use the dbquery command but didn't want to provide access to users for this command
-so I created a saved search using the dbquery command which writes a CSV file using the 'outputlookup' command.
-the saved search runs every 4 hours and rewrites the CSV file which is just a copy of the small database table
-my users access the data using a dashboard so they aren't required to learn any syntax relating to lookup tables

xchang1226 · ‎01-29-2013

You will need to create a custom search command which will query your database. I just followed the official doc here.

http://docs.splunk.com/Documentation/Splunk/latest/AdvancedDev/SearchScripts

If you are like me who is not fammiliar with python, you can use python subprocess to call another script/program to query database.

swathis · ‎01-31-2013

Thank you so much..It would be great if you can give me sample code.My requirement is just to get all the data in MYSQL to splunk server and at any point of time it should have same data as it is in MYSQL.

xchang1226 · ‎01-28-2013

I would try one of these 2 things.
1. For such a small DB table, I would write a custom search command instead of using DB Connect app. Using custom search command means you don't index any data, so you always get the latest data from DB and there is no duplicates events since there are no events at all, just results.
2. If you have to index the data, I had similar issue recently as well with a scripted input, the same data can be returned by the script again and again. This is what I did.
your_search | eventstats max(_time) as maxtime | eval diff=(maxtime - _time) | where diff < 300
Here, eventstats will add a new column "maxtime" that contains the timestamp of the latest collection, and basically any row that is 5 minutes older than the latest collection time will be excluded. This way, you pretty much always have the latest data.

jpass · ‎02-18-2013

FYI, Splunk DB Connect has a search command called 'dbquery'

swathis · ‎01-29-2013

Thanks for your response. Here I was just giving an example. Actually we have tables with 84000 rows and some tables are with just 10 rows.I was looking at solution where splunk has same data as MYSQL.can you please let me know as how dump works in splunk.I want to aviod duplicates at data level not at query level.Becuase now I have given cron as 15 min thats the time my data gets updated for some of the tables in MYSQL.Splunk data is growing by n time for every run which is not good..

Avoid Duplicate Events using splunk DB Connect

Prove Your Splunk Prowess at .conf25—No Prereqs Required!

Splunk Observability Cloud's AI Assistant in Action Series: Observability as Code

Splunk Answers Content Calendar, July Edition I

Are you a member of the Splunk Community?

Avoid Duplicate Events using splunk DB Connect

Prove Your Splunk Prowess at .conf25—No Prereqs Required!

Splunk Observability Cloud's AI Assistant in Action Series: Observability as Code

Splunk Answers Content Calendar, July Edition I