Splunk eats machine data for breakfast, but many of us are using data in Splunk that doesn't come from a machine and isn't easily event-ized.
What are some best practices for dealing with high-volume data from snowflake schemas? This data may change frequently, isn't broken into events, and sometimes requires complex SQL to distill it into events.
Best practices for using DB Connect are most welcome.
Here are my best practices for DB Connect.
Do not use v1.
Try to use v3 but expect many problems, some of them insurmountable.
Trust v2 but beware that there is a hardcoded limit that you need to fix (https://answers.splunk.com/answers/233222/splunk-db-connect-2-dbxquery-only-returns-1001-row.html)
Use checkpoints, but try not to use timestamps for this.
Do as much work as possible in SQL (on the DB side).
Don't ingest more than you need; make sure you limit the fields returned.
If things are overly complex, consider creating a custom view inside of your DB and query against that instead of the raw table.
@SloshBurch, we need a
validated_best_practice in this area.
You rang? lol
I guess I want to know more about the situation here. I'm not familiar enough with database data that has changing schema. I need to appreciate that to get my head around the challenge.
It's not the schema that is changing, it's the data. Updates and deletes are not Splunk-friendly. If I've already indexed an event related to entity X and then something about X changes, I need to index a new event for entity X. The old one isn't relevant anymore for most purposes. That means that either users have to search for the latest version of that event, or I need to find a way to delete the old version that is out of date.
Sometimes I use lookup tables instead of indexes. I've also looked at using scheduled searches to do the heavy lifting of finding the latest version of each entity and then having dashboards use loadjob. But end users trying to use the traditional "index=foo" in the search box can easily come up with incorrect conclusions.
Do you retain a timestamp as a field with a row that is inserted or deleted? If you do then DBConnect could use a cursor follow on a query with ORDER BY of that timestamp field. Then the data is loaded in splunk as a new event and reporting on it uses
latest() of a transforming statistics command.
I'm not sure at this time how to do it without that. I think your approach of using a lookup file to cache it is sound as well. But obv depends on the volume of data.
-- Do as much work as possible in SQL (on the DB side).
This is huge and applies to other software integrations with DBs.
For example, you need a certain type of data-set - create a view that represents this data-set and ingest this data-set, instead of ingesting the raw data and performing the joins within Splunk. In Hunk, with huge data-sets these scenarios were nightmares until we created the proper views.