Best way to get the latest data from csv file that...

termcap · ‎01-23-2021

I am monitoring a CSV file and creating a dashboard based on it, the file is modified many times a day, or not for many days at all.

The file has not just rows added to it but also removed, plus the file contents are edited which will cause Splunk to re-index the whole file.

I can't use latest command because latest will return those lines which were deleted by the user as well and I can't set a range because I do not know when was the file last changed. Due to the above issues my dashboard will always be inconsistent

The only way I can think of solving this problem is

1. If I can somehow get all old events of the files removed from Splunk and then index the file, then run queries will "All Time"

2. Write a script that reads the csv, append a time date-time-field, re-creates a new csv which is monitored. Do this every 15 minutes causing the Splunk monitor to re-index the whole file every 15 minutes. (But this causes another issue, if the user updates the dashboard they can get partial or extra data depending on when the open the dashboard relative to 15 minutes)

Any ideas how this problem can be solved in a more elegant manner ?

gcusello · ‎01-23-2021

Hi @termcap,

I encountered this problem in one of my projects and I solved it with a search that I putted in a macro to avoid to have all the search in all my search.

In my situation, I have six data flows. all in the same index and each one with a different sourcetype, they arrive in different dates (two everyday, one weekly, one every 14 days two monthly).

So this is my macro that I call in my dashbard panel passing the sourcetype as parameter to it:

index=my_index sourcetype="$sourcetype$" [ | metasearch index=my_index sourcetype="$sourcetype$" earliest=-31d@d latest=now | head 1 | eval earliest=relative_time(_time,"-1h"), latest=relative_time(_time,"1h") | fields earliest latest ]

for your needs, you can reduce the earliest time e.g. to 24 hours or less.

In few words: the subsearch says to the main search the timestamp of the last arrive.

Ciao.

Giuseppe

termcap · ‎01-24-2021

Hi @gcusello,

This is a good solution but my problem is compounded by the fact that Splunk will only index the lines that were added to the csv (unless its edited such that Splunk is forced to send the full file).

I this case I will not get the contents of the file that were sent before the earliest was set using the subquery.

spammenot66 · ‎01-24-2021

@termcap the original question asks "how to get the latest data" right? @gcusello provided the right solution. If you're trying to do something more, it should be stated in the question.

having said that, what exactly are you trying to solve for in this specific scenario? its obviously not - checking for latest values from csv file.

spammenot66 · ‎01-24-2021

@gcusellojust curious why are you setting the earliest and latest values in this line? Is metadata really attributed to time?

| metasearch index=my_index sourcetype="$sourcetype$" earliest=-31d@d latest=now

@termcap, fyi, i would recommend same solution as well

termcap · ‎01-24-2021

@spammenot66the solution provided by @gcusello solves my problem but only partially because Splunk does not re-index all the contents of the file if a file is changed at the end, Splunk will only index the last line.

In this case I will get the last changed line of the file if its between the time frame mentioned in the sub-search but I won't get old lines which are "latest" from the point of view of the csv because I have no way to know how much back should I go to get those lines and then even there I don't know if its all the lines from the file as it exists now or still lines are missing.

Plus there is always the chance of deleted lines being returned, those deleted from the csv but existing in the index.

The solution i've arrived at is to reindex the whole file if the timestamp of the file changes with CURRENT as the DATE_CONFIG as suggested by @manjunathmeti

gcusello · ‎01-24-2021

Hi @termcap,

in this case, you could have a different approach:

schedule a search (e.g. each hour, or every ten minutes) that takes all the values you ned deduping the values and save results in a summary index or in a lookup.

In this way you're sure to have the updated values in this lookup or summary index and then you can make your searches (if summary index using my query) here, obviously you always have an update time related to the frequency of your scheduled search.

Ciao.

Giuseppe

gcusello · ‎01-24-2021

Hi @spammenot66,

I needed to create this subsearch to define the time borders for the main search, because I had different frequency of sources uploads and, in addition, I wasn't sure that that frequency was respected!

So I created that subsearch (that's very quick) to define the time borders for the main search: when I have the timestamp of the last indexed event I took all the events indexed in the period + or - 1h of the timestamp.

Ciao.

Giuseppe

manjunathmeti · ‎01-23-2021

hi @termcap ,

You can set DATETIME_CONFIG = CURRENT to the monitor input in props.conf which sets _time of events in the file to modification timestamp on that file being read. Then you can filter all the events based on the latest _time.

<base_search> | eventstats latest(_time) as timestamp | where _time=timestamp

If this reply helps you, an upvote/like would be appreciated.

termcap · ‎01-23-2021

@manjunathmeti this sounds like a good solution but there are some reservations

If I have understood the solution correctly then this should work if the whole file is being indexed every time, but if the user just adds a new line to the file, then just that line will be indexed and then when I run the query based on latest, I will end up with just the new changed line ?

As I understand, on deleting the last line of the file, there will be no change to the index, so running the latest query will still return the line that no longer exists in the actual CSV file that is with the user.

manjunathmeti · ‎01-23-2021

You can reindex the file every time it is updated. Set CHECK_METHOD = entire_md5 | modtime for source file path in props.conf on forwarder. Note that CHECK_METHOD should be configured for the source only.

[source::/path/to/file]
CHECK_METHOD = modtime

Best way to get the latest data from csv file that gets re-indexed with no fixed schedule

panel

table

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!