Re: Update CSV source without duplicates

arusoft · ‎01-24-2022

I have csv data( source .csv file with sourcetype=csv ) which I need to update every week. Problem is that I might get duplicate data and I don't want this. What is the best way to avoid this?

I am thinking I might have to create a staging index every time and use this to compare with actual index and then insert from here only new records. Then delete the staging index. But for this again I am not sure how can I copy partial data from one index to another.

Thank You.

yuanliu · ‎01-26-2022

I have a very similar situation with a CSV source that I wish could be solved by staging index. (To answer richgalloway's question, yes, my intention is to include historic records in analytics.) My problem is not in-file duplicates, though. My source is so laden with overlaps from the past, I am using MS Excel to remove any entries from prior weeks before ingestion.

But if you can use a staging index, @richgalloway's method will work for you, like this

Ingest the original CSV into staging index.
Produce deduped.csv

index=staging <search for indexed events>
| dedup <some field(s)>
| outputcsv deduped.csv

(See outputlookup, outputcsv)

Ingest deduped.csv into your main index.
Delete staging index.

richgalloway · ‎01-24-2022

The typical process for updating a CSV does not require an index. Read the CSV then read the refreshing data. Use the dedup command to remove duplicates. Write the results to the CSV.

---
If this reply helps you, Karma would be appreciated.

arusoft · ‎01-24-2022

Thanks @richgalloway But sorry I didn't understood your logic . I am doing it via Index and not lookup. Are you suggesting to not use Index with csv and instead use lookup? Can you please give steps for your suggestion.

My end goal is have index that can store values from csv file that will keep getting updated every week. My initial csv file will start with million records and then every week it will have in thousands. There is possibility that it will have duplicate records ( which was already updated earlier) which I want to avoid because of unnecessary waste of space issue. I know I can easily accept duplicates and use dedup.

richgalloway · ‎01-24-2022

Here are the steps in SPL format:

index=foo <search for indexed events>
| inputlookup append=t mycsv.csv
| dedup <some field(s)>
| outputlookup mycsv.csv

Why index the CSV file? About the only good reason for that is to have a historical record of changes to the CSV over time. Otherwise, keep the data in the CSV and save index space.

---
If this reply helps you, Karma would be appreciated.

PickleRick · ‎01-24-2022

Sometimes it's easier to update lookup via a temporary index. You don't have to worry about users' permissions, for example. And you don't want to teach people to upload the csv and check if everything is ok. You also have the possibility to do some transformations to the data.

I do that in one of my environments. The users prepare a csv and put it onto a server. From that server I ingest it with UF into a temporary index. I have a scheduled search which prepares a lookup from that data if we have newer events than a timestam within the lookup.

Works like charm.

arusoft · ‎01-24-2022

@richgalloway @PickleRick . But I am afraid to use lookup file. I thought there are limitation on how much data lookup can take. I am looking to capture millions or records. Also like I mentioned every week this will keep getting updated. This is the reason I would like to put in index .

Sorry but looks like my question is not clear. So let me try asking a bit differently. How can I update my index by manually uploading csv file that can consist of records that are already there in the index.

This is how my query looks like

index="XXX" host="xxx.yyy.com" sourcetype="csv" source="myFile.csv"

All I want to do is update this index every week file myFile.csv file. Only problem is that there can be some records that are already there in index and I am trying to find a way to avoid duplicates to start with. I know I can dedup, but I am trying to see if there is any way I can avoid duplicates .

PickleRick · ‎01-24-2022

Indeed, lookup is not as effective in searching as index.

I was only responding to @richgalloway to show that there are some use cases where using index makes sense sometimes 🙂

About your original question - splunk is not a "database" which holds current state of something. Splunk ingests some data points (typically - logs, but you can of course index any data) and lets you analyze them.

Therefore your intermediate index approach would have two caveats:

1) Increased license consumption (the increase depends on the ratio of changed data vs total set)

2) Your data in the "destination" index would eventually expire. What then?

In general - splunk is a great tool to hold events, not a state of something (there is a subtle difference). That's why the idea of holding the data in a lookup.

There is no way to detect duplicates during indexing process. There are some mechanisms on various inputs that avoid processing the same data more than once but they work by simply remembering which data on the input side they already processed. If they mistakenly decide that - for example - a file has not been read yet, it's getting re-read and re-indexed. Nothing on indexer side can prevent that.

So if you wanted to make sure you only ingest "new" records from csv file you'd have to prepare some kind of script which would deduplicate the data even before ingesting. And only upload to splunk the resulting subset of your original file.

arusoft · ‎01-25-2022

Thanks @PickleRick for explaining nicely. I was not looking to detect duplicates during indexing process. I am fine with points you mentioned too. Having said that if I create a staging index then wondering what would Splunk query looks like that would insert data (non duplicate) to main index.

Basically I am looking at copying partial data form Index B into Index A based on some query condition ( that would only bring data from Index B that is not already there in Index A ).

johnhuang · ‎01-28-2022

1. What does the data look like? Could you provide some samples.

2. How large is the weekly updates?

3. Are the updates most incremental data with incidental dupes? Or is it cumulative, e.g. you are loading the entire dataset?

4. Is it necessary to search the entire index for dupes or can it be limited to data indexed within the last x days?

PickleRick · ‎01-26-2022

That's tricky because the most obvious solution (doing a subsearch and adding a NOT clause) is simply wrong - not only it's highly ineffective but also - especially with your dataset size - will surely hit the subsearch limits.

You'd have to approach it differently.

If you have a single identifier field which could distinguish your events, you could do something like

(index=main_index earliest=0) OR (index=temporary_index earliest=-1d)
| stats values(_raw) as _raw values(index) as index count by id
| where count=1 AND index=temporary_index
| fields _raw
| collect index=main_index sourcetype=my_sourcetype

Otherwise you'd have to do similar thing but splitting on _raw not id field and you'll probably have to manipulate _raw if it contains timestamp so that two occurences of the same data point do not "differ" simply beacuse they contain another timestamp.

Not a very pretty solution though, I admit. And might also hit some limits with such search but they should not be silent as with a subsearch where you hit a limit and the subsearch returns only the results it got so far without any notice to an outer search.

arusoft · ‎01-26-2022

Thanks @PickleRick , unfortunately I don't have a single field identifier. I agree this is not the best way to handle this. Having said that, what are my option? Just accept duplicates and then do dedup ?

PickleRick · ‎01-26-2022

Unfortunately, your use case is simply tough. Dedup might also not do. It's unadvisable to run dedup or uniq on dataset with long fields and high cardinality (like on several million _raw events). Typically with big sets of data you'd probably want to reduce your load by calculating some form of hash from the raw event, store it in indexed form (in order to not having to recalculate it each time) and rely mostly on hash lookup which is very cheap compared to string comparison. In fact, you could implement something like that with splunk but that also be not very pretty (and you'd have to store the hash with the event in order to speed things up).

As I said, you can sort and count, you can dedup/uniq but it simply won't be very effective.

johnhuang · ‎01-26-2022

Just throwing out some ideas (some have already been mentioned):

1. Consider using a KV Store which can scale up to multiple GB in size and provides the ability to insert and update records.

2. Staging via SQL DB -> Splunk

3. Staging via a temporary Splunk index -> Summary Index

One way to filter out dupes:

(index=staging_index source=x host=x) OR (index=summary_index source=x host=x)
| eval sort_order=CASE(index=staging_index, 1, index=summary_index, 0)
| sort 0 sort_order -_time
| dedup _time field1, field2, field3, etc
| where sort_order=1
| collect index=summary_index

Or have a schedule job that calculate the MD5 hash of every record in the summary index and save it to a CSV or Kvstore lookup which then can be used to filter out dupes during the staging process.

index=summary_index
| eval record_md5=md5(_raw)
| append [| inputlookup completed_hash_record_history.csv] 
| dedup record_md5
| outputlookup completed_hash_record_history.csv

4. Create a schedule job that finds dupes and pipe it to delete.

PickleRick · ‎01-27-2022

Just remember that with hashes you can't rely only on the hash value (because of the possibility of collision). So in case of two equal hashes one should always compare the actual values.

BTW, I'm wondering what staging through RDBMS would do here. It would also need to work hard to deduplicate. And - in the end - if the RDBMS did deduplicate the data - why load it to splunk afterwards?

Maybe that's one of those few cases where splunk is not the ideal solution. Yes, it can be done using splunk but not very efficiently and not very "pretty".

johnhuang · ‎01-27-2022

Even as weakest hash algorithm, MD5 are still relevant and have many practical use cases, in solving real world problems around data archiving and integrity.

Just to give you an idea of how difficult it is for a collision to occur. Let's say your goal is to produce a collision with a mid-high end GPU (e.g. Nvidia 3070) churning MD5 hashes at a rate of 3.5 GH/S (3.5 billion hashes per second), or 2^35/s.

Your GPU will take 2^65/2^35 = 1,073,741,824 secs, or 12427 days, or >24 years to compute 50% of the MD5 space, meaning after 24 years, you'll a 50% probability of generating a collision.

In a real world scenario where you're using the hash to catalog data, "To have a 50% chance of any hash colliding with any other hash you need 2^64 hashes. This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years."

Now if that still concerns you, you can use much stronger SHA1, SHA256, SHA512 hashing algorithms that are supported by Splunk.

MD5 have the lowest overhead (CPU and Storage).

As for RDBMS, I work in a MSSQL shop and since we've already built a number of processes using MSSQL -> Splunk, it is trivial to setup a new data source.

Of the various dbconnect splunk feeds we, one notable I'd share is a feed which we call "sp_get_splunk_general_feed".

Born out of laziness, this giant store procedure which splunk dbconnect executes every hour and ingest whatever data it returns. In the store procedure, I currently have around 15-20 SQL queries which each running on each own schedule and executing around the clock.

When I need to add another report, data-enrichment data, etc, I could simply add a new SQL query to the store procedure and walk away.

PickleRick · ‎01-28-2022

I know that it's very difficult to generate a collision for a given hash value. That's not the point.

The point is - due to the nature of the hashing function you never know if you hit the colliding value or not and even though something seems improbable it's not impossible. And the more cardinality of your set grows, the more chance you get to hitting the collision.

So yes, you can do an assumption and work on that. True.

But a proper practice in working with such functions is to verify - that way you indeed offload some work and quicken the process but you still keep the assurance that the values are or are not unique.

Your calculation is based on a flawed calculations. Since hash is a one-way function, you don't know beforehand what part of the results space you'll land in. You can't "calculate half of the hashes". That's the point of the hashing function.

What you need is the Bernoulli Scheme. It produces a binomial distribution regarding number of "successes" in n trials with probability p. Yes, it still gives you a very low expected value in "practical" conditions but it's a completely different thing.

johnhuang · ‎01-28-2022

@PickleRick

I think we're starting to go off topic. The OP's dataset is relatively small and this is a practical solution. For the use cases which I've implemented hashing, it was the most optimal approach and has worked well in my experience.

If MD5 is not good enough, Splunk also support stronger hashing functions: SHA1, SHA256, and SHA512 with some trade off to performance and storage.

That said, if your use case requires absolute assurance and guarantee of data integrity, you should be using an ACID database and not Splunk for this.

Your calculation is based on a flawed calculations. Since hash is a one-way function, you don't know beforehand what part of the results space you'll land in. You can't "calculate half of the hashes". That's the point of the hashing function.

That's indeed true for a sample size of one. If you have a large enough sample size, it will average to 50%.

PickleRick · ‎01-28-2022

Yep, we're drifting way away from the main topic 🙂

Let me point out that there's no such thing as "most optimal". Either something is optimal or not.

The ACID-ity in most part has nothing to do with the OP's problem.

MD5 is not an injective function.

Even in general case of any hashing algorithm it's a proper practice to check the actual value, although indeed the probablity of collision is very low, you need to cover much less than half of the number of different unhashed messages to "likely" hit a colision (in fact it's a square root of 2^n).

So let's leave this topic alone for now 🙂

arusoft · ‎01-28-2022

thanks @PickleRick @johnhuang @yuanliu @richgalloway Looks like this discussion is a good one where someone from Splunk should step in and give there input too.

Or you guys tell me which solution should I accept 🙂 And I will accept that and close this. Honestly I am not seeing a nice way to do this.

richgalloway · ‎01-28-2022

If you don't have a working solution then I think it's fair to leave the question unresolved.

---
If this reply helps you, Karma would be appreciated.

Update CSV source without duplicates

Why You Can't Miss .conf25: Unleashing the Power of Agentic AI with Splunk & Cisco

Deep Dive into Federated Analytics: Unlocking the Full Power of Your Security Data

Your summer travels continue with new course releases

Are you a member of the Splunk Community?

Update CSV source without duplicates

Why You Can't Miss .conf25: Unleashing the Power of Agentic AI with Splunk & Cisco

Deep Dive into Federated Analytics: Unlocking the Full Power of Your Security Data

Your summer travels continue with new course releases