I need to compare two CSV lookup files - need to see which records that are in the first CSV are NOT already in the second, so I'm pretty sure join is my only option. While the first file typically has less than 100 records in it, the second file has 120,000 records in it (it's drawn from a database - and no, I can't just hook up Splunk to the database, I wish I could). I've raised the subsearch "maxout" and join "subsearch_maxout" to 150,000, restarted Splunk, and no joy.
I've been out of the Splunk game for a while, and this is my first foray into Splunk 7 (last version I worked with was 6.2), so I might just be missing something, either forgotten or new that I don't know about...
Sample search (yes, I know, it's kinda gross, especially the second join, but I gotta work with what I'm given...):
| inputlookup File1.csv | join type=outer ISBN [| inputlookup File2.csv | table TITLE, ISBN, AUTHOR] | join type=outer Title, Subtitle max=0 [| inputlookup File2.csv | dedup TITLE | rex field=TITLE "^(?<Title>(.+?(?=(, by|: | \[| \/|\.$|;))|^.+$))" | rex field=TITLE "(: |;)(?<SUBTITLE>(.+?(?=(, by| \[| \/|\.$)))|.+$)" | rex field=TITLE " \[(?<MEDIUM>.*)\] " | rex field=TITLE "((?i) by | \/|\.$|\[by\])(?<RESPSTATE>.*)$" | eval Subtitle=SUBTITLE]
I have to do two joins because File2.csv may not actually have the ISBN (yeah, a database with incomplete information, AWESOME), so then I have to compare against the title/subtitle to catch those that are in File2.csv without the ISBN.
I'd be happy to bypass join entirely, but I can't think of a good way to do that given what I have to work with...
If you are looking for entries that are present in File1.csv but not present in File2.csv, (and I assume that ISBNs here are globally unique) then this general structure should get you there:
| inputlookup File1.csv | eval first_file=1 | append [ | inputlookup File2.csv | eval second_file=1 ] | stats max(first_file) AS first_file max(second_file) AS second_file BY TITLE, ISBN, AUTHOR | where isnull(second_file) | rex field=TITLE "^(?<Title>(.+?(?=(, by|: | \[| \/|\.$|;))|^.+$))" | rex field=TITLE "(: |;)(?<SUBTITLE>(.+?(?=(, by| \[| \/|\.$)))|.+$)" | rex field=TITLE " \[(?<MEDIUM>.*)\] " | rex field=TITLE "((?i) by | \/|\.$|\[by\])(?<RESPSTATE>.*)$" | eval Subtitle=SUBTITLE
Unless I'm missing something, you don't need to do the
rex extractions in a separate subsearch and join them in. Also, you technically don't need the
eval first_file=1 from my example, but I put it in to demonstrate the framework. You can replace
| where isnull(second_file) with
| where isnull(first_file) if your requirements change and you need to find entries that are present in the second file but not in the first.
By the way, I tested this with some lookup files that contain >4m entries. So if it doesn't work for you, the issue may not be the output being truncated by
This definitely won't work as written - that's why there are rex's on File2.csv, because it is not in the same format as File1.csv (File2.csv has title, subtitle, and author conglomerated in the TITLE field, so there's no subtitle field in it to begin with; File1.csv has those three fields as separate fields). Couldn't do the stats command where it's written here because of that. But I might be able to tweak it...I'll let you know...
Yes, I had to do two joins because join matches on ALL fields that you tell it to match on, not ANY fields - that is, if I did
join Title, Subtitle, ISBN it would only match results where all three of those fields match in both files, and because file 2 has the title/subtitle/author conglomerated in an inconsistent fashion across records, my rex's only catch about 80% of the title/subtitle matches, and only about 50% of the records in file 2 actually have an ISBN at all (this is mostly due to people who manually entered the records into the database not using consistent formatting, but some are due to non-UTF-8 characters as well, which are treated differently by the two processes that generate these files). So we look for ISBN first because it's guaranteed to be unique and formatted consistently, then check any that don't match on ISBN against title and subtitle. And after that, I still have to check the remaining ones manually against the database that file 2 comes from because of that 80% match issue. But then I'm only checking 20 titles manually instead of 120, so it's definitely faster...at least when file 2 is less than 50k! LOL
Got it! Ok, so let's swap the order of the
inputlookups and do the rex extractions before adding in File1.csv. I'm swapping them just in case the amount of time taken to do the extractions would cause the subsearch to timeout.
| inputlookup File2.csv | eval second_file=1 | rex field=TITLE "^(?<Title>(.+?(?=(, by|: | \[| \/|\.$|;))|^.+$))" | rex field=TITLE "(: |;)(?<SUBTITLE>(.+?(?=(, by| \[| \/|\.$)))|.+$)" | rex field=TITLE " \[(?<MEDIUM>.*)\] " | rex field=TITLE "((?i) by | \/|\.$|\[by\])(?<RESPSTATE>.*)$" | eval Subtitle=SUBTITLE | append [ | inputlookup File1.csv | eval first_file=1 ] | stats max(first_file) AS first_file max(second_file) AS second_file max(Title) AS Title max(SUBTITLE) AS SUBTITLE max(MEDIUM) AS MEDIUM max(REPSTATE) AS REPSTATE max(Subtitle) AS Subtitle BY TITLE, ISBN, AUTHOR | where isnull(second_file)
I'm not 100% confident about this, because I don't know the purpose of having both
Subtitle, but your original code certainly made a point of creating the mixed-case version, so I tried to advance/preserve that.
Having both the all-caps version and the camelcase versions of title and subtitle helps in the downstream process where I have to do the final manual check on the records that didn't match on either ISBN or title/subtitle. I dump it into Excel and use conditional formatting to highlight records - we sometimes have records that match on title/subtitle, but the authors are different because one is, say, 2nd edition and the other is 5th edition. Or title matches, but subtitle doesn't, but it's the same subject matter.
(We're picking out books to buy at a library. The first file is a list of new books published in the last 5 years, the second file is from our library catalog. So if we have fifteen books that are the same subject matter as one of the newly-published books, we might skip buying that book in favor of a new book that we only have three similar books. Or we might skip getting an updated edition of a book we already have in favor of getting another book we don't have.)
Well, good news and bad news. LOL
I was still running into the 50k max on the append command. Apparently, append is overridden by the
[searchresults] default setting in limits.conf
maxresultrows = 50000
So I cranked that setting up and the append works (code below - in order to treat the ISBN and title/subtitle as separate entities, as required by the comparison process, I had to do two appends, but it still works). So for kicks, I tried my original join again...and lo and behold, it also works.
In conclusion, it appears there are two solutions: either switching to the append method, or cranking up
maxresultsrows in addition to cranking up the join and subsearch
subsearch_maxout values in limits.conf. I'm kinda cranky at myself for not checking for any other "max"-somethings set to 50k in the default limits.conf now. LOL
| inputlookup File1.csv | eval first_file=1 | fillnull value="" Subtitle | append maxout=150000 [| inputlookup File2.csv | table TITLE, ISBN | eval second_file=1] | stats max(first_file) as first_file, max(second_file) as second_file, values(Subtitle) as Subtitle, values(Title) as Title, values(Author) as Author, values(TITLE) as TITLE by ISBN | where isnotnull(first_file) | append maxout=150000 [| inputlookup File2.csv | dedup TITLE | rex field=TITLE "^(?<Title>(.+?(?=(, by|: | \[| \/|\.$|;))|^.+$))" | rex field=TITLE "(: |;)(?<SUBTITLE>(.+?(?=(, by| \[| \/|\.$)))|.+$)" | eval Subtitle=SUBTITLE | eval Title=trim(Title) | eval TITLE=Title | eval third_file=1] | stats max(first_file) as first_file, max(second_file) as second_file, max(third_file) as third_file, values(Author) as Author, values(ISBN) as ISBN, values(TITLE) as TITLE, values(SUBTITLE) as SUBTITLE by Title, Subtitle | where isnotnull(first_file)