Thanks! Your solution was super helpful systemjack!
However, I ran into a few challenges that broke your solution:
- data that contains characters that might be word boundaries within the text. For example, if the data is "C-53124 C-53124 C-67943", the hyphens form word boundaries.
- data that is a subset of another item. EG: "C-53124 C-53124 C-67943 C-53124567" (C-53124567 contains C-53124)
In my case, I was also working with text that was already a single string, separated by semicolons, with duplicates. Therefore, I could skip the nomv step. I also could use delim rather than tokenizer, since I had a simple delimiter. Here is what I used, building off your work:
rex field=field_name mode=sed "s/((^|;)[^;]+);(?=.*\1(;|$))/;/g" | makemv delim=";" field_name
I spent a long time working though your regex, so here is an explanation for folks who aren't regex junkies. At a high level, the deduplication approach is to search for a string, followed by anything, followed by that string again, and trash the first instance of the string. We do that through all possible matches, so only the final instance remains.
In more detail:
Search for a string that doesn't contain our delimiter:
[^;]+
Search for that string either at the beginning of the line or with the delimiter in front of it, and with the delimiter after it:
((^|;)[^;]+);
With the double parentheses, what is in the first parenthesis is saved as \1 and what is in the inner parenthesis is saved as \2.
We then look ahead for anything, followed by our string (\1), followed by either a delimiter or the end of the line. (?= performs a "positive lookahead"):
(?=.*\1(;|$))
In the replace portion, we throw away everything, except what was in the positive lookahead, and replace it with a ;
/;/
And we do all this globally - anywhere the pattern will match. This addresses multiple duplicate items, triplicates, etc.
g
I suppose, one could create two macros out of this, for strings and mv's, and handle any delimeter: (I have not tested this)
[string_dedup(2)]
args = field_name, delimiter
definition = rex field=$field_name$ mode=sed "s/((^|$delimiter$)[^$delimiter$]+)$delimiter$(?=.*\1($delimiter$|$))/$delimiter$/g"
[mv_dedup(2)]
args = field_name, delimiter
definition = nomv $field_name$ | `string_dedup($field_name$, $delimiter$)` | makemv delim="$delimiter$" $field_name$
... View more