Data Normalisation

ozair10 · ‎08-26-2020

Hi everyone

how the data can be normalized to a specified format on search. it can happen that sometimes the same name occurs in different notations.

it shall be normalized so that there is:

A single whitespace before and after ‘:’
A single whitespace before and after ‘/’
A single whitespace after ‘;’

I went through this documentation:

https://docs.splunk.com/Documentation/CIM/4.16.0/User/UsetheCIMtonormalizedataatsearchtime

but couldn't find a solution, can someone help please

richgalloway · ‎08-26-2020

There probably are a few ways to do those normalizations. I like to use rex.

... | rex field=<field to normalize> mode=sed "s/\s*:\s*/\s:\s/g"
| rex field=<field to normalize> mode=sed "s;\s*\/\s*;\s/\s;g"
| rex field=<field to normalize> mode=sed "s/;\s*/;\s/g"

---
If this reply helps you, Karma would be appreciated.

Richfez · ‎08-26-2020

First off, you really are going to be better off if you can fix this data on the source side.

If it is unfixable there (I'll use the example of file paths), then it's also possibly a good idea to extract that into "drive" "path" "filename" "extension" portions, using any of the field extraction options.

Also, it *really* would have helped get a better answer more quickly if you had provided a few examples.

The best choice, given what little we have to go on, is probably

https://docs.splunk.com/Documentation/SplunkCloud/8.0.2007/SearchReference/Rex

I won't belabor the syntax, because it's in the link above, but I will both provide a few examples to do precisely what it is you are wanting to do. Welcome to the world of regex. : )

The following regexes work.

| makeresults 
| eval testString = "this    .  that/theother ;foo.bar /baz; buz"
| rex field=testString mode=sed "s/(\S)\.(\S)/\1 . \2/g"
| rex field=testString mode=sed "s/(\s)+\.(\S)/ . \2/g"
| rex field=testString mode=sed "s/(\S)\.(\s)+/\1 . /g"
| rex field=testString mode=sed "s/(\s)+\.(\s)+/ . /g"
| rex field=testString mode=sed "s/(\S)\;(\S)/\1 ; \2/g"
| rex field=testString mode=sed "s/(\s)+\;(\S)/ ; \2/g"
| rex field=testString mode=sed "s/(\S)\;(\s)+/\1 ; /g"
| rex field=testString mode=sed "s/(\s)+\;(\s)+/ ; /g"
| rex field=testString mode=sed "s/(\S)\\/(\S)/\1 \/ \2/g"
| rex field=testString mode=sed "s/(\s)+\\/(\S)/ \/ \2/g"
| rex field=testString mode=sed "s/(\S)\\/(\s)+/\1 \/ /g"
| rex field=testString mode=sed "s/(\s)+\\/(\s)+/ \/ /g"

Let's tear that apart just a bit.

The first two lines just set up my run-anywhere, creating a blank event and then sticking a testString into it to work with.

Then we have the meat. You'll notice it's 12 lines, in 3 groups of 4 lines each, because each of those groups do the same thing for a different character.

The first of each group is, if you strip out the regex parts, just a substitution.

s/stringToFind/stringToReplaceItWith/g

So the first string to find is

(\S)\.(\S)

Which says to match a non-space character (\S), a literal period \. and then another non-space character (\S). The parentheses around the \S's are so we "record" that value in a variable for later. You'll see it in a second.

so when it finds a non-space, period, non-space, what do we do? That's the next part of the regex.

\1 . \2

That says take the matching part, like "D.G" that we found above, and replace it with the first recorded variable \1 which was whatever was in the first \S, a space, a period, a space, and the second recorded part. So "D.G" gets changed to "D . G".

The we repeat with variations of spaces/nonspaces. for instance,

(\s)+\.(\S)/ . \2/g

looks for one or more spaces (up to however many are in a row) - that's the `\s` backslash little-s for space, and the + says "all of them if there's one or more". Followed by a period. Followed by a non-space character.

\s = spaces, \S = not-spaces.

For consistency here, I use a capture group () around the \s, but it's not really necessary because you see in the replacement I don't use \1 anywhere, so the first capture group isn't used. But that's OK, and it makes all the examples "look more the same" so they should be easier to follow.

So anyway that takes something like " .BAR" and erases all those extra spaces and leaves you with just one space at the front " . BAR", then does the same as the first above substitution for the second one to get the space put in there.

So, if I were you, I'd get that working on your data, then put them in a macro so you can call them repeatedly.
https://docs.splunk.com/Documentation/Splunk/8.0.5/Knowledge/Definesearchmacros
https://docs.splunk.com/Documentation/Splunk/8.0.5/Knowledge/Usesearchmacros

Data Normalisation

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers

Are you a member of the Splunk Community?

Data Normalisation

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers