Splunk Search

Can someone help me with regex to remove HTML tags from fields?

ndsouza25
New Member

Hello,

Could someone please help me with removing the HTML tags from fields.

The data is a few sentences, such as remediation of a Microsoft patch, but contains links within.

This data is coming in through a lookup that I can't modify apparently. I'd like to get rid of the

etc tags so I can just display the text in a clear format.

Thank you!

0 Karma
1 Solution

horsefez
SplunkTrust
SplunkTrust

@ndsouza25

I worked on your second request which was a bit more difficult. But I managed to get it.

https://regex101.com/r/q5fPca/3

The SPL command would look something like this:
yourbasesearch | rex mode=sed field=_raw "s/((?=<[^>]>)[^;]+;[^;]+;|<[^>]>|<\/[^>]+>|<[^'"]+['"]|['"][^<]+<[^>]+>)//g"

you can change the field=_raw to another field name if you have already extracted this text into another field (optional)

View solution in original post

horsefez
SplunkTrust
SplunkTrust

@ndsouza25

I worked on your second request which was a bit more difficult. But I managed to get it.

https://regex101.com/r/q5fPca/3

The SPL command would look something like this:
yourbasesearch | rex mode=sed field=_raw "s/((?=<[^>]>)[^;]+;[^;]+;|<[^>]>|<\/[^>]+>|<[^'"]+['"]|['"][^<]+<[^>]+>)//g"

you can change the field=_raw to another field name if you have already extracted this text into another field (optional)

View solution in original post

horsefez
SplunkTrust
SplunkTrust

unfortunately I had to delete the "KB43... " as well... as it would stick to the URL. Therefore making the URL invalid.

If you really need that "KB43..." value then hit me up again.

0 Karma

ndsouza25
New Member

Thank you for spending the time! I don't need the KB values, but when I put the SPL command in, I get this error: Mismatched ']'. I see that the regex works, but can't figure out why Splunk complains about it.

0 Karma

horsefez
SplunkTrust
SplunkTrust

@ndsouza25 you are right, I fixed it 🙂

| rex mode=sed field=_raw "s/((?=<[^>]>)[^;]+;[^;]+;|<[^>]>|<\/[^>]+>|<[^\'\"]+[\'\"]|[\'\"][^<]+<[^>]+>)//g"

The problem was that I needed to escape " characters, as they interfere with the engine 🙂

Works now, tested it in splunk.

0 Karma

ndsouza25
New Member

It works perfectly, thank you very much pyro_wood!

0 Karma

horsefez
SplunkTrust
SplunkTrust

I wrote a regex that can at least get you the raw text in the format you wanted (without the hyperlinks actually working)

yourbasesearch | rex mode=sed field=_raw "s/(<[^>]+>|(?<=P>)(?:[^;]+;)+)//g"

The result should look like this afterwards:
Customers are advised to follow KB4343902 for instructions pertaining to the remediation of these vulnerabilities. Following are links for downloading patches to fix the vulnerabilities: ADV180020

https://regex101.com/r/q5fPca/1

0 Karma

ndsouza25
New Member

Thank you very much! This works great! Is it possible to still display the URL. I don't need it to work as a hyperlink, but just show up so someone can copy and paste it into a browser. I really appreciate the help and quick response!

0 Karma

marycordova
SplunkTrust
SplunkTrust

please submit a sample of the data

ndsouza25
New Member
Customers are advised to follow <A HREF='https://support.microsoft.com/en-ph/help/4343902/security-update-for-adobe-flash-player' TARGET='_blank'>KB4343902</A> for instructions pertaining to the remediation of these vulnerabilities.<P> <P>Patch:&lt;br/&gt; Following are links for downloading patches to fix the vulnerabilities: <P> <A HREF='https://portal.msrc.microsoft.com/en-us/security-guidance/advisory/ADV180020' TARGET='_blank'>ADV180020</A>
0 Karma

ndsouza25
New Member

Above is a sample of the data I get from our vulnerability system. I would like for it to read as such, but actually show the link URL instead of converting to a hyperlink:

Customers are advised to follow KB4343902 for instructions pertaining to the remediation of these vulnerabilities.

Patch: Following are links for downloading patches to fix the vulnerabilities:

ADV180020

0 Karma

horsefez
SplunkTrust
SplunkTrust

When I read "removing the html tags from fields" I immediately thought about regular expressions.
Unfortunately you don't seem to want to remove them. You want to create a hyperlink.

I'm not sure if I can help you with that. Sorry. 😞

P.S.: I'm not even sure if that is possible at all.

0 Karma

horsefez
SplunkTrust
SplunkTrust

Hi @ndsouza25 , as @marycordovacaa said please share some sample data... otherwise we won't be able to help you.

0 Karma
.conf21 Now Fully Virtual!
Register for FREE Today!

We've made .conf21 totally virtual and totally FREE! Our completely online experience will run from 10/19 through 10/20 with some additional events, too!