Splunk Search

how can I use html file as a datasource.

mt25
Explorer

I am getting some HTML files(not available over the server) which I need to process in splunk. Not able to figure out how can I achieve this is in splunk.
Problem Statement
I got a file which has plenty of events inside it but I am intrested only in those Events which has "Error" keyword.
I am trying to find out a way which can give me event Ids (which is available inside the DIV), Error Detail(Available inside the DIV),
timestamp of the event (Available in the parent DIV).

Appreciate any help.
Thanks

input is something like that:

<DIV id="A">[Jan 22 20h39:02.924] - <a href="javascript://" onClick="toggle(this)">Disconnected from server 'SERVER102'</a>
<DIV id="B"><UL>
Disconnected from server SERVER102. Reason: Initiated by the Server application<P>
[Error event 8000]
</P>
</UL><HR></DIV>
</DIV>

Required Fields as output:
***EventType EventID Description* Timestamp**
Error 8000 Disconnected from server SERVER102. Reason: Initiated by the Server application Jan 22 20h39:02.924

Detailed sample html file:

<html>
<head>
<style type="text/css">

A { font-family:Verdana, Arial; font-size:9.0pt; }
A:visited { color:#0000FF }

U { cursor:hand }
P { font-family:Verdana, Arial; font-size:9.0pt; }

</style>
<script>

function handleClick()
{
    el=event.srcElement;
    if (el.id!="clickable") 
        return; 
    if (!changeSetting(el,"content1",true) && !changeSetting(el,"content3",true)) 
        changeSetting(el,"content2",true);
    event.cancelBubble=true
}

/*----------------------------------------------------------------------------*/


</script>
<body>
<DIV id="content1">[Jan 22 20h39:02.756] - <a href="javascript://" onClick="toggle(this)">Disconnected from server 'Server1'</a>
<DIV id="id="content2""><UL>
Disconnected from server &#039;Server1&#039;. Reason: Initiated by the server application<P>
[Error event 5001]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:02.924] - <a href="javascript://" onClick="toggle(this)">Disconnected from server 'hulk'</a>
<DIV id="id="content2""><UL>
Disconnected from server &#039;hulk&#039;. Reason: Initiated by the server application<P>
[Error event 5001]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:12.772] - <a href="javascript://" onClick="toggle(this)">Connected to server 'tarzon'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;tarzon&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.123] - <a href="javascript://" onClick="toggle(this)">Connected to server 'iron'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;iron&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.126] - <a href="javascript://" onClick="toggle(this)">Connected to server 'titanium'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;titanium&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.133] - <a href="javascript://" onClick="toggle(this)">Connected to server 'iron'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;iron&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.192] - <a href="javascript://" onClick="toggle(this)">Connected to server 'titanium'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;titanium&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.362] - <a href="javascript://" onClick="toggle(this)">Connected to server 'titaniumPI898'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;titaniumPI898&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.412] - <a href="javascript://" onClick="toggle(this)">Connected to server 'titaniumPI498'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;titaniumPI498&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.618] - <a href="javascript://" onClick="toggle(this)">Connected to server 'titaniumPI998'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;titaniumPI998&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.745] - <a href="javascript://" onClick="toggle(this)">Connected to server 'titaniumPI098'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;titaniumPI098&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.784] - <a href="javascript://" onClick="toggle(this)">Connected to server 'Server1'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;Server1&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.939] - <a href="javascript://" onClick="toggle(this)">Connected to server 'hulk'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;hulk&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 23 06h00:00.472] - <a href="javascript://" onClick="toggle(this)">he usage</a>
<DIV id="id="content2""><UL>
Hard disk usage warning.<P>
[Warning event 621]
</P>
</UL><HR></DIV>
</DIV>
</body>
</html>
0 Karma
1 Solution

mt25
Explorer

Hello Frank,
Thanks for your prompt reply.

I do not have control over the source system which is generating those files and can't connect to it directly.

Can we pre-process is with Python or JavaScript?

I tried with the regular expression and was struggling to get the desired output. If it is all possible for you to provide me a Reg-ex. It would be a great help.
Thanks

View solution in original post

0 Karma

niketn
Legend

@mt25, while I have not tried it, can you check out Splunk App with HTML to Text command. See if it fits your needs.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

mt25
Explorer

Thanks @niketnilay.
I have installed the HTML2Text app and will give it a try now. thanks !

0 Karma

mt25
Explorer

Hello Frank,
Thanks for your prompt reply.

I do not have control over the source system which is generating those files and can't connect to it directly.

Can we pre-process is with Python or JavaScript?

I tried with the regular expression and was struggling to get the desired output. If it is all possible for you to provide me a Reg-ex. It would be a great help.
Thanks

0 Karma

FrankVl
Ultra Champion
[^\[]+\[(?<timestamp>[^\]]+)\].*\n.*\n(?<description>.*)\n\s+\[(?<eventtype>\w+)\D+(?<eventid>\d+)\]

https://regex101.com/r/7CuseA/1

mt25
Explorer

Thanks Frank.
It is working externally but when I am using it in splunk search it is not returning any values for the same sample input. Maybe I am missing something here.alt text

![alt text][2]

uploaded screenshot here

0 Karma

FrankVl
Ultra Champion

Try removing that / at the start and the /g at the end of what you entered in splunk. That is not part of the regex, it is just how regex101 displays the regex bar.

mt25
Explorer

Silly mistake 🙂 Thanks Frank. it is working. I'll test it with the exact source file and will try to tweak the regex, if require.

Thanks again!!

0 Karma

FrankVl
Ultra Champion

Do you have to get this data from such HTML files? I expect these files have not been typed up by someone, but are generated from a certain datasource (file/DB) by some php/asp/whatever script? Can't you get the data directly from that same source in a more Splunk friendly format?

If you must use these HTML files, I think I would consider somehow pre-processing them by some script that extracts the relevant data and puts it in a CSV file, or at least some simple file with 1 event per line, that Splunk can easily ingest. You could run that script still in Splunk as a scripted input.

Alternatively, you could use the EOM markers as event breaker and then write some regex to pull out the contents. Combined with some SED command to get rid of the header lines.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...