Splunk Search

how can I use html file as a datasource.

mt25
Explorer

I am getting some HTML files(not available over the server) which I need to process in splunk. Not able to figure out how can I achieve this is in splunk.
Problem Statement
I got a file which has plenty of events inside it but I am intrested only in those Events which has "Error" keyword.
I am trying to find out a way which can give me event Ids (which is available inside the DIV), Error Detail(Available inside the DIV),
timestamp of the event (Available in the parent DIV).

Appreciate any help.
Thanks

input is something like that:

<DIV id="A">[Jan 22 20h39:02.924] - <a href="javascript://" onClick="toggle(this)">Disconnected from server 'SERVER102'</a>
<DIV id="B"><UL>
Disconnected from server SERVER102. Reason: Initiated by the Server application<P>
[Error event 8000]
</P>
</UL><HR></DIV>
</DIV>

Required Fields as output:
***EventType EventID Description* Timestamp**
Error 8000 Disconnected from server SERVER102. Reason: Initiated by the Server application Jan 22 20h39:02.924

Detailed sample html file:

<html>
<head>
<style type="text/css">

A { font-family:Verdana, Arial; font-size:9.0pt; }
A:visited { color:#0000FF }

U { cursor:hand }
P { font-family:Verdana, Arial; font-size:9.0pt; }

</style>
<script>

function handleClick()
{
    el=event.srcElement;
    if (el.id!="clickable") 
        return; 
    if (!changeSetting(el,"content1",true) && !changeSetting(el,"content3",true)) 
        changeSetting(el,"content2",true);
    event.cancelBubble=true
}

/*----------------------------------------------------------------------------*/


</script>
<body>
<DIV id="content1">[Jan 22 20h39:02.756] - <a href="javascript://" onClick="toggle(this)">Disconnected from server 'Server1'</a>
<DIV id="id="content2""><UL>
Disconnected from server &#039;Server1&#039;. Reason: Initiated by the server application<P>
[Error event 5001]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:02.924] - <a href="javascript://" onClick="toggle(this)">Disconnected from server 'hulk'</a>
<DIV id="id="content2""><UL>
Disconnected from server &#039;hulk&#039;. Reason: Initiated by the server application<P>
[Error event 5001]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:12.772] - <a href="javascript://" onClick="toggle(this)">Connected to server 'tarzon'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;tarzon&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.123] - <a href="javascript://" onClick="toggle(this)">Connected to server 'iron'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;iron&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.126] - <a href="javascript://" onClick="toggle(this)">Connected to server 'titanium'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;titanium&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.133] - <a href="javascript://" onClick="toggle(this)">Connected to server 'iron'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;iron&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.192] - <a href="javascript://" onClick="toggle(this)">Connected to server 'titanium'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;titanium&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.362] - <a href="javascript://" onClick="toggle(this)">Connected to server 'titaniumPI898'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;titaniumPI898&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.412] - <a href="javascript://" onClick="toggle(this)">Connected to server 'titaniumPI498'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;titaniumPI498&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.618] - <a href="javascript://" onClick="toggle(this)">Connected to server 'titaniumPI998'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;titaniumPI998&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.745] - <a href="javascript://" onClick="toggle(this)">Connected to server 'titaniumPI098'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;titaniumPI098&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.784] - <a href="javascript://" onClick="toggle(this)">Connected to server 'Server1'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;Server1&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 22 20h39:13.939] - <a href="javascript://" onClick="toggle(this)">Connected to server 'hulk'</a>
<DIV id="id="content2""><UL>
Connected to server &#039;hulk&#039;.<P>
[Informational event 5000]
</P>
</UL><HR></DIV>
</DIV>
<!--EOM-->
<DIV id="content1">[Jan 23 06h00:00.472] - <a href="javascript://" onClick="toggle(this)">he usage</a>
<DIV id="id="content2""><UL>
Hard disk usage warning.<P>
[Warning event 621]
</P>
</UL><HR></DIV>
</DIV>
</body>
</html>
0 Karma
1 Solution

mt25
Explorer

Hello Frank,
Thanks for your prompt reply.

I do not have control over the source system which is generating those files and can't connect to it directly.

Can we pre-process is with Python or JavaScript?

I tried with the regular expression and was struggling to get the desired output. If it is all possible for you to provide me a Reg-ex. It would be a great help.
Thanks

View solution in original post

0 Karma

niketn
Legend

@mt25, while I have not tried it, can you check out Splunk App with HTML to Text command. See if it fits your needs.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

mt25
Explorer

Thanks @niketnilay.
I have installed the HTML2Text app and will give it a try now. thanks !

0 Karma

mt25
Explorer

Hello Frank,
Thanks for your prompt reply.

I do not have control over the source system which is generating those files and can't connect to it directly.

Can we pre-process is with Python or JavaScript?

I tried with the regular expression and was struggling to get the desired output. If it is all possible for you to provide me a Reg-ex. It would be a great help.
Thanks

0 Karma

FrankVl
Ultra Champion
[^\[]+\[(?<timestamp>[^\]]+)\].*\n.*\n(?<description>.*)\n\s+\[(?<eventtype>\w+)\D+(?<eventid>\d+)\]

https://regex101.com/r/7CuseA/1

mt25
Explorer

Thanks Frank.
It is working externally but when I am using it in splunk search it is not returning any values for the same sample input. Maybe I am missing something here.alt text

![alt text][2]

uploaded screenshot here

0 Karma

FrankVl
Ultra Champion

Try removing that / at the start and the /g at the end of what you entered in splunk. That is not part of the regex, it is just how regex101 displays the regex bar.

mt25
Explorer

Silly mistake 🙂 Thanks Frank. it is working. I'll test it with the exact source file and will try to tweak the regex, if require.

Thanks again!!

0 Karma

FrankVl
Ultra Champion

Do you have to get this data from such HTML files? I expect these files have not been typed up by someone, but are generated from a certain datasource (file/DB) by some php/asp/whatever script? Can't you get the data directly from that same source in a more Splunk friendly format?

If you must use these HTML files, I think I would consider somehow pre-processing them by some script that extracts the relevant data and puts it in a CSV file, or at least some simple file with 1 event per line, that Splunk can easily ingest. You could run that script still in Splunk as a scripted input.

Alternatively, you could use the EOM markers as event breaker and then write some regex to pull out the contents. Combined with some SED command to get rid of the header lines.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...