Solved: How Can I Cut With Regex An Indexing Part Of A Txt...

vtsguerrero · ‎01-22-2015

Hey everybody! Can anyone help me creating an effective regex for this maybe?

I have this txt file which I only need the part inside the " ***** " to be indexed and considered as one single event per .txt file:

Example of data:
In the moment of indexing, I need to index only the " *>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM <<<<<<<<<<<<<<<<<< " part on, the rest of the file should be ignored.
Cuz I'm gonna use substrings later on...
Tks In advance!

~~CTRL AS~~:FG8WT09UX86UBB929376293762376M92738263TROKOM S28628ITT86327UPK           293862397263755

 *>>>>>>>>>>>>>> LOGS UTDNAME: HUTHUTHYGS <<<<<<<<<<<<<<<<<<*

06.52.22 UTF8556 ---- THURSDAY,  04 DEC 2014 ----
06.52.22 UTF8556 HASP HHIAO WLM IFOP
06.52.47 UTF8556 0PLLOAOKWMO
06.92.22 UTF8556 HASP HHIAO WLM IFOP
06.52.22 UTF8556 0PLLOAOKWMO
06.72.24 UTF8556 HASP HHIAO WLM IFOP
06.52.27 UTF8556 0PLLOAOKWMO
06.53.20 UTF8556 HASP HHIAO WLM IFOP
06.52.23 UTF8556 0PLLOAOKWMO


*>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO PDPO: KOLROLMSG <<<<<<<<<<<<<<<<<<

06.52.22 UTF8556 ---- THURSDAY,  04 DEC 2014 ----
06.52.22 UTF8556 HASP HHIAO WLM IFOP
06.52.47 UTF8556 0PLLOAOKWMO
06.92.22 UTF8556 HASP HHIAO WLM IFOP
06.52.22 UTF8556 0PLLOAOKWMO
06.72.24 UTF8556 HASP HHIAO WLM IFOP
06.52.27 UTF8556 0PLLOAOKWMO
06.53.20 UTF8556 HASP HHIAO WLM IFOP
06.52.23 UTF8556 0PLLOAOKWMO
06.52.22 UTF8556 ---- THURSDAY,  04 DEC 2014 ----
06.52.22 UTF8556 HASP HHIAO WLM IFOP
06.52.47 UTF8556 0PLLOAOKWMO
06.92.22 UTF8556 HASP HHIAO WLM IFOP
06.52.22 UTF8556 0PLLOAOKWMO
06.72.24 UTF8556 HASP HHIAO WLM IFOP
06.52.27 UTF8556 0PLLOAOKWMO
06.53.20 UTF8556 HASP HHIAO WLM IFOP
06.52.23 UTF8556 0PLLOAOKWMO


*>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM   <<<<<<<<<<<<<<<<<<
***************************************************************
**          FD8ZEX3.876I - INICIO                            **
***************************************************************
***************************************************************
**          FD8ZEX3.876I - TERMINO NORMAL                    **
**          PARM         - 00160/U/X                         **
***************************************************************
**                                                           **
**     INICIO : 04/12/2014   * HORA: 07:49:40                **
**     TERMINO: 04/12/2014   * HORA: 07:49:48                **
**                                                           **
***************************************************************
**           #FD8ZEX3.876I - T  O  T  A  I  S                **
***************************************************************
** TOTAL DE REGISTROS LIDOS    (TTY010):                 325 **
** TOTAL DE REGISTROS LIDOS    (TTY011):                   0 **
** TOTAL DE REGISTROS LIDOS    (TTY012):               4.360 **
** TOTAL DE REGISTROS GRAVADOS (TTY013):                   0 **
** TOTAL DE REGISTROS GRAVADOS (TTY014):                   0 **
** TOTAL DE REGISTROS GRAVADOS (TTY015):                   0 **
** TOTAL DE REGISTROS GRAVADOS (TTY016):                   0 **
** TOTAL DE REGISTROS GRAVADOS (TTY017):                 835 **
** TOTAL DE REGISTROS GRAVADOS (TTY018):                   0 **
** TOTAL DE REGISTROS GRAVADOS (TTY019):                  67 **
** TOTAL DE REGISTROS COM ERRO DE EMAIL:                   0 **
***************************************************************

DavidHourani · ‎01-22-2015

Hello,

Once you define that this entire extraction is supposed to be considered as a single event you simply have to add a SEDCMD in your props to filter out the part you dont want.

in your case you should add something like that :

SEDCMD-<class> = s/^.*(?=\*>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM <<<<<<<<<<<<<<<<<<)//g

this will filter out everything that comes before *>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM <<<<<<<<<<<<<<<<<< and will leave you with the part that comes after it.

Regards,
David

View solution in original post

vtsguerrero · ‎01-22-2015

Hello David!
I think we're in the right way, but still didn't get it acctually...
Gonna post as an answer because of the text-size..
After the part I need, this data acctually repeats, for example:

 06.52.22 UTF8556 ---- THURSDAY,  04 DEC 2014 ----
 06.52.22 UTF8556 HASP HHIAO WLM IFOP
 06.52.47 UTF8556 0PLLOAOKWMO
 06.92.22 UTF8556 HASP HHIAO WLM IFOP
 06.52.22 UTF8556 0PLLOAOKWMO
 06.72.24 UTF8556 HASP HHIAO WLM IFOP
 06.52.27 UTF8556 0PLLOAOKWMO
 06.53.20 UTF8556 HASP HHIAO WLM IFOP
 06.52.23 UTF8556 0PLLOAOKWMO


 *>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM   <<<<<<<<<<<<<<<<<<
 ***************************************************************
 **          FD8ZEX3.876I - INICIO                            **
 ***************************************************************
 ***************************************************************
 **          FD8ZEX3.876I - TERMINO NORMAL                    **
 **          PARM         - 00160/U/X                         **
 ***************************************************************
 **                                                           **
 **     INICIO : 04/12/2014   * HORA: 07:49:40                **
 **     TERMINO: 04/12/2014   * HORA: 07:49:48                **
 **                                                           **
 ***************************************************************
 **           #FD8ZEX3.876I - T  O  T  A  I  S                **
 ***************************************************************
 ** TOTAL DE REGISTROS LIDOS    (TTY010):                 325 **
 ** TOTAL DE REGISTROS LIDOS    (TTY011):                   0 **
 ** TOTAL DE REGISTROS LIDOS    (TTY012):               4.360 **
 ** TOTAL DE REGISTROS GRAVADOS (TTY013):                   0 **
 ** TOTAL DE REGISTROS GRAVADOS (TTY014):                   0 **
 ** TOTAL DE REGISTROS GRAVADOS (TTY015):                   0 **
 ** TOTAL DE REGISTROS GRAVADOS (TTY016):                   0 **
 ** TOTAL DE REGISTROS GRAVADOS (TTY017):                 835 **
 ** TOTAL DE REGISTROS GRAVADOS (TTY018):                   0 **
 ** TOTAL DE REGISTROS GRAVADOS (TTY019):                  67 **
 ** TOTAL DE REGISTROS COM ERRO DE EMAIL:                   0 **
 ***************************************************************


 06.52.22 UTF8556 HASP HHIAO WLM IFOP
 06.52.47 UTF8556 0PLLOAOKWMO
 06.92.22 UTF8556 HASP HHIAO WLM IFOP
 06.52.22 UTF8556 0PLLOAOKWMO
 06.72.24 UTF8556 HASP HHIAO WLM IFOP
 06.52.27 UTF8556 0PLLOAOKWMO
 06.53.20 UTF8556 HASP HHIAO WLM IFOP
 06.52.23 UTF8556 0PLLOAOKWMO
 06.52.22 UTF8556 HASP HHIAO WLM IFOP
 06.52.47 UTF8556 0PLLOAOKWMO
 06.92.22 UTF8556 HASP HHIAO WLM IFOP
 06.52.22 UTF8556 0PLLOAOKWMO
 06.72.24 UTF8556 HASP HHIAO WLM IFOP
 06.52.27 UTF8556 0PLLOAOKWMO
 06.53.20 UTF8556 HASP HHIAO WLM IFOP
 06.52.23 UTF8556 0PLLOAOKWMO

So I kinda need to "cut" only the middle square followed by *** characters...
I used the ** SEDCMD- = s/^.(?=*>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM <<<<<<<<<<<<<<<<<<)//g* in my props.config but I didn't cut anything at all...
Shoud the REGEX for this be used before or after indexing data?
Cuz I'm gonna use substr later to get those dates and amounts inside the *** space.

Thanks a lot @DavidHourani !

DavidHourani · ‎01-22-2015

Hello,

Once you define that this entire extraction is supposed to be considered as a single event you simply have to add a SEDCMD in your props to filter out the part you dont want.

in your case you should add something like that :

SEDCMD-<class> = s/^.*(?=\*>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM <<<<<<<<<<<<<<<<<<)//g

this will filter out everything that comes before *>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM <<<<<<<<<<<<<<<<<< and will leave you with the part that comes after it.

Regards,
David

vtsguerrero · ‎01-22-2015

Those data keep repeating for more few lines, but the **>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM <<<<<<<<<<<<<<<<<<* square is unique in the data.

DavidHourani · ‎01-23-2015

Okay, in that case you can simply split your regex into 2 pieces.

First piece clears everything before the square:

BREAK_ONLY_BEFORE= \*>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM <<<<<<<<<<<<<<<<<<

And the second one clears everything behind it:

SEDCMD- = s/(?<=\*\n)\d{2}.\d{2}.\d{2}.*//g

Better put this into your "props.conf" that way everything gets filtered out before it gets indexed that way you would save a lot on your license cost.

Let me know how that works out for you.

Regards,
David

vtsguerrero · ‎01-23-2015

Hello once again @DavidHourani !

Even though using two SEDCMD commands insides my props.conf, it doesn't bring any result at all...
I just forgot to mention that these would be my thirds REGEX try, let me explain why...
Each txt file should be considered as one register, so to that and ignore the timestamps inside the file, I hade to use " BREAK_ONLY_BEFORE=~~CTRL AS~~: " in my props stanza.
And after using SEDCMD- = s/^.(?=>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM <<<<<<<<<<<<<<<<<<)//g and SEDCMD- = s/(?<=n)d{2}.d{2}.d{2}.//g in props.conf it didn't bring the highlighted ** block ** I needed, I'm not sure if I should extract this block as a field after indexed but in "Extract Field Mode" it doesn't appear as well.

Would it be better to extract these before indexed or inside search?
Considering that I kinda have a regex inside a regex inside a regex :S
Still a little confused on how to treat this kinda data.
Thanks in advance for the help @DavidHourani !

DavidHourani · ‎01-23-2015

First of all sorry, the regex got broken in the comment, i should've put it in code...

Anyway I see what you want. Is your document always under this format ? I think a better solution would be to filter out all the lines that don't start with * and make sure the events break each time the square begins.

I tested the following with props on my machine and it seemed to work:

BREAK_ONLY_BEFORE=\*>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM <<<<<<<<<<<<<<<<<<

This will break the event at the beginning of the square.

SEDCMD-cleanup=s/(?m)^[^\*].*//g

This will go line by line (?m) and delete every line that doesn't start with *

I hope that works for you 🙂

Regards,
David

vtsguerrero · ‎01-23-2015

Whoooooa! Very Nice solution @DavidHourani !
Almost 100% working now, just still didn't clean some lines after the block format.
Because some lines after this square shape txt, start with letters also, for example, this is what I've got here:
Line starting by letters and blank spaces are still being indexed...

      ***************************************************************


      TRO98I UTF8556 HASP HHIAO WLM IFOP
      TRO98I UTF8556 HASP HHIAO WLM IFOP
      TRO98I UTF8556 HASP HHIAO WLM IFOP
      TRO98I UTF8556 HASP HHIAO WLM IFOP
                      TRO98I UTF8556 HASP
                      TRO98I UTF8556 HASP
      TRO98I UTF8556 HASP HHIAO WLM IFOP
      TRO98I UTF8556 HASP HHIAO WLM IFOP
                      TRO98I UTF8556 HASP
                      TRO98I UTF8556 HASP
      TRO98I UTF8556 HASP HHIAO WLM IFOP
      TRO98I UTF8556 HASP HHIAO WLM IFOP
      99891119-19927

vtsguerrero · ‎01-23-2015

It kinda cutted everything I didn't need before the text format, just gotta find a way to cut after the format now...
Because I'm gonna use a substr function, so this format can't ever change positions to get exact coordinates on where to find each eval new field.

But thanks a lot pal! Helped a lot!
Bst Rgds
- Vinicius Guerrero.

DavidHourani · ‎01-23-2015

usually with the SEDCMD I gave you above all that should get everything that doesn't start with * filtered out ^^

If my answer helped and you don't need any further help could you please accept it ? ^^

Regards,
David

vtsguerrero · ‎01-23-2015

Yeah, it kinda filtered in the indexing page, but after I go to my seach, it filters only before that ** structure...
Anyway, I think it may not be an issue to use substr as long as these text always comes after even though not filtered out of indexing.

Thanks @DavidHourani !

DavidHourani · ‎01-23-2015

Thats weird...I only have the square after the filter ... here's my props.conf:

[test]
BREAK_ONLY_BEFORE = \*>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM <<<<<<<<<<<<<<<<<<
NO_BINARY_CHECK = true
SEDCMD-cleanup = s/(?m)^[^\*].*//g
category = Custom
pulldown_type = true

vtsguerrero · ‎01-23-2015

Yeah, I'm trying to undestand what could go wrong, but no ideas on what's happening for the second piece of cut, my stanza is like this:

# your settings
BREAK_ONLY_BEFORE=\*>>>>>>>>>>>>>>>>>> ABAIXO REGISTROS GRAVADOS NO DDNAME: OUTERSYSTEM <<<<<<<<<<<<<<<<<<
MAX_TIMESTAMP_LOOKAHEAD=150
NO_BINARY_CHECK=1
SEDCMD-cleanup=s/(?m)^[^\*].*//g
SHOULD_LINEMERGE=true

# set by detected source type
MAX_EVENTS=500
TRUNCATE=0
pulldown_type=1

I'm gonna try changing a little bit the regex, but what's strange is that in preview mode, it shows perfect, but in search, it has not been cut the second share of the text.

How Can I Cut With Regex An Indexing Part Of A Txt File?

How to Monitor Google Kubernetes Engine (GKE)

Index This | How can you make 45 using only 4?

Splunk Education Goes to Washington | Splunk GovSummit 2024