Splunk Search
Highlighted

How do I deal with performance issues using the transaction command with a large amount of data?

New Member

I have a new application that I need to extract one field, taskname, from the main task table (5+ million records) and merge into the subtask table (less 1 million records). When I ran with the 'transaction' statement, I got incomplete results.

Here are my file structures. One main task will have multiple subtasks and each subtask might run in serialized or parallel.

Main  taskA (mID, jID, mCPU, taskname, main task beginning/ending time)
  subtask1: mID, jID, sCPU, subtask beginning/ending time  
  subtask2: mID, jID, sCPU, subtask beginning/ending time 
  . . . . .

I need to search each detailed subtask attributes by task name across different time frame. And, another search will be summarized by time distribution. I know the later one can be achieved with | timechart sum(sCPU) by taskname span=1h. But, I got the performance issue for the first search. I did search to use summary indexing and post processing. It seems they suggest to aggregate fields mID and jID. However, I will lose the subtask detailed information. Any suggestion?

0 Karma
Highlighted

Re: How do I deal with performance issues using the transaction command with a large amount of data?

Communicator

Can you post the query with transaction command that you are using?
Is your main task table and subtask table are different files?

0 Karma
Highlighted

Re: How do I deal with performance issues using the transaction command with a large amount of data?

New Member

the base search,

index=performance source=subtask OR source=maintask
| transaction mID jID startswith=(source="subtask") endswith=(source="maintask")

....

use the search base for the query1

| timechart sum(sCPU) by taskname span=1h

....

use the search base for the query2

| table mID jID sCPU taskname sATT1 sATT2 sATT3
| sort 0 -sCPU | head 100

where sATTx are minor test attributes from the source=subtask; and taskname is from source=maintask.

0 Karma
Highlighted

Re: How do I deal with performance issues using the transaction command with a large amount of data?

Influencer

I find it helps to visualise Splunk as a single giant database table with unlimited columns. So in fact your events are already "joined" and all you need to do is group statistics together in a meaningful way.

Assuming each mID,jID pair is unique, you can group all you fields together like so

index=performance (source=subtask OR source=maintask ) | stats values(*) AS * by mID,jID

This should give you a table of all the fields that belong to each mID jID pair. The fields will be multivalued, so you may need to split them with mvexpand

There is a great flowchart on when you use each aggregation command here: http://docs.splunk.com/Documentation/Splunk/6.3.2/Search/Abouteventcorrelation

See Stats: http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Stats
See Mvexpand: http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Mvexpand

0 Karma