All Apps and Add-ons

Splunk GUI Rest API Twitter Integration

Splunk Employee
Splunk Employee

Hello,

We are looking use the Splunk REST gui to connect to Twitter and monitor feeds based on several URL parameters we care to search for.

We have the end point defined as https://api.twitter.com/1.1/search/tweets.json, have our authentication credentials entered, and for a sample URL Argument as q=UPMC to search twitter anything for UPMC returning in XML format. There is no data returning though but when I use this twitter dev app it works fine: https://dev.twitter.com/console. Anyone else having issue using the GUI Rest integration or have a better way to pull in twitter data based on keywords? Should we worry about defining Response Handler and other options in the config?

Thanks!

0 Karma
1 Solution

Ultra Champion

Well, you will be getting multiple events in the response document , but they are being indexed in Splunk as 1 single event. That is why the REST API Modular Input has Custom Response Handlers that you can plug in to parse the specific response you are getting back ie: split out the individual twitter events from the JSON response.
You add your custom response handler to bin/responsehandlers.py and declare it on the setup page for your REST Input Definition

Here is an example of what a custom handler might look like for the Twitter JSON response :

class TwitterEventHandler:

    def __init__(self,**args):
        pass

    def __call__(self, response_object,raw_response_output,response_type,req_args,endpoint):       

        if response_type == "json":        
            output = json.loads(raw_response_output)
            last_tweet_indexed_id = 0
            for twitter_event in output["statuses"]:
                print_xml_stream(json.dumps(twitter_event))
                if "id_str" in twitter_event:
                    tweet_id = twitter_event["id_str"]
                    if tweet_id > last_tweet_indexed_id:
                        last_tweet_indexed_id = tweet_id

            if not "params" in req_args:
                req_args["params"] = {}

            req_args["params"]["since_id"] = last_tweet_indexed_id

        else:
            print_xml_stream(raw_response_output)

I see that the raw response back from twitter also has a created_at field for each event , which you can then use as your Splunk index time value.

View solution in original post

Splunk Employee
Splunk Employee

Ok guys so I think I solved the issue of duplicate tweets and revised the responsehandlers.py code!

The issue was initializing the lasttweetindexedid = 0 so to overcome this I wrote the lasttweetindexedid to a file then whenever the responsehandlers.py code is called it initializes the lasttweetindexedid to the last value in that file, instead of 0. Here is my revised code and it is working, all you need to do is modify it to where ever you want the log that holds the id to be. Also, I had issues with the createdat field so I created my own timestamp field to make it easier to extract:

class TwitterEventHandler:

def __init__(self,**args):
   pass

def __call__(self, response_object,raw_response_output,response_type,req_args,endpoint):
    if response_type == "json":
        output = json.loads(raw_response_output)
        #tweet_ids only holds one value and is overwritten to the most current value which is read in
        tweet_ids = open('/splunk/etc/apps/rest_ta/bin/last_tweet_id.log', 'r')
        #splunk_log is for debugging and is optional to show where you left off for tweet_ids
        splunk_log = open('/splunk/etc/apps/rest_ta/bin/splunk.log', 'a')
        last_tweet_indexed_id = int(tweet_ids.readline())
        tweet_ids.close()
        for twitter_event in output["statuses"]:
            #creating new __time field to make it easier to extract in props.conf
            if 'created_at' in twitter_event:
                twitter_event['__time'] = twitter_event['created_at']
            print_xml_stream(json.dumps(twitter_event))
            if "id_str" in twitter_event:
                tweet_id = int(twitter_event["id_str"])
                if tweet_id > last_tweet_indexed_id:
                    last_tweet_indexed_id = tweet_id
        if not "params" in req_args:
            req_args["params"] = {}

        req_args["params"]["since_id"] = last_tweet_indexed_id
        #writing to tweet_ids
        tweet_ids = open('/splunk/etc/apps/rest_ta/bin/last_tweet_id.log', 'w')
        tweet_ids.write(str(last_tweet_indexed_id))
        splunk_log.write(str(last_tweet_indexed_id)+'\n')
        tweet_ids.close()
        splunk_log.close()

    else:
        print_xml_stream(raw_response_output)
0 Karma

Splunk Employee
Splunk Employee

Thanks for the response Damien. Here is part of my inputs.conf in case you wanted to test with a more complex query:

[rest://Twitter]
authtype = oauth1
endpoint = https://api.twitter.com/1.1/search/tweets.json
http
method = GET
index = twitter
indexerrorresponsecodes = 1
polling
interval = 30
responsehandler = TwitterEventHandler
response
type = json
sourcetype = twitter
streamingrequest = 0
url
args = q=UPMC%20OR%20HEALTHCARE%20OR%20%40makeitourupmc,lang=en,sinceid=0,includeentities=false
disabled = 0
backofftime = 10
request
timeout = 60

Also, when checking the logs, I seem to be consistently getting a broken pipe error related to twitter.

HttpListener - Socket error from 127.0.0.1 while accessing /servicesNS/nobody/search/data/inputs/rest/Twitter/: Broken pipe

I checked the ulimit -n and it is set to 16384 and I even set the maxThreads and maxSockets = -1 to impose no limits.

Here is a screen shot of a just cleaned index and the unique events along with the total events:

alt text

Here is another search displaying the idstr, text, the createdat (expected timestamp) and the actual time Splunk pulled with the props.conf stanza. As you can see some of these dates aren't even within the same year and are pulling when the user created their account:

alt text

Ultra Champion

I really want to be able to reproduce the duplicate tweets scenario , but I can't. In my environments I am not seeing any duplicate tweets.

Starting with a clean index , I have run the twitter polling for many hours , here is an example of the search and time range I used to verify no duplicate tweets were indexed.

alt text

I also started / stopped / enabled / disabled randomly and for different time periods to try and simulate adverse operational conditions.Still , no duplicate tweets.

0 Karma

Ultra Champion

There really is nothing more for me to add at this point.As you can see from the above screenshot I can not replicate this on any of my environments (multiple OS's and versions of Splunk). I will continue polling for 24 hrs and re-observe.

0 Karma

Path Finder

below is my inputs.conf

[rest://twitter]
authtype = oauth1
endpoint = https://api.twitter.com/1.1/search/tweets.json
http
method = GET
index = twitter
indexerrorresponsecodes = 1
oauth1
accesstoken = 35662991-XXXXXXXXXBF0hL6yffzzZrkZvxFISQGR
oauth1
accesstokensecret = 2dXXXXXXXXXOR5ppHJVQroREXRPUjBRpNK
oauth1clientkey = zLxCXXXXXXXXPljW00A
oauth1clientsecret = fXKvtn0XXXXXXXX0od2zcMD7EtJsnY
pollinginterval = 30
response
handler = TwitterEventHandler
responsetype = json
sourcetype = rest
twitter
streamingrequest = 0
url
args = since_id=395371736725598208,q=XXXX
disabled = 0

0 Karma

Ultra Champion

Note : I just uploaded a new version of the REST Modular Input that now automatically persists any dynamically calculated URL arguments back into your inputs.conf stanzas (using the Splunk Python SDK).So if you restart the REST Modular Input stanza , it starts polling from where it last left off.I tested this all with the "TwitterEventHandler" (now included in version 1.3) and it worked perfectly for me.

Below is the intial stanza that I started with.

On each subsequent polling iteration, the urlargs field gets dynamically updated with the latest tweet id as the sinceid value.

[rest://twitter]
auth_type = oauth1
endpoint = https://api.twitter.com/1.1/search/tweets.json
http_method = GET
index = main
index_error_response_codes = 1
oauth1_access_token = 217362964-dtJVxxxxxxxxxUOY4Q0w
oauth1_access_token_secret = BWQ2LcQhxxxxxxxxxxxlf4o1B84mWrlE
oauth1_client_key = xYj5UxxxxxxxxxxxxOP97Q
oauth1_client_secret = LDdy4VoxxxxxxxxxxxxxxAI1HtlKU
polling_interval = 30
response_handler = TwitterEventHandler
response_type = json
sourcetype = rest_twitter
streaming_request = 0
url_args = q=music,since_id=0
disabled = 0

After the first polling, the since_id has now incremented :

[rest://twitter]
auth_type = oauth1
endpoint = https://api.twitter.com/1.1/search/tweets.json
http_method = GET
index = main
index_error_response_codes = 1
oauth1_access_token = 217362964-dtJVxxxxxxxxxUOY4Q0w
oauth1_access_token_secret = BWQ2LcQhxxxxxxxxxxxlf4o1B84mWrlE
oauth1_client_key = xYj5UxxxxxxxxxxxxOP97Q
oauth1_client_secret = LDdy4VoxxxxxxxxxxxxxxAI1HtlKU
polling_interval = 30
response_handler = TwitterEventHandler
response_type = json
sourcetype = rest_twitter
streaming_request = 0
url_args = q=music,since_id=393287846443753472
disabled = 0

Path Finder

Hi Damien,

Thanks for your reply,
I am using REST API V1.3.1 and the stanza looks the same as you have suggested in this updated answer including the TwitterEventHandler

along with props.conf entry

[resttwitter]
TIME
PREFIX = "createdat": "
MAX
TIMESTAMP_LOOKAHEAD = 40

0 Karma

Ultra Champion

Hi Saad , can you please add more substance to your question. What are your doing or not doing ? Are you using the REST API Modular Input? If so ,what version and what does your stanza look like ? Which particular suggestions in this thread have you implemented or not implemented ? If you can elaborate as thoroughly as possible it makes it easier to answer the question more effectively.

0 Karma

Path Finder

I also am facing the same issue of getting duplicate tweets. Does anyone have a finding on that?

0 Karma

Ultra Champion

You don't need to add any more fields to the indexed event.The "created_at" time is already present in the indexed event.You just have to perform your props.conf timestamp extraction properly. This works for me :

My sourcetype is "resttwitter". I think you may have missed a space character in your TIMEPREFIX regex after the ":" character. You don't need all the other propertys.

[resttwitter]
TIME
PREFIX = "createdat": "
MAX
TIMESTAMP_LOOKAHEAD = 40

0 Karma

Splunk Employee
Splunk Employee

This is what I modified from the twitter app but dont know if it is what needs to be done/ added to responsehandlers.py in the TwitterEventHandler.

if 'createdat' in twitterevent:
twitterevent['time'] = twitterevent['created_at']

Then the props.conf from the twitter app:

[twitter]
CHARSET = UTF-8
NOBINARYCHECK = 1
TIMEFORMAT = %a %b %d %H:%M:%S %z %Y
TIME
PREFIX = "_time":"
MAX
TIMESTAMPLOOKAHEAD = 150
SHOULD
LINEMERGE = false
TZ = UTC
KV_MODE = json

Or do you think it's possible to extract the created_at from props.conf even though there can be 4+ instances of it?

0 Karma

Splunk Employee
Splunk Employee

I think another issue we are having too is the timestamp extraction of createdat. There can be multiple createdat's in the json response as it can indicate when the user's account was created, when the tweet was sent, timestamp of the original tweet if it was retweeted, etc. When I look at the raw event the createdat we care about it at the end of the response but I've tried every combination in props.conf to try to extract it with TIMEPREFIX, MAXTIMESTAMPLOOKAHEAD, and MAXTIMESTAMPLOOKAHEAD. I think something needs added in the responsehandles.py like in the twitter app.

0 Karma

Ultra Champion

perhaps you have some older tweets in your results ? For example , if I clean out my index , run the input for 30 minutes and then do search for a count of all events and compare it to a search of all events deduping on "id_str" , the count is the same.

0 Karma

Splunk Employee
Splunk Employee

If I just do "index=twitter | stats count by idstr | sort -count" I can see multiple entries for the same tweet id. Also just looking at the raw events I can see duplicate events with the same timestamp and idstr

0 Karma

Ultra Champion

I dont see this.What search are you using to determine this ?

0 Karma

Splunk Employee
Splunk Employee

This is awesome! Thank you for the update. I notice that the since_id parameter is changing but we still seem to be ingesting duplicate tweets. Any idea what it could be?

0 Karma

Path Finder

I have the similar setting but getting below mentioned error in my splunkd.log, has anyone encountered this?

10-23-2013 11:30:00.231 +1300 ERROR ExecProcessor - message from "python /app/splunk/etc/apps/rest_ta/bin/rest.py" Exception performing request: [Errno 8] _ssl.c:521: EOF occurred in violation of protocol

0 Karma

Ultra Champion

Well, you will be getting multiple events in the response document , but they are being indexed in Splunk as 1 single event. That is why the REST API Modular Input has Custom Response Handlers that you can plug in to parse the specific response you are getting back ie: split out the individual twitter events from the JSON response.
You add your custom response handler to bin/responsehandlers.py and declare it on the setup page for your REST Input Definition

Here is an example of what a custom handler might look like for the Twitter JSON response :

class TwitterEventHandler:

    def __init__(self,**args):
        pass

    def __call__(self, response_object,raw_response_output,response_type,req_args,endpoint):       

        if response_type == "json":        
            output = json.loads(raw_response_output)
            last_tweet_indexed_id = 0
            for twitter_event in output["statuses"]:
                print_xml_stream(json.dumps(twitter_event))
                if "id_str" in twitter_event:
                    tweet_id = twitter_event["id_str"]
                    if tweet_id > last_tweet_indexed_id:
                        last_tweet_indexed_id = tweet_id

            if not "params" in req_args:
                req_args["params"] = {}

            req_args["params"]["since_id"] = last_tweet_indexed_id

        else:
            print_xml_stream(raw_response_output)

I see that the raw response back from twitter also has a created_at field for each event , which you can then use as your Splunk index time value.

View solution in original post

Ultra Champion

It is automatically passed into the url parameter list.
Trace back through the code in rest.py (the while loop at line 422) to see how this happens.

0 Karma

Splunk Employee
Splunk Employee

So I thought this was working but it isnt. But it is close! Whenever we say "reqargs["params"]["since_id"] = lasttweetindexedid" do we need to set this anywhere else or will it automatically be passed into the url parameter list? Or does since_id need to be added to the response handles arguments in the REST GUI?

0 Karma