Getting Data In

Why is my line breaking for Twitter data not working reliably?

Champion

I am working on Twitter Data in JSON format, basically the way you get it from the 1% sample stream UPDATE: not quite, there are differences - see example tweet at the end. Everything is all good, except that in rare cases two tweets are not correctly separated (i.e. the line breaking is not applied). See this sample which contains the (much shortened) tweets in question.

There are four tweets: one, a newline, two without a newline between them, a newline and a fourth - don't ask me why the newline between the second and third is missing, I just get the data that way and there's not much I can do about that. When they are added in splunk, with standard settings they are added as three events, with the second consisting of two tweets in one event (because the carriage return/newline is missing):

standard

I've tried my best to find a regex that sorts this mess out, and I've come up with this (probably not optimal, but explicit) regex:

"}([\r\n]*){"filter

It should look for a carriage return or newline between the closing bracket of a tweet and the opening bracket at the beginning of a new tweet, so in standard situation it should work just like usual. Since the capturing group can be empty, it should also work on the special case with nothing between the two curly brackets. In regex101.com, this extracts my line breaks just fine, but when I set this as the regex during the "Add Data" wizard in splunk, it splits the second and third tweet (as it should) and then concatenates the third and fourth one:

four regex

This also applies to a fifth one: when I add another tweet with a newline before it to the end of the file, the blob of tweets consists of three tweets:

five

What am I missing here?


{
    "filter_level": "low",
    "retweeted": false,
    "in_reply_to_screen_name": null,
    "possibly_sensitive": false,
    "truncated": false,
    "lang": "in",
    "in_reply_to_status_id_str": null,
    "id": 611075954270539776,
    "in_reply_to_user_id_str": null,
    "timestamp_ms": "1434526835397",
    "in_reply_to_status_id": null,
    "created_at": "Wed Jun 17 07:40:35 +0000 2015",
    "favorite_count": 0,
    "place": null,
    "coordinates": null,
    "text": "Yen mlakune bokonge megal megol kiwo nengen persis koyo ngaduk sego goreng, genah iku wanito!",
    "contributors": null,
    "geo": null,
    "entities": {
        "trends": [],
        "symbols": [],
        "urls": [],
        "hashtags": [],
        "user_mentions": []
    },
    "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>",
    "favorited": false,
    "in_reply_to_user_id": null,
    "retweet_count": 0,
    "id_str": "611075954270539776",
    "user": {
        "location": "#Mbelinger , Yogyakarta",
        "default_profile": false,
        "profile_background_tile": true,
        "statuses_count": 4859,
        "lang": "en",
        "profile_link_color": "090A0A",
        "profile_banner_url": "https://pbs.twimg.com/profile_banners/1642741580/1433708559",
        "id": 1642741580,
        "following": null,
        "protected": false,
        "favourites_count": 61,
        "profile_text_color": "333333",
        "verified": false,
        "description": "Senajan mbeling nanging iseh eling, Mung wong ndeso sing kenal Medsos, KaosJawaMbeling CP: 085228313131",
        "contributors_enabled": false,
        "profile_sidebar_border_color": "EEEEEE",
        "name": "IG :Jawa Mbeling",
        "profile_background_color": "131516",
        "created_at": "Sat Aug 03 12:27:50 +0000 2013",
        "default_profile_image": false,
        "followers_count": 17479,
        "profile_image_url_https": "https://pbs.twimg.com/profile_images/593032155669745664/DCGT-Xr6_normal.jpg",
        "geo_enabled": true,
        "profile_background_image_url": "http://abs.twimg.com/images/themes/theme14/bg.gif",
        "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme14/bg.gif",
        "follow_request_sent": null,
        "url": "http://www.jawambeling.com",
        "utc_offset": null,
        "time_zone": null,
        "notifications": null,
        "profile_use_background_image": true,
        "friends_count": 31,
        "profile_sidebar_fill_color": "EFEFEF",
        "screen_name": "jawambeling",
        "id_str": "1642741580",
        "profile_image_url": "http://pbs.twimg.com/profile_images/593032155669745664/DCGT-Xr6_normal.jpg",
        "listed_count": 17,
        "is_translator": false
    },
    "@version": "1",
    "@timestamp": "2015-06-17T07:40:30.740Z"
}
0 Karma
1 Solution

Champion

It was too easy: the regex flavor splunk uses, pcre, is supposed to work with unescaped } and {. That's why regex101.com showed that indeed my regex from above,

"}([\r\n]*){"filter

worked there. In splunk however, those curly brackets need to be escaped (duh). So what worked for me is the following regex:

\"\}([\n\r]*)\{\"filter

I also changed the capture group to avoid any trouble from other special characters between events, so that the entire props.conf is now

[twitter]
LINE_BREAKER = \"\}([^{]*)\{\"filter
SHOULD_LINEMERGE = false
TIME_PREFIX = timestamp_ms":"
MAX_TIMESTAMP_LOOKAHEAD = 30

Thank you everyone who chimed in on this!

View solution in original post

Champion

It was too easy: the regex flavor splunk uses, pcre, is supposed to work with unescaped } and {. That's why regex101.com showed that indeed my regex from above,

"}([\r\n]*){"filter

worked there. In splunk however, those curly brackets need to be escaped (duh). So what worked for me is the following regex:

\"\}([\n\r]*)\{\"filter

I also changed the capture group to avoid any trouble from other special characters between events, so that the entire props.conf is now

[twitter]
LINE_BREAKER = \"\}([^{]*)\{\"filter
SHOULD_LINEMERGE = false
TIME_PREFIX = timestamp_ms":"
MAX_TIMESTAMP_LOOKAHEAD = 30

Thank you everyone who chimed in on this!

View solution in original post

Splunk Employee
Splunk Employee

I escape everything that I would consider a "special character" ie. not a letter or number. That way I don't have to remember which ones will screw me up, and it doesn't hurt anything. People rag on me for that, but how many days did it take you to see that? and how many other pairs of eyes? 🙂

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!

Champion

How did you know it took me another pair of eyes? 😄

I will be doing that pretty strictly from now on as well.

0 Karma

Champion

It looks like you have multiple json event in one line. Try this with a break before {"filter_level

0 Karma

Splunk Employee
Splunk Employee

You are entering your pattern in with the wrong directive to use the structure you've got...
If you click the "advanced" bar you'll see that entering a "regex pattern" for breaking lines isn't producing the LINE_BREAKER directive but the BREAK_ONLY_BEFORE

I'm not sure what's happening to your your data... but if you look at the raw tweet you should see that "filter" is not the first field of the event, not the first item. you're missing most of the tweet info (which is also why you don't see the proper timestamps"
I searched for "filter" in case it only shows up in some tweets... (I'm no expert at this data) and the resulting event looked like this:

{"timestamp_ms":"1434552867663","retweet_count":0,"id_str":"611185141507985408","__time":"Wed Jun 17 14:54:27 +0000 2015","coordinates":null,"lang":"ja","favorite_count":0,"filter_level":"low","in_reply_to_status_id":null,"possibly_sensitive":false,"in_reply_to_screen_name":null,"in_reply_to_user_id_str":null,"id":611185141507985408,"contributors":null,"text":"【スマブラ】(かくいう俺も嫌いだということは黙っておこう・・・) http://t.co/s9hCLPXaDX","retweeted":false,"in_reply_to_user_id":null,"geo":null,"source":"<a href=\"http://smabro-matome.com/\" rel=\"nofollow\">スマブラまとめつぶやき用</a>","created_at":"Wed Jun 17 14:54:27 +0000 2015","favorited":false,"entities":{"symbols":[],"urls":[{"indices":[33,55],"expanded_url":"http://smabro-matome.com/archives/176362","url":"http://t.co/s9hCLPXaDX","display_url":"smabro-matome.com/archives/176362"}],"hashtags":[],"user_mentions":[],"trends":[]},"truncated":false,"in_reply_to_status_id_str":null,"place":null,"user":{"follow_request_sent":null,"following":null,"id_str":"2833839404","notifications":null,"friends_count":2282,"description":null,"followers_count":2413,"statuses_count":93138,"location":"","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_text_color":"333333","profile_link_color":"0084B4","protected":false,"profile_image_url":"http://pbs.twimg.com/profile_images/515796075987218432/wG9qIr8J_normal.jpeg","profile_use_background_image":true,"time_zone":null,"profile_sidebar_fill_color":"DDEEF6","id":2833839404,"geo_enabled":false,"is_translator":false,"utc_offset":null,"profile_image_url_https":"https://pbs.twimg.com/profile_images/515796075987218432/wG9qIr8J_normal.jpeg","default_profile_image":false,"profile_background_tile":false,"profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","url":"http://smabro-matome.com/","profile_sidebar_border_color":"C0DEED","default_profile":true,"created_at":"Sat Sep 27 09:31:09 +0000 2014","favourites_count":0,"name":"スマッシュブラザーズまとめ","profile_background_color":"C0DEED","listed_count":2,"lang":"ja","contributors_enabled":false,"screen_name":"smabro_matome","verified":false}}

filter_level is on the second line (wrapping).

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!
0 Karma

Champion

Unforunately, we don't get the data the way it comes from the 1% sample stream (sorry for initally stating that in the question) - I first assumed that when I wrote the question, but then I noticed that the data is different some way through writing the question and forgot to change the sentence at the beginning.

I've appended how a tweet looks for me to the question above (I get it it after it has been processed by another system), but in short
a) I don't have a __time field which is why I was using created_at. In all my tweets, this was exactly the same time as in timestamp_ms, I chose created_at because it is easier to read and thus check for inconsistencies.
b) my first field is filter_level, see above. Also, this field is not deprecated so it should appear in every twitter data; see here for the docs on that.

Regarding LINE_BREAKER and BREAK_ONLY_BEFORE, I've always understood the docs like this: if possible, you should use LINE_BREAKER and set SHOULD_LINEMERGE to false because that will require the least processing power (see the docs on props.conf and search for LINE_BREAKER and SHOULD_LINEMERGE), because if you go the other way and set SHOULD_LINEMERGE to true with BREAK_ONLY_BEFORE (or some other of the settings that are available with line merging), splunk will possibly break more often and merge some of them later. Maybe this makes no difference in my case, but I still prefer the direct method without line merging as long as I don't have to deal with any multiline events. That's why I don't use what the add data wizard proposes.

I think I've found the answer to my question in the meantime though, going to post that in a second. Nonetheless, thank you very much for your effort!

0 Karma

Splunk Employee
Splunk Employee

You shouldn't need to us a LINE_BREAKER as Splunk recognizes JSON input. You can use INDEXED_EXTRACTIONS but that does indeed index the fields... and that takes up some extra space.

here is what the twitter app uses...

[twitter]
CHARSET = UTF-8
NO_BINARY_CHECK = 1
TIME_FORMAT = %a %b %d %H:%M:%S %z %Y
TIME_PREFIX = "__time":"
MAX_TIMESTAMP_LOOKAHEAD = 150
SHOULD_LINEMERGE = false
TZ = UTC
KV_MODE = json

in this case... Splunk recognizes the JSON on it's own and the KV_MODE, while a bit old school is probably not really doing anything here.

There is a default setting as follows that makes the KV_MODE in this case obsolete:

AUTO_KV_JSON = [true|false]
* Used for search-time field extractions only.
* Specifies whether to try json extraction automatically.
* Defaults to true.

What happens if you just identify the timestamp?

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!
0 Karma

Champion

I tried the original twitter settings; out of the box it doesn't work because in my data, there is no __time field (I don't know why). I have to rely on the created_at field as shown in my screenshots above. But even with settings adjusted to the new TIME_PREFIX, the tweets in question stay as they are, with number two and three in one event.

When I only identify the timestamp, the result is exactly the same.

0 Karma

Splunk Employee
Splunk Employee

I just realized what you're doing and I'll put it in an individual answer... but as far as the timestamp is concerned you'll want to look at that again:

there are several fields that deal with time, and it is the timestamp_ms or the __time field that gives you the accurate time for the particular tweet. The other fields with time, refer to the retweet, mention etc... the timestamp should be about the time you download the data...
Take one tweet and put it in a text editor where you can replace the { with {\n and you'll be able to see better.
in the first 150 char or so you should see something like this:

"favorite_count":0,"__time":"Wed Jun 17 14:31:47 +0000 2015","in_reply_to_status_id"

that's your timestamp

either way.

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!
0 Karma

SplunkTrust
SplunkTrust

What does your props.conf file look like? I was having this issue with some of my SendGrid logs. Example of some of the logs:

[{"response":"550 5.1.1 User Unknown ","sg_event_id":"1234567890","sg_message_id":"7410258963","event":"deferred","email":"user1@abc.local","attempt":"23","timestamp":1432305797,"smtp-id":"<random@server.local>"}]
[{"email":"user2@abc.local","timestamp":1432305792,"smtp-id":"<random@server.local>","sg_event_id":"2345678901","sg_message_id":"4108529637","event":"processed"}]
[{"email":"user3@abc.local","timestamp":1432305793,"smtp-id":"<random@server.local>","sg_event_id":"3456789012","sg_message_id":"1085296374","event":"processed"},
{"email":"user4@abc.local","timestamp":1432305793,"smtp-id":"<random@server.local>","sg_event_id":"4567890123","sg_message_id":"0852963741","event":"processed"},
{"email":"user5@abc.local","timestamp":1432305793,"smtp-id":"<random@server.local>","sg_event_id":"5678901234","sg_message_id":"852963710","event":"processed"},
{"email":"user6@abc.local","timestamp":1432305793,"smtp-id":"<random@server.local>","sg_event_id":"6789012345","sg_message_id":"5296374108","event":"processed"},
{"email":"user7@abc.local","timestamp":1432305795,"smtp-id":"<random@server.local>","response":"250 Message Queued (No RCPTS) ","sg_event_id":"7890123456","sg_message_id":"2963741085","event":"delivered"},
{"email":"user8@abc.local","smtp-id":"<random@server.local>","timestamp":1432305796,"response":"250 Backend Replied [7531598520.abcd.server.local]:  2.0.0 Ok: queued as A1B2C3D4 (Mode: n ","sg_event_id":"8901234567","sg_message_id":"9637410852","event":"delivered"},
{"email":"user9@abc.local","timestamp":1432305796,"smtp-id":"<random@server.local>","response":"250 Message Queued (No RCPTS) ","sg_event_id":"9012345678","sg_message_id":"637410852963","event":"delivered"},
{"email":"user0@abc.local","timestamp":1432305796,"smtp-id":"<random@server.local>","response":"250 Message Queued (No RCPTS) ","sg_event_id":"0123456789","sg_message_id":"3741085296","event":"delivered"}]

My props.conf for this sourcetype:

[sendgrid_json]
INDEXED_EXTRACTIONS = json
KV_MODE = none
NO_BINARY_CHECK = true
TIMESTAMP_FIELDS = timestamp
category = Structured
disabled = false
pulldown_type = true

Now it shows up in the nice JSON format.

0 Karma

Champion

My props.conf is based around the original one from the twitter app:

[twitter]
CHARSET = utf-8
NO_BINARY_CHECK = 1
TIME_FORMAT = %a %b %d %H:%M:%S %z %Y
TIME_PREFIX = "created_at":"
MAX_TIMESTAMP_LOOKAHEAD = 35
TZ = UTC
KV_MODE = json

I tried your settings, adjusting the TIMESTAMP_FIELDS to TIME_PREFIX, but that now yields "No results found. Please change Sourcetype, adjust Sourcetype settings, or check your source file." in the add data wizard.

0 Karma

Splunk Employee
Splunk Employee

are you using the twitter app or have you done this yourself?

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!
0 Karma

Champion

No, this is not the twitter app.

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!