Solved: How to index Azure Table storage data without a va...

thilleso · ‎05-30-2016

Hi,

Do someone have experience using the Splunk Add-on for Azure app, and retrieving Azure Table storage data?

The problem is getting the table data without having a valid DateTime field in the Azure Table. Instead the table has a column using DateTime.Ticks (a 64-bit integer, e.g. 635999855693046079). The provider of the table is a 3rd party, so I'm not able to change this.

I've tried using the Tick field as the Date/Time Column in the input setup, no error message is shown when pressing save input, but nothing gets indexed. As a test I made a new table with a standard DateTime column, and this was indexed as expected. So far the only solution I see is to mess around in the AzureStorageTable.py script to parse the field correct there, but that's something I'd rather not do.

jconger · ‎05-31-2016

The generic Azure Storage Table input does not support this. However, the Azure Diagnostics input does use ticks on some of the queries. The PartitionKey for some of the tables is ticks. Using the PartitionKey is preferable when querying Azure data since this field is indexed by Azure (which means faster queries). Check out the code in AzureDiagnostics.py.

View solution in original post

thilleso · ‎06-06-2016

Ended up editing AzureStorageTable.py to make it work. I edited/added the following:

import md5, hashlib 
.
.

def get_encoded_csv_file_path(checkpoint_dir, file_name):
    name = ""
    for i in range(len(file_name)):
        if file_name[i].isalnum():
            name += file_name[i]
        else:
            name += "_"

    name = name[:100]

    m = hashlib.md5() # Updated to hashlib since md5.new() is depricated and slower
    m.update(file_name)
    name += "_" + m.hexdigest() + ".csv.gz"

    return os.path.join(checkpoint_dir, name)

# This is used to convert a datetime object to ticks
def dateTimetoTicks(dt):
    ticksBeforeEpoch = 621355968000000000
    ticksSinceEpochToDateTime = (dt - datetime.datetime.utcfromtimestamp(0)).total_seconds() * 10000000
    long(ticksSinceEpochToDateTime)
    totalTime = long(ticksBeforeEpoch + ticksSinceEpochToDateTime)

    return totalTime
.
.

# Default the start date to 2 days ago
    dateTimeStart = (datetime.datetime.today() - datetime.timedelta(days=2))
    dateTimeStart = dateTimetoTicks(dateTimeStart)

if not date_time_start in ['',None]:
    # if a start time was specified in the config, use that value instead of the default
    TempdateTimeStart = dateutil.parser.parse(date_time_start)
    dateTimeStart = dateTimetoTicks(TempdateTimeStart)
.
.

if marker is not None:
    dateTimeStart = marker

# NEW filter_string with ticks as filter - appending L for specifying Int64
filter_string = "%s gt %sL" % (date_time_column, dateTimeStart)
.
.

# Update to check for if entity is long 
if not isinstance(entity[date_time_column], long):
    # if the entity column is not a datetime, try to convert it
    try:
        entity[date_time_column] = dateTimetoTicks(entity[date_time_column])
.
.

if isinstance(entity[date_time_column], long) and (entity[date_time_column] > last_dateTime):
    # compare this entitiy's datetime to the last_dateTime variable
    last_dateTime = entity[date_time_column]

Querying PartitionKey (ticks) is much faster than querying datetime-objects as explained here
http://www.codeproject.com/Tips/671361/Some-tips-and-tricks-for-azure-table-storage

Cheers!

jconger · ‎05-31-2016

The generic Azure Storage Table input does not support this. However, the Azure Diagnostics input does use ticks on some of the queries. The PartitionKey for some of the tables is ticks. Using the PartitionKey is preferable when querying Azure data since this field is indexed by Azure (which means faster queries). Check out the code in AzureDiagnostics.py.

woodcock · ‎05-30-2016

The best thing is to convert Azure ticks to time_t before you import the data. If that cannot be done, then the next best thing is to use DATETIME=CURRENT to force it to use now as each event's timestamp and then use All time for every search, convert the Ticks field to a time_t field and reassign this value to _time before doing any work. The performance on this will be terrible but it will work. Here is how to convert:

http://stackoverflow.com/questions/1613853/convert-ticks-to-time-t-or-filetime

How to index Azure Table storage data without a valid DateTime column?

Monitoring MariaDB and MySQL

Financial Services Industry Use Cases, ITSI Best Practices, and More New Articles ...

Splunk Federated Analytics for Amazon Security Lake