Getting Data In

How to use K-anonymity with Splunk?

MarcHelou
New Member

Hello,
Let's say i have a csv file that contains sensitive data, I want on index to group multiple lines as one event in a way that it doesnt compromise my data. So let's say:
User - Age
U1 - 12
U2 - 13
U3 - 17
U4 - 15
U5 - 20
How can I group for example each 2 users as one event as so(of course before indexing and not on search time):
U1,U2 - 12,13
U3,U4 - 17-15
...
Thanks in advance

0 Karma

DalJeanis
Legend

I don't understand the reason for your business case, but here is what I would do to achieve your stated objective. Instead of running it through a standard indexing, I would bring it in, aggregate it, and then collect the aggregated data into a summary index.

   | inputcsv mystuff.csv 

   | rename COMMENT as "Assign every pair of records to a group, then stats the group together " 
   | streamstats count as recno
   | eval groupno = floor( ( 1+ recno ) / 2 )
   | stats list(User) as User list(Age) as Age by groupno

   | rename COMMENT as "Set time, Get rid of unneeded fields, then copy them to the new index."
   | eval _time = now() 
   | table _time User Age 
   | collect .... send to desired index... 

If you want to break the link of order between each User and his Age, then do this to sort the fields after the stats command. This will break the relationship between any individual Age and its User.

   | stats list(User) as User list(Age) as Age by groupno
   | eval User=mvsort(User)
   | eval Age=mvsort(Age) 

If you want to change the number of records in each group to some number K, change line 5 to use your new K-1 and K as follows:

   | eval groupno = floor( ( K-1 + recno ) / K )

Updated to remove suggestion to use values, since that would delete duplicates.


There is also an issue with this anonymization method if using K=2 or k=3 and all of the Users have the same Age. Sigh.

   | inputcsv mystuff.csv 

   | rename COMMENT as "make sure that no two of the same Age are sequential." 
   | streamstats count as ageno by Age
   | eventstats count as totalcount 
   | eventstats max(ageno) as agecount by Age
   | eval myorder=round((ageno-0.5)/agecount,2)
   | sort 0 myorder User

   | rename COMMENT as "Assign every pair of records to a group, then stats the group together " 
   | streamstats count as recno
   | eval groupno = floor( ( 1+ recno ) / 2 )
   | stats list(User) as User list(Age) as Age by groupno

   | rename COMMENT as "Set time, Get rid of unneeded fields, then copy them to the new index."
   | eval _time = now() 
   | table _time User Age 
   | collect .... send to desired index... 

To work, the above depends on no one Age predominating in the data set.

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

May 2026 Splunk Expert Sessions: Security & Observability

Level Up Your Operations: May 2026 Splunk Expert Sessions Whether you are refining your security posture or ...

Network to App: Observability Unlocked [May & June Series]

In today’s digital landscape, your environment is no longer confined to the data center. It spans complex ...

SPL2 Deep Dives, AppDynamics Integrations, SAML Made Simple and Much More on Splunk ...

Splunk Lantern is Splunk’s customer success center that provides practical guidance from Splunk experts on key ...