1

I have a database that contains some confidential personal information. For data analysis purposes, I'd like to send that data out to some people. However, the anonymity of the data needs to be maintained.

Is there a way for me to generate fake tables to do JOINS on i.e. instead of a JOIN to the user table that has real names, I generate a table fakeuser dynamically and JOIN to that table instead.

I know it can be done manually but the the query doesn't run on all the users in the whole database and it would be nice to have it done dynamically i.e. when the query calls for fakeuser ID 23 and the ID does not exist, it just picks a random name from somewhere and populate the table and query.

I'm open to doing this via other types of scripting if MySQL is not flexible enough for this.

Saad Farooq
  • 13,172
  • 10
  • 68
  • 94
  • see if this so question helps http://stackoverflow.com/questions/260307/anonymizing-customer-data-for-development-or-testing – koriander Apr 23 '13 at 17:01
  • Thanks. The answers for that question seem to discourage using partial real data. I, on the other hand, need to provide partial real data for analysis. The anonymizing is more for privacy protection. The question referenced in that question (a dream within a dream?) does mention some product but they seem to be commercial. – Saad Farooq Apr 23 '13 at 17:12
  • I guess scripting is the way to go but any pointers on who to go about it would be nice. – Saad Farooq Apr 23 '13 at 17:13
  • Nota bene, as stated in the Royal Society's final report on [Science as an open enterprise](http://royalsociety.org/policy/projects/science-public-enterprise/report/): "*a substantial body of work in computer science has now demonstrated that the security of personal records in databases cannot be guaranteed through anonymisation procedures where identities are actively sought.*" If your data is sensitive, I would recommend aggregating it as far as possible before handing over to other parties: i.e. **run their reports for them, don't give them raw data**. – eggyal Apr 23 '13 at 17:27
  • I realize that. The security here is not against malicious intent but unintentional leaks i.e. we trust the researchers but they are human and it has happened that a name snuck into a final report. – Saad Farooq Apr 23 '13 at 23:06

1 Answers1

0

This seems a quite simple task but it involve several steps and I'm not sure where you are stuck.

You need a list of names which you can obtain by generating randomly combinations of letters (giving unreadable names) or you can find out a public source of names (maybe http://www.listofbabynames.org/).

Then you just need a loop to insert records in your fakeuser table by picking names randomly from the list. You just need to make sure you don't assign the same name to different records in fakeuser. This can be done by removing the name from the list when you choose one. If the list is not big enough, you can generate combinations.

I also would carefully consider the advice/article given by @eggyal to see if it applies to your situation. I think there is a lot o work on anonymization data if you need a better solution.

koriander
  • 3,110
  • 2
  • 15
  • 23