Unique in the shopping mall: On the reidentifiability of credit card metadata


From the abstract of

Unique in the shopping mall: On the reidentifiability of credit card metadata

Large-scale data sets of human behavior have the potential to fundamentally transform the way we fight diseases, design cities, or perform research. Metadata, however, contain sensitive information. Understanding the privacy of these data sets is key to their broad use and, ultimately, their impact. We study 3 months of credit card records for 1.1 million people and show that four spatiotemporal points are enough to uniquely reidentify 90% of individuals. We show that knowing the price of a transaction increases the risk of reidentification by 22%, on average. Finally, we show that even data sets that provide coarse information at any or all of the dimensions provide little anonymity and that women are more reidentifiable than men in credit card metadata.

I heard the NPR segment on this — and they made it sound so much more ominous and worrisome than does the abstract, which shows, by the nature of the data itself, how structured and specific it is.  Not saying this isn’t dismaying – but as a would-be data science type, I can think of so many more interesting datasets to struggle with in which anonymized data would be useful, and presumably less easily reverse-engineered.

Spot-Checking Record Type Agreement against Expectations

When we last left ZoHo’s cloud-based Business Intelligence tool

it was with the promise –or was it a threat? — of coming back to discuss matters honestly.  To jar your memory, here is the screenshot of the counts of record types:

Annotated ZoHo Reporting
red needs further investigation
green is exactly or almost exactly as one would expect
yellow is pretty good
oh, provided you’re using a non-free account, it’s basically one-click publishing of your data. Take that, SharePoint 2013!


the power of summary functions
the power of summary functions


So, going row by row:

  • University Alum Club has slightly more contacts than organizations — and that’s fine.  From Columbia Business School (henceforth “CBS”) Club’s perspective, anyone in the file who’s an alum of another club is likely either a programming partner (in which case they should more likely be affiliated with that club’s Board – but that’s not universally true) or if not a subject matter expert helping us with programming development, s/he attended one of our events, we got some money from them, and aggregating those (relatively small) sums at their club level is as defensible a way as any — and might help us track who to be extra solicitous towards.
  • Alumni Club Head Hierarchies are a placeholder mechanism meant to unify relationships among clubs as is seen here.  These records don’t represent anything real in the world — they’re just a convenient way to encode a hierarchical relationship in the database.
  • The other MBA clubs seem sound enough — (though truth be told there are a surprisingly large number of clubs, so I’ll have to go double check that with some troubleshooting).  But logically those MBA club’s members/public would be the most likely of any alumni to attend my programming, so it would only stand to reason that they would be accumulating many more members with attendance and/or revenue association with them for each organization I tracked.
  • The CBS Alumni case is actually the weakest – and that phenomenon has two causes. But I’ll return to that explanation, which is lengthier, after I finish up the other lines.

Now there’s an exhortation I can live with

polysemic, as it were.  and the logo alone is so soothing to look at

dataloader logo

so anyway, the  wonderful people (i just hate that word “folks”) over at +mulesoft  have released this piece of goodness into the world.

the interface is lovely to look at.

Dataloader Vertical Response Export ss

nicest of all, the field select tool enables the user to select fields from first-level relationships among the Salesforce.com tables.  so you can get, for example, the RecordTypeName instead of making do with the 15 digits of the RecordTypeID   thus, the dataloader (or extractor: in any case, the person doing it, not the tool they’re using) is assured of an export that is human readable for a spot check before committing.

dataloaderio relationship field lookup


i’m getting ready to populate my new mailchimp account list so as to test the Salesforce.com-mailchimp connector.  and Dataloader lets me set up a connection to DropBox and even schedule tasks.  so ApexDataloader, so long: no love lost.


hello, mailchimp. so long vertical response.

i’l be doing two exports of one of the Contact Emails and Name fields — hard to populate MailChiimp without those! but the second will be the Vertical Response logs for developing an engagement segmentation histogram so that I can load separate lists into MailChimp: of those who historically have had a tendency to respond and click through — and those whom i spent money with vertical response to reach, but who never opened.  the economics, with mailchimp, is better as well: free for lists under 2000

More documentation coming, and soon