From the abstract of
Unique in the shopping mall: On the reidentifiability of credit card metadata
Large-scale data sets of human behavior have the potential to fundamentally transform the way we fight diseases, design cities, or perform research. Metadata, however, contain sensitive information. Understanding the privacy of these data sets is key to their broad use and, ultimately, their impact. We study 3 months of credit card records for 1.1 million people and show that four spatiotemporal points are enough to uniquely reidentify 90% of individuals. We show that knowing the price of a transaction increases the risk of reidentification by 22%, on average. Finally, we show that even data sets that provide coarse information at any or all of the dimensions provide little anonymity and that women are more reidentifiable than men in credit card metadata.
I heard the NPR segment on this — and they made it sound so much more ominous and worrisome than does the abstract, which shows, by the nature of the data itself, how structured and specific it is. Not saying this isn’t dismaying – but as a would-be data science type, I can think of so many more interesting datasets to struggle with in which anonymized data would be useful, and presumably less easily reverse-engineered.
Ok, so have some work to do to take the LLC/English-language side of the house up to speed.
But it’ll be fun.
It Feels Good. And you can really see the dramatic difference in productivity now that my client’s team has some focus.
When we last left ZoHo’s cloud-based Business Intelligence tool
it was with the promise –or was it a threat? — of coming back to discuss matters honestly. To jar your memory, here is the screenshot of the counts of record types:
So, going row by row:
- University Alum Club has slightly more contacts than organizations — and that’s fine. From Columbia Business School (henceforth “CBS”) Club’s perspective, anyone in the file who’s an alum of another club is likely either a programming partner (in which case they should more likely be affiliated with that club’s Board – but that’s not universally true) or if not a subject matter expert helping us with programming development, s/he attended one of our events, we got some money from them, and aggregating those (relatively small) sums at their club level is as defensible a way as any — and might help us track who to be extra solicitous towards.
- Alumni Club Head Hierarchies are a placeholder mechanism meant to unify relationships among clubs as is seen here. These records don’t represent anything real in the world — they’re just a convenient way to encode a hierarchical relationship in the database.
- The other MBA clubs seem sound enough — (though truth be told there are a surprisingly large number of clubs, so I’ll have to go double check that with some troubleshooting). But logically those MBA club’s members/public would be the most likely of any alumni to attend my programming, so it would only stand to reason that they would be accumulating many more members with attendance and/or revenue association with them for each organization I tracked.
- The CBS Alumni case is actually the weakest – and that phenomenon has two causes. But I’ll return to that explanation, which is lengthier, after I finish up the other lines.
I discovered today in the chrome webstore a nifty little business intelligence offering from Zoho.
So I uploaded some exports from Salesforce. First we’ll take a look at the login activity data, which begins to point towards how one audits things in the multidimensional space that is a database in the cloud, towards which lots of web services are making calls.
The summary function reporting of Zoho’s BI Tool is just like a SQL/MS Access GroupOn[Value] query. It enables us to take this table of 1,691 rows and look at the clustering of values. To tmake this interesting, I choose to Group On (and thereby collapse around) the LoginType field. And count the records to produce the following distribution histogram:
Absolute Automation is the name of an app by IHance, and it’s an email matching app that takes all email to my address and tries to find a Salesforce record to attach them to — it makes for a very thorough approach to CRM, which is rather exactly what we’d expect from Salesforce.com
Cirrus Insight is an app that syncs Google Apps contact data with Salesforce — and enables creating new accounts & contacts & leads from within the Gmail interface.. Those 175 entries via the browser — that’s me as the admin: a living, breathing mortal who is a mere piker in comparison to the hard working apps Such is the beauty and power of Cloud computing
Record Type Agreement between Salesforce.ACCOUNT object and Salesforce.CONTACT.
Record type agreement — after all my bellyaching about the importance of a record type schema that can handle the complexity of the milieu in which an Ivy League Alum Club operates. Record type agreement is one way to track if one’s practice lives up to one’s theory. Furthermore, this little exercise is providng an awfully convenient excuse to dig deeper in Zoho Reports. Pretty nifty the way it’s just a few short clicks until you can make some interesting discoveries.
The image below shows a portion of the 1700 plus rows in the table. The grey shaded portion are SF.CONTACT object fields; the light blue are SF.ACCOUNT object fields. And the dark blue are redaction on my part to safeguard my alumni data.
When I was enthusing earlier about dataloader.io, this is why: if you don’t pull over related records’ actual fields, to look at a Salesforce export — well for a human, it’s often not an easy read: long strings of digits in which upper v. lowercase actually counts!
One nice diagnostic test to run is to compare counts of Salesforce CONTACT records, by record type, against the number of organizational ACCOUNT records, by record type. The logic of the nature of the relatioships that are to be expected helps one to ascertain how well the coding schema is working. So, again, using GroupOn ACCOUNT.Organization Record type: what inferences can we make about the SF.CONTACT records by type?
Take a look at the entires in the report and, as Linda Richmond would say: “discuss amongst yourselves.”
I’ll use non-Linda-Richmond diction by noting that I’ll return to this anon.