A client of mine in Education Services is uploading a large amount of data published by NYS Department of Education about some 7,000 schools & hierarchical entities related (Districts, Charter Schools, Charter Operators) in NY State.
The ETL professional in me likes large numbers. But the SF Admin in me is wondering about a performance hit. Also, hoping to upload this to an Enterprise org I have with Einstein Analytics licenses for the soothing transformations when selecting subsets of the data. Stay tuned.
We still have a few more child objects off the account to go, and I’m wondering if this is a good case for using Big Objects or an external data source. Suggestions welcome!
Unique in the shopping mall: On the reidentifiability of credit card metadata
Large-scale data sets of human behavior have the potential to fundamentally transform the way we fight diseases, design cities, or perform research. Metadata, however, contain sensitive information. Understanding the privacy of these data sets is key to their broad use and, ultimately, their impact. We study 3 months of credit card records for 1.1 million people and show that four spatiotemporal points are enough to uniquely reidentify 90% of individuals. We show that knowing the price of a transaction increases the risk of reidentification by 22%, on average. Finally, we show that even data sets that provide coarse information at any or all of the dimensions provide little anonymity and that women are more reidentifiable than men in credit card metadata.
I heard the NPR segment on this — and they made it sound so much more ominous and worrisome than does the abstract, which shows, by the nature of the data itself, how structured and specific it is. Not saying this isn’t dismaying – but as a would-be data science type, I can think of so many more interesting datasets to struggle with in which anonymized data would be useful, and presumably less easily reverse-engineered.
Today was one of those “whoa”: this stuff is impressive, powerful and the tiniest bit scary. (Not that I haven’t, to paraphrase Kubrick, learned to stop worrying and simply love the cloud).
Signed up for Xendo via Yammer, and then proceeded to add all the usual suspects of my cloud services: Salesforce, Twitter, Gmail, LinkedIn, Office 365, Evernote, Trello — hey, to learn things sometimes you have to have redundancies (even if you have to pay for the privilege).
Let it whir and churn for a while as it builds primary indexes for the first time. But before long, you’ve got an “internal” search engine of quite some (staggering) power. I searched upon my former colleague/treasurer of Columbia Business School Alumni of MetroDC, Dana Scherer, a Virginia resident who works for the Federal Government.
The result set was impressive — her contact information in my multiple synchronized systems; her attendance at events as recorded in Salesforce, along with her ticketed purchases from the club. Google drive stored versions of contact clean up exports that I, mental pack-rat that I am, simply save. A reference to her in a backup of a WordPress database (evidently unencrypted). It goes on and on — just as this image does here.In fact, what you’re seeing and about to scroll through is truncated, because no browser based screenshot tool that I know of will capture a sample as tall as this is.
(To give a slight preview, here is the search results dashboard detailing source, record type and chronological allocation of the search results. Nifty, eh?) But that’s where that slight tinge of fear comes from. If, courtesy of Xendo I can aggregate information like that — just imagine what people who truly know their way around systems can do. The Snowden halo is hard to ignore. Still, for lil’ole me: it’s simply fun — oh and useful.
So: get ready to tire your thumbs as you scroll through the (extremely) truncated search results.