State of the Art in Identifying Sensitive Data

Protecting personal information in your databases is a bigger deal than ever, what with the European General Data Protection Regulation (GDPR) going into effect in May and California passing a new Consumer Privacy Protection Act. Knowing what personal information you have in your systems and where it resides is a precondition to managing it effectively. My friend and colleague Luke Probasco, product manager at Townsend Security has posted a nice listing of security standards with lists of the sensitive data elements that each of them identifies; see What Data Needs to Be Encrypted in MongoDB?

If you are interested in sources that discuss the importance of this discovery process, take a look at

IEEE Cyber Security Blog post: Identify Sensitive Data and How They Should be Handled

Carnegie Mellon University Guidelines for Data Classification

An Expert Guide to Securing Sensitive Data: 34 Experts Reveal the Biggest Mistakes Companies Make with Data Security

GlobalSign’s 5 Ways to Enhance Data Security

If you are interested in categorization strategies, take a look at

Database Trends and Applications Article: Why Data Classification Should Drive Your Security Strategy

7 Steps to Effective Data Classification on Forsythe Focus

FIPS-199: Standards for Security Categorization of Federal Information and Information Systems

NIST Special Publication 800-60 Volume 1: Guide for Mapping Types of Information and Information Systems to Security Categories

As the universe of ERP and CRM systems, marketing systems, databases, data lakes and documents expands and proliferates across the Enterprise and through the clouds, finding where sensitive data is stored becomes an increasingly daunting task. When most such data was stored in relational databases, it was much less difficult, because it was usually possible to scan the system catalog or internal metadata repository for columns that either had field names that indicated they contained sensitive information (like “SSN”) or for values in a particular format (like “145-45-2323”). That worked but really only on a limited basis given that a lot of data was getting loaded into word processing documents and spreadsheets.

The emergence of datastores like Hadoop and NoSQL databases like MongoDB and Couchbase, which proffer a strategy called “schema on read” in which your query tools infer schema when they retrieve data has made our earlier strategy pretty useless, at best a partial solution. Doing this effectively is not really a commodity process yet, with proposed strategies appearing in scholarly journals and conference proceedings like

2015 IEEE 7th International Symposium on Cyberspace Safety and Security Identifying Sensitive Data Items within Hadoop

And this ScienceDirect article on Content sensitivity based access control framework for Hadoop

Fortunately, industry participants recognize the need and have developed strategies to address it. Unfortunately, the ability of most organizations to roll their own solutions is limited. Therefore, they are going to need tool support. Of course, if you collected so much data that you need Hadoop clusters to run analytic workloads, you probably have the economic wherewithal to pay for the tooling required to secure them.

In fact, there is a product category defined by Gartner called “Data-Centric Audit and Protection” that these products fall into. One of the vendors in this space is “Digital Guardian”, and they have posted a nice blog post that explains the project category. Their solution is Digital Guardian for Data Discovery. Another product in this space is Dataguise with their DgSecure Detect product.

For an assessment of your sensitive data exposure and assistance in identifying appropriate tools for your scenarios, contact the EC Wise guys.

Categories: Security.
Languages: English.

Leave a Reply

Your email address will not be published.