Tool to find and identify PII data in database

Cobalt2112

[H]ard|Gawd
Joined
Jun 25, 2002
Messages
1,177
Hi Guys,

Is anyone aware of any tools which can help identify PII (Personal Identifiable Info) data in a given database. Find table names and use some fuzzy logic to identify possible columns.

Maybe something that can be scripted or where multiple databases can be provided.

We're looking to establish compliance.

Using MySQL and SQL Server.
 
Hi Guys,

Is anyone aware of any tools which can help identify PII (Personal Identifiable Info) data in a given database. Find table names and use some fuzzy logic to identify possible columns.

Maybe something that can be scripted or where multiple databases can be provided.

We're looking to establish compliance.

Using MySQL and SQL Server.
You can build a query to dynamically determine and iterate through the tables/columns in a database, and poll a small resultset of rows against some REGEX statements to flag potential matches. But I'm uncertain of what value that gives when you're querying a known data schema (either yours or a vendor's). And even then, you may not fully understand the design or the data could be separated to the point where a REGEX or other evaluation is futile.

So what kind of compliance are you seeking?
 
Hi sorry for the late reply.

For all of our sites, we are looking to make sure that agencies are complying with our standards and that data fields are encrypted.

We're looking out for versions of SSN, DOB, TEL #, Addresses ... basically, any PII data.

Either for product sample submissions, contests or opt-ins
 
Do the auditing as I mentioned earlier for existing data.

Implement the rules you need in your business layer validations (ie: before the DB is hit), which is going to include lots of string parsing and REGEX statements.
 
I see. The issue is that we are sometimes not involved in the dev cycle. The agency develops the website and we're brought in a few weeks before deployment, but too late in the game to have them make changes--dont want to take a chance at slipping date since site launches tie-in with media buys.

We are usually reactive and would like to change that. However, since we deal with dozens of agencies around the world, we do not guideline (yet) what table names should be called, we deal with multiple languages, open/close source software. We best-practice encryption but we don't have the manpower to do data model and code reviews.
 
I've never had a need for this, but it looks like MySQL supports querying using Regular Expressions
Code:
SELECT column_list
FROM table_name
WHERE column REGEXP pattern

So once you identify the regular expression pattern for each of your PII's it looks like this should work for you. E.g. to match a SSN you'd use something like:

Code:
SELECT column_list
FROM table_name
WHERE column REGEXP '^\d{3}-\d{2}-\d{4}$'
 
I've never had a need for this, but it looks like MySQL supports querying using Regular Expressions
Code:
SELECT column_list
FROM table_name
WHERE column REGEXP pattern

So once you identify the regular expression pattern for each of your PII's it looks like this should work for you. E.g. to match a SSN you'd use something like:

Code:
SELECT column_list
FROM table_name
WHERE column REGEXP '^\d{3}-\d{2}-\d{4}$'

Keep in mind that string parsing (and regex matching) costs CPU. And there aren't many server resources more expensive than those on an RDBMS.

On the other hand, if they are bringing you in weeks before launch - with fixed launch dates, maybe this is a viable option.

For SQL Server, you'll need to enable CLR integration and write a CLR function to perform the regex matching. (link 1) (link 2)
 
Thanks for the tips, I think they'll want our group (all in the same company) to come in maybe every quarter to audit the new sites.
 
Ohh and by the way, these databases are for global sites where phone numbers and addresses can and will be different; even as simple as column names will vary by language. So it will be very difficult to identity any pattern.
 
Ohh and by the way, these databases are for global sites where phone numbers and addresses can and will be different; even as simple as column names will vary by language. So it will be very difficult to identity any pattern.
Localization and culture variances will make your job extremely difficult as is, and that's without considering disparate DB schemas.

I visualize this situation is an unending game of "whack-a-mole". It sounds like you've already had enough information to take to upper management and relay all of the challenges. Everything we've stated thus far would only marginally help your cause, and, in fact, serves more to confirm your initial belief about trying to identify any consistency.
 
Back
Top