A frequent request from customers is a way to identify PII within their databases.   I thought I’d show a brief example how this can be done.

User Story: “I want to identify unencrypted credit card numbers and social security numbers within documents so that I can ensure developers aren’t storing things they shouldn’t in the database.”

Background:  N1QL has a “tokenizer” function as of 4.6.  Combine this with regex functions in N1QL, specific secondary indexes and we have a powerful toolset at our disposal to identify patterns within the database.

Example Solution:  I created a query to identify unencrypted social security numbers stored within a bucket (the “default” bucket in this case).  I’m looking for any pattern of digits that matches xxx-xx-xxxx or xxxxxxxxx. The TOKENS function allows me to treat a document as an array of strings.  I used the “specials” flag to tell N1QL to keep these strings intact.  If I didn’t use this, it will strip spaces, and dashes and ignore items following those characters.   I then look for any regular expression that matches an element within the token array.

Identifying unencrypted credit card numbers stored within a bucket uses the same approach:

To speed up my processing time, I use memory optimized secondary indexes (MOI) for the above queries.  Every mutation in Couchbase is asynchronously sent to the index projector.   MOI have the added benefit of updating the information contained within the index every 20 ms. The indexes also make use of tokenization.

…and for unencrypted credit cards

Try It Out: Docker is my favorite way to spin up a development environment.   An easy to use repo for the above examples is on github: n1ql-query-nodejs .  It uses docker-compose to build two services:

  1. A single node Couchbase cluster service.
  2. A nodejs service to provision the Couchbase cluster with 250,000 user profiles and indexes for several examples, including finding unencrypted PII.

Author

Posted by Todd Greenstein

Todd Greenstein is a Solution Architect at Couchbase. Todd is specialize in API design, architecture, data modeling, nodejs and golang development.

Leave a reply