Architect’s Guide to Data Privacy Compliance Using OSS

The lack of a clear data regulation creates a nightmare scenario for CIO’s who are left guessing how and when to treat consumer data with what level of care. Forward looking organizations should prepare for inevitable regulations by ensuring that enterprise data systems are designed to reduce data exposure, identify encryption gaps and track/manage sensitive data in a centralized manner.

There are two specific capabilities that must be implemented appropriately for an enterprise data system to be prepared to deliver appropriate data privacy compliance:

  • Detect sensitive information in both structured and unstructured data sources such that appropriate tagging and lineage can take place for this data.
  • Once sensitive information has been identified it needs to be stored in a system that allows for appropriate access controls, auditing and encryption (inflight and at rest).

This blog will guide the reader through an example of how to identify sensitive information using an open source project called DataProfiler and then the MinIO object store to support data privacy requirements.

General State of Privacy:

Data Privacy laws protecting consumer information continue to evolve around the world. While European regulators have very strict privacy rules under GDPR (General Data Protection Regulation), there is no federal data protection regulation in the US. Four US states have enforceable regulations in place today and a federal regulation is being reviewed in Congress fall of 2022. A current enterprise data architecture should take data protection regulations into account and design the system to ensure the enterprise knows where data resides at all times.

There are two types of sensitive data that enterprises need to be aware of.: Personally Identifiable Information (PII) and Protected Health Information (PHI). A few examples of PII are bank account information, social security numbers, Passport or drivers license number, phone numbers, home address, etc. A few examples of PHI would include an individual’s medical conditions, disabilities, charts, etc. Many organizations may collect PHI information (such as hospitals, labs, drug companies, etc) however almost all enterprises have PII information on their clients. This information is extremely valuable for companies to customize offerings to consumers, however it is each company's legal obligation to leverage this information within legal bounds and provide consumers with rights to revoke or modify the data that an organization holds. In order for companies to comply with privacy data regulations they must know what data resides in what location, a difficult task unless the organization has a well thought enterprise data architecture.

Detecting Sensitive Information:

Step one in creating an enterprise data strategy that adheres to data privacy regulations is to detect the presence of sensitive information within the data set. Identification of sensitive information can occur either via a rule-based model or machine learning systems. The key difference between rule based vs machine learning models is that a machine learning system constantly evolves based on model training that uses available data; whereas a rules based model is programmed based on a set of rules and actions (think if, then, else statements in a programming language). Rule based models have limited scalability potential since all scenarios must be pre-programmed, whereas machine learning based models can evolve based on incoming data streams and evolving learnings.

An engineering organization familiar with Natural Language Processing (NLP) systems can implement customized solutions for data identification. However there are many proprietary and open source projects that can assist in implementing a data identification system. Most large cloud providers such as Amazon, Microsoft and Google provide cloud based capabilities to identify sensitive information; such as Amazon’s Glue, Azure Information Protection Service and Google’s Cloud Data Loss Prevention solution.

For the purpose of this blog we will look at an open source tool called Data Profiler which was originally developed at Capital One. The source code to Data Profiler is available under the Apache 2.0 license under the following Github repository, https://github.com/capitalone/DataProfiler. DataProfiler has native support for CSV, AVRO, Parquet, JSON, Text and URL. Data Profiler provides access to a pre-trained deep learning model with the ability to extend the models with new pipelines for entity recognition. Data Profiler allows users to run predictions on structured data and return data in Pandas DataFrame (think of labeled data structures with columns of different types). Data Profiler also allows users to operate on unstructured data and return the data in NER (Named Entity Recognition) - basically identifying and categorizing key information in text.

Storing Sensitive Information:

All data privacy regulations require that an organization must know:

  • What PII/PHI information is held within the enterprise data lake and where this information resides
  • How long this information must be retained and appropriate access controls are present (who can access PII data)
  • Appropriate breach protection has been implemented within the enterprise storage systems to ensure consumer data is safe (both in-flight and at -rest).
  • Data Subject access requests (such as purge data or modify data requests) can be respected within specific time frames

There are many appropriate ways to set up an enterprise data lake architecture that supports appropriate PII/PHI retention and access rules. The MinIO team can be available to review and comment on our enterprise client’s data architectures, however to keep this blog useful and simple let's consider the following scenario:

Imagine a financial institution that retains two types of PII information (I did mention a simple example:). Type one information is highly sensitive PII, such as social security numbers and bank account information - this data should only be available to the finance department. Type two information is less sensitive PII, such as client’s email address and phone numbers - this data needs to be available to the marketing/sales teams.

The rest of this blog will implement the above simple scenario of creating two PII buckets (highly sensitive and less sensitive), and we will guide the reader on how to create the buckets, create appropriate access controls for the two buckets (for finance and marketing/sales groups), make sure the MinIO deployment is secure and appropriate data retention policies are in place.

Creating The Buckets:

The MinIO Console can be used for general administrative tasks like bucket creation, user creation, Identity and access management, etc. The example we are trying to implement from above, if the sensitive PII bucket needs to be created with object locking and specific time based retention then it can be implemented as follows:

In this case we have created a new bucket called sensitive-pii with object locking (so it may not be deleted by mistake) and retention set to 180 days. The retention period could be set up based on the organizations need to carry this sensitive data for a specified period of time, but no longer. By setting this period to be specified at the creation time the organization can ensure that sensitive PII information is not held within an organization for any longer than necessary.

Similarly a second bucket for less-sensitive-pii data can be created similar to how we created the sensitive PII bucket above, and for less sensitive data “object locking” or “retention” will not be defined.

The DataProfiler tool (that we highlighted above) can be used to identify social security and bank account information that will get persisted to the sensitive-pii bucket whereas any object with email or phone numbers (but no social security or bank account information) should be routed to the less-sensitive-pii bucket.

Controlling Access to Sensitive Data:

MinIO uses Policy-Based Access Control (PBAC) to define the authorized actions and resources to which an authenticated user has access. Each policy describes one or more actions and conditions that outline the permissions of a user or group of users.

Identity and access management can be accomplished by either leveraging a built in MinIO Identity Provider or connecting to an external capability such as OpenID Connect Access Management solution (like Okta, KeyCloak, Google, Facebook, etc) or Active Directory / LDAP based Access Management.

Under the “identity” tab in MinIO Console a user can create a group, for example in our case Finance and MarketingSales, and assign policies to this group as well as members that are part of these groups. In our architecture diagram above we have created a “Finance User” that belongs to the Finance group and a “MarketingSales User” that belongs to the MarketingSales group.

For example one of the first things an administrator may need to do is create policies that can be assigned to users for access. Imagine two different policies that may apply to two different groups: Finance and MarketingSales.

The Policy for sensitive-pii that allows users to access sensitive-pii bucket (from above) may look as follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::sensitive-pii",
                "arn:aws:s3:::sensitive-pii/*"
            ]
        }
    ]
}

The less sensitive pii access policy, called lesssensitivePII policy, may allow access to the less-sensitive-pii bucket as follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::less-sensitive-pii",
                "arn:aws:s3:::less-sensitive-pii/*"
            ]
        }
    ]
}

When a user is created within MinIO, policies and groups can be defined at creation time such that the user inherits the appropriate property. For example, you can create a MarketingSales group user (in the screen shot below we have called this user marketingsalesuser; this name could be anything, for example in the architecture diagram we had called this user “MarketingSales User”) and assign the lesssensitivePII policy to this user as follows:

Similarly you can create a Finance user (called spiiuser; again this name could be anything like FinanceUser) that has access to sensitive-pii bucket using the sensitive-pii policy as well as the less-sensitive-pii bucket. In this case we are suggesting that the spiiuser not only has access to sensitive data, but also less sensitive data.

The marketing and sales user will only have access to the lesssensitivepii policy.

You can find details on access management and how to customize access in our documentation as follows:

https://min.io/docs/minio/linux/administration/identity-access-management/policy-based-access-control.html

Breach Protection / Securing MinIO:

One of the most basic requirements of every privacy regulation is to ensure that the consumer data is safe from exposure to bad actors. MinIO provides the highest levels of data protection available in the industry. MinIOp can be configured to secure data in-flight using TLS certificate. MinIO also provides the ability to secure data at rest using industry standard KMS (key Management Systems) or the built in MinIO solution known as KES. The critical consideration is to know how to set up the TLS environment for in flight data or connect to a KMS environment.

My colleague Andreas Auernhammer has a detailed blog on how best to setup a secure environment that can act as a blueprint for organizations wanting to mitigate information security threat vectors, please refer to the following:

https://blog.min.io/secure-minio-1/

Conclusion:

An enterprise data architect has the responsibility to create the enterprise data lake such that data privacy regulatory compliance is part of the architecture, not an afterthought. The identification and tagging of data is the first step to achieving compliance. The ability to securely store this data, query the content on demand and access/modify this data per the data subjects (consumers’) request is critical. MinIO provides one of the most fundamental building blocks to create one of the most comprehensive data privacy environments that can exist in the marketplace. MinIO has senior architects available to verify an enterprise’s data architecture and provide feedback based on how we have assisted hundreds of organizations across the globe achieve their object storage requirements.