Data Classification: How to Categorize Your Data and Where to Store It

Data Classification – How to Categorize It, Where to Store It

Darrell Jones

Previously, we discussed the requirements of a mature data classification program. In this post, we are going to review the administrative mechanics of such a program. Data classification, you’ll recall, usually includes a three- or four-layer system akin to the below:Data classification typically includes a three- or four-layer system

I recommend that organizations new to data classification begin with the three-level system as these levels and their corresponding actions and controls can be challenging to define. The three-level system considers all internal data confidential. The priority therefore is to create the processes and procedures needed to support confidential data. You can identify the limited amount of Public and Highly Confidential data later through interviews and technical discovery. Then you can clearly communicate your goals across the business, including locations, processes, and applications.

Today, we are going to cover how data is normally stored in organizations and where. These structures are going to have a huge impact on your program’s scope, operations, and technical decisions. As every organization has different business processes and technologies, each data classification project is going to be different, too.

Structured vs. Unstructured Data

Categorizing structured and unstructured data is the easiest data classification component to explain yet the hardest to manage. Structured data is any data within an application, usually a database. Your organization’s application owners, database administrators, or the application vendor can explain the different types of data stored in the application. Organizations marvel at how much data and how many data types are stored in applications. HR, customer relationship management (CMR) systems, enterprise resource planning (ERP) platforms, account platforms, M&A solutions are just a few applications that historically hold huge structured data stockpiles. Many of these systems are regulated (e.g., HR, ERP) and therefore the data must be kept or retained for a specific amount of time and, in some cases, indefinitely.

The security and governance capabilities of individual software distributions don’t always meet all the requirements for granular access control and emerging governance requirements. – Doug Henschen

Unstructured data is data not stored in an application. Excel spreadsheets, PowerPoint presentations, and Word docs are classic examples of unstructured data. Unstructured data is often found in reports generated from structured data systems. Unstructured data is usually ten times larger by volume than structured data. The reason for this is simple: saving copies of important files in several places makes employees feel secure. Email historically accounts for the largest amount of unstructured data in an organization. Think about it: employees email an important or sensitive document to ensure everyone has a copy then save the email in a PST file or in a folder on their laptop. There could be hundreds of copies of a single file containing highly sensitive data located in hundreds of locations across the network.

Ominous Approach of Data Lakes and Cloud Solutions

A trend in business today is to seek value in all the structured data that organizations store. Industries from real estate to waste management have discovered hidden value in the data they are collecting. Some may think this trend started in the finance industry but they would be wrong. The rise of data started with the focus on analytics that Google and Facebook pioneered. These and similar organizations realized they could increase profitability and customer stickiness if they targeted their advertising at specific users. IP addresses, login times, hovering points, and other data provided unique insight into their users that they could sell to a larger pool of advertisers. That information was also useful to other organizations for a host of other reasons. Remember Cambridge Analytica? This new value data provided was made possible by the first-of-their-kind data lakes, though they were not named that at the time.

The key is the new stuff doesn’t have the benefits that we expected from the old stuff. – Merv Adrian

Data lakes provide organizations the unique opportunity to “dump” data for any number of sources and formats. They are usually unmanaged and open to any account with access to the lake. Regardless of the purpose of the lake (marketing, business insights, archiving, etc.), the data classification characteristics are the same. First, it accepts all data. Second, it is an open platform by design. Third, the majority of these solutions are migrating to or being built in the cloud.

Data Warehouse vs. Data Lake

Data Warehouses are more secure than lakes.  This is because the data entered is cleansed prior to entering.  See below:

Data warehouses cleanse data prior to entering the cloud
Data warehouses cleanse data prior to entering the cloud.

A Data Lake on the other hand takes in ALL data without the step of transformation and restructuring:

A data lake, unlike a data warehouse, takes in ALL data, no questions asked
A data lake, unlike a data warehouse, takes in ALL data, no questions asked.

You cannot afford to overlook the data classification issues that arise when laking all this data, especially from a regulatory compliance standpoint. You need to be involved in the design of the lake(s) from the start. Regardless of how the lake is built, data classification needs to be a consideration in its design. For example, a lake is being designed for archival purposes. Should Highly Confidential data be included? Should Highly Confidential have its own lake, or should it be excluded completely? Applying classification either at the beginning of the data injestion, or at the end when it is being exported from the Process Data Stores is your best strategy.

Knowing what systems are providing data to the lake is important. When data is put into a lake, there are fewer protections available for the governing groups (Cyber, Risk, Compliance, etc.) compared to enterprise databases or relational database systems.

With traditional database management systems, the information security team might handle all the network security and access control protections but do little with the data once it enters the database management system. Data lake structures,however, do not come with all of the governance capabilities and policies associated with a traditional database management system, from basic referential integrity to role-based access and separation of duties. One way to approach data lake security is to think of it as a pipeline with upstream, midstream, and downstream components, according to Merv Adrian. The threat vectors associated with each stage are somewhat different and therefore need to be addressed differently.

Data lakes provide great value to the organization but require a different governance model to maintain classification controls.

Coming Up

In my next post, I will pull all the elements we’ve previously discussed together with the Data Management Controls Matrix.


Five Best Practices to do Supply Chain Security Right

Five Best Practices to do Supply Chain Security Right

Supply chain attacks aren’t new. In fact, The National Institute of Standards and Technology (NIST) published their initial report on supply chain risk back in 2015. One of the most well-known supply chain attacks happened shortly after in 2017. NotPetya corrupted...

Stagehand: Episode 2

Stagehand: Episode 2

Carl Timmons: CISO of Illuminating Solutions, a data analytics firm, forty-seven years old, never been married. Last Thursday, Carl arrived in San Jose on business. He was picked up by a company car and driven to The Manifeld Hotel. He was last seen leaving the hotel...

The Dark at the Top of the Stairs

The Dark at the Top of the Stairs

Let’s say you need to apply a critical patch across the organization, and the patch requires a reboot. While forcing a reboot to apply a critical patch is important, it creates business disruption that ripples out to your customers. Sooner or later, someone in the...

The Risk of Banking

The Risk of Banking

I just came off a big Zoom call with traditional bankers where they discussed changes in client behaviors, and the impact which new technologies bring, that fundamentally challenge today’s traditional European banking models. At the end of 2019, Boston...

Effective Board Communication for CISOs

Effective Board Communication for CISOs

Know Your Board If you’re a CISO, your Board generally knows who you are and what you do. But do you know who they are? No Board is monolithic. Each Board member brings unique value to the Board. Each is selected for what they add to the Board’s perspective, vision,...

Cyber Ops Must Evolve Towards Fusion Centres. Here is Why.

Cyber Ops Must Evolve Towards Fusion Centres. Here is Why.

Since the advent of space exploration in the 1960s, every child understands that the success of the space mission is dependent not only on the astronauts, but also on the engineers in the mission operation center. All complex missions or operations are high risk and...

Stagehand: Origins

Stagehand: Origins

I’m sitting at a table in one of the offices of my private security firm in a tense, but now familiar, setting. No matter who the client is, there’s always a strange energy when extremely wealthy and powerful people are asking you to accomplish the seemingly...

Cyber Trends and Predictions for 2021
Share This