154 - DM Bok 2: 2.2.1.2.1 Analyze Information Requirements
154 - DM Bok 2: 2.2.1.2.1 Analyze Information Requirements
154 - DM Bok 2: 2.2.1.2.1 Analyze Information Requirements
• Select Scheme: Decide whether the data model should be built following a relational, dimensional,
fact-based, or NoSQL scheme. Refer to the earlier discussion on scheme and when to choose each
scheme (see Section 1.3.4).
• Select Notation: Once the scheme is selected, choose the appropriate notation, such as information
engineering or object role modeling. Choosing a notation depends on standards within an organization
and the familiarity of users of a particular model with a particular notation.
• Complete Initial CDM: The initial CDM should capture the viewpoint of a user group. It should not
complicate the process by trying to figure out how their viewpoint fits with other departments or with
the organization as a whole.
o Collect the highest-level concepts (nouns) that exist for the organization. Common concepts
are Time, Geography, Customer/Member/Client, Product/Service, and Transaction.
o Then collect the activities (verbs) that connect these concepts. Relationships can go both
ways, or involve more than two concepts. Examples are: Customers have multiple
Geographic Locations (home, work, etc.), Geographic Locations have many Customers.
Transactions occur at a Time, at a Facility, for a Customer, selling a Product.
• Incorporate Enterprise Terminology: Once the data modeler has captured the users’ view in the
boxes and lines, the data modeler next captures the enterprise perspective by ensuring consistency with
enterprise terminology and rules. For example, there would be some reconciliation work involved if
the audience conceptual data model had an entity called Client, and the enterprise perspective called
this same concept Customer.
• Obtain Sign-off: After the initial model is complete, make sure the model is reviewed for data
modeling best practices as well as its ability to meet the requirements. Usually email verification that
the model looks accurate will suffice.
A logical data model (LDM) captures the detailed data requirements within the scope of a CDM.
To identify information requirements, one must first identify business information needs, in the context of one
or more business processes. As their input, business processes require information products that are themselves
the output from other business processes. The names of these information products often identify an essential
business vocabulary that serves as the basis for data modeling. Regardless of whether processes or data are
modeled sequentially (in either order), or concurrently, effective analysis and design should ensure a relatively
balanced view of data (nouns) and processes (verbs), with equal emphasis on both process and data modeling.
DATA MODELING AND DESIGN • 155
Requirements analysis includes the elicitation, organization, documentation, review, refinement, approval, and
change control of business requirements. Some of these requirements identify business needs for data and
information. Express requirement specifications in both words and diagrams.
Logical data modeling is an important means of expressing business data requirements. For many people, as the
old saying goes, ‘a picture is worth a thousand words’. However, some people do not relate easily to pictures;
they relate better to reports and tables created by data modeling tools.
Many organizations have formal requirements. Management may guide drafting and refining formal
requirement statements, such as “The system shall…” Written data requirement specification documents may be
maintained using requirements management tools. The specifications gathered through the contents of any such
documentation should carefully synchronize with the requirements captured with data models to facilitate
impact analysis so one can answer questions like “Which parts of my data models represent or implement
Requirement X?” or “Why is this entity here?”
It can often be a great jump-start to use pre-existing data artifacts, including already built data models and
databases. Even if the data models are out-of-date, parts can be useful to start a new model. Make sure however,
that any work done based on existing artifacts is validated by the SMEs for accuracy and currency. Companies
often use packaged applications, such as Enterprise Resource Planning (ERP) systems, that have their own data
models. Creation of the LDM should take into account these data models and either use them, where applicable,
or map them to the new enterprise data model. In addition, there could be useful data modeling patterns, such as
a standard way of modeling the Party Role concept. Numerous industry data models capture how a generic
industry, such as retail or manufacturing, should be modeled. These patterns or industry data models can then be
customized to work for the particular project or initiative.
Associative entities are used to describe Many-to-Many (or Many-to-Many-to-Many, etc.) relationships. An
associative entity takes the identifying attributes from the entities involved in the relationship, and puts them
into a new entity that just describes the relationship between the entities. This allows the addition of attributes to
describe that relationship, like in effective and expiration dates. Associative entities may have more than two
parents. Associative entities may become nodes in graph databases. In dimensional modeling, associative
entities usually become fact tables.
Add attributes to the conceptual entities. An attribute in a logical data model should be atomic. It should contain
one and only one piece of data (fact) that cannot be divided into smaller pieces. For example, a conceptual
156 • DMBOK2
attribute called phone number divides into several logical attributes for phone type code (home, office, fax,
mobile, etc.), country code, (1 for US and Canada), area code, prefix, base phone number, and extension.
Domains, which were discussed in Section 1.3.3.4, allow for consistency in format and value sets within and
across projects. Student Tuition Amount and Instructor Salary Amount can both be assigned the Amount
domain, for example, which will be a standard currency domain.
Attributes assigned to entities are either key or non-key attributes. A key attribute helps identify one unique
entity instance from all others, either fully (by itself) or partially (in combination with other key elements).
Non-key attributes describe the entity instance but do not help uniquely identify it. Identify primary and
alternate keys.
Logical data models require modifications and adaptations in order to have the resulting design perform well
within storage applications. For example, changes required to accommodate Microsoft Access would be
different from changes required to accommodate Teradata. Going forward, the term table will be used to refer
to tables, files, and schemas; the term column to refer to columns, fields, and elements; and the term row to refer
to rows, records, or instances.
Logical abstraction entities (supertypes and subtypes) become separate objects in the physical database design
using one of two methods.
• Subtype absorption: The subtype entity attributes are included as nullable columns into a table
representing the supertype entity.
• Supertype partition: The supertype entity’s attributes are included in separate tables created for each
subtype.
Add details to the physical model, such as the technical name of each table and column (relational databases), or
file and field (non-relational databases), or schema and element (XML databases).
DATA MODELING AND DESIGN • 157
Define the physical domain, physical data type, and length of each column or field. Add appropriate constraints
(e.g., nullability and default values) for columns or fields, especially for NOT NULL constraints.
Small Reference Data value sets in the logical data model can be implemented in a physical model in three
common ways:
• Create a matching separate code table: Depending on the model, these can be unmanageably
numerous.
• Create a master shared code table: For models with a large number of code tables, this can collapse
them into one table; however, this means that a change to one reference list will change the entire
table. Take care to avoid code value collisions as well.
• Embed rules or valid codes into the appropriate object’s definition: Create a constraint in the
object definition code that embeds the rule or list. For code lists that are only used as reference for one
other object, this can be a good solution.
Assign unique key values that are not visible to the business and have no meaning or relationship with the data
with which they are matched. This is an optional step and depends primarily on whether the natural key is large,
composite, and whose attributes are assigned values that could change over time.
If a surrogate key is assigned to be the primary key of a table, make sure there is an alternate key on the original
primary key. For example, if on the LDM the primary key for Student was Student First Name, Student Last
Name, and Student Birth Date (i.e., a composite primary key), on the PDM the primary key for Student may
be the surrogate key Student ID. In this case, there should be an alternate key defined on the original primary
key of Student First Name, Student Last Name, and Student Birth Date.
In some circumstances, denormalizing or adding redundancy can improve performance so much that it
outweighs the cost of the duplicate storage and synchronization processing. Dimensional structures are the main
means of denormalization.
An index is an alternate path for accessing data in the database to optimize query (data retrieval) performance.
Indexing can improve query performance in many cases. The database administrator or database developer must
158 • DMBOK2
select and define appropriate indexes for database tables. Major RDBMS products support many types of
indexes. Indexes can be unique or non-unique, clustered or non-clustered, partitioned or non-partitioned, single
column or multi-column, b-tree or bitmap or hashed. Without an appropriate index, the DBMS will revert to
reading every row in the table (table scan) to retrieve any data. On large tables, this is very costly. Try to build
indexes on large tables to support the most frequently run queries, using the most frequently referenced
columns, particularly keys (primary, alternate, and foreign).
Great consideration must be given to the partitioning strategy of the overall data model (dimensional) especially
when facts contain many optional dimensional keys (sparse). Ideally, partitioning on a date key is
recommended; when this is not possible, a study is required based on profiled results and workload analysis to
propose and refine the subsequent partitioning model.
Views can be used to control access to certain data elements, or to embed common join conditions or filters to
standardize common objects or queries. Views themselves should be requirements-driven. In many cases, they
will need to be developed via a process that mirrors the development of the LDM and PDM.
Reverse engineering is the process of documenting an existing database. The PDM is completed first to
understand the technical design of an existing system, followed by an LDM to document the business solution
that the existing system meets, followed by the CDM to document the scope and key terminology within the
existing system. Most data modeling tools support reverse engineering from a variety of databases; however,
creating a readable layout of the model elements still requires a modeler. There are several common layouts
(orthogonal, dimensional, and hierarchical) which can be selected to get the process started, but contextual
organization (grouping entities by subject area or function) is still largely a manual process.
As do other areas of IT, models require quality control. Continuous improvement practices should be employed.
Techniques such as time-to-value, support costs, and data model quality validators such as the Data Model
Scorecard® (Hoberman, 2009), can all be used to evaluate the model for correctness, completeness, and
consistency. Once the CDM, LDM, and PDM are complete, they become very useful tools for any roles that
need to understand the model, ranging from business analysts through developers.
DATA MODELING AND DESIGN • 159
Once the data models are built, they need to be kept current. Updates to the data model need to be made when
requirements change and frequently when business processes change. Within a specific project, often when one
model level needs to change, a corresponding higher level of model needs to change. For example, if a new
column is added to a physical data model, that column frequently needs to be added as an attribute to the
corresponding logical data model. A good practice at the end of each development iteration is to reverse
engineer the latest physical data model and make sure it is still consistent with its corresponding logical data
model. Many data modeling tools help automate this process of comparing physical with logical.
3. Tools
There are many types of tools that can assist data modelers in completing their work, including data modeling,
lineage, data profiling tools, and Metadata repositories.
Data modeling tools are software that automate many of the tasks the data modeler performs. Entry-level data
modeling tools provide basic drawing functionality including a data modeling pallet so that the user can easily
create entities and relationships. These entry-level tools also support rubber banding, which is the automatic
redrawing of relationship lines when entities are moved. More sophisticated data modeling tools support
forward engineering from conceptual to logical to physical to database structures, allowing the generation of
database data definition language (DDL). Most will also support reverse engineering from database up to
conceptual data model. These more sophisticated tools often support functionality such as naming standards
validation, spellcheckers, a place to store Metadata (e.g., definitions and lineage), and sharing features (such as
publishing to the Web).
A lineage tool is software that allows the capture and maintenance of the source structures for each attribute on
the data model. These tools enable impact analysis; that is, one can use them to see if a change in one system or
part of system has effects in another system. For example, the attribute Gross Sales Amount might be sourced
from several applications and require a calculation to populate – lineage tools would store this information.
Microsoft Excel® is a frequently-used lineage tool. Although easy to use and relatively inexpensive, Excel does
not enable real impact analysis and leads to manually managing Metadata. Lineage is also frequently captured
in a data modeling tool, Metadata repository, or data integration tool. (See Chapters 11 and 12.)
160 • DMBOK2
A data profiling tool can help explore the data content, validate it against existing Metadata, and identify Data
Quality gaps/deficiencies, as well as deficiencies in existing data artifacts, such as logical and physical models,
DDL, and model descriptions. For example, if the business expects that an Employee can have only one job
position at a time, but the system shows Employees have more than one job position in the same timeframe, this
will be logged as a data anomaly. (See Chapters 8 and 13.)
A Metadata repository is a software tool that stores descriptive information about the data model, including the
diagram and accompanying text such as definitions, along with Metadata imported from other tools and
processes (software development and BPM tools, system catalogs, etc.). The repository itself should enable
Metadata integration and exchange. Even more important than storing the Metadata is sharing the Metadata.
Metadata repositories must have an easily accessible way for people to view and navigate the contents of the
repository. Data modeling tools generally include a limited repository. (See Chapter 13.)
Data model patterns are reusable modeling structures that can be applied to a wide class of situations. There are
elementary, assembly, and integration data model patterns. Elementary patterns are the ‘nuts and bolts’ of data
modeling. They include ways to resolve many-to-many relationships, and to construct self-referencing
hierarchies. Assembly patterns represent the building blocks that span the business and data modeler worlds.
Business people can understand them – assets, documents, people and organizations, and the like. Equally
importantly, they are often the subject of published data model patterns that can give the modeler proven,
robust, extensible, and implementable designs. Integration patterns provide the framework for linking the
assembly patterns in common ways (Giles, 2011).
Industry data models are data models pre-built for an entire industry, such as healthcare, telecom, insurance,
banking, or manufacturing. These models are often both broad in scope and very detailed. Some industry data
models contain thousands of entities and attributes. Industry data models can be purchased through vendors or
obtained through industry groups such as ARTS (for retail), SID (for communications), or ACORD (for
insurance).
Any purchased data model will need to be customized to fit an organization, as it will have been developed
from multiple other organizations’ needs. The level of customization required will depend on how close the
model is to an organization’s needs, and how detailed the most important parts are. In some cases, it can be a
DATA MODELING AND DESIGN • 161
reference for an organization’s in-progress efforts to help the modelers make models that are more complete. In
others, it can merely save the data modeler some data entry effort for annotated common elements.
4. Best Practices
The ISO 11179 Metadata Registry, an international standard for representing Metadata in an organization,
contains several sections related to data standards, including naming attributes and writing definitions.
Data modeling and database design standards serve as the guiding principles to effectively meet business data
needs, conform to Enterprise and Data Architecture (see Chapter 4) and ensure the quality of data (see Chapter
14). Data architects, data analysts, and database administrators must jointly develop these standards. They must
complement and not conflict with related IT standards.
Publish data model and database naming standards for each type of modeling object and database object.
Naming standards are particularly important for entities, tables, attributes, keys, views, and indexes. Names
should be unique and as descriptive as possible.
Logical names should be meaningful to business users, using full words as much as possible and avoiding all
but the most familiar abbreviations. Physical names must conform to the maximum length allowed by the
DBMS, so use abbreviations where necessary. While logical names use blank spaces as separators between
words, physical names typically use underscores as word separators.
Naming standards should minimize name changes across environments. Names should not reflect their specific
environment, such as test, QA, or production. Class words, which are the last terms in attribute names such as
Quantity, Name, and Code, can be used to distinguish attributes from entities and column names from table
names. They can also show which attributes and columns are quantitative rather than qualitative, which can be
important when analyzing the contents of those columns.
In designing and building the database, the DBA should keep the following design principles in mind
(remember the acronym PRISM):
• Performance and ease of use: Ensure quick and easy access to data by approved users in a usable and
business-relevant form, maximizing the business value of both applications and data.
162 • DMBOK2
• Reusability: The database structure should ensure that, where appropriate, multiple applications can
use the data and that the data can serve multiple purposes (e.g., business analysis, quality
improvement, strategic planning, customer relationship management, and process improvement).
Avoid coupling a database, data structure, or data object to a single application.
• Integrity: The data should always have a valid business meaning and value, regardless of context, and
should always reflect a valid state of the business. Enforce data integrity constraints as close to the data
as possible, and immediately detect and report violations of data integrity constraints.
• Security: True and accurate data should always be immediately available to authorized users, but only
to authorized users. The privacy concerns of all stakeholders, including customers, business partners,
and government regulators, must be met. Enforce data security, like data integrity, as close to the data
as possible, and immediately detect and report security violations.
• Maintainability: Perform all data work at a cost that yields value by ensuring that the cost of creating,
storing, maintaining, using, and disposing of data does not exceed its value to the organization. Ensure
the fastest possible response to changes in business processes and new business requirements.
Data analysts and designers act as intermediaries between information consumers (the people with business
requirements for data) and the data producers who capture the data in usable form. Data professionals must
balance the data requirements of the information consumers and the application requirements of data producers.
Data professionals must also balance the short-term versus long-term business interests. Information consumers
need data in a timely fashion to meet short-term business obligations and to take advantage of current business
opportunities. System-development project teams must meet time and budget constraints. However, they must
also meet the long-term interests of all stakeholders by ensuring that an organization’s data resides in data
structures that are secure, recoverable, sharable, and reusable, and that this data is as correct, timely, relevant,
and usable as possible. Therefore, data models and database designs should be a reasonable balance between the
short-term needs and the long-term needs of the enterprise.
As previously noted (in Section 4.1) data modeling and database design standards provide guiding principles to
meet business data requirements, conform to Enterprise and Data Architecture standards, and ensure the quality
of data. Data modeling and database design standards should include the following:
DATA MODELING AND DESIGN • 163
• A list and description of standard data modeling and database design deliverables
• A list of standard names, acceptable abbreviations, and abbreviation rules for uncommon words, that
apply to all data model objects
• A list of standard naming formats for all data model objects, including attribute and column class
words
• A list and description of standard methods for creating and maintaining these deliverables
• A list and description of data modeling and database design roles and responsibilities
• A list and description of all Metadata properties captured in data modeling and database design,
including both business Metadata and technical Metadata. For example, guidelines may set the
expectation that the data model captures lineage for each attribute.
• Metadata quality expectations and requirements (see Chapter 13)
• Guidelines for how to use data modeling tools
• Guidelines for preparing for and leading design reviews
• Guidelines for versioning of data models
• Practices that are discouraged
Project teams should conduct requirements reviews and design reviews of the conceptual data model, logical
data model, and physical database design. The agenda for review meetings should include items for reviewing
the starting model (if any), the changes made to the model and any other options that were considered and
rejected, and how well the new model conforms to any modeling or architecture standards in place.
Conduct design reviews with a group of subject matter experts representing different backgrounds, skills,
expectations, and opinions. It may require executive mandate to get expert resources allocated to these reviews.
Participants must be able to discuss different viewpoints and reach group consensus without personal conflict,
as all participants share the common goal of promoting the most practical, best performing and most usable
design. Chair each design review with one leader who facilitates the meeting. The leader creates and follows an
agenda, ensures all required documentation is available and distributed, solicits input from all participants,
maintains order and keeps the meeting moving, and summarizes the group’s consensus findings. Many design
reviews also use a scribe to capture points of discussion.
In reviews where there is no approval, the modeler must rework the design to resolve the issues. If there are
issues that the modeler cannot resolve on their own, the final say should be given by the owner of the system
reflected by the model.
Data models and other design specifications require careful change control, just like requirements specifications
and other SDLC deliverables. Note each change to a data model to preserve the lineage of changes over time. If
164 • DMBOK2
a change affects the logical data model, such as a new or changed business data requirement, the data analyst or
architect must review and approve the change to the model.
Some data modeling tools include repositories that provide data model versioning and integration functionality.
Otherwise, preserve data models in DDL exports or XML files, checking them in and out of a standard source
code management system just like application code.
There are several ways of measuring a data model’s quality, and all require a standard for comparison. One
method that will be used to provide an example of data model validation is The Data Model Scorecard®, which
provides 11 data model quality metrics: one for each of ten categories that make up the Scorecard and an overall
score across all ten categories (Hoberman, 2015). Table 11 contains the Scorecard template.
The model score column contains the reviewer’s assessment of how well a particular model met the scoring
criteria, with a maximum score being the value that appears in the total score column. For example, a reviewer
might give a model a score of 10 on “How well does the model capture the requirements?” The % column
DATA MODELING AND DESIGN • 165
presents the Model Score for the category divided by the Total Score for the category. For example, receiving
10 out of 15 would lead to 66%. The comments column should document information that explains the score in
more detail or captures the action items required to fix the model. The last row contains the overall score
assigned to the model, a sum of each of the columns.
1. How well does the model capture the requirements? Here we ensure that the data model represents
the requirements. If there is a requirement to capture order information, in this category we check the
model to make sure it captures order information. If there is a requirement to view Student Count by
Semester and Major, in this category we make sure the data model supports this query.
2. How complete is the model? Here completeness means two things: completeness of requirements and
completeness of Metadata. Completeness of requirements means that each requirement that has been
requested appears on the model. It also means that the data model only contains what is being asked
for and nothing extra. It’s easy to add structures to the model anticipating that they will be used in the
near future; we note these sections of the model during the review. The project may become too hard
to deliver if the modeler includes something that was never asked for. We need to consider the likely
cost of including a future requirement in the case that it never eventuates. Completeness of Metadata
means that all of the descriptive information surrounding the model is present as well; for example, if
we are reviewing a physical data model, we would expect formatting and nullability to appear on the
data model.
3. How well does the model match its scheme? Here we ensure that the model level of detail
(conceptual, logical, or physical), and the scheme (e.g., relational, dimensional, NoSQL) of the model
being reviewed matches the definition for this type of model.
4. How structurally sound is the model? Here we validate the design practices employed to build the
model to ensure one can eventually build a database from the data model. This includes avoiding
design issues such as having two attributes with the same exact name in the same entity or having a
null attribute in a primary key.
5. How well does the model leverage generic structures? Here we confirm an appropriate use of
abstraction. Going from Customer Location to a more generic Location, for example, allows the
design to more easily handle other types of locations such as warehouses and distribution centers.
6. How well does the model follow naming standards? Here we ensure correct and consistent naming
standards have been applied to the data model. We focus on naming standard structure, term, and style.
Structure means that the proper building blocks are being used for entities, relationships, and attributes.
For example, a building block for an attribute would be the subject of the attribute such as ‘Customer’
or ‘Product’. Term means that the proper name is given to the attribute or entity. Term also includes
proper spelling and abbreviation. Style means that the appearance, such as upper case or camel case, is
consistent with standard practices.
166 • DMBOK2
7. How well has the model been arranged for readability? Here we ensure the data model is easy to
read. This question is not the most important of the ten categories. However, if your model is hard to
read, you may not accurately address the more important categories on the scorecard. Placing parent
entities above their child entities, displaying related entities together, and minimizing relationship line
length all improve model readability.
8. How good are the definitions? Here we ensure the definitions are clear, complete, and accurate.
9. How consistent is the model with the enterprise? Here we ensure the structures on the data model
are represented in a broad and consistent context, so that one set of terminology and rules can be
spoken in the organization. The structures that appear in a data model should be consistent in
terminology and usage with structures that appear in related data models, and ideally with the
enterprise data model (EDM), if one exists.
10. How well does the Metadata match the data? Here we confirm that the model and the actual data
that will be stored within the resulting structures are consistent. Does the column
Customer_Last_Name really contain the customer’s last name, for example? The Data category is
designed to reduce these surprises and help ensure the structures on the model match the data these
structures will be holding.
The scorecard provides an overall assessment of the quality of the model and identifies specific areas for
improvement.
Avison, David and Christine Cuthbertson. A Management Approach to Database Applications. McGraw-Hill Publishing
Co., 2002. Print. Information systems ser.
Blaha, Michael. UML Database Modeling Workbook. Technics Publications, LLC, 2013. Print.
Brackett, Michael H. Data Resource Design: Reality Beyond Illusion. Technics Publications, LLC, 2012. Print.
Brackett, Michael H. Data Resource Integration: Understanding and Resolving a Disparate Data Resource. Technics
Publications, LLC, 2012. Print.
Brackett, Michael H. Data Resource Simplexity: How Organizations Choose Data Resource Success or Failure. Technics
Publications, LLC, 2011. Print.
Bruce, Thomas A. Designing Quality Databases with IDEF1X Information Models. Dorset House, 1991. Print.
Burns, Larry. Building the Agile Database: How to Build a Successful Application Using Agile Without Sacrificing Data
Management. Technics Publications, LLC, 2011. Print.
Carlis, John and Joseph Maguire. Mastering Data Modeling - A User-Driven Approach. Addison-Wesley Professional,
2000. Print.
DATA MODELING AND DESIGN • 167
Codd, Edward F. “A Relational Model of Data for Large Shared Data Banks”. Communications of the ACM, 13, No. 6 (June
1970).
DAMA International. The DAMA Dictionary of Data Management. 2nd Edition: Over 2,000 Terms Defined for IT and
Business Professionals. 2nd ed. Technics Publications, LLC, 2011. Print.
Daoust, Norman. UML Requirements Modeling for Business Analysts: Steps to Modeling Success. Technics Publications,
LLC, 2012. Print.
Date, C. J. and Hugh Darwen. Databases, Types and the Relational Model. 3d ed. Addison Wesley, 2006. Print.
Date, Chris J. The Relational Database Dictionary: A Comprehensive Glossary of Relational Terms and Concepts, with
Illustrative Examples. O'Reilly Media, 2006. Print.
Dorsey, Paul. Enterprise Data Modeling Using UML. McGraw-Hill Osborne Media, 2009. Print.
Edvinsson, Håkan and Lottie Aderinne. Enterprise Architecture Made Simple: Using the Ready, Set, Go Approach to
Achieving Information Centricity. Technics Publications, LLC, 2013. Print.
Fleming, Candace C. and Barbara Von Halle. The Handbook of Relational Database Design. Addison Wesley, 1989. Print.
Giles, John. The Nimble Elephant: Agile Delivery of Data Models using a Pattern-based Approach. Technics Publications,
LLC, 2012. Print.
Golden, Charles. Data Modeling 152 Success Secrets - 152 Most Asked Questions On Data Modeling - What You Need to
Know. Emereo Publishing, 2015. Print. Success Secrets.
Halpin, Terry, Ken Evans, Pat Hallock, and Bill McLean. Database Modeling with Microsoft Visio for Enterprise
Architects. Morgan Kaufmann, 2003. Print. The Morgan Kaufmann Series in Data Management Systems.
Halpin, Terry. Information Modeling and Relational Databases. Morgan Kaufmann, 2001. Print. The Morgan Kaufmann
Series in Data Management Systems.
Halpin, Terry. Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design. Morgan
Kaufmann, 2001. Print. The Morgan Kaufmann Series in Data Management Systems.
Harrington, Jan L. Relational Database Design Clearly Explained. 2nd ed. Morgan Kaufmann, 2002. Print. The Morgan
Kaufmann Series in Data Management Systems.
Hay, David C. Data Model Patterns: A Metadata Map. Morgan Kaufmann, 2006. Print. The Morgan Kaufmann Series in
Data Management Systems.
Hay, David C. Enterprise Model Patterns: Describing the World (UML Version). Technics Publications, LLC, 2011. Print.
Hay, David C. Requirements Analysis from Business Views to Architecture. Prentice Hall, 2002. Print.
Hay, David C. UML and Data Modeling: A Reconciliation. Technics Publications, LLC, 2011. Print.
Hernandez, Michael J. Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design. 2nd ed.
Addison-Wesley Professional, 2003. Print.
Hoberman, Steve, Donna Burbank, Chris Bradley, et al. Data Modeling for the Business: A Handbook for Aligning the
Business with IT using High-Level Data Models. Technics Publications, LLC, 2009. Print. Take It with You Guides.
Hoberman, Steve. Data Model Scorecard. Technics Publications, LLC, 2015. Print.
Hoberman, Steve. Data Modeling Made Simple with ER/Studio Data Architect. Technics Publications, LLC, 2013. Print.
Hoberman, Steve. Data Modeling Made Simple: A Practical Guide for Business and IT Professionals. 2nd ed. Technics
Publications, LLC, 2009. Print.
168 • DMBOK2
Hoberman, Steve. Data Modeling Master Class Training Manual. 7th ed. Technics Publications, LLC, 2017. Print.
Hoberman, Steve. The Data Modeler's Workbench. Tools and Techniques for Analysis and Design. Wiley, 2001. Print.
Hoffer, Jeffrey A., Joey F. George, and Joseph S. Valacich. Modern Systems Analysis and Design. 7th ed. Prentice Hall,
2013. Print.
IIBA and Kevin Brennan, ed. A Guide to the Business Analysis Body of Knowledge (BABOK Guide). International Institute
of Business Analysis, 2009. Print.
Kent, William. Data and Reality: A Timeless Perspective on Perceiving and Managing Information in Our Imprecise World.
3d ed. Technics Publications, LLC, 2012. Print.
Krogstie, John, Terry Halpin, and Keng Siau, eds. Information Modeling Methods and Methodologies: Advanced Topics in
Database Research. Idea Group Publishing, 2005. Print. Advanced Topics in Database Research.
Linstedt, Dan. Super Charge Your Data Warehouse: Invaluable Data Modeling Rules to Implement Your Data Vault.
Amazon Digital Services. 2012. Data Warehouse Architecture Book 1.
Muller, Robert. J. Database Design for Smarties: Using UML for Data Modeling. Morgan Kaufmann, 1999. Print. The
Morgan Kaufmann Series in Data Management Systems.
Needham, Doug. Data Structure Graphs: The structure of your data has meaning. Doug Needham Amazon Digital
Services, 2015. Kindle.
Newton, Judith J. and Daniel Wahl, eds. Manual for Data Administration. NIST Special Publications, 1993. Print.
Pascal, Fabian. Practical Issues in Database Management: A Reference for The Thinking Practitioner. Addison-Wesley
Professional, 2000. Print.
Reingruber, Michael. C. and William W. Gregory. The Data Modeling Handbook: A Best-Practice Approach to Building
Quality Data Models. Wiley, 1994. Print.
Riordan, Rebecca M. Designing Effective Database Systems. Addison-Wesley Professional, 2005. Print.
Rob, Peter and Carlos Coronel. Database Systems: Design, Implementation, and Management. 7th ed. Cengage Learning,
2006. Print.
Schmidt, Bob. Data Modeling for Information Professionals. Prentice Hall, 1998. Print.
Silverston, Len and Paul Agnew. The Data Model Resource Book, Volume 3: Universal Patterns for Data Modeling. Wiley,
2008. Print.
Silverston, Len. The Data Model Resource Book, Volume 1: A Library of Universal Data Models for All Enterprises. Rev.
ed. Wiley, 2001. Print.
Silverston, Len. The Data Model Resource Book, Volume 2: A Library of Data Models for Specific Industries. Rev. ed.
Wiley, 2001. Print.
Simsion, Graeme C. and Graham C. Witt. Data Modeling Essentials. 3rd ed. Morgan Kaufmann, 2004. Print.
Simsion, Graeme. Data Modeling: Theory and Practice. Technics Publications, LLC, 2007. Print.
Teorey, Toby, et al. Database Modeling and Design: Logical Design, 4th ed. Morgan Kaufmann, 2010. Print. The Morgan
Kaufmann Series in Data Management Systems.
Thalheim, Bernhard. Entity-Relationship Modeling: Foundations of Database Technology. Springer, 2000. Print.
Watson, Richard T. Data Management: Databases and Organizations. 5th ed. Wiley, 2005. Print.
CHAPTER 6
Data Data
Metadata
Governance Security
1. Introduction
D
ata Storage and Operations includes the design, implementation, and support of stored data, to
maximize its value throughout its lifecycle, from creation/acquisition to disposal (see Chapter 1).
Data Storage and Operations includes two sub-activities:
• Database support focuses on activities related to the data lifecycle, from initial implementation of a
database environment, through obtaining, backing up, and purging data. It also includes ensuring the
database performs well. Monitoring and tuning are critical to database support.
169
170 • DMBOK2
• Database technology support includes defining technical requirements that will meet organizational
needs, defining technical architecture, installing and administering technology, and resolving issues
related to technology.
Database administrators (DBAs) play key roles in both aspects of data storage and operations. The role of DBA
is the most established and most widely adopted data professional role, and database administration practices
are perhaps the most mature of all data management practices. DBAs also play dominant roles in data
operations and data security. (See Chapter 7.)
Goals:
1. Manage availability of data throughout the data lifecycle.
2. Ensure the integrity of data assets.
3. Manage performance of data transactions.
Business
Drivers
Companies rely on their information systems to run their operations. Data Storage and Operations activities are
crucial to organizations that rely on data. Business continuity is the primary driver of these activities. If a
system becomes unavailable, company operations may be impaired or stopped completely. A reliable data
storage infrastructure for IT operations minimizes the risk of disruption.
Data Storage and Operations represent a highly technical side of data management. DBAs and others involved
in this work can do their jobs better and help the overall work of data management when they follow these
guiding principles:
• Build with reuse in mind: Develop and promote the use of abstracted and reusable data objects that
prevent applications from being tightly coupled to database schemas (the so-called ‘object-relational
impedance mismatch’). A number of mechanisms exist to this end, including database views, triggers,
functions and stored procedures, application data objects and data-access layers, XML and XSLT,
ADO.NET typed data sets, and web services. The DBA should be able to assess the best approach
virtualizing data. The end goal is to make using the database as quick, easy, and painless as possible.
• Understand and appropriately apply best practices: DBAs should promote database standards and
best practices as requirements, but be flexible enough to deviate from them if given acceptable reasons
for these deviations. Database standards should never be a threat to the success of a project.
• Connect database standards to support requirements: For example, the Service Level Agreement
(SLA) can reflect DBA-recommended and developer-accepted methods of ensuring data integrity and
data security. The SLA should reflect the transfer of responsibility from the DBAs to the development
team if the development team will be coding their own database update procedures or data access
layer. This prevents an ‘all or nothing’ approach to standards.
172 • DMBOK2
• Set expectations for the DBA role in project work: Ensuring project methodology includes
onboarding the DBA in project definition phase can help throughout the SDLC. The DBA can
understand project needs and support requirements up-front. This will improve communication by
clarifying the project team’s expectations from the data group. Having a dedicated primary and
secondary DBA during analysis and design clarify expectations about DBA tasks, standards, work
effort, and timelines for development work. Teams should also clarify expectations for support after
implementation.
Database terminology is specific and technical. In working as a DBA or with DBAs, it is important to
understand the specifics of this technical language:
• Database: Any collection of stored data, regardless of structure or content. Some large databases refer
to instances and schema.
• Instance: An execution of database software controlling access to a certain area of storage. An
organization will usually have multiple instances executing concurrently, using different areas of
storage. Each instance is independent of all other instances.
• Schema: A subset of a database objects contained within the database or an instance. Schemas are
used to organize objects into more manageable parts. Usually, a schema has an owner and an access
list particular to the schema’s contents. Common uses of schemas are to isolate objects containing
sensitive data from the general user base, or to isolate read-only views from the underlying tables in
relational databases. Schema can also be used to refer to a collection of database structures with
something in common.
• Node: An individual computer hosting either processing or data as part of a distributed database.
• Database abstraction means that a common application interface (API) is used to call database
functions, such that an application can connect to multiple different databases without the programmer
having to know all function calls for all possible databases. ODBC (Open Database Connectivity) is an
example of an API that enables database abstraction. Advantages include portability; disadvantages
include inability to use specific database functions that are not common across databases.
DBAs maintain and assure the accuracy and consistency of data over its entire lifecycle through the design,
implementation, and usage of any system that stores, processes, or retrieves data. The DBA is the custodian of
all database changes. While many parties may request changes, the DBA defines the precise changes to make to
the database, implements the changes, and controls the changes.
DATA STORAGE AND OPERATIONS • 173
Data lifecycle management includes implementing policies and procedures for acquisition, migration, retention,
expiration, and disposition of data. It is prudent to prepare checklists to ensure all tasks are performed at a high
level of quality. DBAs should use a controlled, documented, and auditable process for moving application
database changes to the Quality Assurance or Certification (QA) and Production environments. A manager-
approved service request or change request usually initiates the process. The DBA should have a back out plan
to reverse changes in case of problems.
1.3.3 Administrators
The role of Database Administrator (DBA) is the most established and the most widely adopted data
professional role. DBAs play the dominant roles in Data Storage and Operations, and critical roles in Data
Security (see Chapter 7), the physical side of data modeling, and database design (see Chapter 5). DBAs
provide support for development, test, QA, and special use database environments.
DBAs do not exclusively perform all the activities of Data Storage and Operations. Data stewards, data
architects, network administrators, data analysts, and security analysts participate in planning for performance,
retention, and recovery. These teams may also participate in obtaining and processing data from external
sources.
Many DBAs specialize as Production, Application, Procedural and Development DBAs. Some organizations
also have Network Storage Administrators (NSA) who specialize in supporting the data storage system
separately from the data storage applications or structures.
In some organizations, each specialized role reports to a different organization within IT. Production DBAs may
be part of production infrastructure or application operations support groups. Application, Development, and
Procedural DBAs are sometimes integrated into application development organizations. NSAs usually are
connected to Infrastructure organizations.
Production DBAs take primary responsibility for data operations management, including:
• Ensuring the performance and reliability of the database, through performance tuning, monitoring,
error reporting, and other activities
• Implementing backup and recovery mechanisms to ensure data can be recovered if lost in any
circumstance
• Implementing mechanisms for clustering and failover of the database, if continual data availability data
is a requirement
• Executing other database maintenance activities, such as implementing mechanisms for archiving data
174 • DMBOK2
As part of managing data operations, Production DBAs create the following deliverables:
• Mechanisms and processes for controlled implementation of changes to databases in the production
environment
• Mechanisms for ensuring the availability, integrity, and recoverability of data in response to all
circumstances that could result in loss or corruption of data
• Mechanisms for detecting and reporting any error that occurs in the database, the DBMS, or the data
server
• Database availability, recovery, and performance in accordance with service level agreements
• Mechanisms and processes for monitoring database performance as workloads and data volumes vary
An application DBA is responsible for one or more databases in all environments (development / test, QA, and
production), as opposed to database systems administration for any of these environments. Sometimes,
application DBAs report to the organizational units responsible for development and maintenance of the
applications supported by their databases. There are pros and cons to staffing application DBAs.
Application DBAs are viewed as integral members of an application support team. By focusing on a specific
database, they can provide better service to application developers. However, application DBAs can easily
become isolated and lose sight of the organization’s overall data needs and common DBA practices.
Application DBAs collaborate closely with data analysts, modelers, and architects.
Procedural DBAs lead the review and administration of procedural database objects. A procedural DBA
specializes in development and support of procedural logic controlled and execute by the DBMS: stored
procedures, triggers, and user-defined functions (UDFs). The procedural DBA ensures this procedural logic is
planned, implemented, tested, and shared (reused).
Development DBAs focus on data design activities including creating and managing special use databases, such
as ‘sandbox’ or exploration areas.
In many cases, these two functions are combined under one position.
DATA STORAGE AND OPERATIONS • 175
1.3.3.4 NSA
Network Storage Administrators are concerned with the hardware and software supporting data storage arrays.
Multiple network storage array systems have different needs and monitoring requirements than simple database
systems.
A database can be classified as either centralized or distributed. A centralized system manages a single
database, while a distributed system manages multiple databases on multiple systems. A distributed system’s
components can be classified depending on the autonomy of the component systems into two types: federated
(autonomous) or non-federated (non-autonomous). Figure 55 illustrates the difference between centralized and
distributed.
Centralized databases have all the data in one system in one place. All users come to the one system to access
the data. For certain restricted data, centralization can be ideal, but for data that needs to be widely available,
centralized databases have risks. For example, if the centralized system is unavailable, there are no other
alternatives for accessing the data.
Distributed databases make possible quick access to data over a large number of nodes. Popular distributed
database technologies are based on using commodity hardware servers. They are designed to scale out from
single servers to thousands of machines, each offering local computation and storage. Rather than rely on
hardware to deliver high-availability, the database management software itself is designed to replicate data
amongst the servers, thereby delivering a highly available service on top of a cluster of computers. Database
176 • DMBOK2
management software is also designed to detect and handle failures. While any given computer may fail, the
system overall is unlikely to.
Some distributed databases implement a computational paradigm named MapReduce to further improve
performance. In MapReduce, the data request is divided into many small fragments of work, each of which may
be executed or re-executed on any node in the cluster. In addition, data is co-located on the compute nodes,
providing very high aggregate bandwidth across the cluster. Both the filesystem and the application are
designed to automatically handle node failures.
Federation provisions data without additional persistence or duplication of source data. A federated database
system maps multiple autonomous database systems into a single federated database. The constituent databases,
sometimes geographically separated, are interconnected via a computer network. They remain autonomous yet
participate in a federation to allow partial and controlled sharing of their data. Federation provides an alternative
to merging disparate databases. There is no actual data integration in the constituent databases because of data
federation; instead, data interoperability manages the view of the federated databases as one large object (see
Chapter 8). In contrast, a non-federated database system is an integration of component DBMS’s that are not
autonomous; they are controlled, managed and governed by a centralized DBMS.
Federated databases are best for heterogeneous and distributed integration projects such as enterprise
information integration, data virtualization, schema matching, and Master Data Management.
Federated architectures differ based on levels of integration with the component database systems and the extent
of services offered by the federation. A FDBMS can be categorized as either loosely or tightly coupled.
Federated
User View
Loosely coupled systems require component databases to construct their own federated schema. A user will
typically access other component database systems by using a multi-database language, but this removes any
levels of location transparency, forcing the user to have direct knowledge of the federated schema. A user
imports the data required from other component databases, and integrates it with their own to form a federated
schema.
Tightly coupled systems consist of component systems that use independent processes to construct and publish
an integrated federated schema, as illustrated in Figure 57. The same schema can apply to all parts of the
federation, with no data replication.
User View
User View
F
F F
map map map map
Figure 57 Coupling
Blockchain databases are a type of federated database used to securely manage financial transactions. They can
also be used for contract management or exchange of health information. There are two types of structures:
individual records and blocks. Each transaction has a record. The database creates chains of time-bound groups
of transactions (blocks) that also contain information from the previous block in the chain. Hash algorithms are
used to create information about transactions to store in blocks while the block is the end of the chain. Once a
new block is created, the old block hash should never change, which means that no transactions contained
within that block may change. Any change to transactions or blocks (tampering) will be apparent when the hash
values no longer match.
Virtualization (also called ‘cloud computing’) provides computation, software, data access, and storage services
that do not require end-user knowledge of the physical location and configuration of the system that delivers the
178 • DMBOK2
service(s). Parallels are often drawn between the concept of cloud computing and the electricity grid: end users
consume power without needing to understand the component devices or infrastructure required to provide the
service. However, virtualization can be on-premises or off-premises.
Cloud computing is a natural evolution of the widespread adoption of virtualization, service oriented
architectures, and utility computing. Here are some methods for implementing databases on the cloud:
• Virtual machine image: Cloud platforms allow users to purchase virtual machine instances for a
limited time. It is possible to run a database on these virtual machines. Users can either upload their
own machine image with a database installed on it, or use ready-made machine images that already
include an optimized installation of a database.
• Database-as-a-service (DaaS): Some cloud platforms offer options for using a database-as-a-service,
without physically launching a virtual machine instance for the database. In this configuration,
application owners do not have to install and maintain the database on their own. Instead, the database
service provider is responsible for installing and maintaining the database, and application owners pay
according to their usage.
• Managed database hosting on the cloud: Here the database is not offered as a service; instead, the
cloud provider hosts the database and manages it on the application owner’s behalf.
DBAs, in coordination with network and system administrators, need to establish a systematic integrated project
approach to include standardization, consolidation, virtualization, and automation of data backup and recovery
functions, as well as security of these functions.
• Server virtualization: Virtualization technologies allow equipment, such as servers from multiple data
centers, to be replaced or consolidated. Virtualization lowers capital and operational expenses and
reduces energy consumption. Virtualization technologies are also used to create virtual desktops,
which can then be hosted in data centers and rented out on a subscription basis. Gartner views
virtualization as a catalyst for modernization (Bittman, 2009). Virtualization provides data storage
operations much more flexibility in provisioning storage at local or cloud environment.
• Security: The security of data on virtual systems needs to be integrated with existing security of
physical infrastructures (see Chapter 7).
DATA STORAGE AND OPERATIONS • 179
There are two basic types of database processing. ACID and BASE are on opposite ends of a spectrum, so the
coincidental names matching ends of a pH spectrum are helpful. The CAP Theorem is used to define how
closely a distributed system may match either ACID or BASE.
1.3.5.1 ACID
The acronym ACID was coined in the early 1980’s as the indispensable constraint for achieving reliability
within database transactions. For decades, it has provided transaction processing with a reliable foundation on
which to build. 34
• Atomicity: All operations are performed, or none of them is, so that if one part of the transaction fails,
then the entire transaction fails.
• Consistency: The transaction must meet all rules defined by the system at all times and must void half-
completed transactions.
Relational ACID technologies are the dominant tools in relational database storage; most use SQL as the
interface.
1.3.5.2 BASE
The unprecedented increase in data volumes and variability, the need to document and store unstructured data,
the need for read-optimized data workloads, and subsequent need for greater flexibility in scaling, design,
processing, cost, and disaster recovery gave rise to the diametric opposite of ACID, appropriately termed
BASE:
• Basically Available: The system guarantees some level of availability to the data even when there are
node failures. The data may be stale, but the system will still give and accept responses.
• Soft State: The data is in a constant state of flux; while a response may be given, the data is not
guaranteed to be current.
• Eventual Consistency: The data will eventually be consistent through all nodes and in all databases,
but not every transaction will be consistent at every moment.
34 Jim Gray established the concept. Haerder and Rueter (1983) coined the term ACID.
180 • DMBOK2
BASE-type systems are common in Big Data environments. Large online organizations and social media
companies commonly use BASE implementations, as immediate accuracy of all data elements at all times is not
necessary. Table 12 summarizes the differences between ACID and BASE.
1.3.5.3 CAP
The CAP Theorem (or Brewer’s Theorem) was developed in response to a shift toward more distributed
systems (Brewer, 2000). The theorem asserts that a distributed system cannot comply with all parts of ACID at
all time. The larger the system, the lower the compliance. A distributed system must instead trade-off between
properties.
• Consistency: The system must operate as designed and expected at all times.
• Availability: The system must be available when requested and must respond to each request.
• Partition Tolerance: The system must be able to continue operations during occasions of data loss or
partial system failure.
The CAP Theorem states that at most two of the three properties can exist in any shared-data system. This is
usually stated with a ‘pick two’ statement, illustrated in Figure 58.
CAP Theorem
“Pick Two”
No System Failures
An interesting use of this theorem drives the Lambda Architecture design discussed in Chapter 14. Lambda
Architecture uses two paths for data: a Speed path where availability and partition tolerance are most important,
and a Batch path where consistency and availability are most important.
Data can be stored on a variety of media, including disks, volatile memory, and flash drives. Some systems can
combine multiple storage types. The most commonly used are Disk and Storage Area Networks (SAN), In-
Memory, Columnar Compression Solutions, Virtual Storage Area Network VSAN, Cloud-based storage
solutions, Radio Frequency Identification (RFID), Digital wallets, Data centers and Private, Public, and Hybrid
Cloud Storage. (See Chapter 14.)
Disk storage is a very stable method of storing data persistently. Multiple types of disk can exist in the same
system. Data can be stored according to usage patterns, with less-used data stored on slower-access disks, which
are usually cheaper than high performance disk systems.
Disk arrays can be collected into Storage Area Networks (SAN). Data movement on a SAN may not require a
network, as data can be moved on the backplane.
1.3.6.2 In-Memory
In-Memory databases (IMDB) are loaded from permanent storage into volatile memory when the system is
turned on, and all processing occurs within the memory array, providing faster response time than disk-based
systems. Most in-memory databases also have features to set and configure durability in case of unexpected
shutdown.
If the application can be reasonably assured to fit most/all data into memory, then significant optimization can
be made available from in-memory database systems. These IMDB’s provide more predictable access time to
data than do disk storage mechanisms, but they require a much larger investment. IMDB’s provide functionality
for real-time processing of analytics and are generally reserved for this due to the investment required.
Columnar-based databases are designed to handle data sets in which data values are repeated to a great extent.
For example, in a table with 256 columns, a lookup for a value that exists in a row will retrieve all the data in
the row (and be somewhat disk-bound). Columnar storage reduces this I/O bandwidth by storing column data
182 • DMBOK2
using compression – where the state (for example) is stored as a pointer to a table of states, compressing the
master table significantly.
Recent advances in memory storage have made flash memory or solid state drives (SSDs) an attractive
alternative to disks. Flash memory combines the access speed of memory-based storage with the persistence of
disk-based storage.
Databases are used in a variety of environments during the systems development lifecycle. When testing
changes, DBAs should be involved in designing the data structures in the Development environment. The DBA
team should implement any changes to the QA environment, and must be the only team implementing changes
to the Production environment. Production changes must adhere strictly to standard processes and procedures.
While most data technology is software running on general purpose hardware, occasionally specialized
hardware is used to support unique data management requirements. Types of specialized hardware include data
appliances – servers built specifically for data transformation and distribution. These servers integrate with
existing infrastructure either directly as a plug-in, or peripherally as a network connection.
The production environment is the technical environment where all business processes occur. Production is
mission-critical – if this environment ceases to operate, business processes will stop, resulting in bottom-line
losses, as well as a negative impact on customers who are unable to access services. In an emergency, or for
public service systems, unexpected loss of function can be disastrous.
The production environment is the ‘real’ environment from a business perspective. However, in order to have a
reliable production environment, other non-production environments must exist and be used appropriately. For
example, production environments should not be used for development and testing as these activities put
production processes and data at risk.
Pre-production environments are used to develop and test changes before such changes are introduced to the
production environment. In pre-production environments, issues with changes can be detected and addressed
without affecting normal business processes. In order to detect potential issues, the configuration of pre-
production environments must closely resemble the production environment.
DATA STORAGE AND OPERATIONS • 183
Due to space and cost, it is usually not possible to exactly replicate production in the pre-production
environments. The closer on the development path the non-production environment is to the production
environment, the more closely the non-production environment needs to match the production environment.
Any deviation from the production system equipment and configuration can itself create issues or errors that are
unrelated to the change, complicating issue research and resolution.
Common types of pre-production environments include development, test, support, and special use
environments.
1.3.7.2.1 Development
The development environment is usually a slimmer version of the production environment. It generally has less
disk space, fewer CPUs, less RAM, etc. Developers use this environment to create and test code for changes in
separate environments, which then are combined in the QA environment for full integration testing.
Development can have many copies of production data models, depending on how development projects are
managed. Larger organizations may give individual developers their own environments to manage with all
appropriate rights.
The development environment should be the first place any patches or updates are applied for testing. This
environment should be isolated from and on different physical hardware than the production environments. Due
to the isolation, data from production systems may need to be copied to the development environments.
However, in many industries, production data is protected through regulation. Do not move data from
production environments without first determining what restrictions there are on doing so. (See Chapter 7.)
1.3.7.2.2 Test
The test environment is used to execute quality assurance and user acceptance testing and, in some cases, stress
or performance tests. In order to prevent test results from being distorted due to environmental differences, the
test environment ideally also has the same software and hardware as the production environment. This is
especially important for performance testing. Test may or may not be connected via network to production
systems in order to read production data. Test environments should never write to production systems.
A sandbox is an alternate environment that allows read-only connections to production data and can be
managed by the users. Sandboxes are used to experiment with development options and test hypotheses about
data or merge production data with user-developed data or supplemental data obtained from external sources.
Sandboxes are valuable, for example, when performing a Proof-of-Concept.
A sandbox environment can either be a sub-set of the production system, walled off from production
processing, or a completely separate environment. Sandbox users often have CRUD rights over their own space
so that they can quickly validate ideas and options for changes to the system. The DBAs usually have little to do
with these environments other than setting them up, granting access, and monitoring usage. If the Sandbox areas
are situated in production database systems, they must be isolated in order to avoid adversely affecting
production operations. These environments should never write back to the production systems.
Sandbox environments could be handled by virtual machines (VMs), unless licensing costs for separate
instances becomes prohibitive.
Data storage systems provide a way to encapsulate the instructions necessary to put data on disks and manage
processing, so developers can simply use instructions to manipulate data. Databases are organized in three
general ways: Hierarchical, Relational, and Non-Relational. These classes are not mutually exclusive (see
Figure 59). Some database systems can read and write data organized in relational and non-relational structures.
Hierarchical databases can be mapped to relational tables. Flat files with line delimiters can be read as tables
with rows, and one or more columns can be defined to describe the row contents.
NON-
HIERARCHICAL RELATIONAL RELATIONAL
(Tree Schema) (Schema on (Schema on
Write) Read)
1.3.8.1 Hierarchical
Hierarchical database organization is the oldest database model, used in early mainframe DBMS, and is the
most rigid of structures. In hierarchical databases, data is organized into a tree-like structure with mandatory
DATA STORAGE AND OPERATIONS • 185
parent/child relationships: each parent can have many children, but each child has only one parent (also known
as a 1-to-many relationship). Directory trees are an example of a hierarchy. XML also uses a hierarchical
model. It can be represented as a relational database, although the actual structure is that of a tree traversal path.
1.3.8.2 Relational
People sometimes think that relational databases are named for the relation between tables. This is not the case.
Relational databases are based on set theory and relational algebra, where data elements or attributes (columns)
are related into tuples (rows). (See Chapter 5.) Tables are sets of relations with identical structure. Set
operations (like union, intersect, and minus) are used to organize and retrieve data from relational databases, in
the form of Structured Query Language (SQL). In order to write data, the structure (schema) has to be known in
advance (schema on write). Relational databases are row-oriented.
The database management system (DBMS) of a relational database is called RDBMS. A relational database is
the predominant choice in storing data that constantly changes. Variations on relational databases include
Multidimensional and Temporal.
1.3.8.2.1 Multidimensional
Multidimensional database technologies store data in a structure that allows searching using several data
element filters simultaneously. This type of structure is used most frequently in Data Warehousing and Business
Intelligence. Some of these database types are proprietary, although most large databases have cube technology
built in as objects. Access to the data uses a variant of SQL called MDX or Multidimensional eXpression.
1.3.8.2.2 Temporal
A temporal database is a relational database with built-in support for handling data involving time. The
temporal aspects usually include valid time and transaction time. These attributes can be combined to form bi-
temporal data.
• Valid time is the timeframe when a fact is true with respect to the entity it represents in the real world.
• Transaction time is the period during which a fact stored in the database is considered true.
It is possible to have timelines other than Valid Time and Transaction Time, such as Decision Time, in the
database. In that case, the database is called a multi-temporal database as opposed to a bi-temporal database.
Temporal databases enable application developers and DBAs to manage current, proposed, and historical
versions of data in the same database.
186 • DMBOK2
1.3.8.3 Non-relational
Non-relational databases can store data as simple strings or complete files. Data in these files can be read in
different ways, depending on the need (this characteristic is referred to as ‘schema on read’). Non-relational
databases may be row-oriented, but this is not required.
A non-relational database provides a mechanism for storage and retrieval of data that employs less constrained
consistency models than traditional relational databases. Motivations for this approach include simplicity of
design, horizontal scaling, and finer control over availability.
Non-relational databases are usually referred to as NoSQL (which stands for “Not Only SQL”). The primary
differentiating factor is the storage structure itself, where the data structure is no longer bound to a tabular
relational design. It could be a tree, a graph, a network, or a key-value pairing. The NoSQL tag emphasizes that
some editions may in fact support conventional SQL directives. These databases are often highly optimized data
stores intended for simple retrieval and appending operations. The goal is improved performance, especially
with respect to latency and throughput. NoSQL databases are used increasingly in Big Data and real-time web
applications. (See Chapter 5.)
1.3.8.3.1 Column-oriented
Column-oriented databases are used mostly in Business Intelligence applications because they can compress
redundant data. For example, a state ID column only has unique values, instead of one value for each of a
million rows.
There are trade-offs between column-oriented (non-relational) and row-oriented (usually relational)
organization.
• Column-oriented organization is more efficient when an aggregate needs to be computed over many
rows. This only holds true for a notably smaller subset of all columns of data, because reading that
smaller subset of data can be faster than reading all data.
• Column-oriented organization is more efficient when new values of a column are supplied for all rows
at once, because that column data can be written efficiently to replace old column data without
touching any other columns for the rows.
• Row-oriented organization is more efficient when many columns of a single row are required at the
same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk
seek.
• Row-oriented organization is more efficient when writing a new row if all of the row data is supplied
at the same time; the entire row can be written with a single disk seek.
• In practice, row-oriented storage layouts are well suited for Online Transaction Processing (OLTP)-
like workloads, which are more heavily loaded with interactive transactions. Column-oriented storage
DATA STORAGE AND OPERATIONS • 187
layouts are well suited for Online Analytical Processing (OLAP)-like workloads (e.g., data
warehouses) which typically involve a smaller number of highly complex queries over all data
(possibly terabytes).
1.3.8.3.2 Spatial
A spatial database is optimized to store and query data that represents objects defined in a geometric space.
Spatial databases support several primitive types (simple geometric shapes such as box, rectangle, cube,
cylinder, etc.) and geometries composed of collections of points, lines, and shapes.
Spatial database systems use indexes to quickly look up values; the way that most databases index data is not
optimal for spatial queries. Instead, spatial databases use a spatial index to speed up database operations.
Spatial databases can perform a wide variety of spatial operations. As per the Open Geospatial Consortium
standard, a spatial database may perform one or more of the following operations:
• Spatial Measurements: Computes line length, polygon area, the distance between geometries, etc.
• Spatial Functions: Modifies existing features to create new ones; for example, by providing a buffer
around them, intersecting features, etc.
• Spatial Predicates: Allows true/false queries about spatial relationships between geometries.
Examples include “Do two polygons overlap?” or “Is there a residence located within a mile of the
area of the proposed landfill?”
• Geometry Constructors: Creates new geometries, usually by specifying the vertices (points or nodes)
which define the shape.
• Observer Functions: Queries that return specific information about a feature such as the location of
the center of a circle.
A multimedia database includes a Hierarchical Storage Management system for the efficient management of a
hierarchy of magnetic and optical storage media. It also includes a collection of objects classes, which
represents the foundation of the system.
A flat file database describes any of various means to encode a data set as a single file. A flat file can be a plain
text file or a binary file. Strictly, a flat file database consists of nothing but data, and contains records that may
vary in length and delimiters. More broadly, the term refers to any database that exists in a single file in the
188 • DMBOK2
form of rows and columns, with no relationships or links between records and fields except the structure. Plain
text files usually contain one record per line. A list of names, addresses, and phone numbers, written by hand on
a sheet of paper, is an example of a flat file database. Flat files are used not only as data storage tools in DBMS
systems, but also as data transfer tools. Hadoop databases use flat file storage.
Key-Value pair databases contain sets of two items: a key identifier and a value. There are a few specific uses
of these types of databases.
• Graph Databases: Graph databases store key-value pairs where the focus is on the relationship
between the nodes, rather than on the nodes themselves.
1.3.8.3.6 Triplestore
Triplestores can be broadly classified into three categories: Native triplestores, RDBMS-backed triplestores and
NoSQL triplestores.
• Native triplestores are those that are implemented from scratch and exploit the RDF data model to
efficiently store and access the RDF data.
• RDBMS-backed triplestores are built by adding an RDF specific layer to an existing RDBMS.
• NoSQL Triplestores are currently being investigated as possible storage managers for RDF.
Triplestore databases are best for taxonomy and thesaurus management, linked data integration, and knowledge
portals.
Some specialized situations require specialized types of databases that are managed differently from traditional
relational databases. Examples include:
DATA STORAGE AND OPERATIONS • 189
• Computer Assisted Design and Manufacturing (CAD / CAM) applications require an Object
database, as will most embedded real-time applications.
• Geographical Information Systems (GIS) make use of specialized geospatial databases, which have
at least annual updates to their Reference Data. Some specialized GIS are used for utilities (electric
grid, gas lines, etc.), for telecom in network management, or for ocean navigation.
• Shopping-cart applications found on most online retail websites, make use of XML databases to
initially store the customer order data, and may be used real-time by social media databases for ad
placement on other websites.
Some of this data is then copied into one or more traditional OLTP (Online Transaction Processing) databases
or data warehouses. In addition, many off-the-shelf vendor applications may use their own proprietary
databases. At the very least, their schemas will be proprietary and mostly concealed, even if they sit on top of
traditional relational DBMSs.
All databases, no matter the type, share the following processes in some way.
1.3.10.1 Archiving
Archiving is the process of moving data off immediately accessible storage media and onto media with lower
retrieval performance. Archives can be restored to the originating system for short-term use. Data that is not
actively needed to support application processes should be moved to an archive on less-expensive disk, tape, or
a CD / DVD jukebox. Restoring from an archive should be a matter of simply copying the data from the archive
back into the system.
Archival processes must be aligned with the partitioning strategy to ensure optimal availability and retention. A
robust approach involves:
It is wise to schedule regular tests of archive restoration to ensure avoid surprises in an emergency.
When changes are made to the technology or structure of a production system, the archive also needs to be
evaluated to ensure that data moved from the archive into current storage will be readable. There are several
ways of handling out-of-synch archives:
190 • DMBOK2
• Determine if or how much of the archive is required to be preserved. What is not required can be
considered purged.
• For major changes in technology, restore the archives to the originating system before the technology
change, upgrade or migrate to the new technology, and re-archive the data using the new technology.
• For high-value archives where the source database structures change, restore the archive, make any
changes to the data structures, and re-archive the data with the new structure.
• For infrequent-access archives where the source technology or structure changes, keep a small version
of the old system running with limited access, and extract from the archives using the old system as
needed.
Archives that are not recoverable with current technology are useless, and keeping old machinery around to
read archives that cannot be otherwise read, is not efficient or cost-effective.
Think of a database as a box, the data as fruit, and overhead (indexes, etc.) as packing material. The box has
dividers, and fruit and packing material go in the cells:
• First, decide the size of the box that will hold all the fruit and any packing material needed – that is the
Capacity.
• How much fruit goes into the box, and how quickly?
• How much fruit comes out of the box, and how quickly?
Decide if the box will stay the same size over time, or must be expanded over time to hold more fruit. This
projection of how much and how quickly the box must expand to hold incoming fruit and packing material is
the growth projection. If the box cannot expand, the fruit must be taken out as fast as it is put in, and the growth
projection is zero.
How long should the fruit stay in the cells? If the fruit in one cell gets dehydrated over time, or for any reason
becomes not as useful, should that fruit be put in a separate box for longer term storage (i.e., archived)? Will
there ever be a need to bring that dehydrated fruit back into the main box? Moving the fruit to another box with
the ability to move it back into the first box is an important part of archiving. This allows the box to not have to
be expanded quite as often or as much.
If a fruit becomes too stagnant to use, throw that fruit away (i.e., purge the data).
Change data capture refers to the process of detecting that data has changed and ensuring that information
relevant to the change is stored appropriately. Often referred to as log-based replication, CDC is a non-invasive
DATA STORAGE AND OPERATIONS • 191
way to replicate data changes to a target without affecting the source. In a simplified CDC context, one
computer system has data that may have changed from a previous point in time, and a second computer system
needs to reflect the same change. Rather than sending the entire database over the network to reflect just a few
minor changes, the idea is to just send what changed (deltas), so that the receiving system can make appropriate
updates.
There are two different methods to detect and collect changes: data versioning, which evaluates columns that
identify rows that have changed (e.g., last-update-timestamp columns, version-number columns, status-indicator
columns), or by reading logs that document the changes and enable them to be replicated in secondary systems.
1.3.10.4 Purging
It is incorrect to assume that all data will reside forever in primary storage. Eventually, the data will fill the
available space, and performance will begin to degrade. At that point, data will need to be archived, purged, or
both. Just as importantly, some data will degrade in value and is not worth keeping. Purging is the process of
completely removing data from storage media such that it cannot be recovered. A principal goal of data
management is that the cost of maintaining data should not exceed its value to the organization. Purging data
reduces costs and risks. Data to be purged is generally deemed obsolete and unnecessary, even for regulatory
purposes. Some data may become a liability if kept longer than necessary. Purging it reduces the risks that it
may be misused.
1.3.10.5 Replication
Data replication means same data is stored on multiple storage devices. In some situations, having duplicate
databases is useful, such as in a high-availability environment where spreading the workload among identical
databases in different hardware or even data centers can preserve functionality during peak usage times or
disasters.
• Active replication is performed by recreating and storing the same data at every replica from every
other replica.
• Passive replication involves recreating and storing data on a single primary replica and then
transforming its resultant state to other secondary replicas.
Multi-master replication, where updates can be submitted to any database node and then ripple through to other
servers, is often desired, but increases complexity and cost.
192 • DMBOK2
Replication transparency occurs when data is replicated between database servers so that the information
remains consistent throughout the database system and users cannot tell or even know which database copy they
are using.
The two primary replication patterns are mirroring and log shipping (see Figure 60).
• In mirroring, updates to the primary database are replicated immediately (relatively speaking) to the
secondary database, as part of a two-phase commit process.
• In log shipping, a secondary server receives and applies copies of the primary database’s transaction
logs at regular intervals.
The choice of replication method depends on how critical the data is, and how important it is that failover to the
secondary server be immediate. Mirroring is usually a more expensive option than log shipping. For one
secondary server, mirroring is effective; log shipping may be used to update additional secondary servers.
create apply
synch
Resiliency in databases is the measurement of how tolerant a system is to error conditions. If a system can
tolerate a high level of processing errors and still function as expected, it is highly resilient. If an application
crashes upon the first unexpected condition, that system is not resilient. If the database can detect and either
abort or automatically recover from common processing errors (runaway query, for example), it is considered
resilient. There are always some conditions that no system can detect in advance, such as a power failure, and
those conditions are considered disasters.
Three recovery types provide guidelines for how quickly recovery takes place and what it focuses on:
• Immediate recovery from some issues sometimes can be resolved through design; for example,
predicting and automatically resolving issues, such as those that might be caused by a failover to
backup system.
DATA STORAGE AND OPERATIONS • 193
• Critical recovery refers to a plan to restore the system as quickly as possible in order to minimize
delays or shut downs of business processes.
• Non-critical recovery means that restoration of function can be delayed until systems that are more
critical have been restored.
Data processing errors include data load failures, query return failures, and obstacles to completing ETL or
other processes. Common ways of increasing resilience in data processing systems are to trap and re-route data
causing errors, detect and ignore data causing errors, and implement flags in processing for completed steps to
avoid reprocessing data or repeating completed steps when restarting a process.
Each system should require a certain level of resiliency (high or low). Some applications may require that any
error halts all processing (low resiliency), while others may only require that the errors be trapped and re-routed
for review, if not outright ignored.
For extremely critical data, the DBA will need to implement a replication pattern in which data moves to
another copy of the database on a remote server. In the event of database failure, applications can then ‘fail
over’ to the remote database and continue processing.
1.3.10.7 Retention
Data Retention refers to how long data is kept available. Data retention planning should be part of the physical
database design. Retention requirements also affect capacity planning.
Data Security also affects data retention plans, as some data needs to be retained for specific timeframes for
legal reasons. Failure to retain data for the appropriate length of time can have legal consequences. Likewise,
there are also regulations related to purging data. Data can become a liability if kept longer than specified.
Organizations should formulate retention policies based on regulatory requirements and risk management
guidelines. These policies should drive specifications for purging and archiving of data.
1.3.10.8 Sharding
Sharding is a process where small chunks of the database are isolated and can be updated independently of other
shards, so replication is merely a file copy. Because the shards are small, refreshes/overwrites may be optimal.
2. Activities
The two main activities in Data Operations and Storage are Database Technology Support and Database
Operations Support. Database Technology Support is specific to selecting and maintaining the software that
194 • DMBOK2
stores and manages the data. Database Operations Support is specific to the data and processes that the software
manages.
Managing database technology should follow the same principles and standards for managing any technology.
The leading reference model for technology management is the Information Technology Infrastructure Library
(ITIL), a technology management process model developed in the United Kingdom. ITIL principles apply to
managing data technology. 35
It is important to understand how technology works, and how it can provide value in the context of a particular
business. The DBA, along with the rest of the data services teams, works closely with business users and
managers to understand the data and information needs of the business. DBAs and Database Architects combine
their knowledge of available tools with the business requirements in order to suggest the best possible
applications of technology to meet organizational needs.
Data professionals must first understand the characteristics of a candidate database technology before
determining which to recommend as a solution. For example, database technologies that do not have
transaction-based capabilities (e.g., commit and rollback) are not suitable for operational situations supporting
Point-of-Sale processes.
Do not assume that a single type of database architecture or DBMS works for every need. Most organizations
have multiple database tools installed, to perform a range of functions, from performance tuning to backups, to
managing the database itself. Only a few of these tool sets have mandated standards.
Selecting strategic DBMS software is particularly important. DBMS software has a major impact on data
integration, application performance, and business productivity. Some of the factors to consider when selecting
DBMS software include:
35 http://bit.ly/1gA4mpr.
DATA STORAGE AND OPERATIONS • 195
Some factors are not directly related to the technology itself, but rather to the purchasing organization and to the
tool vendors. For example:
The expense of the product, including administration, licensing, and support, should not exceed the product’s
value to the business. Ideally, the technology should be as user friendly, self-monitoring, and self-administering
as possible. If it is not, then it may be necessary to bring in staff with experience using the tool.
It is a good idea to start with a small pilot project or a proof-of-concept (POC), to get a good idea of the true
costs and benefits before proceeding with a full-blown production implementation.
DBAs often serve as Level 2 technical support, working with help desks and technology vendor support to
understand, analyze, and resolve user problems. The key to effective understanding and use of any technology
is training. Organizations should make sure they have training plans and budgets in place for everyone involved
in implementing, supporting, and using data and database technology. Training plans should include appropriate
levels of cross-training to better support application development, especially Agile development. DBAs should
have working knowledge of application development skills, such as data modeling, use-case analysis, and
application data access.
The DBA will be responsible for ensuring databases have regular backups and for performing recovery tests.
However, if data from these databases needs to be merged with other existing data in one or more databases,
there may be a data integration challenge. DBAs should not simply merge data. Instead, they should work with
other stakeholders to ensure that data can be integrated correctly and effectively.
When a business requires new technology, the DBAs will work with business users and application developers
to ensure the most effective use of the technology, to explore new applications of the technology, and to address
any problems or issues that surface from its use. The DBAs then deploy new technology products in pre-
196 • DMBOK2
production and production environments. They will need to create and document processes and procedures for
administering the product with the least amount of effort and expense.
Database support, as provided by DBAs and Network Storage Administrators (NSAs), is at the heart of data
management. Databases reside on managed storage areas. Managed storage can be as small as a disk drive on a
personal computer (managed by the OS), or as large as RAID arrays on a storage area network or SAN. Backup
media is also managed storage.
DBAs manage various data storage applications by assigning storage structures, maintaining physical databases
(including physical data models and physical layouts of the data, such as assignments to specific files or disk
areas), and establishing DBMS environments on servers.
DBAs establish storage systems for DBMS applications and file storage systems to support NoSQL. NSAs and
DBAs together play a vital role in establishing file storage systems. Data enters the storage media during normal
business operations and, depending on the requirements, can stay permanently or temporarily. It is important to
plan for adding additional space well in advance of when that space is actually needed. Doing any sort of
maintenance in an emergency is a risk.
All projects should have an initial capacity estimate for the first year of operations, and a growth projection for
the following few years. Capacity and growth should be estimated not only for the space the data itself holds,
but also for indexes, logs, and any redundant images such as mirrors.
Data storage requirements must account for regulation related to data retention. For legal reasons, organizations
are required to retain some data for set periods (see Chapter 9). In some cases, they may also be required to
purge data after a defined period. It’s a good idea to discuss data retention needs with the data owners at design
time and reach agreement on how to treat data through its lifecycle.
The DBAs will work with application developers and other operations staff, including server and storage
administrators, to implement the approved data retention plan.
• Transaction-based
• Large data set write- or retrieval-based
• Time-based (heavier at month end, lighter on weekends, etc.),
• Location-based (more densely populated areas have more transactions, etc.)
• Priority-based (some departments or batch IDs have higher priority than others)
Some systems will have a combination of these basic patterns. DBAs need to be able to predict ebbs and flows
of usage patterns and have processes in place to handle peaks (such as query governors or priority management)
as well as to take advantage of valleys (delay processes that need large amounts of resources until a valley
pattern exists). This information can be used to maintain database performance.
Data access includes activities related to storing, retrieving, or acting on data housed in a database or other
repository. Data Access is simply the authorization to access different data files.
Various standard languages, methods, and formats exist for accessing data from databases and other
repositories: SQL, ODBC, JDBC, XQJ, ADO.NET, XML, X Query, X Path, and Web Services for ACID-type
systems. BASE-type access method standards include C, C++, REST, XML, and Java 36. Some standards enable
translation of data from unstructured (such as HTML or free-text files) to structured (such as XML or SOL).
Data architects and DBAs can assist organizations to select appropriate methods and tools required for data
access.
Organizations need to plan for business continuity in the event of disaster or adverse event that impacts their
systems and their ability to use their data. DBAs must make sure a recovery plan exists for all databases and
database servers, covering scenarios that could result in loss or corruption of data, such as:
36 http://bit.ly/1rWAUxS (accessed 2/28/2016) has a list of all data access methods for BASE-type systems.
198 • DMBOK2
Each database should be evaluated for criticality so that its restoration can be prioritized. Some databases will
be essential to business operations and will need to be restored immediately. Less critical databases will not be
restored until primary systems are up and running. Still other may not need to be restored at all; for example, if
they are merely copies that are refreshed when loaded.
Management and the organization’s business continuity group, if one exists, should review and approve the data
recovery plan. The DBA group should regularly review the plans for accuracy and comprehensiveness. Keep a
copy of the plan, along with all the software needed to install and configure the DBMS, instructions, and
security codes (e.g., the administrator password) in a secure, off-site location in the event of a disaster.
No system can be recovered from a disaster if the backups are unavailable or unreadable. Regular backups are
essential to any recovery effort, but if they are unreadable, they are worse than useless; processing time making
the unreadable backups will have been wasted, along with the opportunity for fixing the issue that made the
backups unreadable. Keep all backups in a secure, off-site location.
Make backups of databases and, if appropriate, the database transaction logs. The system’s Service Level
Agreement (SLA) should specify backup frequency. Balance the importance of the data against the cost of
protecting it. For large databases, frequent backups can consume large amounts of disk storage and server
resources. In addition to incremental backups, periodically make a complete backup of each database.
Furthermore, databases should reside on a managed storage area, ideally a RAID array on a storage area
network or SAN, with daily back up to separate storage media. For OLTP databases, the frequency of
transaction log backups will depend on the frequency of updating, and the amount of data involved. For
frequently updated databases, more frequent log dumps will not only provide greater protection, but will also
reduce the impact of the backups on server resources and applications.
Backup files should be kept on a separate filesystem from the databases, and should be backed up to some
separate storage medium as specified in the SLA. Store copies of the daily backups in a secure off-site facility.
Most DBMSs support hot backups of the database – backups taken while applications are running. When some
updates occur in transit, they will roll either forward to completion, or roll back when the backup reloads. The
alternative is a cold backup taken when the database is off-line. However, a cold backup may not be a viable
option if applications need to be continuously available.
Most backup software includes the option to read from the backup into the system. The DBA works with the
infrastructure team to re-mount the media containing the backup and to execute the restoration. The specific
utilities used to execute the restoration of the data depend on the type of databased.
DATA STORAGE AND OPERATIONS • 199
Data in file system databases may be easier to restore than those in relational database management systems,
which may have catalog information that needs to be updated during the data recovery, especially if the
recovery is from logs instead of a full backup.
It is critical to periodically test recovery of data. Doing so will reduce bad surprises during a disaster or
emergency. Practice runs can be executed on non-production system copies with identical infrastructure and
configuration, or if the system has a failover, on the secondary system.
DBAs are responsible for the creation of database instances. Related activities include:
• Installing and updating DBMS software: DBAs install new versions of the DBMS software and
apply maintenance patches supplied by the DBMS vendor in all environments (from development to
production) as indicated by the vendor and vetted by and prioritized by DBA specialists, security
specialists, and management. This is a critical activity to ensure against vulnerability to attacks, as well
as to ensure ongoing data integrity in centralized and decentralized installations.
• Maintaining multiple environment installations, including different DBMS versions: DBAs may
install and maintain multiple instances of DBMS software in sandbox, development, testing, user
acceptance testing, system acceptance testing, quality assurance, pre-production, hot-fix, disaster
recovery environments, and production, and manage migration of the DBMS software versions
through environments relative to applications and systems versioning and changes.
• Installing and administering related data technology: DBAs may be involved in installing data
integration software and third party data administration tools.
Storage environment management needs to follow traditional Software Configuration Management (SCM)
processes or Information Technology Infrastructure Library (ITIL) methods to record modification to the
database configuration, structures, constraints, permissions, thresholds, etc. DBAs need to update the physical
data model to reflect the changes to the storage objects as part of a standard configuration management process.
With agile development and extreme programming methods, updates to the physical data model play important
roles in preventing design or development errors.
DBAs need to apply the SCM process to trace changes and to verify that the databases in the development, test,
and production environments have all of the enhancements included in each release – even if the changes are
cosmetic or only in a virtualized data layer.
The four procedures required to ensure a sound SCM process are configuration identification, configuration
change control, configuration status accounting, and configuration audits.
200 • DMBOK2
• During the configuration identification process, DBAs will work with data stewards, data architects,
and data modelers to identify the attributes that define every aspect of a configuration for end-user
purposes. These attributes are recorded in configuration documentation and baselined. Once an
attribute is baselined a formal configuration change control processes is required to change the
attribute.
• Configuration change control is a set of processes and approval stages required to change a
configuration item’s attributes and to re-baseline them.
• Configuration status accounting is the ability to record and report on the configuration baseline
associated with each configuration item at any point in time.
• Configuration audits occur both at delivery and when effecting a change. There are two types. A
physical configuration audit ensures that a configuration item is installed in accordance with the
requirements of its detailed design documentation, while a functional configuration audit ensures that
performance attributes of a configuration item are achieved.
To maintain data integrity and traceability throughout the data lifecycle, DBAs communicate the changes to
physical database attributes to modelers, developers, and Metadata managers.
DBAs must also maintain metrics on data volume, capacity projections, and query performance, as well as
statistics on physical objects, in order to identify data replication needs, data migration volumes, and data
recovery checkpoints. Larger databases will also have object partitioning, which must be monitored and
maintained over time to ensure that the object maintains the desired distribution of data.
DBAs are responsible for managing the controls that enable access to the data. DBAs oversee the following
functions to protect data assets and data integrity:
• Controlled environment: DBAs work with NSAs to manage a controlled environment for data assets;
this includes network roles and permissions management, 24x7 monitoring and network health
monitoring, firewall management, patch management, and Microsoft Baseline Security Analyzer
(MBSA) integration.
• Physical security: The physical security of data assets is managed by Simple Network Management
Protocol (SNMP)-based monitoring, data audit logging, disaster management, and database backup
planning. DBAs configure and monitor these protocols. Monitoring is especially important for security
protocols.
• Monitoring: Database systems are made available by continuous hardware and software monitoring of
critical servers.
DATA STORAGE AND OPERATIONS • 201
• Controls: DBAs maintain information security by access controls, database auditing, intrusion
detection, and vulnerability assessment tools.
Concepts and activities involved in setting up data security are discussed in Chapter 7.
All data must be stored on a physical drive and organized for ease of load, search, and retrieval. Storage
containers themselves may contain storage objects, and each level must be maintained appropriate to the level
of the object. For example, relational databases have schemas that contain tables, and non-relational databases
have filesystems that contain files.
DBAs are typically responsible for creating and managing the complete physical data storage environment
based on the physical data model. The physical data model includes storage objects, indexing objects, and any
encapsulated code objects required to enforce data quality rules, connect database objects, and achieve database
performance.
Depending on the organization, data modelers may provide the data model and the DBAs implement the
physical layout of the data model in storage. In other organizations, DBAs may take a skeleton of a physical
model and add all the database-specific implementation details, including indexes, constraints, partitions or
clusters, capacity estimates, and storage allocation details.
For third-party database structures provided as part of an application, most data modeling tools allow reverse
engineering of Commercial Off the Shelf (COTS) or Enterprise Resource Planning (ERP) system databases, as
long as the modeling tool can read the storage tool catalog. These can be used to develop a Physical Model.
DBAs or data modelers will still need to review and potentially update the physical model for application-based
constraints or relationships; not all constraints and relationships are installed in database catalogs, especially for
older applications where database abstraction was desired.
Well-maintained physical models are necessary when DBAs are providing Data-as-a-Service.
When first built, databases are empty. DBAs fill them. If the data to be loaded has been exported using a
database utility, it may not be necessary to use a data integration tool to load it into the new database. Most
database systems have bulk load capabilities, requiring that the data be in a format that matches the target
database object, or having a simple mapping function to link data in the source to the target object.
202 • DMBOK2
Most organizations also obtain some data from external third-party sources, such as lists of potential customers
purchased from an information broker, postal and address information, or product data provided by a supplier.
The data can be licensed or provided as an open data service, free of charge; provided in a number of different
formats (CD, DVD, EDI, XML, RSS feeds, text files); or provided upon request or regularly updated via a
subscription service. Some acquisitions require legal agreements. DBAs need to be aware of these restrictions
before loading data.
DBAs may be asked to handle these types of loads, or to create the initial load map. Limit manual execution of
these loads to installations or other one-time situations, or ensure they are automated and scheduled.
A managed approach to data acquisition centralizes responsibility for data subscription services with data
analysts. The data analyst will need to document the external data source in the logical data model and data
dictionary. A developer may design and create scripts or programs to read the data and load it into a database.
The DBA will be responsible for implementing the necessary processes to load the data into the database and /
or make it available to the application.
DBAs can influence decisions about the data replication process by advising on:
For small systems or data objects, complete data refreshes may satisfy the requirements for concurrency. For
larger objects where most of the data does NOT change, merging changes into the data object is more efficient
than completely copying all data for every change. For large objects where most of the data is changed, it may
still be better to do a refresh than to incur the overhead of so many updates.
The Database performance depends on two interdependent facets: availability and speed. Performance includes
ensuring availability of space, query optimization, and other factors that enable a database to return data in an
efficient way. Performance cannot be measured without availability. An unavailable database has a performance
measure of zero. DBAs and NSAs manage database performance by:
• Managing database connectivity. NSAs and DBAs provide technical guidance and support for IT and
business users requiring database connectivity based on policies enforced through standards and
protocols of the organization.
DATA STORAGE AND OPERATIONS • 203
• Working with system programmers and network administrators to tune operating systems, networks,
and transaction processing middleware to work with the database.
• Dedicating appropriate storage and enabling the database to work with storage devices and storage
management software. Storage management software optimizes the use of different storage
technologies for cost-effective storage of older, less-frequently referenced data, by migrating that data
to less expensive storage devices. This results in more rapid retrieval time for core data. DBAs work
with storage administrators to set up and monitor effective storage management procedures.
• Providing volumetric growth studies to support storage acquisition and general data lifecycle
management activities of retention, tuning, archiving, backup, purging, and disaster recovery.
• Working with system administrators to provide operating workloads and benchmarks of deployed data
assets that support SLA management, charge-back calculations, server capacity, and lifecycle rotation
within the prescribed planning horizon.
System performance, data availability and recovery expectations, and expectations for teams to respond to
issues are usually governed through Service Level Agreements (SLAs) between IT data management services
organizations and data owners (Figure 61).
Typically, an SLA will identify the timeframes during which the database is expected to be available for use.
Often an SLA will identify a specified maximum allowable execution time for a few application transactions (a
mix of complex queries and updates). If the database is not available as agreed to, or if process execution times
violate the SLA, the data owners will ask the DBA to identify and remediate the causes of the problem.
204 • DMBOK2
Availability is the percentage of time that a system or database can be used for productive work. As
organizations increase their uses of data, availability requirements increase, as do the risks and costs of
unavailable data. To meet higher demand, maintenance windows are shrinking. Four related factors affect
availability:
• Planned outages
o For maintenance
o For upgrades
• Unplanned outages
o Loss of the server hardware
o Disk hardware failure
o Operating system failure
o DBMS software failure
o Data center site loss
o Network failure
• Application problems
o Security and authorization problems
o Severe performance problems
o Recovery failures
• Data problems
o Corruption of data (due to bugs, poor design, or user error)
o Loss of database objects
o Loss of data
o Data replication failure
• Human error
DBAs are responsible for doing everything possible to ensure databases stay online and operational, including:
DBAs also establish and monitor database execution, use of data change logs, and synchronization of duplicated
environments. Log sizes and locations require space and in some cases can be treated like file-based databases
on their own. Other applications that consume logs must also be managed, to ensure use of the correct logs at
the required logging level. The more detail that is logged, the more space and processing required, which may
adversely affect performance.
DBAs optimize database performance both proactively and reactively, by monitoring performance and by
responding to problems quickly and competently. Most DBMSs provide the capability of monitoring
performance, allowing DBAs to generate analysis reports. Most server operating systems have similar
monitoring and reporting capabilities. DBAs should run activity and performance reports against both the
DBMS and the server on a regular basis, including during periods of heavy activity. They should compare these
reports to previous reports to identify any negative trends and save them to help analyze problems over time.
Data movement may occur in real time through online transactions. However, many data movement and
transformation activities are performed through batch programs, which may move data between systems, or
merely perform operations on data within a system. These batch jobs must complete within specified windows
in the operating schedule. DBAs and data integration specialists monitor the performance of batch data jobs,
noting exceptional completion times and errors, determining the root cause of errors, and resolving these issues.
When performance problems occur, the DBA, NSA, and Server Administration teams should use the
monitoring and administration tools of the DBMS to help identify the source of the problem. Common reasons
for poor database performance include:
• Locking and blocking: In some cases, a process running in the database may lock up database
resources, such as tables or data pages, and block another process that needs them. If the problem
persists, the DBA can kill the blocking process. In some cases, two processes may ‘deadlock’, with
206 • DMBOK2
each process locking resources needed by the other. Most DBMSs will automatically terminate one of
these processes after an interval of time. These types of problems are often the result of poor coding,
either in the database or in the application.
• Inaccurate database statistics: Most relational DBMSs have a built-in query optimizer, which relies
on stored statistics about the data and indexes to make decisions about how to execute a given query
most effectively. These statistics should be updated frequently, especially in active databases. Failure
to do so will result in poorly performing queries.
• Poor coding: Perhaps the most common cause of poor database performance is poorly coded SQL.
Query coders need a basic understanding of how the SQL query optimizer works. They should code
SQL in a way that takes maximum advantage of the optimizer’s capabilities. Some systems allow
encapsulation of complex SQL in stored procedures, which can be pre-compiled and pre-optimized,
rather than embedded in application code or in script files.
• Inefficient complex table joins: Use views to pre-define complex table joins. In addition, avoid using
complex SQL (e.g., table joins) in database functions; unlike stored procedures, these are opaque to the
query optimizer.
• Insufficient indexing: Code complex queries and queries involving large tables to use indexes built on
the tables. Create the indexes necessary to support these queries. Be careful about creating too many
indexes on heavily updated tables, as this will slow down update processing.
• Application activity: Ideally, applications should be running on a server separate from the DBMS, so
that they are not competing for resources. Configure and tune database servers for maximum
performance. In addition, the new DBMSs allow application objects, such as Java and .NET classes, to
be encapsulated in database objects and executed in the DBMS. Be careful about making use of this
capability. It can be very useful in certain cases, but executing application code on the database server
may affect the interoperability, application architecture, and performance of database processes.
• Overloaded servers: For DBMSs that support multiple databases and applications, there may be a
breaking point where the addition of more databases has an adverse effect on the performance of
existing databases. In this case, create a new database server. In addition, relocate databases that have
grown very large, or that are being used more heavily than before, to a different server. In some cases,
address problems with large databases by archiving less-used data to another location, or by deleting
expired or obsolete data.
• Database volatility: In some cases, large numbers of table inserts and deletes over a short while can
create inaccurate database distribution statistics. In these cases, turn off updating database statistics for
these tables, as the incorrect statistics will adversely affect the query optimizer.
• Runaway queries: Users may unintentionally submit queries that use a majority of the system’s
shared resources. Use rankings or query governors to kill or pause these queries until they can be
evaluated and improved.
DATA STORAGE AND OPERATIONS • 207
After the cause of the problem is identified, the DBA will take whatever action is needed to resolve the
problem, including working with application developers to improve and optimize the database code, and
archiving or deleting data that is no longer actively needed by application processes. In exceptional cases for
OLTP-type databases, the DBA may consider working with the data modeler restructure the affected portion of
the database. Do this only after other measures (e.g., the creation of views and indexes and the rewriting of SQL
code) have been tried, and only after careful consideration of the possible consequences, such as loss of data
integrity or the increase in complexity of SQL queries against denormalized tables.
For read-only reporting and analytical databases, denormalization for performance and ease of access is the rule
rather than the exception, and poses no threat or risk.
Databases do not appear once and remain unchanged. Business rules change, business processes change, and
technology changes. Development and test environments enable changes to be tested before they are brought
into a production environment. DBAs can make whole or subset copies of database structures and data onto
other environments to enable development and testing of system changes. There are several types of alternate
environments.
• Development environments are used to create and test changes that will be implemented in
production. Development must be maintained to closely resemble the production environment, though
with scaled down resources.
• Test environments serve several purposes: QA, integration testing, UAT, and performance testing.
The test environment ideally also has the same software and hardware as production. In particular,
environments used for performance testing should not be scaled down in resources.
• Sandboxes or experimental environments are used to test hypotheses and develop new uses of data.
The DBAs generally set up, grant access to, and monitor usage of these environments. They should
also ensure that sandboxes are isolated and do not adversely affecting production operations.
• Alternate production environments are required to support offline backups, failover, and resiliency
support systems. These systems should be identical to the production systems, although the backup
(and recovery) system can be scaled down in compute capacity, since it is mostly dedicated to I/O
activities.
Software testing is labor-intensive and accounts for nearly half of the cost of the system development. Efficient
testing requires high quality test data, and this data must be managed. Test data generation is a critical step in
software testing.
208 • DMBOK2
Test data is data that has been specifically identified to test a system. Testing can include verifying that a given
set of input produces expected output or challenging the ability of programming to respond to unusual, extreme,
exceptional, or unexpected input. Test data can be completely fabricated or generated using meaningless values
or it can be sample data. Sample data can be a subset of actual production data (by either content or structure),
or generated from production data. Production data can be filtered or aggregated to create multiple sample data
sets, depending on the need. In cases where production data contains protected or restricted data, sample data
must be masked.
Test data may be produced in a focused or systematic way (as is typically the case in functionality testing) using
statistics or filters, or by using other, less-focused approaches (as is typically the case in high-volume
randomized automated tests). Test data may be produced by the tester, by a program or function that aids the
tester, or by a copy of production data that has been selected and screened for the purpose. Test data may be
recorded for short-term re-use, created and managed to support regression tests, or used once and then removed
– although in most organizations, cleanup after projects does not include this step. DBAs should monitor project
test data and ensure that obsolete test data is purged regularly to preserve capacity.
It is not always possible to produce enough data for some tests, especially performance tests. The amount of test
data to be generated is determined or limited by considerations such as time, cost, and quality. It is also
impacted by regulation that limits the use of production data in a test environment. (See Chapter 7.)
Data migration is the process of transferring data between storage types, formats, or computer systems, with as
little change as possible. Changing data during migration is discussed in Chapter 8.
Data migration is a key consideration for any system implementation, upgrade, or consolidation. It is usually
performed programmatically, being automated based on rules. However, people need to ensure that the rules
and programs are executed correctly. Data migration occurs for a variety of reasons, including server or storage
equipment replacements or upgrades, website consolidation, server maintenance, or data center relocation. Most
implementations allow this to be done in a non-disruptive manner, such as concurrently while the host continues
to perform I/O to the logical disk (or LUN).
The mapping granularity dictates how quickly the Metadata can be updated, how much extra capacity is
required during the migration, and how quickly the previous location is marked as free. Smaller granularity
means faster update, less space required, and quicker freeing up of old storage.
Many day-to-day tasks a storage administrator has to perform can be simply and concurrently completed using
data migration techniques:
Automated and manual data remediation is commonly performed in migration to improve the quality of data,
eliminate redundant or obsolete information, and match the requirements of the new system. Data migration
phases (design, extraction, remediation, load, verification) for applications of moderate to high complexity are
commonly repeated several times before the new system is deployed.
3. Tools
In addition to the database management systems themselves, DBAs use multiple other tools to manage
databases. For example, modeling and other application development tools, interfaces that allow users to write
and execute queries, data evaluation and modification tools for data quality improvement, and performance load
monitoring tools.
Data modeling tools automate many of the tasks the data modeler performs. Some data modeling tools allow the
generation of database data definition language (DDL). Most support reverse engineering from database into a
data model. Tools that are more sophisticated validate naming standards, check spelling, store Metadata such as
definitions and lineage, and even enable publishing to the web. (See Chapter 5.)
Database monitoring tools automate monitoring of key metrics, such as capacity, availability, cache
performance, user statistics, etc., and alert DBAs and NSAs to database issues. Most such tools can
simultaneously monitor multiple database types.
Database systems have often included management tools. In addition, several third-party software packages
allow DBAs to manage multiple databases. These applications include functions for configuration, installation
of patches and upgrades, backup and restore, database cloning, test management, and data clean-up routines.
Developer Support tools contain a visual interface for connecting to and executing commands on a database.
Some are included with the database management software. Others include third-party applications.
210 • DMBOK2
4. Techniques
For upgrades and patches to operating systems, database software, database changes, and code changes, install
and test on the lowest level environment first – usually development. Once tested on the lowest level, install on
the next higher levels, and install on the production environment last. This ensures that the installers have
experience with the upgrade or patch, and can minimize disruption to the production environments.
Consistency in naming speeds understanding. Data architects, database developers, and DBAs can use naming
standards for defining Metadata or creating rules for exchanging documents between organizations.
ISO/IEC 11179 – Metadata registries (MDR), addresses the semantics of data, the representation of data, and
the registration of the descriptions of that data. It is through these descriptions that an accurate understanding of
the semantics and a useful depiction of the data are found.
The significant section for physical databases within that standards is Part 5 – Naming and Identification
Principles, which describes how to form conventions for naming data elements and their components.
It is extremely risky to directly change data in a database. However, there may be a need, such as an annual
change in the chart of accounts structures, or in mergers and acquisitions, or emergencies, where these are
indicated due to the ‘one-off’ nature of the request and/or the lack of appropriate tools for these circumstances.
It is helpful to place changes to be made into update script files and test them thoroughly in non-production
environments before applying to production.
5. Implementation Guidelines
A risk and readiness assessment revolves around two central ideas: risk of data loss and risks related to
technology readiness.
DATA STORAGE AND OPERATIONS • 211
• Data loss: Data can be lost through technical or procedural errors, or through malicious intent.
Organizations need to put in place strategies to mitigate these risks. Service Level Agreements often
specify the general requirements for protection. SLAs need to be supported by well-documented
procedures. Ongoing assessment is required to ensure robust technical responses are in place to prevent
data loss through malicious intent, as cyber threats are ever evolving. SLA audit and data audits are
recommended to assess and plan risk mitigations.
• Technology readiness: Newer technologies such as NoSQL, Big Data, triple stores, and FDMS
require skills and experience readiness in IT. Many organizations do not have the skill sets needed to
take advantage of these new technologies. DBAs, systems engineers and application developers, and
business users must be ready to use the benefits from these in the BI and other applications.
DBAs often do not effectively promote the value of their work to the organization. They need to recognize the
legitimate concerns of data owners and data consumers, balance short-term and long-term data needs, educate
others in the organization about the importance of good data management practices, and optimize data
development practices to ensure maximum benefit to the organization and minimal impact on data consumers.
By regarding data work as an abstract set of principles and practices, and disregarding the human elements
involved, DBAs risk propagating an ‘us versus them’ mentality, and being regarded as dogmatic, impractical,
unhelpful, and obstructionist.
Many disconnects – mostly clashes in frames of reference – contribute to this problem. Organizations generally
regard information technology in terms of specific applications, not data, and usually see data from an
application-centric point of view. The long-term value to organizations of secure, reusable, high quality data,
such as data as a corporate resource, is not as easily recognized or appreciated.
DBAs and other data management practitioners can help overcome these organizational and cultural obstacles.
They can promote a more helpful and collaborative approach to meeting the organization’s data and information
needs by following the guiding principles to identify and act on automation opportunities, building with reuse in
mind, applying best practices, connecting databased standards to support requirements, and setting expectations
for DBAs in project work. In addition, they should:
• Proactively communicate: DBAs should be in close communication with project teams, both during
development and after implementation, to detect and resolve any issues as early as possible. They
212 • DMBOK2
should review data access code, stored procedures, views, and database functions written by
development teams and help surface any problems with database design.
• Communicate with people on their level and in their terms: It is better to talk with business people
in terms of business needs and ROI, and with developers in terms of object-orientation, loose coupling,
and ease of development.
• Stay business-focused: The objective of application development is to meet business requirements and
derive maximum value from the project.
• Be helpful: Always telling people ‘no’ encourages them to ignore standards and find another path.
Recognize that people need to do whatever they need to do and not helping them succeed becomes
mutually detrimental.
• Learn continually: Assess setbacks encountered during a project for lessons learned and apply these
to future projects. If problems arise from having done things wrong, point to them later as reasons for
doing things right.
To sum up, understand stakeholders and their needs. Develop clear, concise, practical, business-focused
standards for doing the best possible work in the best possible way. Moreover, teach and implement those
standards in a way that provides maximum value to stakeholders and earns their respect.
6.1 Metrics
DBAs need to discuss the need for metrics with data architects, Data Quality teams.
Part of data storage governance includes ensuring that an organization complies with all licensing agreements
and regulatory requirements. Carefully track and conduct yearly audits of software license and annual support
costs, as well as server lease agreements and other fixed costs. Being out of compliance with licensing
agreements poses serious financial and legal risks for an organization.
Audit data can help determine the total cost-of-ownership (TCO) for each type of technology and technology
product. Regularly evaluate technologies and products that are becoming obsolete, unsupported, less useful, or
too expensive.
A data audit is the evaluation of a data set based on defined criteria. Typically, an audit is performed to
investigate specific concerns about a data set and is designed to determine whether the data was stored in
compliance with contractual and methodological requirements. The data audit approach may include a project-
specific and comprehensive checklist, required deliverables, and quality control criteria.
Data validation is the process of evaluating stored data against established acceptance criteria to determine its
quality and usability. Data validation procedures depend on the criteria established by the Data Quality team (if
one is in place) or other data consumer requirements. DBAs support part of data audits and validation by:
Armistead, Leigh. Information Operations Matters: Best Practices. Potomac Books Inc., 2010. Print.
Bittman, Tom. “Virtualization with VMWare or HyperV: What you need to know.” Gartner Webinar, 25 November, 2009.
http://gtnr.it/2rRl2aP, Web.
Brewer, Eric. “Toward Robust Distributed Systems.” PODC Keynote 2000. http://bit.ly/2sVsYYv Web.
Dwivedi, Himanshu. Securing Storage: A Practical Guide to SAN and NAS Security. Addison-Wesley Professional, 2005.
Print.
EMC Education Services, ed. Information Storage and Management: Storing, Managing, and Protecting Digital
Information in Classic, Virtualized, and Cloud Environments. 2nd ed. Wiley, 2012. Print.
Finn, Aidan, et al. Microsoft Private Cloud Computing. Sybex, 2013. Print.
Fitzsimmons, James A. and Mona J. Fitzsimmons. Service Management: Operations, Strategy, Information Technology. 6th
ed. Irwin/McGraw-Hill, 2007. Print with CDROM.
Gallagher, Simon, et al. VMware Private Cloud Computing with vCloud Director. Sybex. 2013. Print.
Haerder, T. and A Reuter. “Principles of transaction-oriented database recovery”. ACM Computing Surveys 15 (4) (1983).
https://web.stanford.edu/class/cs340v/papers/recovery.pdf Web.
Hitachi Data Systems Academy, Storage Concepts: Storing and Managing Digital Data. Volume 1. HDS Academy, Hitachi
Data Systems, 2012. Print.
Hoffer, Jeffrey, Mary Prescott, and Fred McFadden. Modern Database Management. 7th Edition. Prentice Hall, 2004. Print.
Khalil, Mostafa. Storage Implementation in vSphere 5.0. VMware Press, 2012. Print.
Kotwal, Nitin. Data Storage Backup and Replication: Effective Data Management to Ensure Optimum Performance and
Business Continuity. Nitin Kotwal, 2015. Amazon Digital Services LLC.
Kroenke, D. M. Database Processing: Fundamentals, Design, and Implementation. 10th Edition. Pearson Prentice Hall,
2005. Print.
DATA STORAGE AND OPERATIONS • 215
Liebowitz, Matt et al. VMware vSphere Performance: Designing CPU, Memory, Storage, and Networking for Performance-
Intensive Workloads. Sybex, 2014. Print.
Matthews, Jeanna N. et al. Running Xen: A Hands-On Guide to the Art of Virtualization. Prentice Hall, 2008. Print.
Mattison, Rob. Understanding Database Management Systems. 2nd Edition. McGraw-Hill, 1998. Print.
McNamara, Michael J. Scale-Out Storage: The Next Frontier in Enterprise Data Management. FriesenPress, 2014. Kindle.
Mullins, Craig S. Database Administration: The Complete Guide to Practices and Procedures. Addison-Wesley, 2002.
Print.
Parsaye, Kamran and Mark Chignell. Intelligent Database Tools and Applications: Hyperinformation Access, Data Quality,
Visualization, Automatic Discovery. John Wiley and Sons, 1993. Print.
Pascal, Fabian. Practical Issues in Database Management: A Reference for The Thinking Practitioner. Addison-Wesley,
2000. Print.
Paulsen, Karl. Moving Media Storage Technologies: Applications and Workflows for Video and Media Server Platforms.
Focal Press, 2011. Print.
Piedad, Floyd, and Michael Hawkins. High Availability: Design, Techniques and Processes. Prentice Hall, 2001. Print.
Rob, Peter, and Carlos Coronel. Database Systems: Design, Implementation, and Management. 7th Edition. Course
Technology, 2006. Print.
Sadalage, Pramod J., and Martin Fowler. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence.
Addison-Wesley, 2012. Print. Addison-Wesley Professional.
Santana, Gustavo A. Data Center Virtualization Fundamentals: Understanding Techniques and Designs for Highly Efficient
Data Centers with Cisco Nexus, UCS, MDS, and Beyond. Cisco Press, 2013. Print. Fundamentals.
Schulz, Greg. Cloud and Virtual Data Storage Networking. Auerbach Publications, 2011. Print.
Tran, Duc A. Data Storage for Social Networks: A Socially Aware Approach. 2013 ed. Springer, 2012. Print. Springer
Briefs in Optimization.
Troppens, Ulf, et al. Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS, iSCSI, InfiniBand
and FCoE. Wiley, 2009. Print.
US Department of Defense. Information Operations: Doctrine, Tactics, Techniques, and Procedures. 2011. Kindle.
VMware. VMware vCloud Architecture Toolkit (vCAT): Technical and Operational Guidance for Cloud Success. VMware
Press, 2013. Print.
Wicker, Stephen B. Error Control Systems for Digital Communication and Storage. US ed. Prentice-Hall, 1994. Print.
Zarra, Marcus S. Core Data: Data Storage and Management for iOS, OS X, and iCloud. 2nd ed. Pragmatic Bookshelf, 2013.
Print. Pragmatic Programmers.
CHAPTER 7
Data Security
Data Data
Metadata
Governance Security
1. Introduction
D
ata Security includes the planning, development, and execution of security policies and procedures to
provide proper authentication, authorization, access, and auditing of data and information assets. The
specifics of data security (which data needs to be protected, for example) differ between industries
and countries. Nevertheless, the goal of data security practices is the same: To protect information assets in
alignment with privacy and confidentiality regulations, contractual agreements, and business requirements.
These requirements come from:
217
218 • DMBOK2
• Stakeholders: Organizations must recognize the privacy and confidentiality needs of their
stakeholders, including clients, patients, students, citizens, suppliers, or business partners. Everyone in
an organization must be a responsible trustee of data about stakeholders.
• Government regulations: Government regulations are in place to protect the interests of some
stakeholders. Regulations have different goals. Some restrict access to information, while others ensure
openness, transparency, and accountability.
• Proprietary business concerns: Each organization has proprietary data to protect. An organization’s
data provides insight into its customers and, when leveraged effectively, can provide a competitive
advantage. If confidential data is stolen or breached, an organization can lose competitive advantage.
• Legitimate access needs: When securing data, organizations must also enable legitimate access.
Business processes require individuals in certain roles be able to access, use, and maintain data.
• Contractual obligations: Contractual and non-disclosure agreements also influence data security
requirements. For example, the PCI Standard, an agreement among credit card companies and
individual business enterprises, demands that certain types of data be protected in defined ways (e.g.,
mandatory encryption for customer passwords).
Effective data security policies and procedures ensure that the right people can use and update data in the right
way, and that all inappropriate access and update is restricted (Ray, 2012) (see Figure 62). Understanding and
complying with the privacy and confidentiality interests and needs of all stakeholders is in the best interest of
every organization. Client, supplier, and constituent relationships all trust in, and depend on, the responsible use
of data.
STAKEHOLDER GOVERNMENT
CONCERNS REGULATION
NECESSARY LEGITIMATE
BUSINESS BUSINESS
ACCESS NEEDS CONCERNS
• Data security must be • Trade secrets
appropriate • Research & other IP
• Data security must not be • Knowledge of customer
too onerous to prevent needs
users from doing their • Business partner
jobs relationships and
• Goldilocks principle impending deals
Data Security
Definition: Definition, planning, development, and execution of security policies and
procedures to provide proper authentication, authorization, access, and auditing of data and
information assets.
Goals:
1. Enable appropriate, and prevent inappropriate, access to enterprise data assets.
2. Understand and comply with all relevant regulations and policies for privacy, protection, and
confidentiality.
3. Ensure that the privacy and confidentiality needs of all stakeholders are enforced and audited.
Business
Drivers
Technical
Drivers
Risk reduction and business growth are the primary drivers of data security activities. Ensuring that an
organization’s data is secure reduces risk and adds competitive advantage. Security itself is a valuable asset.
Data security risks are associated with regulatory compliance, fiduciary responsibility for the enterprise and
stockholders, reputation, and a legal and moral responsibility to protect the private and sensitive information of
employees, business partners, and customers. Organizations can be fined for failure to comply with regulations
and contractual obligations. Data breaches can cause a loss of reputation and customer confidence. (See Chapter
2.)
Business growth includes attaining and sustaining operational business goals. Data security issues, breaches,
and unwarranted restrictions on employee access to data can directly impact operational success.
The goals of mitigating risks and growing the business can be complementary and mutually supportive if they
are integrated into a coherent strategy of information management and protection.
As data regulations increase — usually in response to data thefts and breaches — so do compliance
requirements. Security organizations are often tasked with managing not only IT compliance requirements, but
also policies, practices, data classifications, and access authorization rules across the organization.
As with other aspects of data management, it is best to address data security as an enterprise initiative. Without
a coordinated effort, business units will find different solutions to security needs, increasing overall cost while
potentially reducing security due to inconsistent protection. Ineffective security architecture or processes can
cost organizations through breaches and lost productivity. An operational security strategy that is properly
funded, systems-oriented, and consistent across the enterprise will reduce these risks.
Information security begins by classifying an organization’s data in order to identify which data requires
protection. The overall process includes the following steps:
• Identify and classify sensitive data assets: Depending on the industry and organization, there can be
few or many assets, and a range of sensitive data (including personal identification, medical, financial,
and more).
• Locate sensitive data throughout the enterprise: Security requirements may differ, depending on
where data is stored. A significant amount of sensitive data in a single location poses a high risk due to
the damage possible from a single breach.
• Determine how each asset needs to be protected: The measures necessary to ensure security can
vary between assets, depending on data content and the type of technology.
DATA SECURITY • 221
• Identify how this information interacts with business processes: Analysis of business processes is
required to determine what access is allowed and under what conditions.
In addition to classifying the data itself, it is necessary to assess external threats (such as those from hackers and
criminals) and internal risks (posed by employees and processes). Much data is lost or exposed through the
ignorance of employees who did not realize that the information was highly sensitive or who bypassed security
policies. 37 The customer sales data left on a web server that is hacked, the employee database downloaded onto
a contractor’s laptop that is subsequently stolen, and trade secrets left unencrypted in an executive’s computer
that goes missing, all result from missing or unenforced security controls.
The impact of security breaches on well-established brands in recent years has resulted in huge financial losses
and a drop in customer trust. Not only are the external threats from the criminal hacking community becoming
more sophisticated and targeted, the amount of damage done by external and internal threats, intentional or
unintentional, has also been steadily increasing over the years (Kark, 2009).
In a world of almost all-electronic, business infrastructure, trustworthy information systems have become a
business differentiator.
Globally, electronic technology is pervasive in the office, the marketplace, and the home. Desktop and laptop
computers, smart phones, tablets, and other devices are important elements of most business and government
operations. The explosive growth of e-commerce has changed how organizations offer goods and services. In
their personal lives, individuals have become accustomed to conducting business online with goods providers,
medical agencies, utilities, governmental offices, and financial institutions. Trusted e-commerce drives profit
and growth. Product and service quality relate to information security in a quite direct fashion: Robust
information security enables transactions and builds customer confidence.
One approach to managing sensitive data is via Metadata. Security classifications and regulatory sensitivity can
be captured at the data element and data set level. Technology exists to tag data so that Metadata travel with the
information as it flows across the enterprise. Developing a master repository of data characteristics means all
parts of the enterprise can know precisely what level of protection sensitive information requires.
37 One survey stated, “70 percent of IT professionals believe the use of unauthorized programs resulted in as many as half of
their companies’ data loss incidents. This belief was most common in the United States (74 percent), Brazil (75 percent),
and India (79 percent).” A report from the Ponomon group and Symantic Anti-Virus found that, “human errors and system
problems caused two-thirds of data breaches in 2012. http://bit.ly/1dGChAz, http://symc.ly/1FzNo5l, http://bit.ly/2sQ68Ba,
http://bit.ly/2tNEkKY.
222 • DMBOK2
If a common standard is enforced, this approach enables multiple departments, business units, and vendors to
use the same Metadata. Standard security Metadata can optimize data protection and guide business usage and
technical support processes, leading to lower costs. This layer of information security can help prevent
unauthorized access to and misuse of data assets. When sensitive data is correctly identified as such,
organizations build trust with their customers and partners. Security-related Metadata itself becomes a strategic
asset, increasing the quality of transactions, reporting, and business analysis, while reducing the cost of
protection and associated risks that lost or stolen information cause.
1.2.1 Goals
• Enabling appropriate access and preventing inappropriate access to enterprise data assets
• Enabling compliance with regulations and policies for privacy, protection, and confidentiality
• Ensuring that stakeholder requirements for privacy and confidentiality are met
1.2.2 Principles
• Enterprise approach: Data Security standards and policies must be applied consistently across the
entire organization.
• Proactive management: Success in data security management depends on being proactive and
dynamic, engaging all stakeholders, managing change, and overcoming organizational or cultural
bottlenecks such as traditional separation of responsibilities between information security, information
technology, data administration, and business stakeholders.
• Clear accountability: Roles and responsibilities must be clearly defined, including the ‘chain of
custody’ for data across organizations and roles.
• Metadata-driven: Security classification for data elements is an essential part of data definitions.
Information security has a specific vocabulary. Knowledge of key terms enables clearer articulation of
governance requirements.
1.3.1 Vulnerability
A vulnerability is a weaknesses or defect in a system that allows it to be successfully attacked and compromised
– essentially a hole in an organization’s defenses. Some vulnerabilities are called exploits.
Examples include network computers with out-of-date security patches, web pages not protected with robust
passwords, users not trained to ignore email attachments from unknown senders, or corporate software
unprotected against technical commands that will give the attacker control of the system.
In many cases, non-production environments are more vulnerable to threats than production environments.
Thus, it is critical to keep production data out of non-production environments.
1.3.2 Threat
A threat is a potential offensive action that could be taken against an organization. Threats can be internal or
external. They are not always malicious. An uniformed insider can take offensive actions again the organization
without even knowing it. Threats may relate to specific vulnerabilities, which then can be prioritized for
remediation. Each threat should match to a capability that either prevents the threat or limits the damage it
might cause. An occurrence of a threat is also called an attack surface.
Examples of threats include virus-infected email attachments being sent to the organization, processes that
overwhelm network servers and result in an inability to perform business transactions (also called denial-of-
service attacks), and exploitation of known vulnerabilities.
1.3.3 Risk
The term risk refers both to the possibility of loss and to the thing or condition that poses the potential loss. Risk
can be calculated for each possible threat using the following factors.
• Probability that the threat will occur and its likely frequency
• The type and amount of damage created each occurrence might cause, including damage to reputation
• The effect damage will have on revenue or business operations
• The cost to fix the damage after an occurrence
• The cost to prevent the threat, including by remediation of vulnerabilities
• The goal or intent of the probable attacker
224 • DMBOK2
Risks can be prioritized by potential severity of damage to the company, or by likelihood of occurrence, with
easily exploited vulnerabilities creating a higher likelihood of occurrence. Often a priority list combines both
metrics. Prioritization of risk must be a formal process among the stakeholders.
Risk classifications describe the sensitivity of the data and the likelihood that it might be sought after for
malicious purposes. Classifications are used to determine who (i.e., people in which roles) can access the data.
The highest security classification of any datum within a user entitlement determines the security classification
of the entire aggregation. Example classifications include:
• Critical Risk Data (CRD): Personal information aggressively sought for unauthorized use by both
internal and external parties due to its high direct financial value. Compromise of CRD would not only
harm individuals, but would result in financial harm to the company from significant penalties, costs to
retain customers and employees, as well as harm to brand and reputation.
• High Risk Data (HRD): HRD is actively sought for unauthorized use due to its potential direct
financial value. HRD provides the company with a competitive edge. If compromised, it could expose
the company to financial harm through loss of opportunity. Loss of HRD can cause mistrust leading to
the loss of business and may result in legal exposure, regulatory fines and penalties, as well as damage
to brand and reputation.
• Moderate Risk Data (MRD): Company information that has little tangible value to unauthorized
parties; however, the unauthorized use of this non-public information would likely have a negative
effect on the company.
Depending on the size of the enterprise, the overall Information Security function may be the primary
responsibility of a dedicated Information Security group, usually within the Information Technology (IT) area.
Larger enterprises often have a Chief Information Security Officer (CISO) who reports to either the CIO or the
CEO. In organizations without dedicated Information Security personnel, responsibility for data security will
fall on data managers. In all cases, data managers need to be involved in data security efforts.
In large enterprises, the information security personnel may let specific data governance and user authorization
functions be guided by the business managers. Examples include granting user authorizations and data
regulatory compliance. Dedicated Information Security personnel are often most concerned with the technical
aspects of information protection such as combating malicious software and system attacks. However, there is
ample room for collaboration during development or an installation project.
This opportunity for synergy is often missed when the two governance entities, IT and Data Management, lack
an organized process to share regulatory and security requirements. They need a standard procedure to inform
DATA SECURITY • 225
each other of data regulations, data loss threats, and data protection requirements, and to do so at the
commencement of every software development or installation project.
The first step in the NIST (National Institute of Standards and Technology) Risk Management Framework, for
example, is to categorize all enterprise information. 38 Creating an enterprise data model is essential to this goal.
Without clear visibility to the location of all sensitive information, it is impossible to create a comprehensive
and effective data protection program.
Data managers need to be actively engaged with information technology developers and cyber security
professionals so that regulated data may be identified, sensitive systems can be properly protected, and user
access controls can be designed to enforce confidentiality, integrity, and data regulatory compliance. The larger
the enterprise, the more important becomes the need for teamwork and reliance on a correct and updated
enterprise data model.
Data security requirements and procedures are categorized into four groups, known as the four A’s: Access,
Audit, Authentication, and Authorization. Recently an E, Entitlement, has been included, for effective data
regulatory compliance. Information classification, access rights, role groups, users, and passwords are the
means to implementing policy and satisfying the four A’s. Security Monitoring is also essential for proving the
success of the other processes. Both monitoring and audit can be done continuously or intermittently. Formal
audits must be done by a third party to be considered valid. The third party may be internal or external.
• Access: Enable individuals with authorization to access systems in a timely manner. Used as a verb,
access means to actively connect to an information system and be working with the data. Used as a
noun, access indicates that the person has a valid authorization to the data.
• Audit: Review security actions and user activity to ensure compliance with regulations and
conformance with company policy and standards. Information security professionals periodically
review logs and documents to validate compliance with security regulations, policies, and standards.
Results of these audits are published periodically.
• Authentication: Validate users’ access. When a user tries to log into a system, the system needs to
verify that the person is who he or she claims to be. Passwords are one way of doing this. More
stringent authentication methods include the person having a security token, answering questions, or
submitting a fingerprint. All transmissions during authentication are encrypted to prevent theft of the
authenticating information.
• Authorization: Grant individuals privileges to access specific views of data, appropriate to their role.
After the authorization decision, the Access Control System checks each time a user logs in to see if
they have a valid authorization token. Technically, this is an entry in a data field in the corporate
Active Directory indicating that the person has been authorized by somebody to access the data. It
further indicates that a responsible person made the decision to grant this authorization because the
user is entitled to it by virtue of their job or corporate status.
• Entitlement: An Entitlement is the sum total of all the data elements that are exposed to a user by a
single access authorization decision. A responsible manager must decide that a person is ‘entitled’ to
access this information before an authorization request is generated. An inventory of all the data
exposed by each entitlement is necessary in determining regulatory and confidentiality requirements
for Entitlement decisions.
1.3.6.2 Monitoring
Systems should include monitoring controls that detect unexpected events, including potential security
violations. Systems containing confidential information, such as salary or financial data, commonly implement
active, real-time monitoring that alerts the security administrator to suspicious activity or inappropriate access.
Some security systems will actively interrupt activities that do not follow specific access profiles. The account
or activity remains locked until security support personnel evaluate the details.
In contrast, passive monitoring tracks changes over time by taking snapshots of the system at regular intervals,
and comparing trends against a benchmark or other criteria. The system sends reports to the data stewards or
security administrator accountable for the data. While active monitoring is a detection mechanism, passive
monitoring is an assessment mechanism.
In security, data integrity is the state of being whole – protected from improper alteration, deletion, or addition.
For example, in the U.S., Sarbanes-Oxley regulations are mostly concerned with protecting financial
information integrity by identifying rules for how financial information can be created and edited.
1.3.8 Encryption
Encryption is the process of translating plain text into complex codes to hide privileged information, verify
complete transmission, or verify the sender’s identity. Encrypted data cannot be read without the decryption key
or algorithm, which is usually stored separately and cannot be calculated based on other data elements in the
same data set. There are four main methods of encryption – hash, symmetric, private-key, and public-key – with
varying levels of complexity and key structure.
DATA SECURITY • 227
1.3.8.1 Hash
Hash encryption uses algorithms to convert data into a mathematical representation. The exact algorithms used
and order of application must be known in order to reverse the encryption process and reveal the original data.
Sometimes hashing is used as verification of transmission integrity or identity. Common hashing algorithms are
Message Digest 5 (MD5) and Secure Hashing Algorithm (SHA).
1.3.8.2 Private-key
Private-key encryption uses one key to encrypt the data. Both the sender and the recipient must have the key to
read the original data. Data can be encrypted one character at a time (as in a stream) or in blocks. Common
private-key algorithms include Data Encryption Standard (DES), Triple DES (3DES), Advanced Encryption
Standard (AES), and International Data Encryption Algorithm (IDEA). Cyphers Twofish and Serpent are also
considered secure. The use of simple DES is unwise as it is susceptible to many easy attacks.
1.3.8.3 Public-key
In public-key encryption, the sender and the receiver have different keys. The sender uses a public key that is
freely available, and the receiver uses a private key to reveal the original data. This type of encryption is useful
when many data sources must send protected information to just a few recipients, such as when submitting data
to clearinghouses. Public-key methods include Rivest-Shamir-Adelman (RSA) Key Exchange and Diffie-
Hellman Key Agreement. PGP (Pretty Good Privacy) is a freely available application of public-key encryption.
Data can be made less available by obfuscation (making obscure or unclear) or masking, which removes,
shuffles, or otherwise changes the appearance of the data, without losing the meaning of the data or the
relationships the data has to other data sets, such as foreign key relationships to other objects or systems. The
values within the attributes may change, but the new values are still valid for those attributes. Obfuscation is
useful when displaying sensitive information on screens for reference, or creating test data sets from production
data that comply with expected application logic.
Data masking is a type of data-centric security. There are two types of data masking, persistent and dynamic.
Persistent masking can be executed in-flight or in-place.
Persistent data masking permanently and irreversibly alters the data. This type of masking is not typically used
in production environments, but rather between a production environment and development or test
228 • DMBOK2
environments. Persistent masking changes the data, but the data must still be viable for use to test processes,
application, report, etc.
• In-flight persistent masking occurs when the data is masked or obfuscated while it is moving
between the source (typically production) and destination (typically non-production) environment. In-
flight masking is very secure when properly executed because it does not leave an intermediate file or
database with unmasked data. Another benefit is that it is re-runnable if issues are encountered part
way through the masking.
• In-place persistent masking is used when the source and destination are the same. The unmasked data
is read from the source, masked, and then used to overwrite the unmasked data. In-place masking
assumes the sensitive data is in a location where it should not exist and the risk needs to be mitigated,
or that there is an extra copy of the data in a secure location to mask before moving it to the non-secure
location. There are risks to this process. If the masking process fails mid-masking, it can be difficult to
restore the data to a useable format. This technique has a few niche uses, but in general, in-flight
masking will more securely meet project needs.
Dynamic data masking changes the appearance of the data to the end user or system without changing the
underlying data. This can be extremely useful when users need access to some sensitive production data, but not
all of it. For example, in a database the social security number is stored as 123456789, but to the call center
associate that needs to verify who they are speaking to, the data shows up as ***-**-6789.
• Substitution: Replace characters or whole values with those in a lookup or as a standard pattern. For
example, first names can be replaced with random values from a list.
• Shuffling: Swap data elements of the same type within a record, or swap data elements of one attribute
between rows. For example, mixing vendor names among supplier invoices such that the original
supplier is replaced with a different valid supplier on an invoice.
• Temporal variance: Move dates +/– a number of days – small enough to preserve trends, but
significant enough to render them non-identifiable.
• Value variance: Apply a random factor +/– a percent, again small enough to preserve trends, but
significant enough to be non-identifiable.
• Nulling or deleting: Remove data that should not be present in a test system.
DATA SECURITY • 229
• Randomization: Replace part or all of data elements with either random characters or a series of a
single character.
• Expression masking: Change all values to the result of an expression. For example, a simple
expression would just hard code all values in a large free form database field (that could potentially
contain confidential data) to be ‘This is a comment field’.
• Key masking: Designate that the result of the masking algorithm/process must be unique and
repeatable because it is being used mask a database key field (or similar). This type of masking is
extremely important for testing to maintain integrity around the organization.
Data security includes both data-at-rest and data-in-motion. Data-in-motion requires a network in order to move
between systems. It is no longer sufficient for an organization to wholly trust in the firewall to protect it from
malicious software, poisoned email, or social engineering attacks. Each machine on the network needs to have a
line of defense, and web servers need sophisticated protection as they are continually exposed to the entire
world on the Internet.
1.3.10.1 Backdoor
A backdoor refers to an overlooked or hidden entry into a computer system or application. It allows
unauthorized users to bypass the password requirement to gain access. Backdoors are often created by
developers for maintenance purposes. Any backdoor is a security risk. Other backdoors are put in place by the
creators of commercial software packages.
Default passwords left unchanged when installing any software system or web page package is a backdoor and
will undoubtedly be known to hackers. Any backdoor is a security risk.
A bot (short for robot) or Zombie is a workstation that has been taken over by a malicious hacker using a
Trojan, a Virus, a Phish, or a download of an infected file. Remotely controlled, bots are used to perform
malicious tasks, such as sending large amounts of spam, attacking legitimate businesses with network-clogging
230 • DMBOK2
Internet packets, performing illegal money transfers, and hosting fraudulent websites. A Bot-Net is a network of
robot computers (infected machines). 39
It was estimated in 2012 that globally 17% of all computers (approximately 187 million of 1.1 Billion
computers) do not have anti-virus protection. 40 In the USA that year, 19.32% of users surfed unprotected. A
large percentage of them are Zombies. Estimates are that two billion computers are in operation as of 2016. 41
Considering that desktop and laptop computers are being eclipsed in number by smart phones, tablets,
wearables, and other devices, many of which are used for business transactions, the risks for data exposure will
only increase. 42
1.3.10.3 Cookie
A cookie is a small data file that a website installs on a computer’s hard drive, to identify returning visitors and
profile their preferences. Cookies are used for Internet commerce. However, they are also controversial, as they
raise questions of privacy because spyware sometimes uses them.
1.3.10.4 Firewall
A firewall is software and/or hardware that filters network traffic to protect an individual computer or an entire
network from unauthorized attempts to access or attack the system. A firewall may scan both incoming and
outgoing communications for restricted or regulated information and prevent it from passing without permission
(Data Loss Prevention). Some firewalls also restrict access to specific external websites.
1.3.10.5 Perimeter
A perimeter is the boundary between an organization’s environments and exterior systems. Typically, a firewall
will be in place between all internal and external environments.
39 http://bit.ly/1FrKWR8, http://bit.ly/2rQQuWJ.
42 Cisco Corporation estimated that “By 2018, there will be 8.2 billion handheld or personal mobile-ready
devices and 2 billion machine-to-machine connections (e.g., GPS systems in cars, asset tracking systems in
shipping and manufacturing sectors, or medical applications making patient records and health status more
readily available.)” http://bit.ly/Msevdw ( future numbers of computers and devices).
DATA SECURITY • 231
1.3.10.6 DMZ
Short for de-militarized zone, a DMZ is an area on the edge or perimeter of an organization, with a firewall
between it and the organization. A DMZ environment will always have a firewall between it and the internet
(see Figure 64). DMZ environments are used to pass or temporarily store data moving between organizations.
DMZ Internal
Systems
A Super User Account is an account that has administrator or root access to a system to be used only in an
emergency. Credentials for these accounts are highly secured, only released in an emergency with appropriate
documentation and approvals, and expire within a short time. For example, the staff assigned to production
control might require access authorizations to multiple large systems, but these authorizations should be tightly
controlled by time, user ID, location, or other requirement to prevent abuse.
Key Loggers are a type of attack software that records all the keystrokes that a person types into their keyboard,
then sends them elsewhere on the Internet. Thus, every password, memo, formula, document, and web address
is captured. Often an infected website or malicious software download will install a key logger. Some types of
document downloads will allow this to happen as well.
Setting up a secure network and website is incomplete without testing it to make certain that it truly is secure. In
Penetration Testing (sometimes called ‘penn test’), an ethical hacker, either from the organization itself or hired
from an external security firm, attempts to break into the system from outside, as would a malicious hacker, in
order to identify system vulnerabilities. Vulnerabilities found through penetration tests can be addressed before
the application is released.
232 • DMBOK2
Some people are threatened by ethical hacking audits because they believe these audits will result only in finger
pointing. The reality is that in the fast-moving conflict between business security and criminal hacking, all
purchased and internally-developed software contains potential vulnerabilities that were not known at the time
of their creation. Thus, all software implementations must be challenged periodically. Finding vulnerabilities is
an ongoing procedure and no blame should be applied – only security patches.
As proof of the need for continual software vulnerability mitigation, observe a constant stream of security
patches arriving from software vendors. This continual security patch update process is a sign of due diligence
and professional customer support from these vendors. Many of these patches are the result of ethical hacking
performed on behalf of the vendors.
VPN connections use the unsecured internet to create a secure path or ‘tunnel’ into an organization’s
environment. The tunnel is highly encrypted. It allows communication between users and the internal network
by using multiple authentication elements to connect with a firewall on the perimeter of an organization’s
environment. Then it strongly encrypts all transmitted data.
Data security involves not just preventing inappropriate access, but also enabling appropriate access to data.
Access to sensitive data should be controlled by granting permissions (opt-in). Without permission, a user
should not be allowed to see data or take action within the system. ‘Least Privilege’ is an important security
principle. A user, process, or program should be allowed to access only the information allowed by its
legitimate purpose.
Facility security is the first line of defense against bad actors. Facilities should have, at a minimum, a locked
data center with access restricted to authorized employees. Social threats to security (See Section 1.3.15)
recognize humans as the weakest point in facility security. Ensure that employees have the tools and training to
protect data in facilities.
Mobile devices, including laptops, tablets, and smartphones, are inherently insecure, as they can be lost, stolen,
and physically and electronically attacked by criminal hackers. They often contain corporate emails,
DATA SECURITY • 233
spreadsheets, addresses, and documents that, if exposed, can be damaging to the organization, its employees, or
its customers.
With the explosion of portable devices and media, a plan to manage the security of these devices (both
company-owned and personal) must be part of any company’s overall strategic security architecture. This plan
should include both software and hardware tools.
Each user is assigned credentials to use when obtaining access to a system. Most credentials are a combination
of a User ID and a Password. There is a spectrum of how credentials are used across systems within an
environment, depending on the sensitivity of the system’s data, and the system’s capabilities to link to
credential repositories.
Traditionally, users have had different accounts and passwords for each individual resource, platform,
application system, or workstation. This approach requires users to manage several passwords and accounts.
Organizations with enterprise user directories may have a synchronization mechanism established between the
heterogeneous resources to ease user password management. In such cases, the user is required to enter the
password only once, usually when logging into the workstation, after which all authentication and authorization
executes through a reference to the enterprise user directory. An identity management system implementing this
capability is known as ‘single-sign-on’, and is optimal from a user perspective.
User IDs should be unique within the email domain. Most companies use some first name or initial, and full or
partial last name as the email or network ID, with a number to differentiate collisions. Names are generally
known and are more useful for business contact reasons.
Email or network IDs containing system employee ID numbers are discouraged, as that information is not
generally available outside the organization, and provides data that should be secure within the systems.
234 • DMBOK2
Passwords are the first line of defense in protecting access to data. Every user account should be required to
have a password set by the user (account owner) with a sufficient level of password complexity defined in the
security standards, commonly referred to as ‘strong’ passwords.
When creating a new user account, the generated temporary password should be set to expire immediately after
the first use and the user must choose a new password for subsequent access. Do not permit blank passwords.
Most security experts recommend requiring users to change their passwords every 45 to 180 days, depending on
the nature of the system, the type of data, and the sensitivity of the enterprise. However, changing passwords
too frequently introduces risk, since it often causes employees write down their new passwords.
Some systems require additional identification procedures. These can include a return call to the user’s mobile
device that contains a code, the use of a hardware item that must be used for login, or a biometric factor such as
fingerprint, facial recognition, or retinal scan. Two-factor identification makes it much harder to break into an
account or to log into a user’s device. All users with authorization entitlement to highly sensitive information
should use two-factor identification to log into the network.
Users must be trained to avoid sending their personal information or any restricted or confidential company
information over email or direct communication applications. These insecure methods of communication can be
read or intercepted by outside sources. Once a user sends an email, he or she no longer controls the information
in it. It can be forwarded to other people without the sender’s knowledge or consent.
Social media also applies here. Blogs, portals, wikis, forums, and other Internet or Intranet social media should
be considered insecure and should not contain confidential or restricted information.
Two concepts drive security restrictions: the level of confidentiality of data and regulation related to data.
• Confidentiality level: Confidential means secret or private. Organizations determine which types of
data should not be known outside the organization, or even within certain parts of the organization.
Confidential information is shared only on a ‘need-to-know’ basis. Levels of confidentiality depend on
who needs to know certain kinds of information.
DATA SECURITY • 235
• Regulation: Regulatory categories are assigned based on external rules, such as laws, treaties, customs
agreements, and industry regulations. Regulatory information is shared on an ‘allowed-to-know’ basis.
The ways in which data can be shared are governed by the details of the regulation.
The main difference between confidential and regulatory restrictions is where the restriction originates:
confidentiality restrictions originate internally, while regulatory restrictions are externally defined.
Another difference is that any data set, such as a document or a database view, can only have one
confidentiality level. This level is established based on the most sensitive (and highest classified) item in the
data set. Regulatory categorizations, however, are additive. A single data set may have data restricted based on
multiple regulatory categories. To assure regulatory compliance, enforce all actions required for each category,
along with the confidentiality requirements.
When applied to the user entitlement (the aggregation of the particular data elements to which a user
authorization provides access), all protection policies must be followed, regardless of whether they originated
internally or externally.
Confidentiality requirements range from high (very few people have access, for example, to data about
employee compensation) to low (everyone has access to product catalogs). A typical classification schema
might include two or more of the five confidentiality classification levels listed here:
• Internal use only: Information limited to employees or members, but with minimal risk if shared. For
internal use only; may be shown or discussed, but not copied, outside the organization.
• Confidential: Information that cannot be shared outside the organization without a properly executed
non-disclosure agreement or similar in place. Client confidential information may not be shared with
other clients.
• Restricted confidential: Information limited to individuals performing certain roles with the ‘need to
know.’ Restricted confidential may require individuals to qualify through clearance.
• Registered confidential: Information so confidential that anyone accessing the information must sign
a legal agreement to access the data and assume responsibility for its secrecy.
The confidentiality level does not imply any details about restrictions due to regulatory requirements. For
example, it does not inform the data manager that data may not be exposed outside its country of origin, or that
some employees are prohibited from seeing certain information based on regulations like HIPAA.
236 • DMBOK2
Certain types of information are regulated by external laws, industry standards, or contracts that influence how
data can be used, as well as who can access it and for what purposes. As there are many overlapping
regulations, it is easier to collect them by subject area into a few regulatory categories or families to better
inform data managers of regulatory requirements.
Each enterprise, of course, must develop regulatory categories that meet their own compliance needs. Further, it
is important that this process and the categories be as simple as possible to allow for an actionable protection
capability. When category protective actions are similar, they should be combined into a regulation ‘family’.
Each regulatory category should include auditable protective actions. This is not an organizational tool but an
enforcement method.
Since different industries are affected by different types of regulations, the organization needs to develop
regulatory groupings that meet their operational needs. For example, companies that do no business outside of
their native land may not need to incorporate regulations pertaining to exports.
However, since all nations have some mixture of personal data privacy laws, and customers are likely to be
from anywhere in the world, it may be wise and easier to gather all customer data privacy regulations into a
single regulatory family, and comply with the requirements for all the nations. Doing so ensures compliance
everywhere, and offers a single standard to enforce.
An example of the possible detail of regulatory compliance is one that prohibits by law a single type of data
element in the database to travel outside the physical borders of the originating nation. Several regulations, both
domestic and international, have this as a requirement.
An optimal number of regulatory action categories is nine or fewer. Sample regulatory categories follow.
Certain government regulations specify data elements by name, and demand that they be protected in specific
ways. Each element does not need a different category; instead, use a single family of actions to protect all
specifically targeted data fields. Some PCI data may be included in these categories even though it is a
contractual obligation and not a governmental regulation. PCI contractual obligations are mostly uniform
around the globe.
• Personal Identification Information (PII): Also known as Personally Private Information (PPI),
includes any information that can personally identify the individual (individually or as a set), such as
name, address, phone numbers, schedule, government ID number, account numbers, age, race,
religion, ethnicity, birthday, family members’ names or friends’ names, employment information (HR
data), and in many cases, remuneration. Highly similar protective actions will satisfy the EU Privacy
Directives, Canadian Privacy law (PIPEDA), PIP Act 2003 in Japan, PCI standards, US FTC
requirements, GLB, FTC standards, and most Security Breach of Information Acts.
DATA SECURITY • 237
• Financially Sensitive Data: All financial information, including what may be termed ‘shareholder’ or
‘insider’ data, including all current financial information that has not yet been reported publicly. It also
includes any future business plans not made public, planned mergers, acquisitions, or spin-offs, non-
public reports of significant company problems, unexpected changes in senior management,
comprehensive sales, orders, and billing data. All of these can be captured within this one category,
and protected by the same policies. In the US, this is covered under Insider Trading Laws, SOX
(Sarbanes-Oxley Act), or GLBA (Gramm-Leach-Bliley/Financial Services Modernization Act). Note:
Sarbanes-Oxley act restricts and manages who can change financial data, thus assuring data integrity,
while Insider Trading laws affect all those who can see financial data.
• Medically Sensitive Data/Personal Health Information (PHI): All information regarding a person’s
health or medical treatments. In the US, this is covered by HIPAA (Health Information Portability and
Accountability Act). Other nations also have restrictive laws regarding protection of personal and
medical information. As these are evolving, ensure Corporate Counsel is aware of the need to follow
legal requirements in a nation in which the organization does business or has customers.
• Educational Records: All information regarding a person’s education. In the US, this is covered by
FERPA (Family Educational Rights and Privacy Act).
Some industries have specific standards for how to record, retain, and encrypt information. Some also disallow
deletion, editing, or distributing to prohibited locations. For example, regulations for pharmaceuticals, other
dangerous substances, food, cosmetics, and advanced technology prevent the transmission or storage of certain
information outside the country of origin, or require data to be encrypted during transport.
• Payment Card Industry Data Security Standard (PCI-DSS): PCI-DSS is the most widely known
industry data security standard. It addresses any information that can identify an individual with an
account at a financial organization, such as name, credit card number (any number on the card), bank
account number, or account expiration date. Most of these data fields are regulated by laws and
policies. Any data with this classification in its Metadata definition automatically should be carefully
reviewed by data stewards when included in any database, application, report, dashboard, or user view.
• Competitive advantage or trade secrets: Companies that use proprietary methods, mixtures,
formulas, sources, designs, tools, recipes, or operational techniques to achieve a competitive advantage
may be protected by industry regulations and/or intellectual property laws.
• Contractual restrictions: In its contracts with vendors and partners, an organization may stipulate
how specific pieces of information may or may not be used, and which information can and cannot be
shared. For example, environmental records, hazardous materials reports, batch numbers, cooking
times, points of origin, customer passwords, account numbers, and certain national identity numbers of
non-US nationals. Specific technical companies may need to include certain restricted products or
ingredients in this category.
238 • DMBOK2
The first step in identifying risk is identifying where sensitive data is stored, and what protections are required
for that data. It is also necessary to identify risks inherent in systems. System security risks include elements
that can compromise a network or database. These threats allow legitimate employees to misuse information,
either intentionally or accidentally, and enable malicious hacker success.
In granting access to data, the principle of least privilege should be applied. A user, process, or program should
be allowed to access only the information allowed by its legitimate purpose. The risk is that users with
privileges that exceed the requirements of their job function may abuse these privileges for malicious purpose or
accidentally. Users may be granted more access than they should have (excessive privilege) simply because it is
challenging to manage user entitlements. The DBA may not have the time or Metadata to define and update
granular access privilege control mechanisms for each user entitlement. As a result, many users receive generic
default access privileges that far exceed specific job requirements. This lack of oversight to user entitlements is
one reason why many data regulations specify data management security.
The solution to excessive privileges is query-level access control, a mechanism that restricts database privileges
to minimum-required SQL operations and data. The granularity of data access control must extend beyond the
table to specific rows and columns within a table. Query-level access control is useful for detecting excessive
privilege abuse by malicious employees.
Most database software implementations integrate some level of query-level access control (triggers, row-level
security, table security, views), but the manual nature of these ‘built-in’ features make them impractical for all
but the most limited deployments. The process of manually defining a query-level access control policy for all
users across database rows, columns, and operations is time consuming. To make matters worse, as user roles
change over time, query policies must be updated to reflect those new roles. Most database administrators
would have a hard time defining a useful query policy for a handful of users at a single point in time, much less
hundreds of users over time. As a result, in a large number of organizations, automated tools are usually
necessary to make real query-level access control functional.
Users may abuse legitimate database privileges for unauthorized purposes. Consider a criminally inclined
healthcare worker with privileges to view individual patient records via a custom Web application.
The structure of corporate Web applications normally limits users to viewing an individual patient’s healthcare
history, where multiple records cannot be viewed simultaneously and electronic copies are not allowed.
However, the worker may circumvent these limitations by connecting to the database using an alternative
DATA SECURITY • 239
system such as MS-Excel. Using MS-Excel and his legitimate login credentials, the worker might retrieve and
save all patient records.
There are two risks to consider: intentional and unintentional abuse. Intentional abuse occurs when an employee
deliberately misuses organizational data. For example, an errant worker who wants to trade patient records for
money or for intentional damage, such as releasing (or threatening to release) sensitive information publicly.
Unintentional abuse is a more common risk: The diligent employee who retrieves and stores large amounts of
patient information to a work machine for what he or she considers legitimate work purposes. Once the data
exists on an endpoint machine, it becomes vulnerable to laptop theft and loss.
The partial solution to the abuse of legitimate privilege is database access control that not only applies to
specific queries, but also enforces policies for end-point machines using time of day, location monitoring, and
amount of information downloaded, and reduces the ability of any user to have unlimited access to all records
containing sensitive information unless it is specifically demanded by their job and approved by their
supervisor. For example while it may be necessary for a field agent to access their customer’s personal records,
they might not be allowed to download the entire customer database to their laptop just to ‘save time’.
Attackers may take advantage of database platform software vulnerabilities to convert access privileges from
those of an ordinary user to those of an administrator. Vulnerabilities may occur in stored procedures, built-in
functions, protocol implementations, and even SQL statements. For example, a software developer at a financial
institution might take advantage of a vulnerable function to gain the database administrative privilege. With
administrative privilege, the offending developer may turn off audit mechanisms, create bogus accounts,
transfer funds, or close accounts.
Prevent privilege elevation exploits with a combination of traditional intrusion prevention systems (IPS) and
query-level access control intrusion prevention. These systems inspect database traffic to identify patterns that
correspond to known vulnerabilities. For example, if a given function is vulnerable to an attack, an IPS may
either block all access to the procedure, or block those procedures allowing embedded attacks.
Combine IPS with alternative attack indicators, such as query access control, to improve accuracy in identifying
attacks. IPS can detect whether a database request accesses a vulnerable function while query access control
detects whether the request matches normal user behavior. If a single request indicates both access to a
vulnerable function and unusual behavior, then an attack is almost certainly occurring.
Use of service accounts (batch IDs) and shared accounts (generic IDs) increases the risk of data security
breaches and complicates the ability to trace the breach to its source. Some organizations further increase their
240 • DMBOK2
risk when they configure monitoring systems to ignore any alerts related to these accounts. Information security
managers should consider adopting tools to manage service accounts securely.
Service accounts are convenient because they can tailor enhanced access for the processes that use them.
However, if they are used for other purposes, they are untraceable to a particular user or administrator. Unless
they have access to decryption keys, service accounts do not threaten encrypted data. This may be especially
important for data held on servers storing legal documents, medical information, trade secrets, or confidential
executive planning.
Restrict the use of service accounts to specific tasks or commands on specific systems, and require
documentation and approval for distributing the credentials. Consider assigning a new password every time
distribution occurs, using processes such as those in place for Super User accounts.
Shared accounts are created when an application cannot handle the number of user accounts needed or when
adding specific users requires a large effort or incurs additional licensing costs. For shared accounts, credentials
are given to multiple users, and the password is rarely changed due to the effort to notify all users. Because they
provide essentially ungoverned access, any use of shared accounts should be carefully evaluated. They should
never be used by default.
Software updates and intrusion prevention protection of database assets requires a combination of regular
software updates (patches) and the implementation of a dedicated Intrusion Prevention Systems (IPS). An IPS is
usually, but not always, implemented alongside of an Intrusion Detection System (IDS). The goal is to prevent
the vast majority of network intrusion attempts and to respond quickly to any intrusion that has succeeded in
working its way past a prevention system. The most primitive form of intrusion protection is a firewall, but with
mobile users, web access, and mobile computing equipment a part of most enterprise environments, a simple
firewall, while still necessary, is no longer sufficient.
Vendor-provided updates reduce vulnerabilities found in database platforms over time. Unfortunately, software
updates are often implemented by enterprises according to periodic maintenance cycles rather than as soon as
possible after the patches are made available. In between update cycles, databases are not protected. In addition,
compatibility problems sometimes prevent software updates altogether. To address these problems, implement
IPS.
DATA SECURITY • 241
In a SQL injection attack, a perpetrator inserts (or ‘injects’) unauthorized database statements into a vulnerable
SQL data channel, such as stored procedures and Web application input spaces. These injected SQL statements
are passed to the database, where they are often executed as legitimate commands. Using SQL injection,
attackers may gain unrestricted access to an entire database.
SQL injections are also used to attack the DBMS, by passing SQL commands as a parameter of a function or
stored procedure. For example, a component that provides backup functionality usually runs at a high privilege;
calling a SQL injection vulnerable function in that specific component could allow a regular user to escalate
their privileges, become a DBA and take over the database.
Mitigate this risk by sanitizing all inputs before passing them back to the server.
It is a long-standing practice in the software industry to create default accounts during the installation of
software packages. Some are used in the installation itself. Others provide users with a means to test the
software out of the box.
Default passwords are part of many demo packages. Installation of third party software creates others. For
example, a CRM package might create several accounts in the backend database, for install, test, and admin and
for regular users. SAP creates a number of default database users at the time of installation. The DBMS industry
also engages in this practice.
Attackers are constantly looking for an easy way to steal sensitive data. Mitigate threats to sensitive data by
creating the required username and password combinations, and ensuring the no default passwords are left in
place in the DBMS. Eliminating the default passwords is an important security step after every implementation.
Backups are made to reduce the risks associated with data loss, but backups also represent a security risk. The
news offers many stories about lost backup media. Encrypt all database backups. Encryption prevents loss of a
backup either in tangible media or in electronic transit. Securely manage backup decryption keys. Keys must be
available off-site to be useful for disaster recovery.
The term hacking came from an era when finding clever ways to perform some computer task was the goal. A
hacker is a person who finds unknown operations and pathways within complex computer systems. Hackers can
be good or bad.
242 • DMBOK2
An ethical or ‘White Hat’ hacker works to improve a system. (‘White Hat’ refers to American western movies
in which the hero always wore a white hat.) Without ethical hackers, system vulnerabilities that could be
corrected would be discovered only by accident. The systematic patching (updating) of computers to increase
security results from ethical hacking.
A malicious hacker is someone who intentionally breaches or ‘hacks’ into a computer system to steal
confidential information or to cause damage. Malicious Hackers usually look for financial or personal
information in order to steal money or identities. They try to guess simple passwords, and seek to find
undocumented weaknesses and backdoors in existing systems. They are sometimes called ‘Black Hat hackers’.
(In those same American westerns where the heroes wore white hats, the villains wore black hats.)
Social threats to security often involve direct communications (whether in person, by phone, or over the
internet) designed to trick people who have access to protected data into providing that information (or access to
the information) to people who will use it for criminal or malicious purposes.
Social engineering refers to how malicious hackers try to trick people into giving them either information or
access. Hackers use any information they obtain to convince other employees that they have legitimate requests.
Sometimes hackers will contact several people in sequence, collecting information at each step useful for
gaining the trust of the next higher employee.
Phishing refers to a phone call, instant message, or email meant to lure recipients into giving out valuable or
private information without realizing they are doing so. Often these calls or messages appear to be from a
legitimate source. For example, sometimes they are framed as sales pitches for discounts or lowered interest
rates. But they ask for personal information such as names, passwords, Social Security numbers, or credit card
information. To reduce suspicion, these messages often request the recipient to ‘update’ or ‘confirm’
information. Phishing instant messages and emails might also direct users to phony websites to trick them into
providing personal information. Of special danger are fake emails specifically targeted to senior executives by
name. This is called ‘Spear-phishing for whales’. In addition to phoning and spoofing, hackers have been
known to physically go to target sites and speak directly with employees, sometimes using disguises or posing
as vendors, in order to gain access to sensitive information. 43
1.3.16 Malware
Malware refers to any malicious software created to damage, change, or improperly access a computer or
network. Computer viruses, worms, spyware, key loggers, and adware are all examples of malware. Any
software installed without authorization can be considered malware, if for no other reason than that it takes up
43 The FBI report on Russian Hacking during the 2016 US Presidential Election outlines how these techniques were used in
that instance. http://bit.ly/2iKStXO.
DATA SECURITY • 243
disk space and possibly processing cycles that the system owner did not authorize. Malware can take many
forms, depending on its purpose (replication, destruction, information or processing theft, or behavior
monitoring).
1.3.16.1 Adware
Adware is a form of spyware that enters a computer from an Internet download. Adware monitors a computer’s
use, such as what websites are visited. Adware also may insert objects and tool bars in the user’s browser.
Adware is not illegal, but is used to develop complete profiles of the user’s browsing and buying habits to sell
to other marketing firms. It can also be easily leveraged by malicious software for identity theft.
1.3.16.2 Spyware
Spyware refers to any software program that slips into a computer without consent, in order to track online
activity. These programs tend to piggyback on other software programs. When a user downloads and installs
free software from a site on the Internet, spyware can also install, usually without the user’s knowledge.
Different forms of spyware track different types of activity. Some programs monitor what websites are visited,
while others record the user’s keystrokes to steal personal information, such as credit card numbers, bank
account information, and passwords.
Many legitimate websites, including search engines, install tracking spyware, which is a form of Adware.
The Trojan horse was a large wooden ‘gift statue’ of a horse that the Greeks gave to the people of Troy, who
quickly brought it inside the city walls. Unfortunately for them, it concealed Greek soldiers, who, once inside
the Troy, slipped out and attacked the city.
In computer security terms, a Trojan horse refers to a malicious program that enters a computer system
disguised or embedded within legitimate software. Once installed, a Trojan horse will delete files, access
personal information, install malware, reconfigure the computer, install a key logger, or even allow hackers to
use the computer as a weapon (Bot or Zombie) against other computers on a network.
1.3.16.4 Virus
A virus is a program that attaches itself to an executable file or vulnerable application and delivers a payload
that ranges from annoying to extremely destructive. A file virus executes when an infected file opens. A virus
always needs to accompany another program. Opening downloaded and infected programs can release a virus.
244 • DMBOK2
1.3.16.5 Worm
A computer worm is a program built to reproduce and spread across a network by itself. A worm-infected
computer will send out a continuous stream of infected messages. A worm may perform several different
malicious activities, although the main function is to harm networks by consuming large amounts of bandwidth,
potentially shutting the network down.
IM allows users to relay messages to each other in real-time. IM is also becoming a new threat to network
security. Because many IM systems have been slow to add security features, malicious hackers have found IM a
useful means of spreading viruses, spyware, phishing scams, and a wide variety of worms. Typically, these
threats infiltrate systems through contaminated attachments and messages.
Social networking sites, such as Facebook, Twitter, Vimeo, Google+, LinkedIn, Xanga, Instagram, Pinterest, or
MySpace, where users build online profiles and share personal information, opinions, photographs, blog entries,
and other information, have become targets of online predators, spammers, and identity thieves.
In addition to representing a threat from malicious people, these sites pose risks from employees who may post
information sensitive to the enterprise or ‘insider’ knowledge that might affect the price of a public
organization’s stock. Inform users of the dangers and the reality that whatever they post will become permanent
on the Internet. Even if they later remove the data, many will have made copies. Some companies block these
sites at their firewall.
1.3.16.6.3 Spam
Spam refers to unsolicited, commercial email messages sent out in bulk, usually to tens of millions of users in
hopes that a few may reply. A return rate of 1% can net millions of dollars. Most email routing systems have
traps to filter out known spam message patterns to reduce internal traffic. These exclusion patterns include:
Responding to a spam message will confirm to the sender that they have reached a legitimate email address and
will increase future spam because lists of valid emails can be sold to other spammers.
Spam messages may also be Internet hoaxes or include malware attachments, with attachment names and
extensions, message text, and images giving the appearance of a legitimate communication. One way to detect
spam email is to hover the pointer over any hyperlinks, which will show the actual link that has nothing in
common with the company shown in the text. Another way is the lack of a way to unsubscribe. In the US,
advertising emails are required to list an unsubscribe link to stop further emails.
2. Activities
There is no one prescribed way of implementing data security to meet all necessary privacy and confidentiality
requirements. Regulations focus on the ends of security, not the means for achieving it. Organizations should
design their own security controls, demonstrate that the controls meet or exceed the requirements of the laws or
regulations, document the implementation of those controls, and monitor and measure them over time. As in
other Knowledge Areas, the activities include identifying requirements, assessing the current environment for
gaps or risks, implementing security tools and processes, and auditing data security measures to ensure they are
effective.
It is important to distinguish between business requirements, external regulatory restrictions, and the rules
imposed by application software products. While application systems serve as vehicles to enforce business rules
and procedures, it is common for these systems to have their own data security requirements over and above
those required for business processes. These requirements are becoming more common with packaged and off-
the-shelf systems. It is necessary, however, to see that they support organizational data security standards as
well.
Implementing data security within an enterprise begins with a thorough understanding of business requirements.
The business needs of an enterprise, its mission, strategy and size, and the industry to which it belongs define
the degree of rigidity required for data security. For example, financial and securities enterprises in the United
States are highly regulated and required to maintain stringent data security standards. In contrast, a small-scale
retail enterprise may choose not to have the same kind of data security function that a large retailer has, even
though both of them have similar core business activities.
246 • DMBOK2
Analyze business rules and processes to identify security touch points. Every event in the business workflow
may have its own security requirements. Data-to-process and data-to-role relationship matrices are useful tools
to map these needs and guide definition of data security role-groups, parameters, and permissions. Plan to
address short-term and long-term goals to achieve a balanced and effective data security function.
Today’s fast changing and global environment requires organizations to comply with a growing set of laws and
regulations. The ethical and legal issues facing organizations in the Information Age are leading governments to
establish new laws and standards. These have all imposed strict security controls on information management.
(See Chapter 2.)
Create a central inventory of all relevant data regulations and the data subject area affected by each regulation.
Add links to the corresponding security policies developed for compliance to these regulations (see Table 13),
and the controls implemented. Regulations, policies, required actions, and data affected will change over time,
so this inventory should be in a format that is simple to manage and maintain.
• US
o Sarbanes-Oxley Act of 2002
o Health Information Technology for Economic and Clinical Health (HITECH) Act, enacted as
part of the American Recovery and Reinvestment Act of 2009
o Health Insurance Portability and Accountability Act of 1996 (HIPAA) Security Regulations
o Gramm-Leach-Bliley I and II
o SEC laws and Corporate Information Security Accountability Act
o Homeland Security Act and USA Patriot Act
o Federal Information Security Management Act (FISMA)
o California: SB 1386, California Security Breach Information Act
• EU
o Data Protection Directive (EU DPD 95/46/) AB 1901, Theft of electronic files or databases
• Canada
o Canadian Bill 198
• Australia
o The CLERP Act of Australia
• Payment Card Industry Data Security Standard (PCI DSS), in the form of a contractual agreement for
all companies working with credit cards
• EU: The Basel II Accord, which imposes information controls for all financial institutions doing
business in its related countries
• US: FTC Standards for Safeguarding Customer Info
Compliance with company policies or regulatory restrictions will often require adjustments to business
processes. For example, the need to authorize access to health information (regulated data elements) to multiple
unique groups of users, in order to accommodate HIPAA.
Organizations should create data security policies based on business and regulatory requirements. A policy is a
statement of a selected course of action and high-level description of desired behavior to achieve a set of goals.
Data security policies describe behaviors that are determined to be in the best interests of an organization that
wishes to protect its data. For policies to have a measurable impact, they must be auditable and audited.
Corporate policies often have legal implications. A court may consider a policy instituted to support a legal
regulatory requirement to be an intrinsic part of the organization’s effort to comply with that legal requirement.
Failure to comply with a corporate policy might have negative legal ramifications after a data breach.
Defining security policy requires collaboration between IT security administrators, Security Architects, Data
Governance committees, Data Stewards, internal and external audit teams, and the legal department. Data
Stewards must also collaborate with all Privacy Officers (Sarbanes-Oxley supervisors, HIPAA Officers, etc.),
and business managers having data expertise, to develop regulatory category Metadata and apply proper
security classifications consistently. All data regulation compliance actions must be coordinated to reduce cost,
work instruction confusion, and needless turf battles.
Different levels of policy are required to govern behavior related to enterprise security. For example:
• Enterprise Security Policy: Global policies for employee access to facilities and other assets, email
standards and policies, security access levels based on position or title, and security breach reporting
policies
• IT Security Policy: Directory structures standards, password policies, and an identity management
framework
• Data Security Policy: Categories for individual application, database roles, user groups, and
information sensitivity
248 • DMBOK2
Commonly, the IT Security Policy and Data Security Policy are part of a combined security policy. The
preference, however, should be to separate them. Data security policies are more granular in nature, specific to
content, and require different controls and procedures. The Data Governance Council should review and
approve the Data Security Policy. The Data Management Executive owns and maintains the policy.
Employees need to understand and follow security policies. Develop security policies so that the required
processes and the reasons behind them are clearly defined and achievable. Compliance should be made easier
than non-compliance. Policies need to protect and secure data without stifling user access.
Security policies should be in a format easily accessible by the suppliers, consumers, and other stakeholders.
They should be available and maintained on the company intranet or a similar collaboration portal.
Data security policies, procedures, and activities should be periodically reevaluated to strike the best possible
balance between the data security requirements of all stakeholders.
Policies provide guidelines for behavior. They do not outline every possible contingency. Standards supplement
policies and provide additional detail on how to meet the intention of the policies. For example, a policy may
state that passwords must follow guidelines for strong passwords; the standards for strong passwords would be
detailed separately; and the policy would be enforced through technology that prevents passwords from being
created if they do not meet the standards for strong passwords.
Confidentiality classification is an important Metadata characteristic, guiding how users are granted access
privileges. Each organization should create or adopt a classification scheme that meets its business
requirements. Any classification method should be clear and easy to apply. It will contain a range of levels,
from the least to the most confidential (e.g., from “for general use” to “registered confidential”). (See Section
1.3.12.1.)
A growing number of highly publicized data breaches, in which sensitive personal information has been
compromised, have resulted in data-specific laws to being introduced. Financially-focused data incidents have
spurred governments across the globe to implement additional regulations.
This has created a new class of data, which might be called ‘Regulated Information’. Regulatory requirements
are an extension of information security. Additional measures are required to manage regulatory requirements
effectively. Consultation with corporate counsel is often helpful in determining what actions certain regulations
DATA SECURITY • 249
require from the enterprise. Often the regulations imply a goal, and it is up to the corporation to determine the
means for reaching that information protection goal. Actions that can be audited provide legal proof of
compliance.
A useful way to handle the data-specific regulations is by analyzing and grouping similar regulations into
categories, as was been done by grouping various risks into a few security classifications.
With more than one-hundred different data-specific ordinances around the world, it would be useless to develop
a different category for each regulation. Most data regulations, imposed as they are by separate legal entities,
seek to do the same thing. For example, the contractual obligations for protecting confidential customer data are
remarkably similar to U.S., Japanese, and Canadian government regulations for protecting Personally
Identifiable Information, and similar for compliance with EU privacy requirements. This pattern is easy to see
when the auditable compliance actions for each regulation are listed and compared. Thus, they may all be
managed properly by using the same protective action category.
A key principle for both security classification and regulatory categorization is that most information can be
aggregated so that it has greater or lesser sensitivity. Developers need to know how aggregations affect the
overall security classification and regulatory categories. When a developer of a dashboard, report, or database
view knows that some of the data that is required may be personally private or insider or related to competitive
advantage, the system can then be designed to eliminate aspects of that from the entitlement, or, if the data must
remain in the user-entitlement, to enforce all the security and regulatory requirements at the time of user
authorization.
The results of this classification work will be a formally approved set of security classifications and regulatory
categories and a process for capturing this Metadata in a central repository so that employees, both business and
technical, know the sensitivity if the information they are handling, transmitting, and authorizing
Data access control can be organized at an individual or group level, depending on the need. That said, granting
access and update privileges to individual user accounts entails a great deal of redundant effort. Smaller
organizations may find it acceptable to manage data access at the individual level. However, larger
organizations will benefit greatly from role-based access control, granting permissions to role groups and
thereby to each group member.
Role groups enable security administrators to define privileges by role and to grant these privileges by enrolling
users in the appropriate role group. While it is technically possible to enroll a user in more than one group, this
practice may make it difficult to understand the privileges granted to a specific user. Whenever possible, try to
assign each user to only one role group. This may require the creation of different user views of certain data
entitlements to comply with regulations.
Data consistency in user and role management is a challenge. User information such as name, title, and
employee ID must be stored redundantly in several locations. These islands of data often conflict, representing
250 • DMBOK2
multiple versions of the ‘truth’. To avoid data integrity issues, manage user identity data and role-group
membership centrally. This is a requirement for the quality of data used for effective access control. Security
administrators create, modify, and delete user accounts and role groups. Changes made to the group taxonomy
and membership should receive appropriate approval. Changes should be tracked via a change management
system.
Applying data security measures inconsistently or improperly within an organization can lead to employee
dissatisfaction and significant risk to the organization. Role-based security depends on clearly defined,
consistently assigned roles.
There are two ways to define and organize roles: as a grid (starting from the data), or in a hierarchy (starting
from the user).
A grid can be useful for mapping out access roles for data, based on data confidentiality, regulations, and user
functions. The Public User role can have access to all data ranked for General Audiences and not subject to any
regulations. A Marketing role may have access to some PII information for use in developing campaigns, but
not to any restricted data, or Client Confidential data. Table 14 shows a very simplified example.
Confidentiality Level
General Audience Client Confidential Restricted Confidential
Not Regulated Public User Role Client Manager Role Restricted Access Role
PII Marketing Role Client Marketing Role HR Role
PCI Financial Role Client Financial Role Restricted Financial Role
Construct group definitions at a workgroup or business unit level. Organize these roles in a hierarchy, so that
child roles further restrict the privileges of parent roles. The ongoing maintenance of these hierarchies is a
complex operation requiring reporting systems capable of granular drill down to individual user privileges. A
security role hierarchy example is shown in Figure 65.
Security risks include elements that can compromise a network and/or database. The first step in identifying risk
is identifying where sensitive data is stored, and what protections are required for that data. Evaluate each
system for the following:
DATA SECURITY • 251
Document the findings, as they create a baseline for future evaluations. This documentation may also be a
requirement for privacy compliance, such as in the European Union. Gaps must be remediated through
improved security processes supported by technology. The impact of improvements should be measured and
monitored to ensure risks are mitigated.
In larger organizations, white-hat hackers may be hired to assess vulnerabilities. A white hat exercise can be
used as proof of an organization’s impenetrability, which can be used in publicity for market reputation.
Implementation and administration of data security policy is primarily the responsibility of security
administrators, in coordination with data stewards and technical teams. For example, database security is often a
DBA responsibility.
Organizations must implement proper controls to meet the security policy requirements. Controls and
procedures should (at a minimum) cover:
Document the requirements for allowing original user authorizations so de-authorization may happen when
these conditions no longer apply.
For instance, a policy to ‘maintain appropriate user privileges’ could have a control objective of ‘Review DBA
and User rights and privileges on a monthly basis’. The organization’s procedure to satisfy this control might
be to implement and maintain processes to:
• Validate assigned permissions against a change management system used for tracking all user
permission requests
• Require a workflow approval process or signed paper form to record and document each change
request
• Include a procedure for eliminating authorizations for people whose job status or department no longer
qualifies them to have certain access rights
Some level of management must formally request, track, and approve all initial authorizations and subsequent
changes to user and group authorizations
Data Stewards are responsible for evaluating and determining the appropriate confidentiality level for data
based on the organization’s classification scheme.
The classification for documents and reports should be based on the highest level of confidentiality for any
information found within the document. (See Chapter 9.) Label each page or screen with the classification in the
header or footer. Information products classified as least confidential (e.g., “For General Audiences”) do not
need labels. Assume any unlabeled products to be for General Audiences.
Document authors and information product designers are responsible for evaluating, correctly classifying, and
labeling the appropriate confidentiality level for each document, as well as each database, including relational
tables, columns, and user entitlement views.
In larger organizations, much of the security classification and protective effort will be the responsibility of a
dedicated information security organization. While Information Security will be happy to have the Data
Stewards work with these classifications, they usually take responsibility for enforcement and for physically
protecting the network.
Organizations should create or adopt a classification approach to ensure that they can meet the demands of
regulatory compliance. (See Section 3.3.) This classification scheme provides a foundation for responding to
internal and external audits. Once it is in place, information needs to be assessed and classified within the
schema. Security staff may not be familiar with this concept, as they do not work with individual data
DATA SECURITY • 253
regulations, but with infrastructure systems. They will need to have documented requirements for data
protection relating to these categories defining actions they can implement.
Once all the requirements, policies, and procedures are in place, the main task is to ensure that security breaches
do not occur, and if they do, to detect them as soon as possible. Continual monitoring of systems and auditing of
the execution of security procedures are crucial to preserving data security.
Controlling data availability requires management of user entitlements and of the structures (data masking, view
creation, etc.) that technically control access based on entitlements. Some databases are better than others in
providing structures and processes to protect data in storage. (See Section 3.7.)
Security Compliance managers may have direct responsibility for designing user entitlement profiles that allow
the business to function smoothly, while following relevant restrictions.
Defining entitlements and granting authorizations requires an inventory of data, careful analysis of data needs,
and documentation of the data exposed in each user entitlement. Often highly sensitive information is mixed
with non-sensitive information. An enterprise data model is essential to identifying and locating sensitive data.
(See Section 1.1.1.)
Data masking can protect data even if it is inadvertently exposed. Certain data regulations require encryption, an
extreme version of in-place masking. Authorization to the decryption keys can be part of the user authorization
process. Users authorized to access the decryption keys can see the unencrypted data, while others only see
random characters.
Relational database views can used to enforce data security levels. Views can restrict access to certain rows
based on data values or restrict access to certain columns, limiting access to confidential or regulated fields.
Reporting on access is a basic requirement for compliance audits. Monitoring authentication and access
behavior provides information about who is connecting and accessing information assets. Monitoring also helps
detect unusual, unforeseen, or suspicious transactions that warrant investigation. In this way, it compensates for
gaps in data security planning, design, and implementation.
Deciding what needs monitoring, for how long, and what actions to take in the event of an alert, requires careful
analysis driven by business and regulatory requirements. Monitoring entails a wide range of activities. It can be
specific to certain data sets, users, or roles. It can be used to validate data integrity, configurations, or core
254 • DMBOK2
Metadata. It can be implemented within a system or across dependent heterogeneous systems. It can focus on
specific privileges, such as the ability to download large sets of data or to access data at off hours.
Monitoring can be automated or executed manually or executed through a combination of automation and
oversight. Automated monitoring does impose overhead on the underlying systems and may affect system
performance. Periodic snapshots of activity can be useful in understanding trends and comparing against
standards criteria. Iterative configuration changes may be required to achieve the optimal parameters for proper
monitoring.
Automated recording of sensitive or unusual database transactions should be part of any database deployment.
Lack of automated monitoring represents serious risks:
• Regulatory risk: Organizations with weak database audit mechanisms will increasingly find that they
are at odds with government regulatory requirements. Sarbanes-Oxley (SOX) in the financial services
sector and the Healthcare Information Portability and Accountability Act (HIPAA) in the healthcare
sector are just two examples of US government regulation with clear database audit requirements.
• Detection and recovery risk: Audit mechanisms represent the last line of defense. If an attacker
circumvents other defenses, audit data can identify the existence of a violation after the fact. Audit data
can also be used to link a violation to a particular user or as a guide to repair the system.
• Administrative and audit duties risk: Users with administrative access to the database server –
whether that access was obtained legitimately or maliciously – can turn off auditing to hide fraudulent
activity. Audit duties should ideally be separate from both database administrators and the database
server platform support staff.
• Risk of reliance on inadequate native audit tools: Database software platforms often try to integrate
basic audit capabilities but they often suffer from multiple weaknesses that limit or preclude
deployment. When users access the database via Web applications (such as SAP, Oracle E-Business
Suite, or PeopleSoft), native audit mechanisms have no awareness of specific user identities and all
user activity is associated with the Web application account name. Therefore, when native audit logs
reveal fraudulent database transactions, there is no link to the responsible user.
To mitigate the risks, implement a network-based audit appliance, which can address most of the weaknesses
associated with native audit tools, but which does not take place of regular audits by trained auditors. This kind
of appliance has the following benefits:
• High performance: Network-based audit appliances can operate at line speed with little impact on
database performance.
• Granular transaction tracking supports advanced fraud detection, forensics, and recovery. Logs
include details such as source application name, complete query text, query response attributes, source
OS, time, and source name.
Managing security policy compliance includes ongoing activities to ensure policies are followed and controls
are effectively maintained. Management also includes providing recommendations to meet new requirements.
In many cases, Data Stewards will act in conjunction with Information Security and Corporate Counsel so that
operational policies and technical controls are aligned.
Compliance controls require audit trails. For example, if policy states that users must take training before
accessing certain data, then the organization must be able to prove that any given user took the training. Without
an audit trail, there is no evidence of compliance. Controls should be designed to ensure they are auditable.
Internal audits of activities to ensure data security and regulatory compliance policies are followed should be
conducted regularly and consistently. Compliance controls themselves must be revisited when new data
regulation is enacted, when existing regulation changes, and periodically to ensure usefulness. Internal or
external auditors may perform audits. In all cases, auditors must be independent of the data and / or process
involved in the audit to avoid any conflict of interest and to ensure the integrity of the auditing activity and
results.
Auditing is not a fault-finding mission. The goal of auditing is to provide management and the data governance
council with objective, unbiased assessments, and rational, practical recommendations.
Data security policy statements, standards documents, implementation guides, change requests, access
monitoring logs, report outputs, and other records (electronic or hard copy) form the input to an audit. In
addition to examining existing evidence, audits often include performing tests and checks, such as:
256 • DMBOK2
• Analyzing policy and standards to assure that compliance controls are defined clearly and fulfill
regulatory requirements
• Assessing whether authorization standards and procedures are adequate and in alignment with
technology requirements
• Evaluating escalation procedures and notification mechanisms to be executed when potential non-
compliance issues are discovered or in the event of a regulatory compliance breach
• Reviewing contracts, data sharing agreements, and regulatory compliance obligations of outsourced
and external vendors, that ensure business partners meet their obligations and that the organization
meets its legal obligations for protecting regulated data
• Assessing the maturity of security practices within the organization and reporting to senior
management and other stakeholders on the ‘State of Regulatory Compliance’
Auditing data security is not a substitute for management of data security. It is a supporting process that
objectively assesses whether management is meeting goals.
3. Tools
The tools used for managing information security depend, in large part, on the size of the organization, the
network architecture, and the policies and standards used by a security organization.
Anti-virus software protects computers from viruses encountered on the Web. New viruses and other malware
appear every day, so it is important to update security software regularly.
3.2 HTTPS
If a Web address begins with https://, it indicates that the website is equipped with an encrypted security layer.
Typically, users must provide a password or other means of authentication to access the site. Making payments
online or accessing classified information uses this encryption protection. Train users to look for this in the
DATA SECURITY • 257
URL address when they are performing sensitive operations over the Internet, or even within the enterprise.
Without encryption, people on the same network segment can read the plain text information.
Identity management technology stores assigned credentials and shares them with systems upon request, such as
when a user logs into a system. Some applications manage their own credential repository, although it is more
convenient for users to have most or all applications use a central credential repository. There are protocols for
managing credentials: Lightweight Directory Access Protocol (LDAP) is one.
Some companies choose and provide an enterprise approved ‘Password Safe’ product that creates an encrypted
password file on each user’s computer. Users only need to learn one long pass-phrase to open the program and
they can store all their passwords safely in the encrypted file. A single-sign-on system also can perform this
role.
Tools that can detect incursions and dynamically deny access are necessary for when hackers do penetrate
firewalls or other security measures.
An Intrusion Detection System (IDS) will notify appropriate people when an inappropriate incident happens.
IDS should optimally be connected with an intrusion Prevention System (IPS) that automatically responds to
known attacks and illogical combinations of user commands. Detection is often accomplished by analysis of
patterns within the organization. Knowledge of expected patterns allows detection of out-of-the-ordinary events.
When these take place, the system can send alerts.
Secure and sophisticated firewalls, with capacity to allow full speed data transmission while still performing
detailed packet analysis, should be placed at the enterprise gateway. For web servers exposed to the Internet, a
more complex firewall structure is advised, as many malicious hacker attacks exploit legitimate appearing
traffic that is intentionally malformed to exploit database and web server vulnerabilities.
Tools that track Metadata can help an organization track the movement of sensitive data. These tools create a
risk that outside agents can detect internal information from metadata associated with documents. Identification
of sensitive information using Metadata provides the best way to ensure that data is protected properly. Since
258 • DMBOK2
the largest number of data loss incidents result from the lack of sensitive data protection due to ignorance of its
sensitivity, Metadata documentation completely overshadows any hypothetical risk that might occur if the
Metadata were to be somehow exposed from the Metadata repository. This risk is made more negligible since it
is trivial for an experienced hacker to locate unprotected sensitive data on the network. The people most likely
unaware of the need to protect sensitive data appear to be employees and managers.
Tools that perform masking or encryption are useful for restricting movement of sensitive data. (See Section
1.3.9.)
4. Techniques
Techniques for managing information security depend on the size of the organization, the architecture of the
network, the type of data that must be secured, and the policies and standards used by a security organization.
Creating and using data-to-process and data-to-role relationship (CRUD–Create, Read, Update, Delete) matrices
help map data access needs and guide definition of data security role groups, parameters, and permissions.
Some versions add an E for Execute to make CRUDE.
A process for installing security patches as quickly as possible on all machines should be in place. A malicious
hacker only needs root access to one machine in order to conduct his attack successfully on the network. Users
should not be able to delay this update.
A Metadata repository is essential to assure the integrity and consistent use of an Enterprise Data Model across
business processes. Metadata should include security and regulatory classifications for data. (See Section 1.1.3.)
Having security Metadata in place protects an organization from employees who may not recognize the data as
sensitive. When Data Stewards apply confidentiality and regulatory categories, category information should be
documented in the Metadata repository and, if technology allows, tagged to the data. (See Sections 3.3.1 and
DATA SECURITY • 259
3.3.2.) These classifications can be used to define and manage user entitlements and authorizations, as well as to
inform development teams about risks related to sensitive data.
4.4 Metrics
It is essential to measure information protection processes to ensure that they are functioning as required.
Metrics also enable improvement of these processes. Some metrics measure progress on processes: the number
of audits performed, security systems installed, incidents reported, and the amount of unexamined data in
systems. More sophisticated metrics will focus on findings from audits or the movement of the organization
along a maturity model.
In larger organizations with existing information security staff, a significant number of these metrics may
already exist. It is helpful to reuse existing metrics as a part of an overall threat management measurement
process, and to prevent duplication of effort. Create a baseline (initial reading) of each metric to show progress
over time.
While a great number of security activities and conditions can be measured and tracked, focus on actionable
metrics. A few key metrics in organized groups are easier to manage than pages of apparently unrelated
indicators. Improvement actions may include awareness training on data regulatory policies and compliance
actions.
Many organizations face similar data security challenges. The following lists may assist in selecting applicable
metrics.
• Percentage of enterprise computers having the most recent security patches installed
• Percentage of computers having up-to-date anti-malware software installed and running
• Percentage of new-hires who have had successful background checks
• Percentage of employees scoring more than 80% on annual security practices quiz
• Percentage of business units for which a formal risk assessment analysis has been completed
• Percentage of business processes successfully tested for disaster recovery in the event of fire,
earthquake, storm, flood, explosion or other disaster
• Percentage of audit findings that have been successfully resolved
Select and maintain a reasonable number of actionable metrics in appropriate categories over time to assure
compliance, spot issues before they become crises, and indicate to senior management a determination to
protect valuable corporate information.
• Risk assessment findings provide qualitative data that needs to be fed back to appropriate business
units to make them more aware of their accountability.
• Risk events and profiles identify unmanaged exposures that need correction. Determine the absence
or degree of measurable improvement in risk exposure or conformance to policy by conducting follow-
up testing of the awareness initiative to see how well the messages got across.
• Formal feedback surveys and interviews identify the level of security awareness. Also, measure the
number of employees who have successfully completed security awareness training within targeted
populations.
• Incident post mortems, lessons learned, and victim interviews provide a rich source of information
on gaps in security awareness. Measures may include how much vulnerability has been mitigated.
• Patching effectiveness audits involve specific machines that work with confidential and regulated
information to assess the effectiveness of security patching. (An automated patching system is advised
whenever possible.)
DATA SECURITY • 261
• Criticality ranking of specific data types and information systems that, if made inoperable, would
have profound impact on the enterprise.
• Annualized loss expectancy of mishaps, hacks, thefts, or disasters related to data loss, compromise, or
corruption.
• Risk of specific data losses related to certain categories of regulated information, and remediation
priority ranking.
• Risk mapping of data to specific business processes. Risks associated with Point of Sale devices
would be included in the risk profile of the financial payment system.
• Threat assessments performed based on the likelihood of an attack against certain valuable data
resources and the media through which they travel.
• Vulnerability assessments of specific parts of the business process where sensitive information could
be exposed, either accidentally or intentionally.
Auditable list of locations where sensitive data is propagated throughout the organization.
The number of copies of confidential data should be measured in order to reduce this proliferation. The more
places confidential data is stored, the higher the risk of a breach.
Every project that involves data must address system and data security. Identify detailed data and application
security requirements in the analysis phase. Identification up front guides the design and prevents having to
retrofit security processes. If implementation teams understand data protection requirements from the start, they
can build compliance into the basic architecture of the system. This information can also be used for selecting
appropriate vendor/purchased software packages.
262 • DMBOK2
Searching encrypted data obviously includes the need to decrypt the data. One way to reduce the amount of data
that needs decryption is to encrypt the search criteria (such as a string) using the same encryption method used
for the data, and then seek matches. The amount of data matching the encrypted search criteria will be much
less, and therefore less costly (and risky) to decrypt. Then search using clear text on the result set to get exact
matches.
Document sanitization is the process of cleaning Metadata, such as tracked change history, from documents
before sharing. Sanitization mitigates the risk of sharing confidential information that might be embedded in
comments. In contracts especially, access to this information may negatively affect negotiations.
5. Implementation Guidelines
Implementation of data security practices depends on corporate culture, the nature of the risks, the sensitivity of
what data the company manages, and the types of systems in place. Implementation system components should
be guided by a strategic security plan and supporting architecture.
Keeping data secure is deeply connected to corporate culture. Organizations often end up reacting to crises,
rather than proactively managing accountability and ensuring auditability. While perfect data security is next to
impossible, the best way to avoid data security breaches is to build awareness and understanding of security
requirements, policies, and procedures. Organizations can increase compliance through:
• Training: Promotion of standards through training on security initiatives at all levels of the
organization. Follow training with evaluation mechanisms such as online tests focused on improving
employee awareness. Such training and testing should be mandatory and a prerequisite for employee
performance evaluation.
• Consistent policies: Definition of data security policies and regulatory compliance policies for
workgroups and departments that complement and align with enterprise policies. Adopting an ‘act
local’ mindset helps engage people more actively.
• Measure the benefits of security: Link data security benefits to organizational initiatives.
Organizations should include objective metrics for data security activities in their balanced scorecard
measurements and project evaluations.
DATA SECURITY • 263
• Set security requirements for vendors: Include data security requirements in service level
agreements and outsourcing contractual obligations. SLA agreements must include all data protection
actions.
• Build a sense of urgency: Emphasize legal, contractual, and regulatory requirements to build a sense
of urgency and an internal framework for data security management.
Organizations need to develop data policies that enable them to meet their goals while protecting sensitive and
regulated information from misuse or unauthorized exposure. They must account for the interests of all
stakeholders as they balance risks with ease of access. Often the technical architecture must accommodate the
Data Architecture to balance these needs to create an effective and secure electronic environment. In most
organizations, the behavior of both management and employees will need to change if they are to successfully
protect their data.
In many larger companies, the existing information security group will have in place policies, safeguards,
security tools, access control systems, and information protection devices and systems. There should be a clear
understanding and appreciation where these elements complement the work done by the Data Stewards and data
administrators. Data Stewards are generally responsible for data categorization. Information security teams
assist with compliance enforcement and establish operational procedures based on data protection policies, and
security and regulatory categorization.
Implementing data security measures without regard for the expectations of customers and employees can result
in employee dissatisfaction, customer dissatisfaction, and organizational risk. To promote compliance, data
security measures must account for the viewpoint of those who will be working with the data and systems.
Well-planned and comprehensive technical security measures should make secure access easier for
stakeholders.
Each user data entitlement, which is the sum total of all the data made available by a single authorization, must
be reviewed during system implementation to determine if it contains any regulated information. Knowing who
can see which data requires management of Metadata that describes the confidentiality and regulatory
classifications of the data, as well as management of the entitlements and authorizations themselves.
Classification of regulatory sensitivity should be a standard part of the data definition process.
264 • DMBOK2
Outsourcing IT operations introduces additional data security challenges and responsibilities. Outsourcing
increases the number of people who share accountability for data across organizational and geographic
boundaries. Previously informal roles and responsibilities must be explicitly defined as contractual obligations.
Outsourcing contracts must specify the responsibilities and expectations of each role.
Any form of outsourcing increases risk to the organization, including some loss of control over the technical
environment and the people working with the organization’s data. Data security measures and processes must
look at the risk from the outsource vendor as both an external and internal risk.
The maturity of IT outsourcing has enabled organizations to re-look at outsourced services. A broad consensus
has emerged that architecture and ownership of IT, which includes data security architecture, should be an in-
sourced function. In other words, the internal organization owns and manages the enterprise and security
architecture. The outsourced partner may take the responsibility for implementing the architecture.
Transferring control, but not accountability, requires tighter risk management and control mechanisms. Some of
these mechanisms include:
In an outsourced environment, it is critical to track the lineage, or flow, of data across systems and individuals
to maintain a ‘chain of custody’. Outsourcing organizations especially benefit from developing CRUD (Create,
Read, Update, and Delete) matrices that map data responsibilities across business processes, applications, roles,
and organizations, tracing the transformation, lineage, and chain of custody for data. Additionally, the ability to
execute business decisions or application functionality such as approving checks or orders, must be included as
part of the matrix.
Responsible, Accountable, Consulted, and Informed (RACI) matrices also help clarify roles, the separation of
duties, and responsibilities of different roles, including their data security obligations.
The RACI matrix can become part of the contractual agreements and data security policies. Defining
responsibility matrices like RACI will establish clear accountability and ownership among the parties involved
in the outsourcing engagement, leading to support of the overall data security policies and their implementation.
DATA SECURITY • 265
In outsourcing information technology operations, the accountability for maintaining data still lies with the
organization. It is critical to have appropriate compliance mechanisms in place and have realistic expectations
from parties entering into the outsourcing agreements.
The rapid emergence of web computing and business-to-business and business-to-consumer interaction has
caused the boundaries of data to extend beyond the four walls of the organization. The recent advances in cloud
computing have extended the boundaries a step further. The ‘as-a-service’ nomenclature is now common across
all stacks of technology and business. ‘Data-as-a-Service’, ‘Software-as-a-Service’, ‘Platform-as-a-Service’ are
commonly used terms today. Cloud computing, or having resources distributed over the internet to process data
and information, is complementing the ‘X-as-a-Service’ provisioning.
Data security policies need to account for the distribution of data across the different service models. This
includes the need to leverage external data security standards.
Shared responsibility, defining chain of custody of data and defining ownership and custodianship rights, is
especially important in cloud computing. Infrastructure considerations (e.g., Who is responsible for the firewall
when the cloud provider delivers the software over the web? Who is accountable for access rights on the
servers?) have direct impacts to data security management and data policies.
Fine-tuning or even creating a new data security management policy geared towards cloud computing is
necessary for organizations of all sizes. Even if an organization has not directly implemented resources in the
cloud, business partners may. In a connected world of data, having a business partner use cloud computing
means putting the organization’s data in the cloud. The same data proliferation security principles apply to
sensitive/confidential production data.
Internal cloud data-center architecture, including virtual machines even though potentially more secure, should
follow the same security policy as the rest of the enterprise.
Enterprise Architecture defines the information assets and components of an enterprise, their interrelationships,
and business rules regarding transformation, principles, and guidelines. Data Security architecture is the
266 • DMBOK2
component of enterprise architecture that describes how data security is implemented within the enterprise to
satisfy the business rules and external regulations. Architecture influences:
For example, an architectural pattern of a service-oriented integration mechanism between internal and external
parties would call for a data security implementation different from traditional electronic data interchange (EDI)
integration architecture.
For a large enterprise, the formal liaison function between these disciplines is essential to protecting information
from misuse, theft, exposure, and loss. Each party must be aware of elements that concern the others, so they
can speak a common language and work toward shared goals.
Calder, Alan, and Steve Watkins. IT Governance: An International Guide to Data Security and ISO27001/ISO27002. 5th ed.
Kogan Page, 2012. Print.
Fuster, Gloria González. The Emergence of Personal Data Protection as a Fundamental Right of the EU. Springer, 2014.
Print. Law, Governance and Technology Series / Issues in Privacy and Data Protection.
Harkins, Malcolm. Managing Risk and Information Security: Protect to Enable (Expert's Voice in Information Technology).
Apress, 2012. Kindle.
Hayden, Lance. IT Security Metrics: A Practical Framework for Measuring Security and Protecting Data. McGraw-Hill
Osborne Media, 2010. Print.
Kark, Khalid. “Building A Business Case for Information Security”. Computer World. 2009-08-10 http://bit.ly/2rCu7QQ
Web.
Kennedy, Gwen, and Leighton Peter Prabhu. Data Privacy: A Practical Guide. Interstice Consulting LLP, 2014. Kindle.
Amazon Digital Services.
DATA SECURITY • 267
Murdoch, Don GSE. Blue Team Handbook: Incident Response Edition: A condensed field guide for the Cyber Security
Incident Responder. 2nd ed. CreateSpace Independent Publishing Platform, 2014. Print.
National Institute for Standards and Technology (US Department of Commerce website) http://bit.ly/1eQYolG.
Rao, Umesh Hodeghatta and Umesha Nayak. The InfoSec Handbook: An Introduction to Information Security. Apress,
2014. Kindle. Amazon Digital Services.
Ray, Dewey E. The IT professional’s merger and acquisition handbook. Cognitive Diligence, 2012.
Schlesinger, David. The Hidden Corporation: A Data Management Security Novel. Technics Publications, LLC, 2011. Print.
Singer, P.W. and Allan Friedman. Cybersecurity and Cyberwar: What Everyone Needs to Know®. Oxford University Press,
2014. Print. What Everyone Needs to Know.
Watts, John. Certified Information Privacy Professional Study Guide: Pass the IAPP's Certification Foundation Exam with
Ease! CreateSpace Independent Publishing Platform, 2014. Print.
Williams, Branden R., Anton Chuvakin Ph.D. PCI Compliance: Understand and Implement Effective PCI Data Security
Standard Compliance. 4th ed. Syngress, 2014. Print.
CHAPTER 8
Data Data
Metadata
Governance Security
1. Introduction
D
ata Integration and Interoperability (DII) describes processes related to the movement and
consolidation of data within and between data stores, applications and organizations. Integration
consolidates data into consistent forms, either physical or virtual. Data Interoperability is the ability
for multiple systems to communicate. DII solutions enable basic data management functions on which most
organizations depend:
269
270 • DMBOK2
• Data Governance: For governing the transformation rules and message structures
• Data Architecture: For designing solutions
• Data Security: For ensuring solutions appropriately protect the security of data, whether it is
persistent, virtual, or in motion between applications and organizations
• Metadata: For tracking the technical inventory of data (persistent, virtual, and in motion), the business
meaning of the data, the business rules for transforming the data, and the operational history and
lineage of the data
• Data Storage and Operations: For managing the physical instantiation of the solutions
• Data Modeling and Design: For designing the data structures including physical persistence in
databases, virtual data structures, and messages passing information between applications and
organizations
Data Integration and Interoperability is critical to Data Warehousing and Business Intelligence, as well as
Reference Data and Master Data Management, because all of these focus on transforming and integrating data
from source systems to consolidated data hubs and from hubs to the target systems where it can be delivered to
data consumers, both system and human.
Data Integration and Interoperability is central to the emerging area of Big Data management. Big Data seeks to
integrate various types of data, including data structured and stored in databases, unstructured text data in
documents or files, other types of unstructured data such as audio, video, and streaming data. This integrated
data can be mined, used to develop predictive models, and deployed in operational intelligence activities.
The need to manage data movement efficiently is a primary driver for DII. Since most organizations have
hundreds or thousands of databases and stores, managing the processes for moving data between the data stores
within the organization and to and from other organizations has become a central responsibility of every
information technology organization. If not managed properly, the process of moving data can overwhelm IT
resources and capabilities and dwarf the support requirements of traditional application and data management
areas.
DATA INTEGRATION AND INTEROPERABILITY • 271
The advent of organizations purchasing applications from software vendors, rather than developing custom
applications, has amplified the need for enterprise data integration and interoperability. Each purchased
application comes with its own set of Master Data stores, transaction data stores, and reporting data stores that
must integrate with the other data stores in the organization. Even Enterprise Resource Planning (ERP) systems
that run the common functions of the organization, rarely, if ever, encompass all the data stores in the
organization. They, too, have to have their data integrated with other organizational data.
Goals:
1. Provide data securely, with regulatory compliance, in the format and timeframe needed.
2. Lower cost and complexity of managing solutions by developing shared models and interfaces.
3. Identify meaningful events and automatically trigger alerts and actions.
4. Support business intelligence, analytics, master data management, and operational efficiency efforts.
Business
Drivers
Inputs: Activities: Deliverables:
• Business Goals & 1. Plan & Analyze (P) • DII Architecture
Strategies 1. Define data integration and lifecycle • Data Exchange
requirements
• Data Needs & Specifications
2. Perform Data Discovery
Standards • Data Access
3. Document Data Lineage
• Regulatory, 4. Profile Data Agreements
Compliance, & 5. Examine Business Rule Compliance • Data Services
Security 2. Design DII Solutions (P) • Complex Event
Requirements 1. Design Solution Components Processing
2. Map Sources to Targets Thresholds and
• Data, Process,
3. Design Data Orchestration Alerts
Application, and 3. Develop DII Solutions (D)
Technical 1. Develop Data Services
Architectures 2. Develop Data Flow Orchestration
• Data Semantics 3. Develop Data Migration Approach
• Source Data 4. Develop Complex Event Processing
5. Maintain DII Metadata
4. Implement and Monitor (O)
Technical
Drivers
The need to manage complexity and the costs associated with complexity are reasons to architect data
integration from an enterprise perspective. An enterprise design of data integration is demonstrably more
efficient and cost effective than distributed or point-to-point solutions. Developing point-to-point solutions
between applications can result in thousands to millions of interfaces and can quickly overwhelm the
capabilities of even the most effective and efficient IT support organization.
Data hubs such as data warehouses and Master Data solutions help to alleviate this problem by consolidating
the data needed by many applications and providing those applications with consistent views of the data.
Similarly, the complexity of managing operational and transactional data that needs to be shared across the
organization can be greatly simplified using enterprise data integration techniques such as hub-and-spoke
integration and canonical message models.
Another business driver is managing the cost of support. Moving data using multiple technologies, each
requiring specific development and maintenance skills, can drive support costs up. Standard tool
implementations can reduce support and staffing costs and improve the efficiency of troubleshooting efforts.
Reducing the complexity of interface management can lower the cost of interface maintenance, and allow
support resources to be more effectively deployed on other organizational priorities.
DII also supports an organization’s ability to comply with data handling standards and regulations. Enterprise-
level DII systems enable re-use of code to implement compliance rules and simplify verification of compliance.
The implementation of Data Integration and Interoperability practices and solutions aims to:
• Make data available in the format and timeframe needed by data consumers, both human and system
• Lower cost and complexity of managing solutions by developing shared models and interfaces
• Identify meaningful events (opportunities and threats) and automatically trigger alerts and actions
• Support Business Intelligence, analytics, Master Data Management, and operational efficiency efforts
• Take an enterprise perspective in design to ensure future extensibility, but implement through iterative
and incremental delivery
• Balance local data needs with enterprise data needs, including support and maintenance.
• Ensure business accountability for Data Integration and Interoperability design and activity. Business
experts should be involved in the design and modification of data transformation rules, both persistent
and virtual.
DATA INTEGRATION AND INTEROPERABILITY • 273
Central to all areas in Data Integration and Interoperability is the basic process of Extract, Transform, and Load
(ETL). Whether executed physically or virtually, in batch or real-time, these are the essential steps in moving
data around and between applications and organizations.
Depending on data integration requirements, ETL can be performed as a periodically scheduled event (batch) or
whenever new or updated data is available (real-time or event-driven). Operational data processing tends to be
real-time or near real-time, while data needed for analysis or reporting is often scheduled in batch jobs.
Data integration requirements also determine whether the extracted and transformed data is physically stored in
staging structures. Physical staging allows for an audit trail of steps that have occurred with the data and
potential process restarts from an intermediate point. However, staging structures take up disk space and take
time to write and read. Data integration needs that require very low latency will usually not include physical
staging of the intermediate data integration results.
1.3.1.1 Extract
The extract process includes selecting the required data and extracting it from its source. Extracted data is then
staged, in a physical data store on disk or in memory. If physically staged on disk, the staging data store may be
co-located with the source data store or with the target data store, or both.
Ideally, if this process executes on an operational system, it is designed to use as few resources as possible, in
order to avoid negatively affecting the operational processes. Batch processing during off-peak hours is an
option for extracts that include complex processing to perform the selection or identify changed data to extract.
1.3.1.2 Transform
The transform process makes the selected data compatible with the structure of the target data store.
Transformation includes cases where data is removed from the source when it moves to the target, where data is
copied to multiple targets, and where the data is used to trigger events but is not persisted.
• Format changes: Conversion of the technical format of the data; for example, from EBCDIC to ASCII
format
• Structure changes: Changes to the structure of the data; for example, from denormalized to
normalized records
274 • DMBOK2
• Semantic conversion: Conversion of data values to maintain consistent semantic representation. For
example, the source gender codes might include 0, 1, 2, and 3, while the target gender codes might be
represented as UNKNOWN, FEMALE, MALE, or NOT PROVIDED.
• De-duping: Ensuring that if rules require unique key values or records, a means for scanning the
target, and detecting and removing duplicate rows, is included
• Re-ordering: Changing the order of the data elements or records to fit a defined pattern
Transformation may be performed in batch or real-time, either physically storing the result in a staging area, or
virtually storing the transformed data in memory until ready to move to the load step. Data resulting from the
transformation stage should be ready to integrate with data in the target structure.
1.3.1.3 Load
The load step of ETL is physically storing or presenting the result of the transformations in the target system.
Depending on the transformations performed, the target system’s purpose, and the intended use, the data may
need further processing to be integrated with other data, or it may be in a final form, ready to present to
consumers.
Lookups
Mappings
1.3.1.4 ELT
If the target system has more transformation capability than either the source or an intermediary application
system, the order of processes may be switched to ELT – Extract, Load, and Transform. ELT allows
DATA INTEGRATION AND INTEROPERABILITY • 275
transformations to occur after the load to the target system, often as part of the process. ELT allows source data
to be instantiated on the target system as raw data, which can be useful for other processes. This is common in
Big Data environments where ELT loads the data lake. (See Chapter 14.)
Lookups
Mappings
Source Target
Datastore Datastore
1.3.1.5 Mapping
A synonym for transformation, a mapping is both the process of developing the lookup matrix from source to
target structures and the result of that process. A mapping defines the sources to be extracted, the rules for
identifying data for extraction, targets to be loaded, rules for identifying target rows for update (if any), and any
transformation rules or calculations to be applied. Many data integration tools offer visualizations of mappings
that enable developers to use graphical interfaces to create transformation code.
1.3.2 Latency
Latency is the time difference between when data is generated in the source system and when the data is
available for use in the target system. Different approaches to data processing result in different degrees of data
latency. Latency can be high (batch) or low (event-driven) to very low (real-time synchronous).
1.3.2.1 Batch
Most data moves between applications and organizations in clumps or files either on request by a human data
consumer or automatically on a periodic schedule. This type of interaction is called batch or ETL.
276 • DMBOK2
Data moving in batch mode will represent either the full set of data at a given point in time, such as account
balances at the end of a period, or data that has changed values since the last time the data was sent, such as
address changes that have been made in a day. The set of changed data is called the delta, and the data from a
point in time is called a snapshot.
With batch data integration solutions, there is often a significant delay between when data changes in the source
and when it is updated in the target, resulting in high latency. Batch processing is very useful for processing
very high volumes of data in a short time window. It tends to be used for data warehouse data integration
solutions, even when lower latency solutions are available.
To achieve fast processing and lower latency, some data integration solutions use micro-batch processing which
schedules batch processing to run on a much higher frequency than daily, such as every five minutes.
Batch data integration is used for data conversions, migrations, and archiving, as well as for extracting from and
loading data warehouses and data marts. There are risks associated with the timing of batch processing. To
minimize issues with application updates, schedule data movement between applications at the end of logical
processing for the business day, or after special processing of the data has occurred at night. To avoid
incomplete data sets, jobs moving data to a data warehouse should be scheduled based on the daily, weekly, or
monthly reporting schedule.
Change Data Capture is a method of reducing bandwidth by filtering to include only data that has been changed
within a defined timeframe. Change data capture monitors a data set for changes (inserts, changes, deletes) and
then passes those changes (the deltas) to other data sets, applications, and organizations that consume the data.
Data may also be tagged with identifiers such as flags or timestamps as part of the process. Change data capture
may be data-based or log-based. (See Chapter 6.)
• The source system populates specific data elements, such as timestamps within a range, or codes or
flags, which serve as change indicators. The extract process uses rules to identify rows to extract.
• The source system processes add to a simple list of objects and identifiers when changing data, which
is then used to control selection of data for extraction.
• The source system processes copy data that has changed into a separate object as part of the
transaction, which is then used for extract processing. This object does not need to be within the
database management system.
These types of extraction use capabilities built into the source application, which may be resource intensive and
require the ability to modify the source application.
DATA INTEGRATION AND INTEROPERABILITY • 277
In log-based change data captures, data activity logs created by the database management system are copied and
processed, looking for specific changes that are then translated and applied to a target database. Complex
translations may be difficult, but intermediary structures resembling the source object can be used as a way of
staging the changes for further processing.
Most data integration solutions that are not performed in batches use a near-real-time or event-driven solution.
Data is processed in smaller sets spread across the day in a defined schedule, or data is processed when an event
happens, such as a data update. Near-real-time processing has a lower latency than batch processing and often a
lower system load as the work is distributed over time, but it is usually slower than a synchronized data
integration solution. Near-real-time data integration solutions are usually implemented using an enterprise
service bus.
State information and process dependencies must be monitored by the target application load process. Data
coming into the target may not be available in the exact order that the target needs to build the correct target
data. For example, process Master Data or dimensional data prior to transactional data that uses that Master
Data.
1.3.2.4 Asynchronous
In an asynchronous data flow, the system providing data does not wait for the receiving system to acknowledge
update before continuing processing. Asynchronous implies that either the sending or receiving system could be
off-line for some period without the other system also being off-line.
Asynchronous data integration does not prevent the source application from continuing its processing, or cause
the source application to be unavailable if any of the target applications are unavailable. Since the data updates
made to applications in an asynchronous configuration are not immediate, the integration is called near-real-
time. The delay between updates made in the source and relayed to target data sets in a near-real-time
environment is usually measured in seconds or minutes.
There are situations where no time delay or other differences between source and target data is acceptable.
When data in one data set must be kept perfectly in synch with the data in another data set, then a real-time,
synchronous solution must be used.
In a synchronous integration solution, an executing process waits to receive confirmation from other
applications or processes prior to executing its next activity or transaction. This means that the solution can
process fewer transactions because it has to spend time waiting for confirmation of data synchronization. If any
278 • DMBOK2
of the applications that need the update are not available then the transaction cannot be completed in the
primary application. This situation keeps data synchronized but has the potential to make strategic applications
dependent on less critical applications.
Solutions using this type of architecture exist on a continuum based on how much difference between data sets
might be possible and how much such a solution is worth. Data sets may be kept in synch through database
capabilities such as two-phase commits, which ensure that all updates in a business transaction are all
successful, or none is made. For example, financial institutions use two-phase commit solutions to ensure that
financial transaction tables are absolutely synchronized with financial balance tables. Most programming does
not use two-phase commit. There is a very small possibility that if an application is interrupted unexpectedly
then one data set may be updated but not another.
Real-time, synchronous solutions require less state management than asynchronous solutions because the order
in which transactions are processed is clearly managed by the updating applications. However, they also may
lead to blocking and delay other transactions.
Tremendous advances have been made in developing extremely fast data integration solutions. These solutions
require a large investment in hardware and software. The extra costs of low latency solutions are justified if an
organization requires extremely fast data movement across large distances. ‘Streaming data’ flows from
computer systems on a real-time continuous basis immediately as events occur. Data streams capture events like
the purchase of goods or financial securities, social media comments, and readouts from sensors monitoring
location, temperature, usage, or other values.
Low latency data integration solutions are designed to minimize the response time to events. They may include
the use of hardware solutions like solid-state disk or software solutions like in-memory databases so that the
process does not have to slow down to read or write to traditional disk. The read and write processes to
traditional disk drives is thousands of times slower than processing data in-memory or on solid-state disk drives.
Asynchronous solutions are usually used in low latency solutions so that transactions do not need to wait for
confirmation from subsequent processes before processing the next piece of data.
Massive multi-processing, or simultaneous processing, is also a common configuration in low latency solutions
so that the processing of incoming data can be spread out over many processors simultaneously, and not
bottlenecked by a single or small number of processors.
1.3.3 Replication
To provide better response time for users located around the world, some applications maintain exact copies of
data sets in multiple physical locations. Replication solutions minimize the performance impact of analytics and
queries on the primary transactional operating environment.
DATA INTEGRATION AND INTEROPERABILITY • 279
Such a solution must synchronize the physically distributed data set copies. Most database management systems
have replication utilities to do this work. These utilities work best when the data sets are all maintained in the
same database management system technology. Replication solutions usually monitor the log of changes to the
data set, not the data set itself. They minimize the impact on any operational applications because they do not
compete with the applications for access to the data set. Only data from the change log passes between
replicated copies. Standard replication solutions are near-real-time; there is a small delay between a change in
one copy of the data set and another.
Because the benefits of replication solutions — minimal effect on the source data set and minimal amount of
data being passed — are very desirable, replication is used in many data integration solutions, even those that
do not include long distance physical distribution. The database management utilities do not require extensive
programming, so there tend to be few programming bugs.
Replication utilities work optimally when source and target data sets are exact copies of each other. Differences
between source and target introduce risks to synchronization. If the ultimate target is not an exact copy of the
source then it is necessary to maintain a staging area to house an exact copy of the sources. This requires extra
disk usage and possibly extra database technology.
Data replication solutions are not optimal if changes to the data may occur at multiple copy sites. If it is possible
that the same piece of data is changed at two different sites, then there is a risk that the data might get
unsynchronized, or one of the sites may have their changes overwritten without warning. (See Chapter 6.)
1.3.4 Archiving
Data that is used infrequently or not actively used may be moved to an alternate data structure or storage
solution that is less costly to the organization. ETL functions can be used to transport and possibly transform the
archive data to the data structures in the archive environment. Use archives to store data from applications that
are being retired, as well as data from production operational systems that have not been used for a long time, to
improve operational efficiency.
It is critical to monitor archive technology to ensure that the data is still accessible when technology changes.
Having an archive in an older structure or format unreadable by newer technology can be a risk, especially for
data that is still legally required. (See Chapter 9.)
A canonical data model is a common model used by an organization or data exchange group that standardizes
the format in which data will be shared. In a hub-and-spoke data interaction design pattern, all systems that
want to provide or receive data interact only with a central information hub. Data is transformed from or to a
sending or receiving system based on a common or enterprise message format for the organization (a canonical
model). (See Chapter 5.) Use of a canonical model limits the number of data transformations needed by any
280 • DMBOK2
system or organization exchanging data. Each system needs to transform data only to and from the central
canonical model, rather than to the format of the multitude of systems with which it may want to exchange data.
Although developing and agreeing on a shared message format is a major undertaking, having a canonical
model can significantly reduce the complexity of data interoperability in an enterprise, and thus greatly lower
the cost of support. The creation and management of the common canonical data model for all data interactions
is a complex item of overhead that is required in the implementation of an enterprise data integration solution
using a hub-and-spoke interaction model. It is justifiable in support of managing the data interactions between
more than three systems and critical for managing data interactions in environments of more than 100
application systems.
Interaction models describe ways to make connections between systems in order to transfer data.
1.3.6.1 Point-to-point
The vast majority of interactions between systems that share data do so ‘point-to-point’; they pass data directly
to each other. This model makes sense in the context of a small set of systems. However, it becomes quickly
inefficient and increases organizational risk when many systems require the same data from the same sources.
• Impacts to processing: If source systems are operational, then the workload from supplying data
could affect processing.
• Potential for inconsistency: Design issues arise when multiple systems require different versions or
formats of the data. The use of multiple interfaces to obtain data will lead to inconsistencies in the data
sent to downstream systems.
1.3.6.2 Hub-and-spoke
The hub-and-spoke model, an alternative to point-to-point, consolidates shared data (either physically or
virtually) in a central data hub that many applications can use. All systems that want to exchange data do so
through a central common data control system, rather than directly with one another (point-to-point). Data
Warehouses, Data Marts, Operational Data Stores, and Master Data Management hubs are the most well-known
examples of data hubs.
DATA INTEGRATION AND INTEROPERABILITY • 281
The hubs provide consistent views of the data with limited performance impact on the source systems. Data
hubs even minimize the number of systems and extracts that must access the data sources, thus minimizing the
impact on the source system resources. Adding new systems to the portfolio only requires building interfaces to
the data hub. Hub-and-spoke interaction is more efficient and can be cost-justified even if the number of
systems involved is relatively small, but it becomes critical to managing a portfolio of systems in the hundreds
or thousands.
Enterprise Service Buses (ESB) are the data integration solution for near real-time sharing of data between
many systems, where the hub is a virtual concept of the standard format or the canonical model for sharing data
in the organization.
Hub-and-spoke may not always be the best solution. Some hub-and-spoke model latency is unacceptable or
performance is insufficient. The hub itself creates overhead in a hub-and-spoke architecture. A point-to-point
solution would not require the hub. However, the benefits of the hub outweigh the drawbacks of the overhead as
soon as three or more systems are involved in sharing data. Use of the hub-and-spoke design pattern for the
interchange of data can drastically reduce the proliferation of data transformation and integration solutions and
thus dramatically simplify the necessary organizational support.
A publish and subscribe model involves systems pushing data out (publish), and other systems pulling data in
(subscribe). Systems providing data are listed in a catalog of data services, and systems looking to consume data
subscribe to those services. When data is published, the data is automatically sent to the subscribers.
When multiple data consumers want a certain set of data or data in a certain format, developing that data set
centrally and making it available to all who need it ensures that all constituents receive a consistent data set in a
timely manner.
Coupling describes the degree to which two systems are entwined. Two systems that are tightly coupled usually
have a synchronous interface, where one system waits for a response from the other. Tight coupling represents a
riskier operation: if one system is unavailable then they are both effectively unavailable, and the business
continuity plan for both have to be the same. (See Chapter 6.)
Where possible, loose coupling is a preferred interface design, where data is passed between systems without
waiting for a response and one system may be unavailable without causing the other to be unavailable. Loose
282 • DMBOK2
coupling can be implemented using various techniques with services, APIs, or message queues. Figure 69
illustrates a possible loose coupling design.
Process A Process B
Tight Coupling
Process A Process B
Loose Coupling
Service Oriented Architecture using an Enterprise Service Bus is an example of a loosely coupled data
interaction design pattern.
Where the systems are loosely coupled, replacement of systems in the application inventory can theoretically be
performed without rewriting the systems with which they interact, because the interaction points are well-
defined.
Orchestration is the term used to describe how multiple processes are organized and executed in a system. All
systems handling messages or data packets must be able to manage the order of execution of those processes, in
order to preserve consistency and continuity.
Process Controls are the components that ensure shipment, delivery, extraction, and loading of data is accurate
and complete. An often-overlooked aspect of basic data movement architecture, controls include:
In an enterprise application integration model (EAI), software modules interact with one another only through
well-defined interface calls (application programming interfaces – APIs). Data stores are updated only by their
own software modules and other software cannot reach in to the data in an application but only access through
the defined APIs. EAI is built on object-oriented concepts, which emphasize reuse and the ability to replace any
module without impact on any other.
An Enterprise Service Bus is a system that acts as an intermediary between systems, passing messages between
them. Applications can send and receive messages or files using the ESB, and are encapsulated from other
processes existing on the ESB. An example of loose coupling, the ESB acts as the service between the
applications. (See Figure 70.)
Process
Application 1 Orchestration Application 2
Manager
Most mature enterprise data integration strategies utilize the idea of service-oriented architecture (SOA), where
the functionality of providing data or updating data (or other data services) can be provided through well-
defined service calls between applications. With this approach, applications do not have to have direct
interaction with or knowledge of the inner workings of other applications. SOA enables application
independence and the ability for an organization to replace systems without needing to make significant
changes to the systems that interfaced with them.
284 • DMBOK2
The goal of service-oriented architecture is to have well-defined interaction between self-contained software
modules. Each module performs functions (a.k.a. provides services) to other software modules or to human
consumers. The key concept is that SOA architecture provides independent services: the service has no fore
knowledge of the calling application and the implementation of the service is a black box to the calling
application. A service-oriented architecture may be implemented with various technologies including web
services, messaging, RESTful APIs, etc. Services are usually implemented as APIs (application programming
interfaces) that are available to be called by application systems (or human consumers). A well-defined API
registry describes what options are available, parameters that need to be provided, and resulting information that
is provided.
Data services, which may include the addition, deletion, update, and retrieval of data, are specified in a catalog
of available services. To achieve the enterprise goals of scalability (supporting integrations between all
applications in the enterprise without using unreasonable amounts of resources to do so) and reuse (having
services that are leveraged by all requestors of data of a type), a strong governance model must be established
around the design and registration of services and APIs. Prior to developing new data services, it is necessary to
ensure that no service already exists that could provide the requested data. In addition, new services need to be
designed to meet broad requirements so that they will not be limited to the immediate need but can be reused.
Event processing is a method of tracking and analyzing (processing) streams of information (data) about things
that happen (events), and deriving a conclusion from them. Complex event processing (CEP) combines data
from multiple sources to identify meaningful events (such as opportunities or threats) to predict behavior or
activity and automatically trigger real-time response, such as suggesting a product for a consumer to purchase.
Rules are set to guide the event processing and routing.
Organizations can use complex event processing to predict behavior or activity and automatically trigger real-
time response. Events such as sales leads, web clicks, orders, or customer service calls may happen across the
various layers of an organization. Alternatively, they may include news items, text messages, social media
posts, stock market feeds, traffic reports, weather reports, or other kinds of data. An event may also be defined
as a change of state, when a measurement exceeds a predefined threshold of time, temperature, or other value.
CEP presents some data challenges. In many cases, the rate at which events occur makes it impractical to
retrieve the additional data necessary to interpret the event as it occurs. Efficient processing typically mandates
pre-positioning some data in the CEP engine’s memory.
Supporting complex event processing requires an environment that can integrate vast amounts of data of various
types. Because of the volume and variety of data usually involved in creating predictions, complex event
processing is often tied to Big Data. It often requires use of technologies that support ultra-low latency
requirements such as processing real-time streaming data and in-memory databases. (See Chapter 14.)
DATA INTEGRATION AND INTEROPERABILITY • 285
When data exists in disparate data stores, it can be brought together in ways other than physical integration.
Data Federation provides access to a combination of individual data stores, regardless of structure. Data
Virtualization enables distributed databases, as well as multiple heterogeneous data stores, to be accessed and
viewed as a single database. (See Chapter 6.)
Software-as-a-service (SaaS) is a delivery and licensing model. An application is licensed to provide services,
but the software and data are located at a data center controlled by the software vendor, rather than in the data
center of the licensing organization. There are similar concepts for providing various tiers of computing
infrastructure-as-a-service (IT-as-a-service, platform-as-a-service, database-as-a-service).
One definition of Data-as-a-Service (DaaS) is data licensed from a vendor and provided on demand, rather than
stored and maintained in the data center of the licensing organization. A common example includes information
on the securities sold through a stock exchange and associated prices (current and historical).
Although Data-as-a-Service certainly lends itself to vendors that sell data to stakeholders within an industry, the
‘service’ concept is also used within an organization to provide enterprise data or data services to various
functions and operational systems. Service organizations provide a catalog of services available, service levels,
and pricing schedules.
Prior to the emergence of cloud computing, integration could be categorized as either internal or business to
business (B2B). Internal integration requirements are serviced through an on-premises middleware platform,
and typically use a service bus (ESB) to manage exchange of data between systems. Business-to-business
integration is serviced through EDI (electronic data interchange) gateways or value-added networks (VAN) or
market places.
The advent of SaaS applications created a new kind of demand for integrating data located outside of an
organization’s data center, met through cloud-based integration. Since their emergence, many such services
have also developed the capability to integrate on-premises applications as well as function as EDI gateways.
Cloud-based integration solutions are usually run as SaaS applications in the data centers of the vendors and not
the organizations that own the data being integrated. Cloud-based integration involves interacting with the SaaS
application data to be integrated using SOA interaction services. (See Chapter 6.)
286 • DMBOK2
Data Exchange Standards are formal rules for the structure of data elements. ISO (International Standards
Organization) has developed data exchange standards, as have many industries. A data exchange specification is
a common model used by an organization or data exchange group that standardizes the format in which data
will be shared. An exchange pattern defines a structure for data transformations needed by any system or
organization exchanging data. Data needs to be mapped to the exchange specification.
Although developing and agreeing on a shared message format is a major undertaking, having an agreed upon
exchange format or data layout between systems can significantly simplify data interoperability in an enterprise,
lowering the cost of support and enabling better understanding of the data.
The National Information Exchange Model (NIEM) was developed to exchange documents and transactions
across government organizations in the United States. The intention is that the sender and receiver of
information share a common, unambiguous understanding of the meaning of that information. Conformance to
NIEM ensures that a basic set of information is well understood and carries the same consistent meaning across
various communities, thus allowing interoperability.
NIEM uses Extensible Markup Language (XML) for schema definitions and element representation, which
allows the structure and meaning of data to be defined through simple, but carefully defined XML syntax rules.
Defining data integration requirements involves understanding the organization’s business objectives, as well as
the data required and the technology initiatives proposed to meet those objectives. It is also necessary to gather
any relevant laws or regulations regarding the data to be used. Some activities may need to be restricted due to
the data contents, and knowing up front will prevent issues later. Requirements must also account for
organizational policy on data retention and other parts of the data lifecycle. Often requirements for data
retention will differ by data domain and type.
DATA INTEGRATION AND INTEROPERABILITY • 287
Data integration and lifecycle requirements are usually defined by business analysts, data stewards, and
architects in various functions, including IT, who have a desire to get data in a certain place, in a certain format,
and integrated with other data. The requirements will determine the type of DII interaction model, which then
determines the technology and services necessary to fulfill the requirements.
The process of defining requirements creates and uncovers valuable Metadata. This Metadata should be
managed throughout the data lifecycle, from discovery through operations. The more complete and accurate an
organization’s Metadata, the better its ability to manage the risks and costs of data integration.
Data discovery should be performed prior to design. The goal of data discovery is to identify potential sources
of data for the data integration effort. Discovery will identify where data might be acquired and where it might
be integrated. The process combines a technical search, using tools that scan the Metadata and/or actual
contents on an organization’s data sets, with subject matter expertise (i.e., interviewing people who work with
the data of interest).
Discovery also includes high-level assessment of data quality, to determine whether the data is fit for the
purposes of the integration initiative. This assessment requires not only reviewing existing documentation,
interviewing subject matter experts, but also verifying information gathered against the actual data through data
profiling or other analysis. (See Section 2.1.4.) In almost all cases, there will be discrepancies between what is
believed about a data set and what is actually found to be true.
Data discovery produces or adds to an inventory of organizational data. This inventory should be maintained in
a Metadata repository. Ensure this inventory is maintained as a standard part of integration efforts: add or
remove data stores, document structure changes.
Most organizations have a need to integrate data from their internal systems. However, data integration
solutions may also involve the acquisition of data from outside the organization. There is a vast and ever
growing amount of valuable information available for free, or from data vendors. Data from external sources
can be extremely valuable when integrated with data from within an organization. However, acquiring and
integrating external data takes planning.
The process of data discovery will also uncover information about how data flows through an organization. This
information can be used to document high-level data lineage: how the data under analysis is acquired or created
by the organization, where it moves and is changed within the organization, and how the data is used by the
organization for analytics, decision-making, or event triggering. Detailed lineage can include the rules
according to which data is changed, and the frequency of changes.
288 • DMBOK2
Analysis of lineage may identify updates required to documentation of systems in use. Custom-coded ETL and
other legacy data manipulation objects should be documented to ensure that the organization can analyze the
impact of any changes in the data flow.
The analysis process may also identify opportunities for improvements in the existing data flow. For example,
finding that code can be upgraded to a simple call to a function in a tool, or can be discarded as no longer
relevant. Sometimes an old tool is performing a transformation that is undone later in the process. Finding and
removing these inefficiencies can greatly help with the project’s success and with an organization’s overall
ability to use its data.
Understanding data content and structure is essential to successful integration of data. Data profiling contributes
to this end. Actual data structure and contents always differ from what is assumed. Sometimes differences are
small; other times they are large enough to derail an integration effort. Profiling can help integration teams
discover these differences and use that knowledge to make better decisions about sourcing and design. If data
profiling is skipped, then information that should influence design will not be discovered until testing or
operations.
• Data format as defined in the data structures and inferred from the actual data
• Data population, including the levels of null, blank, or defaulted data
• Data values and how closely they correspond to a defined set of valid values
• Patterns and relationships internal to the data set, such as related fields and cardinality rules
• Relationships to other data sets
More extensive profiling of the potential source and target data sets is required to understand how well the data
meets the requirements of the particular data integration initiative. Profile both the sources and targets to
understand how to transform the data to match requirements.
One goal of profiling is to assess the quality of data. Assessing the fitness of the data for a particular use
requires documenting business rules and measuring how well the data meets those business rules. Assessing
accuracy requires comparing to a definitive set of data that has been determined to be correct. Such data sets are
not always available, so measuring accuracy may not be possible, especially as part of a profiling effort.
As with high-level data discovery, data profiling includes verifying assumptions about the data against the
actual data. Capture results of data profiling in a Metadata repository for use on later projects and use what is
learned from the process to improve the accuracy of existing Metadata (Olson, 2003). (See Chapter 13.)
The requirement to profile data must be balanced with an organization’s security and privacy regulations. (See
Chapter 7.)
DATA INTEGRATION AND INTEROPERABILITY • 289
Business rules are a critical subset of requirements. A business rule is a statement that defines or constrains an
aspect of business processing. Business rules are intended to assert business structure or to control or influence
the behavior of the business. Business rules fall into one of four categories: definitions of business terms, facts
relating terms to each other, constraints or action assertions, and derivations.
Use business rules to support Data Integration and Interoperability at various points, to:
For Master Data Management, business rules include match rules, merge rules, survivorship rules, and trust
rules. For data archiving, data warehousing, and other situations where a data store is in use, the business rules
also include data retention rules.
Gathering business rules is also called rules harvesting or business rule mining. The business analyst or data
steward can extract the rules from existing documentation (like use cases, specifications, or system code), or
they may also organize workshops and interviews with subject matter experts (SMEs), or both.
Data integration solutions should be specified at both the enterprise level and the individual solution level (see
Chapter 4). By establishing enterprise standards, the organization saves time in implementing individual
solutions, because assessments and negotiations have been performed in advance of need. An enterprise
approach saves money in the cost of licenses through group discounts and in the costs of operating a consistent
and less complex set of solutions. Operational resources that support and back up one another can be part of a
shared pool.
Design a solution to meet the requirements, reusing as many of the existing Data Integration and
Interoperability components as is feasible. A solution architecture indicates the techniques and technologies that
will be used. It will include an inventory of the involved data structures (both persistent and transitive, existing
and required), an indication of the orchestration and frequency of data flow, regulatory and security concerns
and remediation, and operating concerns around backup and recovery, availability, and data archive and
retention.
290 • DMBOK2
Determine which interaction model or combination will fulfill the requirements – hub-and-spoke, point-to-
point, or publish-subscribe. If the requirements match an existing interaction pattern already implemented, re-
use the existing system as much as possible, to reduce development efforts.
Create or re-use existing integration flows to move the data. These data services should be companions to
existing similar data services, but be careful to not create multiple almost-identical services, as troubleshooting
and support increasingly become difficult if services proliferate. If an existing data flow can be modified to
support multiple needs, it may be worthwhile to make that change instead of creating a new service.
Any data exchange specification design should start with industry standards, or other exchange patterns already
existing. When possible, make any changes to existing patterns generic enough to be useful to other systems;
having specific exchange patterns that only relate to one exchange has the same issues as point-to-point
connections.
Data structures needed in Data Integration and Interoperability include those in which data persists, such as
Master Data Management hubs, data warehouses and marts, and operational data stores, and those that are
transient and used only for moving or transforming data, such as interfaces, message layouts, and canonical
models. Both types should be modeled. (See Chapter 5.)
Almost all data integration solutions include transforming data from source to target structures. Mapping
sources to targets involves specifying the rules for transforming data from one location and format to another.
Transformation may be performed on a batch schedule, or triggered by the occurrence of a real-time event. It
may be accomplished through physical persistence of the target format or through virtual presentation of the
data in the target format.
The flow of data in a data integration solution must be designed and documented. Data orchestration is the
pattern of data flows from start to finish, including intermediate steps, required to complete the transformation
and/or transaction.
Batch data integration orchestration will indicate the frequency of the data movement and transformation. Batch
data integration is usually coded into a scheduler that triggers the start at a certain time, periodicity, or when an
event occurs. The schedule may include multiple steps with dependencies.
Real-time data integration orchestration is usually triggered by an event, such as new or updated data. Real-time
data integration orchestration is usually more complex and implemented across multiple tools. It may not be
linear in nature.
Develop services to access, transform, and deliver data as specified, matching the interaction model selected.
Tools or vendor suites are most frequently used to implement data integration solutions, such as data
transformation, Master Data Management, data warehousing, etc. Using consistent tools or standard vendor
suites across the organization for these various purposes can simplify operational support and lower operating
costs by enabling shared support solutions.
Integration or ETL data flows will usually be developed within tools specialized to manage those flows in a
proprietary way. Batch data flows will be developed in a scheduler (usually the enterprise standard scheduler)
that will manage the order, frequency, and dependency of executing the data integration pieces that have been
developed.
Interoperability requirements may include developing mappings or coordination points between data stores.
Some organizations use an ESB to subscribe to data that is created or changed in the organization and other
applications to publish changes to data. The enterprise service bus will poll the applications constantly to see if
they have any data to publish and deliver to them new or changed data for which they have subscribed.
292 • DMBOK2
Developing real-time data integration flows involves monitoring for events that should trigger the execution of
services to acquire, transform, or publish data. This is usually implemented within one or multiple proprietary
technologies and is best implemented with a solution that can manage the operation across technologies.
Data needs to be moved when new applications are implemented or when applications are retired or merged.
This process involves transformation of data to the format of the receiving application. Almost all application
development projects involve some data migration, even if all that is involved is the population of Reference
Data. Migration is not quite a one-time process, as it needs to be executed for testing phases as well as final
implementation.
Data migration projects are frequently under-estimated or under-designed, because programmers are told to
simply move the data; they do not engage in the analysis and design activities required for data integration.
When data is migrated without proper analysis, it often looks different from the data that came in through the
normal processing. Or the migrated data may not work with the application as anticipated. Profiling data of core
operational applications will usually highlight data that has been migrated from one or more generations of
previous operational systems and does not meet the standards of the data that enters the data set through the
current application code. (See Chapter 6.)
Systems where critical data is created or maintained need to make that data available to other systems in the
organization. New or changed data should be pushed by data producing applications to other systems
(especially data hubs and enterprise data buses) either at the time of data change (event-driven) or on a periodic
schedule.
Best practice is to define common message definitions (canonical model) for the various types of data in the
organization and let data consumers (either applications or individuals) who have appropriate access authority
subscribe to receive notification of any changes to data of interest.
• Preparation of the historical data about an individual, organization, product, or market and pre-
population of the predictive models
• Processing the real-time data stream to fully populate the predictive model and identify meaningful
events (opportunities or threats)
• Executing the triggered action in response to the prediction
DATA INTEGRATION AND INTEROPERABILITY • 293
Preparation and pre-processing of the historical data needed in the predictive model may be performed in
nightly batch processes or in near real-time. Usually some of the predictive model can be populated in advance
of the triggering event, such as identifying what products are usually bought together in preparation of
suggesting an additional item for purchase.
Some processing flows trigger a response to every event in the real-time stream, such as adding an item to a
shopping cart; other processing flows attempt to identify particularly meaningful events that trigger action, such
as a suspected fraudulent charge attempt on a credit card.
The response to the identification of a meaningful event may be as simple as a warning being sent out or as
complex as the automatic deployment of armed forces.
As previously noted (see Section 2.1), an organization will create and uncover valuable Metadata during the
process of developing DII solutions. This Metadata should be managed and maintained to ensure proper
understanding of the data in the system, and to prevent the need to rediscover it for future solutions. Reliable
Metadata improves an organization’s ability to manage risks, reduce costs, and obtain more value from its data.
Document the data structures of all systems involved in data integration as source, target, or staging. Include
business definitions and technical definitions (structure, format, size), as well as the transformation of data
between the persistent data stores. Whether data integration Metadata is stored in documents or a Metadata
repository, it should not be changed without a review and approval process from both business and technical
stakeholders.
Most ETL tool vendors package their Metadata repositories with additional functionality that enables
governance and stewardship oversight. If the Metadata repository is utilized as an operational tool, then it may
even include operational Metadata about when data was copied and transformed between systems.
Of particular importance for DII solutions is the SOA registry, which provides controlled access to an evolving
catalog of information about the available services for accessing and using the data and functionality in an
application.
Activate the data services that have been developed and tested. Real-time data processing requires real-time
monitoring for issues. Establish parameters that indicate potential problems with processing, as well as direct
notification of issues. Automated as well as human monitoring for issues should be established, especially as the
complexity and risk of the triggered responses rises. There are cases, for example, where issues with automated
financial securities trading algorithms have triggered actions that have affected entire markets or bankrupted
organizations.
294 • DMBOK2
Data interaction capabilities must be monitored and serviced at the same service level as the most demanding
target application or data consumer.
3. Tools
A data transformation engine (or ETL tool) is the primary tool in the data integration toolbox, central to every
enterprise data integration program. These tools usually support the operation as well as the design of the data
transformation activities.
Extremely sophisticated tools exist to develop and perform ETL, whether batch or real-time, physically or
virtually. For single use point-to-point solutions, data integration processing is frequently implemented through
custom coding. Enterprise level solutions usually require the use of tools to perform this processing in a
standard way across the organization.
Basic considerations in selecting a data transformation engine should include whether it is necessary to handle
batch as well as real-time functionality, and whether unstructured as well as structured data needs to be
accommodated, as the most mature tools exist for batch-oriented processing of structured data only.
Data transformation engines usually perform extract, transform, and load physically on data; however, data
virtualization servers perform data extract, transform, and integrate virtually. Data virtualization servers can
combine structured and unstructured data. A data warehouse is frequently an input to a data virtualization
server, but a data virtualization server does not replace the data warehouse in the enterprise information
architecture.
An enterprise service bus (ESB) refers to both a software architecture model and a type of message-oriented
middleware used to implement near real-time messaging between heterogeneous data stores, applications, and
servers that reside within the same organization. Most internal data integration solutions that need to execute
more frequent than daily use this architecture and this technology. Most commonly, an ESB is used in
asynchronous format to enable the free flow of data. An ESB can also be used synchronously in certain
situations.
DATA INTEGRATION AND INTEROPERABILITY • 295
The enterprise service bus implements incoming and outgoing message queues on each of the systems
participating in message interchange with an adapter or agent installed in each environment. The central
processor for the ESB is usually implemented on a server separate from the other participating systems. The
processor keeps track of which systems have subscribed interest in what kinds of messages. The central
processor continuously polls each participating system for outgoing messages and deposits incoming messages
into the message queue for subscribed types of messages and messages that have been directly addressed to that
system.
This model is called ‘near real-time’ because the data can take up to a couple of minutes to get from sending
system to receiving system. This is a loosely coupled model and the system sending data will not wait for
confirmation of receipt and update from the receiving system before continuing processing.
Many data integration solutions are dependent on business rules. An important form of Metadata, these rules
can be used in basic integration and in solutions that incorporate complex event processing to enable an
organization to respond to events in near real-time. A business rules engine that allows non-technical users to
manage business rules implemented by software is a very valuable tool that will enable evolution of the solution
at a lower cost, because a business rules engine can support changes to predictive models without technical code
changes. For example, models that predict what a customer might want to purchase may be defined as business
rules rather than code changes.
Data modeling tools should be used to design not only the target but also the intermediate data structures needed
in data integration solutions. The structure of the messages or streams of data that pass between systems and
organizations, and are not usually persisted, should nevertheless be modeled. The flow of data between systems
and organizations should also be designed, as should complex event processes.
Data profiling involves statistical analysis of data set contents to understand format, completeness, consistency,
validity, and structure of the data. All data integration and interoperability development should include detailed
assessment of potential data sources and targets to determine whether the actual data meets the needs of the
proposed solution. Since most integration projects involve a significant amount of data, the most efficient
means of conducting this analysis is to use a data profiling tool. (See Section 2.1.4 and Chapter 13.)
296 • DMBOK2
A Metadata repository contains information about the data in an organization, including data structure, content,
and the business rules for managing the data. During data integration projects, one or more Metadata
repositories may be used to document the technical structure and business meaning of the data being sourced,
transformed, and targeted.
Usually the rules regarding data transformation, lineage, and processing used by the data integration tools are
also stored in a Metadata repository as are the instructions for scheduled processes such as triggers and
frequency.
Every tool usually has its own Metadata repository. Suites of tools from the same vendor may share a Metadata
repository. One Metadata repository may be designated as a central point for consolidating data from the
various operational tools. (See Chapter 12.)
4. Techniques
Several of the important techniques for designing data integration solutions are described in the Essential
Concepts in this chapter. The basic goals are to keep the applications coupled loosely, limit the number of
interfaces developed and requiring management by using a hub-and-spoke approach, and to create standard (or
canonical) interfaces.
5. Implementation Guidelines
All organizations have some form of DII already in place – so the readiness/risk assessment should be around
enterprise integration tool implementation, or enhancing capabilities to allow interoperability.
Implementing enterprise data integration solutions is usually cost-justified based on implementation between
many systems. Design an enterprise data integration solution to support the movement of data between many
applications and organizations, and not just the first one to be implemented.
Many organizations spend their time reworking existing solutions instead of bringing additional value. Focus on
implementing data integration solutions where none or limited integration currently exists, rather than replacing
working data integration solutions with a common enterprise solution across the organization.
DATA INTEGRATION AND INTEROPERABILITY • 297
Certain data projects can justify a data integration solution focused only on a particular application, such as a
data warehouse or Master Data Management hub. In those cases, any additional use of the data integration
solution adds value to the investment, because the first system use already achieved the justification.
Application support teams prefer to manage data integration solutions locally. They will perceive that the cost
of doing so is lower than leveraging an enterprise solution. The software vendors that support such teams will
also prefer that they leverage the data integration tools that they sell. Therefore, it is necessary to sponsor the
implementation of an enterprise data integration program from a level that has sufficient authority over solution
design and technology purchase, such as from IT enterprise architecture. In addition, it may be necessary to
encourage application systems to participate through positive incentives, such as funding the data integration
technology centrally, and through negative incentives, such as refusing to approve the implementation of new
alternative data integration technologies.
Development projects that implement new data integration technology frequently become focused on the
technology and lose focus on the business goals. It is necessary to make sure that data integration solution
implementation retain focus on the business goals and requirements, including making sure that some
participants in every project are business- or application-oriented, and not just data integration tool experts.
Organizations must determine whether responsibility for managing data integration implementations is
centralized or whether it resides with decentralized application teams. Local teams understand the data in their
applications. Central teams can build deep knowledge of tools and technologies. Many organizations develop a
Center of Excellence specializing in the design and deployment of the enterprise data integration solutions.
Local and central teams collaborate to develop solutions connecting an application into an enterprise data
integration solution. The local team should take primary responsibility for managing the solution and resolving
any problems, escalating to the Center of Excellence, if necessary.
Data integration solutions are frequently perceived as purely technical; however, to successfully deliver value,
they must be developed based on deep business knowledge. The data analysis and modeling activities should be
performed by business-oriented resources. Development of a canonical message model, or consistent standard
for how data is shared in the organization, requires a large resource commitment that should involve business
modeling resources as well as technical resources. Review all data transformation mapping design and changes
with by business subject matter experts in each involved system.
6. DII Governance
Decisions about the design of data messages, data models, and data transformation rules have a direct impact on
an organization’s ability to use its data. These decisions must be business-driven. While there are many
298 • DMBOK2
technical considerations in implementing business rules, a purely technical approach to DII can lead to errors in
the data mappings and transformations as data flows into, through and out of an organization.
Business stakeholders are responsible for defining rules for how data should be modeled and transformed.
Business stakeholders should approve changes to any of these business rules. Rules should be captured as
Metadata and consolidated for cross-enterprise analysis. Identifying and verifying the predictive models and
defining what actions should be automatically triggered by the predictions are also business functions.
Without trust that the integration or interoperable design will perform as promised, in a secure, reliable way,
there can be no effective business value. In DII, the landscape of governance controls to support trust can be
complex and detailed. One approach is to determine what events trigger governance reviews (exceptions or
critical events). Map each trigger to reviews that engage with governance bodies. Event triggers may be part of
the System Development Life Cycle (SDLC) at Stage Gates when moving from one phase to another or as part
of User Stories. For example, architecture design compliance checklists may include such questions as: If
possible, are you using the ESB and tools? Was there a search for reusable services?
Controls may come from governance-driven management routines, such as mandated reviews of models,
auditing of Metadata, gating of deliverables, and required approvals for changes to transformation rules.
In Service Level Agreements, and in Business Continuity/Disaster Recovery plans, real-time operational data
integration solutions must be included in the same backup and recovery tier as the most critical system to which
they provide data.
Policies need to be established to ensure that the organization benefits from an enterprise approach to DII. For
example, policies can be put in place to ensure that SOA principles are followed, that new services are created
only after a review of existing services, and that all data flowing between systems goes through the enterprise
service bus.
Prior to the development of interfaces or the provision of data electronically, develop a data sharing agreement
or memorandum of understanding (MOU) which stipulates the responsibilities and acceptable use of data to be
exchanged, approved by the business data stewards of the data in question. The data sharing agreements should
specify anticipated use and access to the data, restrictions on use, as well as expected service levels, including
required system up times and response times. These agreements are especially critical for regulated industries,
or when personal or secure information is involved.
Data lineage is useful to the development of DII solutions. It is also often required for data consumers to use
data, but it is becoming even more important as data is integrated between organizations. Governance is
DATA INTEGRATION AND INTEROPERABILITY • 299
required to ensure that knowledge of data origins and movement is documented. Data sharing agreements may
stipulate limitations to the uses of data and in order to abide by these, it is necessary to know where data moves
and persists. There are emerging compliance standards (for example, Solvency II regulation in Europe) that
require organizations be able to describe where their data originated and how it has been changed as it has
moved through various systems.
In addition, data lineage information is required when making changes to data flows. This information must be
managed as a critical part of solution Metadata. Forward and backward data lineage (i.e., where did data get
used and where did it come from) is critical as part of the impact analysis needed when making changes to data
structures, data flows, or data processing.
To measure the scale and benefits from implementing Data Integration solutions, include metrics on
availability, volume, speed, cost, and usage:
• Data Availability
o Availability of data requested
• Data Volumes and Speed
o Volumes of data transported and transformed
o Volumes of data analyzed
o Speed of transmission
o Latency between data update and availability
o Latency between event and triggered action
o Time to availability of new data sources
• Solution Costs and Complexity
o Cost of developing and managing solutions
o Ease of acquiring new data
o Complexity of solutions and operations
o Number of systems using data integration solutions
Bahga, Arshdeep, and Vijay Madisetti. Cloud Computing: A Hands-On Approach. CreateSpace Independent Publishing
Platform, 2013. Print.
Bobak, Angelo R. Connecting the Data: Data Integration Techniques for Building an Operational Data Store (ODS).
Technics Publications, LLC, 2012. Print.
300 • DMBOK2
Brackett, Michael. Data Resource Integration: Understanding and Resolving a Disparate Data Resource. Technics
Publications, LLC, 2012. Print.
Carstensen, Jared, Bernard Golden, and JP Morgenthal. Cloud Computing - Assessing the Risks. IT Governance Publishing,
2012. Print.
Di Martino, Beniamino, Giuseppina Cretella, and Antonio Esposito. Cloud Portability and Interoperability: Issues and
Current Trend. Springer, 2015. Print. SpringerBriefs in Computer Science.
Doan, AnHai, Alon Halevy, and Zachary Ives. Principles of Data Integration. Morgan Kaufmann, 2012. Print.
Erl, Thomas, Ricardo Puttini, and Zaigham Mahmood. Cloud Computing: Concepts, Technology and Architecture. Prentice
Hall, 2013. Print. The Prentice Hall Service Technology Ser. from Thomas Erl.
Ferguson, M. Maximizing the Business Value of Data Virtualization. Enterprise Data World, 2012. Web.
http://bit.ly/2sVAsui.
Giordano, Anthony David. Data Integration Blueprint and Modeling: Techniques for a Scalable and Sustainable
Architecture. IBM Press, 2011. Print.
Haley, Beard. Cloud Computing Best Practices for Managing and Measuring Processes for On-demand Computing,
Applications and Data Centers in the Cloud with SLAs. Emereo Publishing, 2008. Print.
Hohpe, Gregor and Bobby Woolf. Enterprise Integration Patterns: Designing, Building, and Deploying Messaging
Solutions. Addison-Wesley Professional, 2003. Print.
Inmon, W. Building the Data Warehouse. 4th ed. Wiley, 2005. Print.
Inmon, W., Claudia Imhoff, and Ryan Sousa. The Corporate Information Factory. 2nd ed. Wiley 2001, Print.
Jamsa, Kris. Cloud Computing: SaaS, PaaS, IaaS, Virtualization, Business Models, Mobile, Security and More. Jones and
Bartlett Learning, 2012. Print.
Kavis, Michael J. Architecting the Cloud: Design Decisions for Cloud Computing Service Models (SaaS, PaaS, and IaaS).
Wiley, 2014. Print. Wiley CIO.
Kimball, Ralph and Margy Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. 2nd ed.
Wiley, 2002. Print.
Linthicum, David S. Cloud Computing and SOA Convergence in Your Enterprise: A Step-by-Step Guide. Addison-Wesley
Professional, 2009. Print.
Linthicum, David S. Next Generation Application Integration: From Simple Information to Web Services. Addison-Wesley
Professional, 2003. Print.
Majkic, Zoran. Big Data Integration Theory: Theory and Methods of Database Mappings, Programming Languages, and
Semantics. Springer, 2014. Print. Texts in Computer Science.
Mather, Tim, Subra Kumaraswamy, and Shahed Latif. Cloud Security and Privacy: An Enterprise Perspective on Risks and
Compliance. O'Reilly Media, 2009. Print. Theory in Practice.
Reese, George. Cloud Application Architectures: Building Applications and Infrastructure in the Cloud. O'Reilly Media,
2009. Print. Theory in Practice (O'Reilly).
Reeve, April. Managing Data in Motion: Data Integration Best Practice Techniques and Technologies. Morgan Kaufmann,
2013. Print. The Morgan Kaufmann Series on Business Intelligence.
Rhoton, John. Cloud Computing Explained: Implementation Handbook for Enterprises. Recursive Press, 2009. Print.
DATA INTEGRATION AND INTEROPERABILITY • 301
Sarkar, Pushpak. Data as a Service: A Framework for Providing Reusable Enterprise Data Services. Wiley-IEEE Computer
Society Pr, 2015. Print.
Sears, Jonathan. Data Integration 200 Success Secrets - 200 Most Asked Questions On Data Integration - What You Need to
Know. Emereo Publishing, 2014. Kindle.
Sherman, Rick. Business Intelligence Guidebook: From Data Integration to Analytics. Morgan Kaufmann, 2014. Print.
U.S. Department of Commerce. Guidelines on Security and Privacy in Public Cloud Computing. CreateSpace Independent
Publishing Platform, 2014. Print.
Van der Lans, Rick. Data Virtualization for Business Intelligence Systems: Revolutionizing Data Integration for Data
Warehouses. Morgan Kaufmann, 2012. Print. The Morgan Kaufmann Series on Business Intelligence.
Zhao, Liang, Sherif Sakr, Anna Liu, and Athman Bouguettaya. Cloud Data Management. Springer; 2014. Print.
CHAPTER 9
Data Data
Metadata
Governance Security
1. Introduction
D
ocument and Content Management entails controlling the capture, storage, access, and use of data and
information stored outside relational databases. 44 Its focus is on maintaining the integrity of and
enabling access to documents and other unstructured or semi-structured information which makes it
44 The types of unstructured data have evolved since the early 2000s, as the capacity to capture and store digital information
has grown. The concept of unstructured data continues to refer to data that is not pre-defined through a data model, whether
relational or otherwise.
303
304 • DMBOK2
roughly equivalent to data operations management for relational databases. However, it also has strategic
drivers. In many organizations, unstructured data has a direct relationship to structured data. Management
decisions about such content should be applied consistently. In addition, as are other types of data, documents
and unstructured content are expected to be secure and of high quality. Ensuring security and quality requires
governance, reliable architecture, and well-managed Metadata.
Goals:
1. To comply with legal obligations and customer expectations regarding Records management.
2. To ensure effective and efficient storage, retrieval, and use of Documents and Content.
3. To ensure integration capabilities between structured and unstructured Content.
Business
Drivers
Technical
Drivers
The primary business drivers for document and content management include regulatory compliance, the ability
to respond to litigation and e-discovery requests, and business continuity requirements. Good records
management can also help organizations become more efficient. Well-organized, searchable websites that result
from effective management of ontologies and other structures that facilitate searching help improve customer
and employee satisfaction.
Laws and regulations require that organizations maintain records of certain kinds of activities. Most
organizations also have policies, standards, and best practices for record keeping. Records include both paper
documents and electronically stored information (ESI). Good records management is necessary for business
continuity. It also enables an organization to respond in the case of litigation.
E-discovery is the process of finding electronic records that might serve as evidence in a legal action. As the
technology for creating, storing, and using data has developed, the volume of ESI has increased exponentially.
Some of this data will undoubtedly end up in litigation or regulatory requests.
The ability of an organization to respond to an e-discovery request depends on how proactively it has managed
records such as email, chats, websites, and electronic documents, as well as raw application data and Metadata.
Big Data has become a driver for more efficient e-discovery, records retention, and strong information
governance.
Gaining efficiencies is a driver for improving document management. Technological advances in document
management are helping organizations streamline processes, manage workflow, eliminate repetitive manual
tasks, and enable collaboration. These technologies have the additional benefits of enabling people to locate,
access, and share documents more quickly. They can also prevent documents from being lost. This is very
important for e-discovery. Money is also saved by freeing up file cabinet space and reducing document
handling costs.
The goals of implementing best practices around Document and Content Management include:
• Ensuring effective and efficient retrieval and use of data and information in unstructured formats
• Everyone in an organization has a role to play in protecting the organization’s future. Everyone must
create, use, retrieve, and dispose of records in accordance with the established policies and procedures.
306 • DMBOK2
• Experts in the handling of records and content should be fully engaged in policy and planning.
Regulatory and best practices can vary significantly based on industry sector and legal jurisdiction.
Even if records management professionals are not available to the organization, everyone can be trained to
understand the challenges, best practices, and issues. Once trained, business stewards and others can collaborate
on an effective approach to records management.
In 2009, ARMA International, a not-for-profit professional association for managing records and information,
published a set of Generally Acceptable Recordkeeping Principles® (GARP) 45 that describes how business
records should be maintained. It also provides a recordkeeping and information governance framework with
associated metrics. The first sentence of each principle is stated below. Further explanation can be found on the
ARMA website.
• Principle of Integrity: An information governance program shall be constructed so the records and
information generated or managed by or for the organization have a reasonable and suitable guarantee
of authenticity and reliability.
• Principle of Availability: An organization shall maintain its information in a manner that ensures
timely, efficient, and accurate retrieval of its information.
• Principle of Retention: An organization shall retain its information for an appropriate time, taking
into account all operational, legal, regulatory and fiscal requirements, and those of all relevant binding
authorities.
• Principle of Transparency: An organization shall document its policies, processes and activities,
including its information governance program, in a manner that is available to and understood by staff
and appropriate interested parties.
1.3.1 Content
A document is to content what a bucket is to water: a container. Content refers to the data and information
inside the file, document, or website. Content is often managed based on the concepts represented by the
documents, as well as the type or status of the documents. Content also has a lifecycle. In its completed form,
some content becomes a matter of record for an organization. Official records are treated differently from other
content.
Content management includes the processes, techniques, and technologies for organizing, categorizing, and
structuring information resources so that they can be stored, published, and reused in multiple ways.
The lifecycle of content can be active, with daily changes through controlled processes for creation and
modification; or it can be more static with only minor, occasional changes. Content may be managed formally
(strictly stored, managed, audited, retained or disposed of) or informally through ad hoc updates.
Content management is particularly important in websites and portals, but the techniques of indexing based on
keywords and organizing based on taxonomies can be applied across technology platforms. When the scope of
content management includes the entire enterprise, it is referred to as Enterprise Content Management (ECM).
Metadata is essential to managing unstructured data, both what is traditionally thought of as content and
documents and what we now understand as ‘Big Data’. Without Metadata, it is not possible to inventory and
organize content. Metadata for unstructured data content is based on:
• Format: Often the format of the data dictates the method to access the data (such as electronic index
for electronic unstructured data).
• Search-ability: Whether search tools already exist for use with related unstructured data.
• Existing patterns: Whether existing methods and patterns can be adopted or adapted (as in library
catalogs).
• Requirements: Need for thoroughness and detail in retrieval (as in the pharmaceutical or nuclear
industry 46). Therefore, detailed Metadata at the content level might be necessary, and a tool capable of
content tagging might be necessary.
Generally, the maintenance of Metadata for unstructured data becomes the maintenance of a cross-reference
between various local patterns and the official set of enterprise Metadata. Records managers and Metadata
professionals recognize long-term embedded methods exist throughout the organization for documents, records,
and other content that must be retained for many years, but that these methods are often costly to re-organize. In
some organizations, a centralized team maintains cross-reference patterns between records management
indexes, taxonomies, and even variant thesauri.
Content modeling is the process of converting logical content concepts into content types, attributes, and data
types with relationships. An attribute describes something specific and distinguishable about the content to
which it relates. A data type restricts the type of data the attribute may hold, enabling validation and processing.
Metadata management and data modeling techniques are used in the development of a content model.
There are two levels of content modeling. The first is at the information product level, which creates an actual
deliverable like a website. The second is at the component level, which further details the elements that make up
the information product model. The level of detail in the model depends on the granularity desired for reuse and
structure.
Content models support the content strategy by guiding content creation and promoting reuse. They support
adaptive content, which is format-free and device-independent. The models become the specifications for the
content implemented in such structures such as XML schema definition (XSDs), forms, or stylesheets.
Content needs to be modular, structured, reusable, and device- and platform- independent. Delivery methods
include web pages, print, and mobile apps as well as eBooks with interactive video and audio. Converting
content into XML early in the workflow supports reuse across different media channels.
• Push: In a push delivery system, users choose the type of content delivered to them on a pre-
determined schedule. Syndication involves one party creating the content published in many places.
46 These industries are responsible for supplying evidence of how certain kinds of materials are handled. Pharmacy
manufacturers, for example, must keep detailed records of how a compound came to be and was then tested and handled,
before being allowed to be used by people.
DOCUMENT AND CONTENT MANAGEMENT • 309
Really Simple Syndication (RSS) is an example of a push content delivery mechanism. It distributes
content (i.e., a feed) to syndicate news and other web content upon request.
• Pull: In a pull delivery system, users pull the content through the Internet. An example of a pull system
is when shoppers visit online retail stores.
• Interactive: Interactive content delivery methods, such as third-party electronic point of sale (EPOS)
apps or customer facing websites (e.g., for enrollment), need to exchange high volumes of real-time
data between enterprise applications. Options for sharing data between applications include Enterprise
Application Integration (EAI), Changed Data Capture, Data Integration and EII. (See Chapter 8.)
A controlled vocabulary is a defined list of explicitly allowed terms used to index, categorize, tag, sort, and
retrieve content through browsing and searching. A controlled vocabulary is necessary to systematically
organize documents, records, and content. Vocabularies range in complexity from simple lists or pick lists, to
the synonym rings or authority lists, to taxonomies, and, the most complex, thesauri and ontologies. An
example of a controlled vocabulary is the Dublin Core, used to catalog publications.
Defined policies control over who adds terms to the vocabulary (e.g., a taxonomist or indexer, or librarian).
Librarians are particularly trained in the theory and development of controlled vocabularies. Users of the list
may only apply terms from the list for its scoped subject area. (See Chapter 10.)
Ideally, controlled vocabularies should be aligned with the entity names and definitions in an enterprise
conceptual data model. A bottom up approach to collecting terms and concepts is to compile them in a
folksonomy, which is a collection of terms and concepts obtained through social tagging.
Controlled vocabularies constitute a type of Reference Data. Like other Reference Data, their values and
definitions need to be managed for completeness and currency. They can also be thought of as Metadata, as they
help explain and support the use of other data. They are described in this chapter because Document and
Content Management are primary use cases for controlled vocabularies.
Because vocabularies evolve over time, they require management. ANSI/NISO Z39.19-2005 is an American
standard, which provides guidelines for the Construction, Format, and Management of Monolingual Controlled
Vocabularies, describes vocabulary management as a way to “to improve the effectiveness of information
storage and retrieval systems, web navigation systems, and other environments that seek to both identify and
310 • DMBOK2
locate desired content via some sort of description using language. The primary purpose of vocabulary control
is to achieve consistency in the description of content objects and to facilitate retrieval.” 47
Vocabulary management is the function of defining, sourcing, importing, and maintaining any given
vocabulary. Key questions to enable vocabulary management focus on uses, consumers, standards, and
maintenance:
• Who is the audience for this vocabulary? What processes do they support? What roles do they play?
• Why is the vocabulary necessary? Will it support an application, content management, or analytics?
• What existing vocabularies do different groups use to classify this information? Where are they
located? How were they created? Who are their subject matter experts? Are there any security or
privacy concerns for any of them?
• Is there an existing standard that can fulfill this need? Are there concerns of using an external standard
vs. internal? How frequently is the standard updated and what is the degree of change of each update?
Are standards accessible in an easy to import / maintain format, in a cost-efficient manner?
The results of this assessment will enable data integration. They will also help to establish internal standards,
including associated preferred vocabulary through term and term relationship management functions.
If this kind of assessment is not done, preferred vocabularies would still be defined in an organization, except
they would be done in silos, project by project, lead to a higher cost of integration and higher chances of data
quality issues. (See Chapter 13.)
A vocabulary view is a subset of a controlled vocabulary, covering a limited range of topics within the domain
of the controlled vocabulary. Vocabulary views are necessary when the goal is to use a standard vocabulary
containing a large number of terms, but not all terms are relevant to some consumers of the information. For
example, a view that only contains terms relevant to a Marketing Business Unit would not contain terms
relevant only to Finance.
Vocabulary views increase information’s usability by limiting the content to what is appropriate to the users.
Construct a vocabulary view of preferred vocabulary terms manually, or through business rules that act on
preferred vocabulary term data or Metadata. Define rules for which terms are included in each vocabulary view.
47 http://bit.ly/2sTaI2h.