Many big
data application implementations seem to begin with an existing data warehouse,
one or more new high-volume data streams, and some specialized hardware and
software to support data storage and business analytics. The data storage issue
is often accommodated by installing a proprietary hardware appliance that can
store huge amounts of data while providing extremely fast data access.
In these
cases, do we really need to worry about database design?
Data Modeling in a Big Data Environment
Most
database administrators agree: good database design is part of system and
application design. There are many business requirements, such as data
availability, purge processing, and application performance that are addressed
using specific database design options.
What
about big data? Interestingly, vendors that supply hardware and software
solutions for business analytics against big data claim that database design is
less important. They argue that since the data is stored in their proprietary
format, most database design considerations do not apply.
Related Articles
Confusion
over this issue usually originates from misperceptions regarding how special
solutions execute big data queries. Simply put, in most cases your data will be
physically stored in two places: your current production database management
system (DBMS) and the new special-purpose appliance. Your current production
processes that extract, transform, and load data into the current DBMS continue
to operate as-is, with an additional step: at every point that you load data
into a table you will also ensure that the new data is loaded into the
appliance as well.
Loading
data into the appliance can be done immediately after your DBMS loads, or can
be batched for later execution. The important thing is that data must be loaded
into the appliance before any big data queries can utilize it for the
advertised performance gains.
Database Design for Quality Assurance
What does
a quality database design mean? Generally speaking, a database design starts
with a model of the data and the business rules that define their
relationships. For example, an order is always associated with a
customer, and a customer may have zero, one, or many orders. Along with these
things and the data element definitions and attributes, the database design
will address, cope with, or mitigate risks in the following areas:
- Assist with defect prevention by automating data element
valid values checking;
- Permits defect detection and remediation during
application construction and testing;
- Moves data verification as close to the source as
possible;
- Provides stability, reliability, data accessibility and
system scalability.
What Will the Database Designer do Differently?
A poor
quality database design affects technical support the most. They are the ones
that must deal with system problems in real-time. This drives up the cost of
problem determination and problem resolution. It also manifests itself in
product behaviors that may annoy ... or drive away customers. The most common
problems associated with bad designs are poor application performance or data
contention.
Typical
fixes include database reorganizations or re-design, adding table indexes and
changing table partitioning or clustering. However, in a big data environment,
these options are typically not available in the appliance. They will
only exist in the base tables in the database proper. This is the crux of the
matter: despite the vendor's claims that all your data can be moved into the
appliance, this is seldom the best solution.
Having
your data co-exist between the main database management system and the
appliance is a best practice for several reasons.
Avoid
single point of failure: The appliance is a single
point of failure. Despite the best efforts of the vendor and your support staff
the hardware, software, network connections or processes within the appliance
may fail. If they do, how will queries be satisfied? With data co-located
in the database management system, query results can be satisfied by accessing
the base tables. Granted, performance may suffer; however, the alternative is
that your big data application will be unavailable until someone fixes the problem.
Provide
for data offloads: Queries are not the only consumers of your
data. One common use is offloading production data to a test environment. In
addition, some third-party vendor software tools access the data in the
database natively and directly, which is not available in the appliance because
it stores the data in a proprietary format.
Backup
and recovery: Most
common backup and recovery utilities are based on data residing in the
database. Again, third-party vendor tools are commonly used for high-performance
backups and recovery, including index recovery. These backups are executed
against the base tables and table spaces, not against the appliance.
Certain
performance situations: There are some situations in
which SQL queries are not executable in the appliance. Such limitations are
defined in the manuals, and vary across vendor appliances and versions. In
these cases, you have no choice; you must access the base tables and accept the
performance degradation. Some of these limitations involve specific SQL syntax
such as scrollable cursors, dynamic SQL, use of multiple character encoding
schemes, certain correlated table expressions, and the use of certain built-in
functions.
Database Design Decisions for Big Data
Since you
will be keeping your data in both the DBMS and in the appliance, your standard
database design rules still apply. Interestingly, some of the rules are
now expanded or more complex due to the existence of the appliance. Some
of the considerations are as follows:
The need
for indexes: Indexes serve multiple needs: they can enforce
data element uniqueness, they can enforce referential integrity relationships,
they define primary keys, and they define additional access paths. This last
item is important. In a big data environment the idea is to push long-running
queries into the appliance for high-speed processing. If certain indexes exist
simply to provide alternative access paths, they may no longer be needed. Part
of database design or re-design should involve a review of so-called
performance indexes. If the index is no longer being used by queries it can be
dropped, thereby saving disk space, processing time, and recovery time if the
table data needs to be recovered.
Removing
appliance SQL limitations: Usually the business rules
for the data determine portions of the database design. These include physical
partitioning to permit faster queries and easier data purging, data element
domain checking such as column constraints, and definition of primary and
foreign keys to support referential integrity rules. Application developers
then code SQL queries to access the data. In addition, users may have reporting
tools that automatically generate SQL for queries and reports. Since the SQL
query syntax and options will depend upon the database design, the designer
needs to keep appliance limitations in mind.
Designing
for high-speed appliance data load: Normal
database load processes now contain an extra step: loading data into the
appliance as well. How is this best accomplished? Depending upon your
application and on your data volatility, you may wish to consider variations on
the following:
- Regular bulk load (daily, hourly) of the appliance, with
the understanding that data there will not be completely current.
- Trickle load, where row updates to the base tables are
then propagated synchronously to the appliance. This keeps appliance data
current, but row processing is much slower than bulk loading.
Summary
Big data
and appliances have not removed the need for good database design.
Indeed, the designer has more things to consider: backup and recovery, index
management, multiple methods of data access, and SQL limitations. The good news
is that advances in database hardware and software can speed up data queries to
a remarkable degree.
Additional Resources
-
Carnegie Mellon Capability Maturity Model
-
IBM TechDocs library: Information on autonomics—“
A First Look at Solution Installation for Autonomic Computing”, IBM document
SG24-7099, available at the IBM Quality management solutions
-
American Productivity and Quality Center
-
American Society for Quality