Our world is awash in digital data. Continuous advances in storage technologies, data-collection methods and processing power enable our businesses to capture, aggregate and analyze information from an ever-greater spectrum of sources. The opportunities are endless, but with these opportunities come daunting challenges.
On one hand, the options for assembling the ideal solution for a given data application have never been better; on the other hand, choosing the right components for that solution is ever more confusing and complex.
When collecting requirements, concentrate on what is needed over how the need will be addressed. Focusing on the need rather than creating a requirement for a specific implementation opens the door for other methods that may provide a more optimal solution, particularly if you are forced to trade off between the strengths and weaknesses of various options. With more than 200 choices available, it’s likely that you’ll be able to find one or more databases that are a good fit for your needs. However, given their breadth of features and characteristics, identifying the best match can be complex. Consider the following attributes when comparing databases.
Type of database
From relational database management systems (RDBMS) through specialized graphics, object and columnar databases, many types of databases have evolved to meet a wide range of common and specialized applications. For most applications, the two most important categories are:
SQL RDBMS: Relational databases, the most widely used model, maintain data in a set of separate, related files (tables), and combine elements from these files when needed for queries and reports. Designed primarily to handle structured data, these databases are heavily used for workhorse business applications in areas such as finance, manufacturing and human resources, which use SQL queries to access data maintained in standard record formats. Popular SQL RDBMS examples include PostgreSQL (open source), Oracle, MySQL (also Oracle), SQL Server (Microsoft) and DB2 (IBM).
NoSQL: These databases provide a solution for handling data that is less structured than required for relational databases. Trading query processing flexibility for scalability and performance under different workloads, these databases focus on data storage and are optimized for retrieval and appending operations. These characteristics make NoSQL databases attractive for cloud deployment and Big Data applications. Popular NoSQL database examples (all open source) include MongoDB, CouchDB, Cassandra, Riak, Hadoop, Redis and Neo4j.
All databases do not operate on all available platforms. Desktop (Windows, OS X), mobile (iOS, Android), enterprise (Linux, Illumos/OpenSolaris, BSD), and cloud (Amazon EC2, Joyent, GoGrid, Rackspace) operating platforms each have their own nuances and constraints that affect database cost, capacity and performance. Even when a given database covers multiple platforms, it may support some better than others.
How a database is licensed can dramatically impact its cost, risk profile and usage flexibility. Commercially available databases charge licensing and maintenance costs for regular releases and stronger support, but can be costly to scale, restrictive in usage and require vendor resources if modifications or extensions are required. An increasing number of databases are being offered through open-source licensing. This model cuts licensing costs and usage restrictions, but requires a greater investment in internal and/or external support.
With new databases reaching the market on a frequent basis, it is tempting to try the latest and greatest to get the newest features. Unfortunately, these databases often have more defects, security vulnerabilities and performance and stability issues than more-mature offerings. Maturity provides a larger user base and a longer time period to identify and remedy inherent issues and strengthen support. Unless newly offered features are essential for the business solution, a more mature offering is usually the best choice. Investigate characteristics such as length of time on the market, number of users, and level of support to compare the maturity of database offerings.
Operations and management
The cost of maintaining and operating a database over its life often exceeds its acquisition costs. Depending on its data center environment and the size, skill sets and strategy for its internal IT organization, a company can choose to implement and support a database with its own staff, or use one of many available sourcing options. Any aspect of the database’s selection, implementation, application software development, support and operations can be sourced to external consultants as a project or as part of a broader service.
Likewise, the database can be hosted by an external solution provider. Some hosting services include regular database backup and maintenance activities, as well as disaster recovery services that provide extra security and free internal IT staff to perform more strategic tasks. Cloud deployment is a variation of external hosting where the database operates on a cloud-computing platform offered by Amazon EC2, GoGrid, Rackspace or other providers. In this model, companies can use a virtual machine image to operate their databases independently, or they can simply purchase access to a database service delivered by the provider.
Databases differ considerably in the robustness of their operational attributes. Attributes such as superfast performance and high availability come at a price in terms of software, hardware and support overhead. The true requirements of the business functions being supported are the key for finding the most effective balance between these attributes and their cost.
Performance: Database performance is judged by factors such as response time per query and transaction throughput, and is affected by the design of the database, infrastructure capacity, and the logical and physical structure of the data. For example, using a schema imposes more structure on the underlying data but provides better performance due to the constraints.
Scalability: Scalability evaluates the ability of a database to handle anticipated growth in usage, transaction rates and data volume. Considerations include inherent product limitations, ease of increasing capacity by adding hardware, and costs for additional licenses and hardware to handle expected growth.
Reliability: Reliability is a measure of the ability of a system to function correctly—including avoiding data loss and corruption—over a given time period. It is often assessed using Mean Time Between Failure and is enhanced by database features that help to avoid, detect and repair hardware- and software-driven data faults.
Availability: Availability measures whether a database is available for use when needed. A database designed for high availability would disable a malfunctioning component and continue operating at a reduced capacity, as opposed to a less capable system that might crash and become totally non-operational.
Resiliency: This factor assesses a database’s ability to recover from faults such as hardware crashes, overloads and software component failures. Simple databases may require restoration from backup files after a failure, whereas a more advanced database may employ sophisticated replication, rollback and component isolation strategies to permit continued operation and minimize the impact of faults.
#!Security: Databases offer differing levels of security features to protect the database from security vulnerabilities such as unauthorized access or disclosure of personal or proprietary data; damage to the data or programs; interruption or denial of authorized access to the database; attacks on other systems; and the unanticipated failure of database services. The level of security protection required for a given database application depends on the confidentiality/value of the data it contains and its level of accessibility to unauthorized activities.
Data integrity: A database’s data integrity features enable the specification and enforcement of rules to protect the quality of the data it stores. These rules may include checks and correction for invalid data upon entry, restrictions on data relationships, and timeframes for data retention.Ease of implementation: Choosing the database with the best features will not necessarily guarantee a successful implementation. The availability of tools, staffing and support for a widely used database can trump the technical advantages of a less common offering.
Availability of tools: A mature, widely used database will be supported by a variety of open-source and commercial tools to assist with software development, integration with other products, backup and recovery, and other critical functions. These tools increase productivity and cut time and costs throughout implementation and operation.
Extensibility: Does the database support simple ways of adding functionality outside of its standard feature set? Does your company have access to the source code, the skills and ability to make modifications? Does the database have interfaces to support externally developed functionality, or can it take advantage of the capabilities of other software applications? For example, an otherwise ideal database may not support the data encryption your company needs itself, but if it can be run using an encrypted file system, the combination becomes a viable solution.
Knowledge and staffing: Is the new database familiar to your internal IT staff? If not, it will require an investment for training or the hiring of external resources. For example, while many enterprise programmers know how to code effective SQL queries, far fewer would be able to write Map/Reduce jobs for Riak. Likewise, the market availability of experts is far greater for widely used databases such as Postgres and Hadoop.
Community: Open-source offerings, such as PostgreSQL, have active and robust user communities that provide support, and they make getting use cases and product extensions easier. In contrast, gaining the same level of assistance for closed-source offerings such as Oracle and Vertica requires expensive contracting for proprietary resources.
Successfully implementing a new database involves considerations well beyond the selection of the database itself. If your company doesn’t already have the right expertise in house, engaging specialized consulting assistance can be highly valuable. By bringing market knowledge and first-hand experience of the many available database solutions, consultants can reduce the time, resources and costs of choosing, implementing, operating and managing your database solution.
Denish Patel is the lead database architect at OmniTI, a global IT services company.