(Editor’s note: This column is based upon the forthcoming book “Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data,” www.pdbmbook.com).
In the context of database management, business continuity can be defined as an organization’s ability to guarantee its uninterrupted functioning, despite possible planned or unplanned downtime of the hard- and software supporting its database functionality.
Planned downtime can be due to backups, maintenance, upgrades etc. Unplanned downtime can be due to malfunctioning of the server hardware, the storage devices, the operating system, the database software or business applications. A very specific, and extreme, aspect of business continuity is an organization’s endurance against human or nature induced disasters. In that context, we speak of disaster tolerance.
Contingency planning, recovery point and recovery time
An organization’s measures with respect to business continuity and recovery from any calamities are formalized in a contingency plan. Without going into too much detail, a primary element of such a plan is the quantification of recovery objectives, considering an organization’s strategic priorities:
The Recovery Time Objective (RTO) specifies the amount of downtime that is acceptable after a calamity occurred. The estimated cost of this downtime provides guidance as to the investments an organization is prepared to make to keep this downtime as minimal as possible. The closer the RTO is to the calamity, the less downtime there will be, but also the higher the required investments in measures to restore database systems to a functioning state after planned or unplanned downtime.
The RTO is different for any organization. For example, a worldwide online shop will push for zero downtime, as even the slightest downtime costs vast amounts in lost sales. On the other hand, a secondary school may be able to cope with a few hours of downtime, so there is no need in making investments to recover from a calamity in a matter of minutes.
The Recovery Point Objective (RPO) specifies the degree to which data loss is acceptable after a calamity. Or to put it differently, it specifies which point in time the system should be restored to, once the system is up and running again. The closer the RPO is to the time of the calamity, the less data will be lost, but also the higher the required investments in state of the art backup facilities, data redundancy etc. Also, the RPO differs from organization to organization.
For example, although a higher RTO is acceptable for a secondary school, its RPO is probably closer to zero, as loss of data with respect to e.g. the pupils’ exam results is quite unacceptable. On the other hand, a weather observatory is better off with a low RTO than with a low RPO, as it is probably important to be able to resume observations as soon as possible, but a certain loss of past data from just before the calamity may be less dramatic.
The aim of a contingency plan is to minimize the RPO and/or RTO, or to at least guarantee a level that is appropriate to the organization, department or process at hand. A crucial aspect in this context is to avoid single points of failure, as these represent the Achilles heels of the organization’s information systems.
With respect to database management, the following ‘points of failure’ can be identified: availability and accessibility of storage devices, availability of database functionality and availability of the data itself. In each of these domains, some form of redundancy is called for to mitigate single points of failure.
Availability and Accessibility of Storage Devices
The availability and accessibility of storage devices is typically arranged by means of RAID configurations and enterprise storage subsystems. For example, networked storage, in addition to other considerations, avoids single points of failure with respect to the connectivity between servers and storage devices. In addition, different RAID levels not only impact the RPO by avoiding data loss through redundancy, but also the RTO.
For example, the mirror set-up in RAID 1 allows for uninterrupted storage device access, as all processes can be instantaneously redirected to the mirror drive if the primary drive fails. In contrast, the redundancy in the format of parity bits in other RAID levels requires some time to reconstruct the data, if the content of one drive in the RAID configuration is damaged.
Availability of Database Functionality
Safeguarding access to storage devices is useless if the organization cannot guarantee a DBMS that is permanently up and running as well. A first, simple, approach here is to provide for manual failover of DBMS functionality. This means that a spare server with DBMS software is standby, possibly with shared access to the same storage devices as the primary server.
However, in case of a calamity, manual intervention is needed, initiating startup scripts etc., to transfer the workload from the primary database server over to the backup server. This inevitably takes some time, hence pushing back the RTO.
A more complex and expensive solution, but with a much better impact on the RTO, is the use of clustering. In general, clustering refers to multiple interconnected computer systems working together to be perceived, in certain aspects, as a unity. The individual computer systems are denoted as the nodes in the cluster. The purpose of cluster computing is to improve performance by means of parallelism and/or availability through redundancy in hardware, software and data.
Availability is guaranteed by automated failover, in that other nodes in the cluster are configured to take over the workload of a failing node without halting the system. In the same way, planned downtime can be avoided. A typical example here are rolling upgrades, where software upgrades are applied one node at the time, with the other nodes temporarily taking over the workload.
The coordination of DBMS nodes in a cluster can be organized at different levels. It can be the responsibility of the operating system, which then provides specific facilities for exploiting a cluster environment. Several DBMS vendors also offer tailored DBMS implementations, with the DBMS software itself taking on the responsibility of coordinating and synchronizing different DBMS instances in a distributed setting.
A last concern in the context of business continuity is the availability of the data itself. Many techniques exist to safeguard data by means of backup and/or replication. These techniques all have a different impact on the RPO and RTO, resulting in a different answer to the respective questions ‘how much data will be lost since the last backup, in case of a calamity?’ and ‘how long does it take to restore the backup copy?’.
Of course, also here, a tighter RPO and/or RTO often comes at a higher cost. We present some typical approaches:
Tape backup: with tape backup, the database files are copied periodically to a tape storage medium for safekeeping. Tape backup is still the least expensive backup solution. However, it is a time-consuming process, so the frequency is necessarily somewhat lower, which has a negative impact on the RPO. Therefore, tape backup is often combined with other precautions. Restoring a backup copy from tape onto a functional database server after a calamity is a time intensive process as well. Tape backup thus has an equally negative impact on the RTO as on the RPO.
Hard disk backup: the process of making and restoring backups to/from hard disk is more efficient than tape backup because of the device characteristics such as better access times and transfer rate. This has a positive impact on the RTO and possibly also on the RPO. Still, as to the latter, the frequency of backups not only depends on the characteristics of the backup medium, but also on the infrastructure where the primary copy of the data resides. For example, the workload of the source system, and the possible performance impact on the latter, may be an important factor to determine backup frequency.
Electronic vaulting: creating backups is key to business continuity, but in most cases, it is also essential to safeguard the backup copies at a remote site, at sufficient distance from the primary site to avoid them both being involved in the same incident or disaster. A simple but error prone approach here is to manually transport the offline tape backups to the remote site. A more efficient technique is electronic vaulting. Here, backup data is transmitted over a network to hard disk or tape devices at a secure vaulting facility or at an alternate data center. This process can be largely automated.
Replication and mirroring: the techniques mentioned thus far are all asynchronous approaches; backup copies of the data are only created periodically. Therefore, whatever the frequency, there is always a certain amount of data loss and the RPO will never coincide with the moment of the calamity. To avoid data loss altogether, synchronous techniques are needed, maintaining redundant copies of the data in real time.
Two closely related terms in this context, are replication and mirroring.. Mirroring is the act of performing the same write operations on two or more identical disks simultaneously. Mirroring is always synchronous. Replication is the act of propagating data written to one device over a network onto another device. This can be done synchronously or semi-synchronously, or asynchronously.
Synchronous replication and mirroring provide near real time redundant copies of the data and thus cater for a very strict RPO. Of course, the tradeoff is the cost of the solution, but also the performance impact that real time replication may have on the source system and sometimes the network. Asynchronous replication is more flexible in this respect.
Disaster tolerance: to guarantee a tight RPO and RTO under any circumstances, remote data replication is needed to an actual second data center at a sufficiently distant location. The data can be replicated over a dedicated network (e.g., a WAN) or over public lines. The same considerations play with respect to synchronous versus asynchronous replication, with the addition that asynchronous replication is less sensitive to network latency, which may be an important factor given the distance.
The remote site should be fully operational, and in that case, it may also handle some workload to relieve the primary site, or at least be able to become fully operational in a very limited amount of time. This means that not only up to date data should be available, but also DBMS functionality should be up and running or at least standby.
Transaction recovery: as a final remark, it must be stressed that replicating the data alone not always suffices to guarantee database integrity in case of calamities. Also, the transaction context must be preserved.
For example, suppose disaster strikes in a bank amidst a set of database operations where money is withdrawn from one account and about to be transferred to another account. Even if the data files themselves were replicated synchronously from the primary site to the remote site, the remote database is not necessarily aware that a transaction was going on, where money was already retrieved from one account, but not yet deposited on the other account.
This information, which is vital to the consistency of the database, is what we call the transaction context. If the overall data replication is coordinated at DBMS level, and not at operating system or network level, then typically the transaction context is also transferred between the DBMSs. One popular technique here is called log shipping. It means that the logfile, which keeps account of ongoing transactions, is replicated between both DBMSs. The remote DBMS can use this log file for transaction recovery, i.e. to restore the context of the transactions that were ongoing at the primary site.
In this article, we provided a database perspective on business continuity. Starting from a contingency plan, we elaborated on various single points of failure: availability and accessibility of storage devices, availability of database functionality and availability of the data itself.
This blog is provided for informational purposes only and may require additional research and substantiation by the end user. In addition, the information is provided “as is” without any warranty or condition of any kind, either express or implied. Use of this information is at the end user’s own risk. CenturyLink does not warrant that the information will meet the end user’s requirements or that the implementation or usage of this information will result in the desired outcome of the end user.