6.3 Archival Storage

6.3.1 Archival Information Package (AIP)

6.3.1.1  The definition of the term Archival Storage in OAIS includes the services and functions necessary for the storage of theArchival Information Package (AIP). Archival storage encompasses data management and includes processes such as storage media selection, transfer of AIP to storage system, data security and validity, backup and data restoration, and reproduction of AIP to new media.

6.3.1.2  AIP, as defined in OAIS reference model (CCSDS 650.0-B-1 Reference Model for an Open Archival Information System (OAIS)), is an information package that is used to transmit archival objects into a digital archival system, store the objects within the system, and transmit objects from the system. An AIP contains both metadata that describes the structure and content of an archived essence and the actual essence itself. It consists of multiple data files that hold either a logically or physically packaged entity. The implementation of AIP can vary from archive to another archive; it specifies, however, a container that contains all the necessary information to allow long term preservation and access to archival holdings. The metadata model of OAIS is based on METS specifications.

6.3.1.3  From physical point of view the AIP contains three parts; metadata, essence and packaging information, which all consists of one or more files (see 6.1.3 Defining the Digital Object). Packaging information can be thought as wrapper information and it encapsulates metadata and essence components.

6.3.2 Archival Storage basics

6.3.2.1  Archival Storage provides the means to store, preserve and provide access to archived content. In small systems the storage can stand alone and may be manually operated, but in larger systems storage is usually implemented in conjunction with cataloguing applications, asset management systems, information retrieval systems and access control systems in order to control and manage archived content and provide a controlled way to access them.

6.3.2.2  Archival Storage must be connected to equipment that ingests and creates the digital asset to be archived, and it must provide a secure and reliable interface that can be used to import assets to the storage system.

6.3.2.3  A system that is used to store archival content must be reliable in several ways: It must be available for use without any significant interruptions, and it must be able to report to the system or user who imports content whether the import was successful or not, thus enabling the importing party to delete the ingest copy of the of the archival file if appropriate. Archival Storage must also be able to preserve the content it manages for a long period of time and be able to protect the content from all kinds of failures and disasters.

6.3.2.4  An Archival Storage system should be built according to the needs of its functional owner: it must be correctly-sized to carry out the tasks that are needed, and manage the capacities that are required in every day operations. In addition, Archival Storage must provide controlled access to the content it manages for the users who have permissions or rights to access the content.

6.3.3 Digital Mass Storage Systems (DMSS)

6.3.3.1  A Digital Mass Storage System refers to an IT based system that has been planned and built to be able to store and maintain large amounts of data for a given or extended period of time. These systems come in many forms; a basic DMSS could be a personal computer which has large enough hard disk drive and some kind of catalogue that can be used to keep track of the assets the system possesses. A more complex DMSS may consist of hard disk drive and/or tape storage and group of computers that control the storage entity. A DMSS can also contain many tiers of storage with different characteristics; a fast Fibre Channel based hard disk drive tier can be used to cache assets whose access time is critical while a tier built of cheaper hard disk drives could be used to hold material whose access time is not so critical, and finally tape based storage can be used as the most cost-effective tier of storage.

6.3.3.2  When a number of different storage technologies are used in a large system to build the functional entity, a HSM (Hierarchical Storage Management) system is usually deployed in such a way that it supports the different technologies working together. Larger scale systems may also be distributed geographically in order to achieve better performance and make the system more fault tolerant.

6.3.4 Data Tape Types and Formats Introduction

6.3.4.1  The following is an outline of some of the main data tape formats and tape automation systems that may be used for storing AV content in data form. Data tapes are only used in conjunction with other components of a DMSS. It is prudent to commence a section on comparison of the various data tape formats with a reminder that no carrier is permanent and that, all things being equal, they will only be viable as long as the data systems in which they are incorporated continue to support them.

6.3.5 Data Tape Performance

6.3.5.1  Format geometry and dimensions govern performance. Data transfer speed, one aspect of performance, is a direct product of the number of tracks written and read simultaneously, as well as the tape- head speed, linear density and the channel-code. Similarly, physically smaller, lighter tape housings are faster to move in a robotic library. Data density is a product of:

6.3.5.1.1  tape length and thickness trade-offs
6.3.5.1.2  track width and pitch
6.3.5.1.3  linear density of data payload within each track

6.3.6 Tape Coatings

6.3.6.1  There are two main types of tape coatings: particulate and evaporated. The earliest coated data tapes used metal oxides similar to video tape, whereas more recent data tapes use metal particles (MP). Pure iron with inert ceramic and oxide passivation layers is dispersed in polymer binders which are applied evenly to a PET or PEN base-film or substrate which in turn provides dimensional stability and strength under tension. Some of the highest density data tapes currently on the market now use evaporated metal foil coatings of cobalt alloys and similar material to those used on hard disks. This achieves a much higher purity of magnetic material and allows thinner coatings. Most metal-evaporated (ME) tapes have a protective polymer coating similar to the binder material on MP tapes. The more recent formulations include a ceramic protective layer as well. Several of the early ME tapes failed during heavy usage due to de-lamination (Osaki 1993:11).

6.3.7 Tape Housing Design

6.3.7.1  Two basic styles of housings are used, dual-hub cassettes, which may enable faster access times and single-hub cartridges, which offer greater capacity within a given external volume.

6.3.7.2  Dual hub cassettes include:

3.81mm, principally DDS [derived from DAT]
QIC [quarter-inch cartridge] and Travan
8mm formats, including Exabyte and AIT
DTF
Storagetek 9840

6.3.7.3  Single-hub cartridges include:

IBM MTC and Magstar formats such as 3590, 3592 and TS1120
Quantum S-DLT and DLT-S4
LTO Ultrium [100, 200, 400 & 800 GB]
Storagetek 9940 and T10000
Sony S-AIT

6.3.7.4  Neither design is necessarily superior for long-term archiving, since the life is governed by a range of details specific to each format.For instance,some models of the single-ended half-inch cartridges have large-diameter guides within the housing, which ensure minimum friction and accurate tape guidance. Problems have been experienced with the leader latching mechanism on older single-ended cartridges, although more recent designs have improved reliability in this area. Some dual-hub cassettes can be positioned to park halfway along the tape to minimise the amount of spooling time to any particular file. This contradicts the traditional practice in AV archives of spooling tapes carefully to one end before storage so that only leader tape is exposed to the threading mechanism. Tapes generally don’t incorporate a hermetically sealed enclosure in the way that hard disks are protected.

6.3.8 Linearly and Helically Scanned Tapes

6.3.8.1 Data tapes may be written or read with a fixed head, generally described as linear, or with a rotating or helical head. Linear tapes typically follow a serpentine track layout, and it has been argued that this shuttling can lead to wear or a so-called shoe-shine effect. In practice, modern tapes are designed to last for large numbers of passes, however, it is still prudent to access frequently used content from hard disc. Tapes, which experience chemical decomposition from hydrolysis and other causes, will usually run better over fixed guides and components in the tape path at speeds of around1-2 m/s or greater, which are typical of fixed- head or linear formats. Rotary-head or helical formats typically have higher tape-head speeds which create a greater air-bearing effect between the tape surface and the read-write heads, but the linear tape speed over the fixed guides and heads is much slower, so this is where fouling often occurs.

6.3.9 Ancillary Storage and Access Devices

6.3.9.1 Formats such as AIT include solid-state ‘Memory in Cassette’ or MIC which stores file positional information similar to a Table of Contents (TOC) on Compact Disks for rapid location of data. DTF uses rf memory.

6.3.10 Format Obsolescence and Technology Cycles

6.3.10.1  The inherent nature of data storage is of constant progress and development, which means inevitable change, and ongoing obsolescence. Realistic long-term management of content must accept and build upon the continuing evolution and upgrading of hardware and media. Although central infrastructure such as data cabling or storage libraries may remain in operation for ten or twenty years, individual tape drives and media have a finite life much shorter than this. All of the main data tape formats have development roadmaps projecting upgrades every 18 months to 2 years. Backward compatibility for read-only access is sometimes assured over one or two generations of media within any common family. As a result,each generation of tape drives and media may be viable for 4 to 6 years, after which time it is essential to migrate the data and move on.1  Also the hardware maintenance cost of mass storage systems tend to rise notably when the system gets older than its projected life or the guarantee period ends. After this it may be difficult to obtain new spare parts for the tape libraries or tape drives, for example. A summary of projected roadmaps is presented below. Many formats have read-only compatibility with at least one previous generation.

Family 1st Generation 2nd Generation 3rd Generation 4th Generation 5th Generation 6th Generation
Quantum SDLT SDLT220 110GBytes SDLT320 160GBytes SDLT600 300GBytes DLT-S4 800GBytes    
IBM     3592 2004 300GB 40MB/s TS1120 2006 700GB 104MB/s    
Sun - Storagetek   9940B 2002 200GB 30MB/s T10000 2006 500GB 120MB/s T10000B-2008 ITB 120MB/s    
LTO LTO-1 2001 100GB 20MB/s LTO-2 2003 200GB 40MB/s LTO-3 2004 400GB 80MB/s LTO-4 2007 800GB 120MB/s LTO-5 no date (2009+) 1.6TB 180MB/s (estimated) LTO-6 no date (2011+) 3.2TB 270MB/s (estimated)
Sony S-AIT S-AIT 2003 500 GB 30MB/s S-AIT2 2006 800 GB 45MB/s        
Sony AIT     AIT-3 2003 100 GB 12MB/s AIT-4 2005 200 GB 24MB/s    

Table 1 Section 6.3: Projected Development Roadmap for Data Tapes


1. This implies a degree of waste and environmental pressure beyond the scope of our purely technological discussion, but in reality, a large-scale library of older data tapes will consume more polymers and require more petrochemicals for manufacture than a newer, high-density system with more energy-efficient drives and robotics, occupying less real-estate at the same time

6.3.11 Automated Robotics or Manual Retrieval

6.3.11.1  For small-scale operations it is possible to back up data from a single workstation onto a single data tape drive and manually load tapes for storage on traditional shelving, and even small scale networked systems will undertake manual backup of their storage (see also Chapter 7 Small Scale Approaches to Digital Storage Systems). The same guidelines for storage environments apply as for other magnetic tapes,though increased attention to minimising the presence of dust and other particulates and pollutants would be beneficial.For larger-scale operations,particularly in countries where labour costs are high, and capital equipment budgets are favourable, a degree of automation is normally desirable and more economical than purely manual systems. The degree of automation depends upon the scale and consistency of the task, type of access to the content, and the relative costs of the main resources.

6.3.11.2   Autoloaders and Robotic Tape Libraries: The next step from single drives is the small-scale auto¡loader, which usually has one drive (occasionally two), and a single row or carousel of data tapes which are fed in sequentially to support backup operations. One of the key differences between autoloaders, and large-scale robotic libraries is that the recorded tapes are not logged by the backup software in a central database which can then enable automated retrieval. The task of searching, retrieving and reloading individual files still falls to a human operator. All that autoloaders do, as the name implies, is to allow a series of tapes to be written or read sequentially to overcome the size limitations of individual data media, and to negate the requirement for a human operator’s presence to load the next tape in a long backup sequence.

6.3.11.3   By way of contrast, even the smallest robotic tape libraries are programmed to behave as a single, self -contained storage system. The location of individual files on different tapes is transparent to the user, and the library controller keeps track of addresses of files on each tape, and of the physical location of tapes within the library. If tapes are removed or reloaded, the robotic sub-system re-scans the tape slots as it initialises, to update its inventory with metadata from barcodes, rf tags, or memory chips in the tape housings.

6.3.11.4  Large tape libraries have some benefits when compared to the smaller tape libraries. They can be built to be redundant and distributed, i.e. downtime can be minimised and the read/write load can be balanced between several similar systems. Large tape library can also be used as a multi-purpose system; they can, for example, maintain a company’s normal IT backups as well as manage all archived video and audio.

6.3.11.5   Data tapes or cartridges used in a robotic system will have some system of barcoding, rf tags or other ID. These optical or electromagnetic recognition systems sometimes operate in conjunction with MIC for supplementing information about tape ID and content. Some formats have a global ID system for barcoding tapes so that a tape used in one robotic library can be recognised in another library system.

6.3.11.6   Backup and Migration Software and Schedules: Some confusion and misunderstanding exists both in IT circles, and in the wider community as to the purpose and operation of long-term data archives. There are two popular misconceptions regarding long term data archives. The first; that archiving is the process of moving infrequently used material from expensive, on-line networked disc storage, to less expensive,inaccessible offline shelving from whence it may never be retrieved and the other;that backup is a regular daily and weekly routine of making a copy of everything stored in the system.

6.3.11.7   With regard to the first misconception, the reality is that some of the most important and valuable material may not be used for months or years, but its survival must be guaranteed unequivocally. Likewise with the second, if suitable rules are established, vast amounts of material may not need to be replicated daily or weekly when only small percentages are updated. In practice, while a stringent regime of replicating data on different media in different locations is essential to minimise risks from technology failures and to ensure recovery from disasters, the particular characteristics of digital heritage material requires some procedures that differ from routine IT data management.

6.3.11.8   Conventional HSM (Hierarchical Storage Management) systems may be optimised for backing up everything on a regular basis, and moving out infrequently-used content to inaccessible locations, but the better systems can be configured to suit the business rules and practices in archives of different sizes with different levels of access. A medium-sized organisation may ingest 100 GB of audio data every week or 1TB of video. It is fairly straightforward to ensure that copies are made as soon as valuable material is ingested, and that frequently used material remains accessible.

6.3.11.9   Some of the primary tasks of storage management software are to optimise the use of resources and to manage devices in the hardware layer, while regulating traffic with minimal delays to users. HSM software offers a choice of conditions for migrating files from on-line disk to tape, such as older than a certain date, larger than a nominated size, located in particular sub-folders or when available disk space falls outside certain limits (high and low watermark).

6.3.11.10  Typically, where both high resolution files, as well as low resolution access copies are produced, the larger, high resolution files used for preservation and broadcast will be migrated to tape to free up space on the more expensive hard disk array. A balance is needed to maintain availability of material, and to optimise use of tape drives and media. If tapes are being accessed very frequently, a large number of mounts and unmounts, spooling and restore operations will degrade system performance. More sophisticated content management systems sometimes incorporate lower levels of storage management so that users are less aware of individual files and components that support the system.

6.3.12 Selection and Monitoring of Data Tape Media

6.3.12.1  As with any conventional preservation system, it is important not only to have backups and redundancy in case of failures in media or components, but it is vital to establish and to measure performance standards for key parts of the system. Software such as SCSI-Tools will allow a lower level of interrogation of individual drives and devices on a network to determine if media and hardware are performing at their optimum level. LTO tape has an interface for data monitoring, however this functionality is rarely utilised though it would be advantageous for archival systems. Some HSM systems are capable of monitoring the quality of stored assets on a regular basis. These systems monitor the error rates of tapes while users access the assets or read the assets without user intervention if a tape has not been used during a certain period of time.

6.3.13 Costs

6.3.13.1  Typically, the cost of data tape storage is spread in four areas: Tape media: procurement and replacement of primary and backup tape media every 3-5 years. Tape drives: procurement and replacement every 1-5 years, with support. Robotic Library purchase and maintenance within a 10 year life-cycle, and software purchase, integration/development and maintenance.

6.3.13.2  In a manual system, the costs for shelving will be lower, although the space requirement for staff is greater, and the labour cost for manual retrieval and checking is higher. In an automated robotic system, much of the human cost is offset by up-front expense for hardware and software. Large scale robotic tape libraries can be purchased in a modular fashion to spread the cost over several years as demand for storage grows.Within the life of a robotic tape library, individual components such as tape drives will be replaced by newer technology every three to five years. If content from an archive is accessed continuously the life time of drives can be considerably short, even only one year or less. Older tape media and drives may be kept on hand for redundancy if required. If an archive does not grow rapidly, the present and next generation of tapes and drives can co-exist in a tape library while the archive content is migrated to the next generation of media or technology. If an archive grows continuously it may be cost-effective to create a tape library of a specific size to only store the amount of content that shall be archived during the life time of the then current technology, and to then acquire a new larger tape library to store the content that shall be stored using the next generation of technology including the old content that will be migrated. The later approach is also necessary if old and new technology cannot co-exist in the same unit.

6.3.13.3  It is good business practice to keep at least one redundant copy of data off-site or geographically separate. Typically a radius of 20 to 50 km is common for natural and man-made disasters, and still allows manual retrieval within a few hours. To reduce risks further, redundant copies should be on different batches or sources of media, or even on different technologies. Some data tapes are only manufactured at a single supplier, and chances of a single point of failure are increased. Three copies of data are safer than two, and although costs for media increase, the hardware and software costs are only slightly higher than for the first copy.

6.3.14 Hard Disk Drives (HDD) Introduction

6.3.14.1  Hard Disk Drives (HDDs) have served as the primary memory and data storage in computers since IBM introduced the model 3340 disk drive in 1973. Nicknamed “the Winchester”, because it had 30MB of fixed memory and 30MB of removable and the working designation of 30/30 resembled, in name at least, the famous rifle, it pioneered head designs that made operation of the hard disk viable. Subsequent reduction in size and more recent developments in head and disk design have greatly increased the reliability of disk drives, leading to the robust designs in common use today.

6.3.14.2  Data managers whose responsibility it is to maintain data have considered the hard disk too unreliable to use as the sole copy of an item, and too expensive to use in multiple, and consequently more reliable, disk arrays. The data on HDDs has consequently been duplicated on multiple tape copies to ensure its survival. As stated above (6.1.4 Practical Aspects of Data Protection Strategies and 7.6 Archival Storage) all data systems must have multiple and separate copies of all data. While experts tend to agree that the most reliable data system consists of a HDD array supported by multiple duplicates on tape, the continued reduction in costs and improvement in reliability make the concept of identical duplicates of data on separate hard disks a possibility. The principle of multiple media remains, however, and disk only storage constitutes a risk.

6.3.15 Reliability

6.3.15.1  Loss of data as a consequence of disk failure and head crashes has made most data professionals suspicious of HDDs, however manufacturers now claim annualised failure rates of less than one percent and an operational life of 40,000 hours (Plend 2003). High reliability drives may have an even longer operational life, termed by manufacturers as “mean time between failure”. Though HDDs are self-contained and sealed and so protected from damage, most failures in disk drives occur in two opposing ways: as a result of wear through extended use, or as power to the drive is turned on or off. The dilemma is whether to leave the disk on, and increase wear, or turn it on and off and increase risk of failure.

6.3.16 System Description, Complexity and Cost

6.3.16.1 As noted in Section 2, Key Digital Principles, the more recent generations of computers have sufficient power to manipulate large audio files. All recent generation computers incorporate hard disks of adequate speed and size, and an external HDD adapter can be plugged into a USB, Firewire or SCSI port. The system complexity and the degree of expertise required to run such systems is not much greater than is necessary for desktop computer operation.

6.3.16.2 When large quantities of audio and audiovisual material required for access are stored on HDDs, the disks are usually incorporated into a RAID (Redundant Array of Inexpensive (or Independent) Disks). RAID increases the reliability of the hard disk system, and the overall access speed by treating the array of disks as one large hard disk. If a disk fails, it can be replaced and all the data on that disk can be reconstructed with data from the rest of the disks in the array. The level of failure the system will tolerate, and the speed of recovery from such failures is a product of the RAID levels. RAID is not designed as a data preservation tool, but as a means of maintaining access through inevitable disk failures. The appropriate RAID level for any particular installation, and the requirement for duplication of controllers, is dependant on the particular circumstance and the frequency of data duplication. A RAID requires that all disks in the array be turned on when any part of the disk is in use. All RAIDs containing archival material, as with all digital data, must be duplicated more than once on other carriers.

Capacity Native tape capacity (GB) # of tapes Recommended
# of
tape drives
Maximum #
of drives
System price (€) Tape price (€) Drive price (€) Cost per GB (€)
10 TB 800 13 2 4 20.480 97 7.625 2,05
50 TB 800 63 4 16 56.800 97 10.175 1,14
100 TB 800 125 8 16 134.050 97 12.725 1,34
200 TB 800 250 12 16 205.350 97 12.725 1,03
500 TB 800 625 18 56 446.938 97 15.975 0,89
1000 TB 800 1250 36 88 864.517 97 15.975 0,86
2000 TB 800 2500 72 176 1.687.690 97 15.975 0,84

Table 2 Section 6.3: Investment Costs of LTO-4 technology based Storage Systems

 

Capacity HW maintenance, year 1 (€) SW maintenance, year 1 (€) HW maintenance, year 2 (€) SW maintenance, year 2 (€) HW maintenance, year 3 (€) SW maintenance, year 3 (€) HW maintenance, year 4 (€) SW maintenance, year 4 (€) HW maintenance, year 5 (€) SW maintenance, year 5 (€)
10 TB 2.420 n/a 2.420 n/a 2.420 n/a 2.514 n/a 2.514 n/a
50 TB 3.454 n/a 4.958 n/a 4.958 n/a 4.958 n/a 4.958 n/a
100 TB 11.808 490 13.817 490 13.817 490 13.817 490 13.817 490
200 TB 15.787 582 19.323 582 19.323 582 19.323 582 19.323 582
500 TB 27.380 1.068 34.111 1.068 34.111 1.068 34.111 1.068 34.111 1.068
1000 TB 47.542 2.115 66.734 2.115 66.734 2.115 66.734 2.115 66.734 2.115
2000 TB 99.272 4.221 99.272 4.221 99.272 4.221 99.272 4.221 99.272 4.221

Table 3 Section 6.3:Yearly Maintenance Costs of LTO-4 technology based Storage Systems

Notes to the tables:

  • Prices are averages of list prices from multiple vendors. A price that a customer has to pay is usually somewhat lower.
  • Prices indicate price of raw capacity. At least double amount of tape media will be needed for backup purposes.
  • Price in the system price column includes cost of tapes and drives for the capacity in question, but does not include any HSM software or hardware
  • The tables indicate only investment costs and maintenance fees that have to be paid to a vendor. In addition to this, also costs from electricity, cooling, machine room, management, etc. must be included in individual calculations. Electricity and cooling of tape library system might cost 10% of purchase price over five year period.

 

Capacity Drive technology Size of drive (GB) # of drives System price (€) Drive price (€) Cost per GB (€)
5 TB SATA 500–1000 5–10 11.884 1.000 2,38
10 TB SATA 750–1000 10–14 19.997 1.000 2,00
50 TB SATA/FATA 1000 50 124.334 1.800 2,49
100 TB SATA/FATA 1000 100 230.914 1.800 2,31
200 TB SATA/FATA 1000 200 456.942 1.800 2,28
500 TB SATA/FATA 1000 500 1.202.726 1.900 2,41
1000 TB SATA/FATA 1000 1000 2.566.513 1.900 2,57
2000 TB SATA/FATA 1000 2000 4.782.584 1.900 2,39

Table 4 Section 6.3: Investment Costs of HDD Based Storage Systems

 

Capacity HW maintenance, year 1 (€) SW maintenance, year 1 (€) HW maintenance, year 2 (€) SW maintenance, year 2 (€) HW maintenance, year 3 (€) SW maintenance, year 3 (€) HW maintenance, year 4 (€) SW maintenance, year 4 (€) HW maintenance, year 5 (€) SW maintenance, year 5 (€)
5 TB 826 750 826 750 826 750 1.845 750 1.845 750
10 TB 1.206 1.125 1.206 1.125 1.206 1.125 2.600 1.125 2.600 1.125
50 TB 5.822 6.125 5.822 6.125 5.822 6.125 12.365 6.125 12.365 6.125
100 TB 10.514 8.500 10.514 8.500 10.514 8.500 22.391 8.500 22.391 8.500
200 TB 21.724 12.750 21.724 12.750 21.724 12.750 44.956 12.750 44.956 12.750
500 TB 57.061 37.250 57.061 37.250 130.394 37.250 130.394 37.250 130.394 37.250
1000 TB 130.203 66.250 130.203 66.250 263.537 66.250 263.537 66.250 263.537 66.250
2000 TB 223.778 124.250 223.778 124.250 477.121 124.250 477.121 124.250 477.121 124.250

Table 5 Section 6.3:Yearly Maintenance Costs of HDD Based Storage Systems

Notes to the tables:

  • Prices are averages of list prices from multiple vendors. A price that a customer has to pay is usually somewhat lower.
  • Price in the system price column includes cost of hard disk drives for the capacity in question.
  • The tables indicate only investment costs and maintenance fees that have to be paid to a vendor. In addition to this also costs from electricity, cooling, machine room, management, etc.must be included in individual calculations. Electricity and cooling of hard disk drive system might cost 30% to 40% of purchase price over five years period.

6.3.17 Disk Only Storage

6.3.17.1  RAID arrays are scalable within the limits of the system, however individual HDDs are infinitely scalable by simply adding more drives. Since the introduction of the IBM 3340 HDD, storage capacity has increased rapidly, almost exponentially, while costs have fallen. These changes, linked with an improvement in reliability, have led some to suggest that HDDs could be used as both the primary storage system and the back up copy. There are three difficulties associated with this approach: Firstly, hard disk life is estimated in terms of usage- time, that is the number of hours of operation. There has been no testing of the life of an infrequently used HDD. Secondly, having data on different types of media is advantageous as it spreads the risk of failure. Therefore the approach should be considered very cautiously. Finally, there is no way of monitoring the condition of the hard disk on the shelf without turning it on at regular intervals and thereby compromising the advantage gained by having the disk turned off (see section 6.3.18 below, Monitoring of Hard Disk Media). Multiple carriers (eg Tape and Hard disk) remain the preferred option. Hard disks should be implemented within an integrated system.

6.3.18 Hard Disk Storage Systems

6.3.18.1  Hard Disk Storage Systems are centralised systems that are used to maximise disk storage utilisation and to provide large capacities and/or high performance. These systems are used in conjunction with server computers so that server have only small amount of internal hard disk storage or do not have it at all. These kind of systems are often used in mid and large size environments as storage for an archiving system. Alternatively an archiving system can share a centralised storage system with a number of other computer systems. The size of a system can vary from 1 terabyte to several petabytes. It should be taken into consideration that performance characteristics of a storage system can vary notably according to its chosen configuration and it is essential that the actual needs for a system are carefully planned beforehand and a qualified professional is used to configure the storage structure and interfaces of a system to produce the best value for ones investment.

6.3.18.2  Centralised disk storage systems are designed to provide better error resilience than independent hard disk drives. These systems provide several alternative levels of RAID protection, their components can be redundant in order to avoid single point of failures, and systems can be locally or geographically distributed to protect valuable assets from different kind of failures and disasters.

6.3.18.3  The connection between the storage system and the computers it serves play important role regarding performance of a system. Generally speaking, two methods used are NAS (Network Attached Storage) and SAN (Storage Area Network).While NAS utilises regular IT network like Ethernet to move data between computer and storage system SAN uses switched Fibre Channel connections. NAS systems can operate at 100 Mbit/s, 1 Gbit/s and 10 Gbit/s speeds while SANs operate at 2 Gbit/s or 4 Gbit/s. Both technologies have clear road map to the future and their performance can be expected to grow in the future. SAN technology is usually chosen for more demanding environments since it gives better performance due to specific design. For example, the in/out (I/O) block size can be controlled more effectively in SAN environments while networking protocols tend to force NAS systems to use quite small I/O blocks. From economical point of view NAS technology is cheaper than SAN technology.

6.3.19 HDD Life

6.3.19.1  As stated above, a life of 40,000 hours is estimated for many commercially available HDDs. Typical commercial use of HDDs would give these disks a replacement life of five years. With improvements such as fluid/ceramic spindle bearings, surface lubrication of disks, and special head parking techniques made on the most recent desktop HDDs, the life of HDDs may be somewhat longer. However, there is no reliable testing of the life span of unused HDD and it would be astute to plan to replace the disks in such a working system within 5 years.

6.3.20 Monitoring of Hard Disk Media

6.3.20.1  An indication of imminent disk failure may be an increase in bad data blocks. It is typical for the latest disks to show bad block errors even from new and most data systems manage the bad blocks by reassigning the address of that block. However, if the quantity of bad blocks increases it may indicate that the disk is beginning to fail. Software exists which will provide a warning of increased bad data blocks, as well as measuring other physical characteristics that may indicate disk problems.

6.3.21 HDD technologies

6.3.21.1  There are four main methods of connecting HDDs and other peripheral devices to computers, USB (Universal Serial Bus), IEEE 1394 (Firewire), SCSI (Small Computer System Interface) and SATA/ATA (Serial Advanced Technology Attachment/AT Attachment). They each have particular advantages in certain situations. USB and Firewire are planned to be all-purpose buses that can be used to connect to personal computer a HDD as well as digital video camera or MP3 player. SCSI and SATA/ATA are mainly used to connect hard disk drives to a computer or disk storage system.

6.3.21.2  SCSI and its successor SAS (Serial Attached SCSI) interface allows faster writing and reading speeds, and facilitates access to larger numbers of drives than the SATA/ATA drives. SCSI disks can accept multiple commands at once on a SCSI bus and does not suffer from request queues like SATA/ATA. The SATA/ATA drives are comparatively cheaper. The read access speed is largely the same and in an audio context neither interface will limit the operation of the digital audio workstation (DAW) more than the other. The performance difference of SCSI/SAS and SATA drives can have meaning in heavily utilised centralised hard disk storage system.

6.3.21.3  Fibre Channel (FC) SCSI/SAS drives are mainly used in demanding use in enterprise or business systems while the cheaper SATA drives are more used in the personal market, but they are also increasingly used in enterprise and business systems to offer more cost-effective storage capacity e.g. in archival storage. In archival storage, the actual decision between (FC) SCSI/SAS and SATA technology is dependent on the actual load of the system. If a system is used to archive small or medium amounts of content that is not accessed intensively a SATA based solution might well be enough. The actual decision must be based on clearly identified demands and negotiations with one’s storage provider.

6.3.21.4  USB and Firewire connected disk can be used to transfer content from one environment to another, but since they are rather unreliable, difficult to monitor and easy to loose they should not be used for archiving even though their pricing may seem very attractive.

6.3.21.5  The interface is not a completely consistent indication of the reliability and performance of a given drive or storage system and the purchaser should be more aware of other operating and configuration parameters of a storage system. It seems to be the case that more reliable drives are associated with the FC SCSI/SAS interface. Nonetheless, HDDs are not in themselves permanently reliable, and all audio data should be backed up on suitable tape (see 6.3.5 Data Tape Performance). (For further discussion see Anderson, Dykes and Riedel 2003).

6.3.21.6  There is one emerging storage technology which may have a prominent position in the near future. Solid- state storage in form of flash memory is developing as a alternative to moving disks and has already become an alternative to a HDD in laptop PCs. Some storage manufacturers have also introduced flash drives in their low cost or midrange storage systems and are planning to introduce flash drives in their high end systems too. Even though flash storage still has some challenges in storage reliability to overcome it might become a viable solution to storage needs of archival community; its price per gigabyte is becoming competitive, it is more environmentally friendly due to lower demand for power, and it does not have moving parts, which could mean longer life time of storage units. A life time of ten years instead of five years for a storage unit could mean lower investment and management costs for an archivist since every other migration to the next storage technology could be skipped. In terms of read and write performance flash storage is already comparable with HDD technology.

6.3.22 Hierarchical Storage Management (HSM)

6.3.22.1  The OAIS Functions of Archival Storage embeds the notion of Hierarchical Storage Management (HSM) in the conceptual model. At the time OAIS was written the situation where large amounts of data could be affordably managed in other ways was not envisaged. The practical issue that underpins the need for HSM is the differing cost of storage media, e.g. where disc storage is expensive, but tape storage is much cheaper. In this situation HSM provides a virtual single store of information, while in reality the copies can be spread across a number of different carrier types according to use and access speeds.

6.3.22.2  However, the cost of hard disc has fallen at a greater rate than the cost of tape, to the point where there is an equivalency in price. Consequently the use of HSM becomes an implementation choice. Under these circumstances a storage system which contains all of the data on a hard disc array, all of which is also stored on a number of tapes, is a very affordable proposition, especially for digital storage systems up to 50 terabytes (and rising every year). For a smaller digital storage facility a fully functional HSM is consequently unnecessary and instead what is required is a much simpler system which manages and maintains copy location information, media age and versions and completely replicates the stored data on hard disc and on tape.

6.3.22.3  For medium to large digital storage systems the need for HSM storage systems remains and continues to be amongst the very expensive components of the digital storage systems.

6.3.23 File Management Software in smaller systems

6.3.23.1  The purpose of file management software in systems where the entire archive is replicated both on hard disc and tape is to keep track of the location, condition, accuracy and age of the tape copies. This basic backup functionality is a lower cost alternative to a classic HSM and may, at least in theory, be more reliable for small systems. However, as the large scale HSM represents a significant market,research and development has been supported by the industry in this area.Small scale file management software is being developed amongst the open source software development community. These include such systems as three most popular open source NAS applications, FreeNAS, Openfiler and NASLite,and the Advanced Maryland Automatic Network Disk Archiver (AMANDA). As with all such open source solutions, the onus is on the user to test the suitability and reliability of such systems, and without further development this publication makes no specific recommendation.

6.3.24 Verification and retrieval

6.3.24.1  In some commercial software, tape read/write error can be reported automatically during the data backup and verification process. This function is normally implemented with cyclic redundancy check, a technology using checksum against data to detect errors for transmission or storage. It is recommended that an error checking function should be implemented in any archival storage system. Error checking is difficult to implement in open source because that capability is linked to specific hardware. A commercially available stand-alone LTO Cartridge Memory Reader is the “Veritape” from MPTapes, Inc. and recently, Fuji Magnetics announced a Chip Reader Diagnostics System for LTO-Cassettes, bundled with software.

6.3.25 Integrity and Checksums

6.3.25.1  A checksum is a calculated value which is used to check that all stored, transmitted or replicated data is without error. The value is calculated according to an appropriate algorithm and transmitted or stored with the data.When the data is subsequently accessed, a new checksum is calculated and compared with the original, and if they match, then no error is indicated. Checksum algorithms come in many types and versions and are recommended, and standard, practice for the detection of accidental or intentional errors in archival files.

6.3.25.2  The cryptographic versions are the only type that have a proven record of trust when protecting against intentional damage to data, and even the simplest of these are now compromised. It has been recently shown that there are ways of creating meaningless bits that will calculate as a given MD5 checksum. This means that an external or internal intruder may replace digital content with meaningless data and that this attack will go unnoticed by the error checking management system until the files are required for use and opened. MD5, although still useful for transmission purposes, is 124 bit and should not be used where security is the issue. SHA-1 is another cryptographic algorithm that is under threat of being compromised, and which it has already been shown can, in theory, be circumvented. The length of SHA-1 is 160 bit: SHA-2 comes in versions with 224, 256, 384, and 512 bit lengths, and are algorithmically similar to SHA-1. The steady growth of computational power means that these checksums may, in the long run, be compromised as well.

6.3.25.3  Even with these compromises, a checksum is a valid approach to detecting accidental errors, and if incorporated into a trusted digital repository, may well be sufficient to uncover intentional damage to data files in low risk scenarios. However, where risks exists, and perhaps even where they do not, monitoring checksums and their viability must be part of preservation planning.