7: Small Scale Approaches to Digital Storage Systems

7.1 Introduction  It is possible to build small scale digital storage systems to meet the requirement of archives with smaller collections and a small recurrent budget. Until recently, only large and comparatively wealthy organisations with sound archives were able to digitise their holdings on a large scale and store them by means of Digital Mass Storage Systems comprising of managed hard disk and data tape. These systems tended to be large and expensive dedicated audio and audio-visual storage systems. In more recent years many national sound archives and large libraries have, with the university and higher education sector, initiated and supported the development of open standards and open source software which supports digital archiving widely. These enterprise systems are now the backbone and the model for all forms of digital archiving. Audio archiving benefits by using these systems and importing our own discipline specific knowledge to them.  At the same time as open source and other low cost software solutions are appearing on the market, the cost of data tapes are decreasing, and hard disk drives (HDD) are dropping at an even greater rate. It is now possible to undertake digital archiving of a far more professional character than the inherently risky single carrier target formats such as recordable CD or DVD.  This chapter of the guidelines describes how a small scale digital repository meeting the requirements of an OAIS might be established and managed. Chapter 6, Preservation Target Formats and Systems, contains much that is pertinent to this chapter, as does Chapter 3 Metadata, and Chapter 4 Unique and Persistent Identifiers.

7.2 Approaches to Small Scale Digital Archiving

7.2.1 Funding and Technical Knowledge  It is quite possible to build a low cost digital preservation system, but this cannot be achieved without at least a small level of technical knowledge and some recurrent resources, albeit at a low level, to make it sustainable. Regardless of how simple or robust a system is, it must be managed and maintained, and it will need to be replaced at some time or risk losing the content it manages.  “Digital preservation is as much an economic issue as a technical one. The requirements of ongoing sustainability demand at their base a source of reliable funding, necessary to ensure that the constant, albeit potentially low level, support for the sustainability of the digital content and its supporting repositories, technologies and systems can be maintained for as long as it is required. Such constant funding is not at all typical of the many communities that build these digital collections, many of which tend to be grant funded on an episodic basis. There is therefore a need to develop costing models for sustainability of digital materials according to the specific requirements of the various classes of content, access and sustainability.” (Bradley 2004).  It is inevitable and unavoidable that the system and its hardware and software components will require maintenance and management which will demand both technical knowledge and dedicated funds. Any proposal to build and manage an archive of digital audio objects should have a strategy which includes plans for the funding of ongoing maintenance and replacement, and a listing of the risks associated with the loss of technical expertise and how that will be addressed.

7.2.2 Alternative Strategies  In the event that there is no adequate way to manage the risks described in the section above an archive may decide to continue with the preservation and digitisation of their collection to look to partnerships to manage the storage risks. An archive may choose to distribute the risk in a number of ways, including; by forming local partnerships so that content is distributed between a number of related collections; by establishing a relationship with a stable well funded archive; by engaging a commercial supplier of storage services (see Section 6.1.6 Long Term Planning).  To effectively take advantage of any of the approaches described it would be necessary to establish an agreement about what data and content would be exchanged between the partners, and the form it would take. This agreement should be established well before the need to take advantage of it might occur. An agreement about exchange packages would consider all the relevant information necessary to continue the archival role undertaken by an archive. This would include the data that makes up the audio object itself in its archival form, the technical metadata, descriptive metadata, the structural metadata, rights metadata, and the metadata created to record provenance and change history. It would need to be packaged in a standard form so that it could be used to recreate the archive if data was lost, or so that another archive could take up the role of managing content if that was deemed necessary.  The tools to produce such profiles exist using, for example, Metadata Encoding and Transmission Standard (METS), a Library based approach that is widely used, are available.Whether this or other strategies are used, agreement about their form is critical to the success of the strategy.Whether this is used to support remote content replication or to support federation of cooperating archives, the agreement about standard form and exchange is a most effective preservation strategy, spreading the risk of failure, due to natural or man made disaster or just lack of resources at a critical time in the life-cycle of the digital audio object.

7.3 Description of System  In Section 6.1.4 Practical Aspects of Data Protection Strategies, the need to address the functional categories defined in the Reference Model for an Open Archival Information System (OAIS, ISO 14721:2003) is argued. The same issues apply to both large and small scale collections as this framework is critical to the development of modular storage systems with interoperable exchange of content. The following section which deals with small scale systems adopts the major functional components of the OAIS reference model to assist in the analysis of the available software and to develop recommendations for necessary development. They are Ingest, Access, Administration, Data Management, Preservation Planning and Archival Storage.  The system described consists of some form of repository software which manages the content, at least a minimum set of metadata, as well as hardware, with some recommendations on manual approaches to manage the data’s integrity. The hardware section outlines broadly two situations under which small scale storage systems may be implemented; a single operator digitising onto a single storage device, and a situation where more than one operator requires access to the storage device. Either system assumes compliance with all other components mentioned in the Guidelines, including appropriate analogue to digital converters,adequate sound cards,digital audio workstations (DAW) and appropriate replay devices.  The following information describes systems and software that might support a small scale collection as though an institution or collection were undertaking all the tasks.It is important to recognise that the approaches described below do not have to be undertaken by one collection.It is possible to find partners and commercial providers who might support some or all of the tasks described below. It is equally important to recognise that all of these tasks form the complete preservation and archival package and must be undertaken by someone whether locally managed or distributed.

7.3.2 Repository Software  A well designed piece of repository software will support a number of the functions identified in the OAIS. There are both commercial providers of the software and open source. The advantage of commercial software is the provider is expected to make the system work, however, these commercial systems have ongoing expenses and may lock the user into proprietary systems from which it is hard to escape. Open source software’s main advantages are that it is free, and the developers adhere to open standards and frameworks which will allow the extraction of content in future upgrades. Its disadvantage is that, though open source communities are helpful, support is the responsibility of the user. It is however, possible to find commercial providers who provide a support service for the open source solutions.  Most of these repository software systems will support the tasks identified in access, administration, data management and some aspects of ingest. At the time of writing preservation planning and archival storage is generally not supported by repository software, the former being very often technology or format specific, and the latter dependant on hardware. They are discussed separately in the following sections.  Two types of open source software are briefly described, however, this software is under constant development, and the claims and comments made below should be checked against the latest developments made by the software providers. The software described are DSpace and FEDORA.  The DSpace repository platform is a very popular and widely adopted repository within the higher education and research sectors, although knowledge of its use within the museums and cultural heritage sectors is limited but growing. One of the reasons for the popularity of DSpace is that it is relatively easy to install and maintain, and has a ready made user-interface that integrates data management and access functions within the system’s architecture. DSpace has a strong international developer community that has evolved to support DSpace and new features are being added constantly.  One of the strengths of DSpace is its integrated feature set enabling institutional users to quickly establish a repository and then start adding new items to the collection. This strength,however,is also one of its major weaknesses, in that DSpace has evolved into a monolithic software application, and complex code base, that introduces potential scaling and capacity constraints for some large institutional users. This presents no problems for most small to medium scale collections, and is probably not an issue for any digital audio collection. DSpace currently uses a qualified version of the Dublin Core schema based on the Dublin Core Libraries Working Group Application Profile (LAP)  FEDORA (Flexible Extensible Digital Object and Repository Architecture) is an increasingly popular repository system that is designed as a base software architecture upon which a wide range of repository services can be built, including preservation services. Compared to the speedy adoption of DSpace, FEDORA has been slower to gain adopters because it lacks a dedicated user-interface and access service out-of-the-box. There are a number of commercial and opens source providers of web-based front-ends for FEDORA.  The main strengths of FEDORA are its flexible and scalable architecture. The experiences of institutional adopters indicate that FEDORA can scale to cope with large collections, yet is sufficiently flexible to store multiple types of digital items and their complex relationships. There are few limitations to the features that can be added to FEDORA, whilst still remaining interoperable with other software applications and systems. It can be configured to support virtually any of the metadata profiles through METS ingest capabilities. The main disadvantage of FEDORA is the high level of software engineering expertise required to contribute to its core development, and it is not readily installed and implemented “out-of-the-box” (Bradley, Lei and Blackall 2007).  Tools have been developed to migrate content from DSpace to FEDORA and visa-versa, which theoretically negates any future compatibility issues and supports sharing and other workflows (see http://www.apsr.edu.au/currentprojects/index.htm )

7.4 Basic Metadata Chapter 3 Metadata, outlines the requirements of documentation and management of a collection. As has been stated, metadata is pivotal to all aspects of the life cycle of a digital audio object, and paying strict attention to describing all aspects of the collection is one of the more important steps in its preservation. A detailed metadata record of all technical, process, provenance and descriptive aspects is a vital part of the preservation process. However, it is recognised that there is often a technical imperative to preserve audio collection material, and that this may well be before a metadata management system or policy has been developed. The following very basic recommendations are intended as a first step, a collection of data which is necessary to manage the file, or which must be captured or it would otherwise be lost: Unique Identifier: Should be structured, meaningful and human readable as well as unique. A meaningful identifier can also be used to relate objects like: master or preservation files and distribution copies, metadata records, series, etc where a sophisticated system will manage that in the metadata. Description: Description of the sound sequence. A small amount of text to simply identify the content of the audio file. Technical Data: Format, sampling rate, bit rate, file size. Though this information can be acquired later, making it an explicit part of the record allows management and preservation planning of the collection. Coding History: In BWF a number of discrete lines of information describing the original item and the process and technology of creating the digital file that is being archived. (See also 3.1.4 Metadata). Process errors: Any error data which the transfer system can collect which describes failings in the transfer process (e.g. uncorrectable errors in CD or DAT transfers). The information described in Unique Identifier, Description, and Technical Data can be recorded in Dublin Core records or the BWF headers. Coding History and Process errors can be recorded in the BeXT chunk of the BWF headers or in related XML encoded documents. The date, and if necessary, time of transfer should be recorded into the BWF header, and the date, and if necessary, time of ingest into the repository should be recorded in the metadata management in the repository. In some circumstances the timestamp information that relates components of a multi¡part recording will be mandatory. It is generally advisable to include time and date information with every event or digital object.

7.5 Preservation Planning  Preservation planning, as has been discussed, is the planning and preparation which goes to ensure that the digital audio object remains accessible over the long term, even if the computing storage and access environment becomes obsolete. Preservation planning for a small scale collection which is interested only in the preservation of its own digital audio objects is a relatively straightforward task. The metadata captured above informs the decisions about preservation by making clear the relationship between the original and the preservation copy in the digital repository. The technical information helps with planning. The choice of BWF as the preservation format is made to ensure the longest time possible before any format migration is necessary. It remains only for the collection managers and curators to maintain knowledge of the changes occurring in the digital archiving domain through contact with such associations as IASA.

7.6 Archival Storage The archival storage system sits underneath the repository, technically speaking, and incorporates a suite of sub-processes such as storage media selection, transfer of the Archival Information Package (AIP) to the storage system, data security and validity, backup and data restoration, and reproduction of AIP to new media. The basic principles of archival storage can be summarised as follows There should be multiple copies. The system should support a number of duplicate copies of the same item. Copies should be remote from the main or original system and from each other. The greater the physical distance between copies the safer in the event of disaster. There should be copies on different types of media. If all the copies are on a single type of carrier, such as hard disc, the risk of a single failure mechanism destroying all the copies is great. The risk is spread by having different types of carriers. IT professionals commonly use data tape as the second (and subsequent) copy. The major cost in the data storage systems is not the hardware, but the Hierarchical Storage Management (HSM) System. The OAIS Functions of Archival Storage embeds the notion of HSM in the conceptual model. At the time OAIS was written the situation where large amounts of data could be affordably managed in other ways was not envisaged. The practical issue that underpins the need for HSM is the differing cost of storage media, e.g. where disc storage is expensive, but tape storage is much cheaper.In this situation HSM provides a virtual single store of information,while in reality the copies can be spread across a number of different carrier types according to use and access speeds. However, the cost of disc has fallen at a greater rate than the cost of tape, to the point where there is an equivalency in price. Consequently the use of HSM becomes an implementation choice. Under these circumstances a storage system which contains all of the data on a hard disc array, all of which is also stored on a number of tapes, is a very affordable proposition, especially for a small to medium sized digital audio collection. For this type of system a fully functional HSM is unnecessary and instead what is required is a much simpler system which manages and maintains copy location information, media age and versions (Bradley, Lei and Blackall 2007).

7.7 Practical Hardware Arrangements  The following information describes how a practical system might be implemented. As has already been discussed above, the assumption is that all of the audio archival data will be stored on hard drive and all of the audio archival data will also be mirrored on data tape such as LTO.

7.7.2 Hard Disk drives  A common and affordable approach to data storage on disk is to connect to a cluster of HDDs (hard disk drive) arranged in a RAID array (see section 6.3.14 Hard Disk Drives). RAID level 1 is little more than two drives mirrored; keeping two copies of the data on different physical hardware; if one disk fails it is available on the other drive. Higher level RAID arrays (2 to 5) implement increasingly complex systems of data redundancy and parity checking that ensures the data integrity is maintained. The higher level RAID arrays achieve the same level of security as level 1, or mirroring, but with significantly less storage space. RAID 5, for example, may have a 25% storage loss (or less depending on implementation), when compared to 50% for RAID 1. Sophisticated arrays are widely available.

7.7.3 Tape Backup  No single component of a digital system can be considered reliable, instead the reliability of the system is achieved through multiple redundant copies at every stage. The final and most important component in the storage chain is the data tape. In the recent past LTO has gained popularity for this purpose (see section 6.3.12 Selection and Monitoring of Data Tape Media), however other data tape formats may be appropriate depending on the particular circumstance.  All data on disk storage should be duplicated on a suitable storage tape. A minimum of two sets of data tapes must be produced, to be stored physically in different places. As it is not unusual for the second set of tapes to be required in the restoration of the data many established archives make three sets of copies, two to be kept near the system for ease of access and a third set stored remotely to protect against physical disasters. It has become customary that the separate sets of data tapes should be made using different products of which a considerable amount of the same batches are bought at one time. This renders quality control and rescue measures easier, once a batch of a given product should fail. Appropriate volume management software will aid in the back up and retrieval process especially if the system incorporates a number of storage devices.  Error checking is difficult to implement in open source and low tech solutions because that capability is linked to specific hardware. Nonetheless, a low-tech possible alternative to proper error testing is described in the following paragraph. The data management software has a catalogue (with a printer attached). The hard disc (in RAID) contains a complete set of data. All data is copied onto identical tape copies. There are at least two copies. As data is copied onto a tape, a unique identifier is printed onto a label (human readable) which is attached to the tape. The same identifier can be recorded onto the header of the tape. The data management system can be scripted to prompt the user to find and insert the tape identified by the system. Rather than checking the tape for errors, the system will verify the content of the tape against the hard disc. The hard disc can check the veracity of its own data content and is aware of any failings itself. If the verification of the tape fails, the system can produce a new tape from the hard disc. Assuming 20 terabytes of storage, the system would verify two tapes a day, every tape and its duplicate can be verified three times per year. In the event of a disc failure requiring the data tapes to replace it, there will be two tapes which have been checked within the previous four months. The risk that both tapes and the hard disc would fail is very low.

7.7.4 Single (or Double) Operator Storage System  The simplest archival storage system would be to attach a separate RAID array containing only the audio data to the primary DAW (digital audio workstation). This configuration is only possible for institutions with one operator in the digitising process. A requirement for the success of this approach is a well structured plan for digitisation and a dedicated disk array so as the work can be carried out continuously without major interruptions. This will ensure that the HDD attached to the DAW continuously copies to tape whenever the amount of data to fill the target medium is reached.  If two operators and workstations are undertaking the digitisation tasks it will be necessary to provide access to a shared drive or drives. The sharing of such resources can be achieved by defining one of the computers as the server, and configuring it so that it manages the drives, and implementing a single wire sharing capability. Such an approach is relatively easy to implement and allows sharing between two operators, though it requires some procedural agreements to avoid conflicts. Logical organisation of data and strict naming procedures are a necessity of small scale manual storage systems.  If a system were established of the size described here. It might be the case that it would be more effective to establish a partnership with a larger archivally established institution, or to contract a storage service provider. Nonetheless, the approach above is possible.

7.7.5 Multiple operator storage system  For any number of connections greater than two, a networked system of data storage and backup should be implemented. Such a networked system allows access to multiple users in accordance with the rules set down by the data management system. Small scale networks are relatively common and, with the right level of knowledge, easy and affordable to implement. Reasonable quantities of storage can be achieved with an enterprise level attached storage device. Storage technologies and products can be split into three main types: direct- attached storage (DAS), network-attached storage (NAS) and the storage area network (SAN). NAS has better performance and scalability than DAS and it is cheaper and simpler to configure than SAN. NAS technology is, from a cost benefit view, the most appropriate scalable technology for system of the size under discussion.  Most low cost NAS devices exhibit reduced bandwidth when compared to the more expensive devices resulting in slower access times, or a lower number of allowable simultaneous access availability. This should present no major problem to smaller collection as the requirement for simultaneous access remains low, especially if MP3 derivatives of the preservation master copies are used for access.  A typical small scale networked storage system may comprise of a server class desktop computer connected to a NAS device. The NAS would have the capability of mounting multiple hard disks in a RAID array. An average low cost NAS would hold between 0.5 and 20 terabytes of disk storage (noting the penalty for RAID is less storage than that indicated by the raw disk size). The digital audio workstations (DAW) access the NAS via an Ethernet switch or similar device which, if configured properly, has the effect of separating the storage facility from the office LAN (local area network) and improves the security of the storage facility. The HDDs would be backed up onto data tape.

7.8 Risks  Automated storage systems can be configured to constantly copy and refresh data, discarding data tapes which have become unreliable. Large-scale Digital Mass Storage Systems are professionally designed and run by well resourced organisations which can afford and guarantee all necessary measures for data security.With manual data back up and recovery systems the dangers of data loss associated with self-designed and self-managed manual and semi-automated digitisation systems cannot be overestimated. The responsibility for ensuring that the archived audio data remains valid and accessible falls upon the individual, and requires that they physically check the data tapes on a regular basis. This situation is specifically aggravated by the fact that most research and cultural institutions are notoriously under-financed.  Though the design of such systems seems to incorporate a very high level of redundancy, one has to bear in mind that the digital components and carriers may fail at any moment without any warning. Therefore it is imperative to have at any stage of the digitisation process and the further storage at the very minimum two copies of the linear archive file. Any flaw will inevitably lead to the loss of a smaller or greater amount of data, however, if suitable strategies have been put in place, this will not be fatal because the redundant copies are available. In view of the time consuming process of transfer not to mention the inevitable losses of older materials, all efforts have to be made to avoid the necessity of re-digitising materials as an outcome of an inconsistent security architecture or careless conduct in the concrete approach.

7.8.2 Complexity of the System  Once implemented and installed data storage systems are relatively easy to operate and maintain. However, at the initial stages of implementation and at any subsequent problem or upgrade, specialised IT support is strongly recommended to ameliorate the risk of poor set up.

7.8.3 Partnerships and Backup  As has already been discussed, a partnership which provides data backup capability with an institution with established and trusted digital archival practices is a major manager of risk. A network of repositories which can create and accept such organised packages of information will be a most effective preservation strategy, spreading the risk of failure due to natural or man made disaster, or just lack of resources at a critical time in the life-cycle of the digital object.

7.8.4 Cost and Scalability A small scale system described above can be added to in order to allow the creation of larger storage and management capabilities. Relatively small tape drives which can handle a number of data tapes are available and larger scale robotic systems may make the system expandable. If HDD costs continue to fall the cost of replacing and expanding the disk arrays remains affordable. Partnerships between commercial suppliers and open source providers mean that the sophistication of the repository software can be integrated with the safety of a commercial service provider. DSpace and FEDORA, for example, have both released an open source system that operates with a commercial storage solution company. The cost of establishing a small scale data storage system may seem relatively high in comparison to purchasing an individual CD burner, however, on a bit for bit comparison for the storage of more than a few hundred hours of audio, the relative difference is greatly reduced when costing all the requirements of an archive. In addition, a properly managed data storage facility is an altogether more reliable system and will allow the future transfer of audio data to the next storage solution when that inevitability occurs.