Report on an Investigation to Digitize a Selection from the Toronto Telegram Photograph Collection
Rob van der Bliek / November 2000
/ Initial Considerations / Development of expertise / Investigating options for digitization / Selection /
/ Digitization and file conversion / Database and web interface / Costs / Conclusion and further recommendations /
The Toronto Telegram Photo Collection held by Archives and Special Collections (ASC) consists of approximately 830,000 negatives of photos taken for the newspaper during the period 1876-1971, with the bulk of the photos being from the mid 1950s through the 1970s. It is a collection of complete runs of photographs (with multiple shots), the majority in the form of cellulose acetate negatives. The collection represents a large and rich aggregate of images relating to world events as well as local historical documentation, useful for many areas of research. It has been manually indexed by means of Canadian Council of Archives grants, with resulting subject and name indexes. Since York owns the copyright to the collection, it provides us with an opportunity to digitize selected photographs and make them available on the web as a means to expose the collection and to promote the archives libraries to the community at large.
The goal of this digitization project was to examine and evaluate techniques, procedures, and costs for digitizing collections, in this case through a small, representative sample from the collection. One outcome is to have a means of securing ongoing funding for further digitization of the collection. It also provides an opportunity for the libraries to develop expertise in digitalization and the mounting of locally produced databases on the web. Not only have we been able to gather information about the process, but we will also, once the pilot database has been exposed to the public for a period of time, be able to gauge outside interest through usage statistics and anecdotal feedback.
1.1.1 Vinegar syndrome
The Telegram collection suffers from "vinegar syndrome," a condition whereby the acetate plastic base of a negative reacts with oxygen, producing a vinegar-like odor. The base deteriorates, leaving the negative curled, with surface flakes, leaving it more or less unusable for further reproduction. A conservation assessment of the collection was done in 1995, with the conclusion that there is no realistic way to stabilize the negatives. Reduction of the rate of chemical decay would be possible through eliminating oxygen and lowering the ambient temperature to minus 18 degrees Celsius, but this would be impractical for a collection of this size.
1.1.2 Archival media
Given the fact that stabilizing the negatives is unrealistic, the alternative is to transfer the negatives to another medium, preferably an archival medium. At present, there are no sanctioned standards for transferring information--photographic or other--to a digital format and subsequent storage on CD-ROM, tape, or hard disk space. Not only are the media unstable, but as far as the archival community is concerned, for photographs there is no agreed-upon resolution and bit-depth sufficient to capture and preserve all of the information necessary to ensure that future reproductions will be faithful. The ANSI standard for preserving photographic information is an analogue copy preserved under strictly specified environmental conditions on approved 35 mm film stock. Quotes for copying the negatives and prints onto ANSI compliant 35 mm film were obtained from Xerox and Kodak, but costs were prohibitively high, in the range of $1 million. But it is likely that a digital storage standard will be developed and approved sometime in the future. In fact, practical but unapproved digital standards have already been adopted by such institutions as the Library of Congress and the National Library of Canada to provide service images for display and copying, the main constraint being file size as related to print-quality resolution.
1.2 Formats and physical condition
The majority of items in the Telegram collection are 35 mm negatives, but there are also 4 x 5 negatives, glossy prints, and glass plate negatives. Vinegar syndrome has primarily affected the 35 mm negatives because of their chemical makeup. For purposes of batch digitization it is more cost-effective to stick as closely as possible to one format, since machinery is format-specific: flat-bed scanners are used for glossy prints while photo scanners are used for negatives.
Digitization is an area that has grown in tandem with the web, the result being that most technical and administrative information about digitization projects is readily available on the web. The two primary clearinghouse sources are the RLG DigiNews and Ariadne, online journals where reports on projects are published. The Library of Congress has published all of its RFP's and technical background papers on the web, several of which deal with photograph digitization projects. The National Library of Canada is sponsoring the Canadian Initiative on Digital Libraries site, an alliance of Canadian libraries formed to promote and coordinate digital library activities.
2.2 Etext institute
To get some sense of the practical problems of digitization I attended the Third Summer Institute "Creating Electronic Texts and Images - a practical "hands-on" exploration of the research, preservation and pedagogical uses of electronic texts and images in the humanities," held at the University of New Brunswick from August 15 - 20, 1999. The course was taught by David Seaman, founder and director of the Electronic Text Center at the University of Virginia. Seaman writes and lectures frequently on SGML throughout the world.
3.1 Technical specifications
3.1.1 File formats
Archivists have informally accepted the use of the TIFF file format as a means of achieving the best possible reproduction and having the least possible chance of losing information when converting to other file formats. There are different specifications of the TIFF format, some with minimal compression, but it is a flexible format that allows for a close, non-interpolated mapping of the data. TIFF files can be read by a variety of proprietary and non-proprietary software, in the same way that ASCII files can be read by a variety of text processors. The main disadvantage of TIFF files is their size: a 35 mm black and white negative, scanned at a resolution and bit depth sufficient to produce high quality 3 x 5 prints is about 9 MB. Other formats, such as JPEG and GIF use "lossy" compression methods, which may not impact their use in the short run (with current display and print practices), but if used as a master file from which web images are converted, changing practices may require reconstitution of the lost data. Kodak's PhotoCD format has also been used successfully to produce high-quality images, but because of it's proprietary nature the archival community has used it sparingly.
3.1.2 Resolution and bit-depth
Specifications for scanning images vary greatly, the two main variables being resolution, or the number of actual pixels used to map the image onto the screen, and bit depth, which determines the number of values possible for a given pixel. For grayscale imaging the bit depth is usually set at 8, allowing for one out of 256 different shades of gray to be stored for each pixel. This approximates the actual information available on the negative, although a one-to-one mapping is not possible since the analogue information on the negative is a continuous scale with infinite gradations. Resolution needs to approach the information on the negative as well; if you were to magnify a negative to the point at which you begin to see a "grain," you can more or less correlate this with a dots-per-inch (dpi) measure. The choice of resolution comes down to the question of how the digital masters will be used. If they are to be used to print high-quality glossies comparable to those produced from the original negatives, then a resolution of 3000 x 4000 pixels for a regular 3 x 5 will produce satisfactory results at about 1200 dpi. But the resulting file size will be close to 20 MB. Resolutions in this order are completely redundant for web display, since the average computer monitor displays at an equivalent of 72 dpi.
3.2 Scanning 3.2.1 In-house vs outsourcing
Since the majority of photographs in the Telegram collection are in negative format, household flatbed scanners were not suitable. Negative scanners capable of doing bulk operations are priced in the range of $2000-$3000, making it unfeasible to purchase for a pilot project of this size. Scanning negatives also presents significant problems in terms of quality control, since there are so many variables, such as lighting and tonal adjustments, that determine the final outcome. Finally, training and supervision of students to perform the scanning, particularly in view of these complications, would add significant expenses on to the project.
Quotes for scanning negatives range from $2 - $20 per negative, variables being quality, size, and added metadata. We had two vendors digitize samples from the collection, with the request that the scanned images be of a quality that conceivably could be used as masters for hard-copy glossy prints. The first batch we examined was scanned with the specification that it would be possible to print the negatives at 800 dpi. The vendor proposed a rate of $20 per image (this included multiple copies of the images at different resolutions). A typical image from the collection had the following attributes:
|Vendor 1 sample scan|
|Pixel dimension||4000 x 3000 pixels|
|Print size||4 x 5 inches|
|Printing resolution||800 dpi|
|File size||12 MB|
Most office laser printers are capable of printing documents at 600 dpi while a printer used for printing photographs needs 1200 dpi, so 800 dpi for a 4 x 5 print will not produce an acceptable image. Increasing resolution to the point of producing images suitable for printing, and possibly suitable for storing archival digital masters results in a corresponding increase in file size and scanning cost. An black and white image scanned to produce a 1200 dpi print will increase to over 20 MB and cost twice as much to scan.
The second vendor's samples were scanned at a a slightly lower resolution, and at a much lower cost, about $6 per image. These images were delivered as TIFF but were originally produced as Kodak 5-PAC images. The Kodak format, although much more versatile and efficient than TIFF, requires special software for reading and manipulation, and since we were looking for a long-term solution it was important to store the images in a more common and non-proprietary format. Scanning costs were lower because of a combination of lower vendor markup and a quicker scanning process, a direct result of the lower resolution. Image quality, in terms of more difficult to measure attributes such as tone reproduction, optical density, and flare, is less but since we were not actually testing the reproductive qualities of the scans, it made no difference.
|Vendor 2 sample scan (derived from Kodak 5-PAC images)|
|Pixel dimension||3000 x 2000 pixels|
|Print size||4 x 5 inches|
|Printing resolution||600 dpi|
|File size||8 MB|
4.1 Selection tools
The Telegram collection functioned as a working collection for reporters, covering the news of the day, special features, and selected historical images that, in one way or another, ended up as part of the collection. Organization and access is through assigned headings used by the Telegram staff, the result being a practical but sometimes inconsistent list of subject headings handwritten on envelopes and folders. Annotations or "captions" often include information about why the photo was taken and whether or not it was used in the paper. In addition, ASC has thoroughly indexed the collection through name and subject access.
For the pilot project, we decided to select about 1000 photographs, since this seemed to be a number that could be handled by a student during the course of a summer. We decided to select based on four themes:
- Ethnic Groups in Toronto
- Toronto Streets and Architecture
- Ships and Shipping
These subjects were all easily accessed through both the headings assigned originally by the newspaper and those in the finding aids devised by ASC. The idea was to select photographs that would be of local interest or at least have some social or geographic component. As it turned out, these large general subjects turned up some remarkable subsets. In the Toronto Streets and Architecture series, for example, it became clear that there was a need to focus on pictures that were recognizably Toronto, with some bias towards pictures that document Toronto's growth and development. As a result, the student ended up selecting a fair number of aerials. The headings also covered unanticipated aspects or dimensions of the subjects, exposing the weaknesses of the classification system. For example, under Streets and Architecture, the scenes of several murder cases form the foreground while the street itself plays a minor role. Luckily, these "access points" are augmented by item captions and notes, giving more meaningful descriptions. There was also an emphasis on selecting pictures that were framed properly.
4.3 Selection process
Part of the process of selection involved weeding the collection for duplicates, mostly in the form of alternate shots. The table below shows the total number of negatives examined (broken down by subject), the number of negatives culled from the collection for future preservation, and the number selected for potential digitization. This work was done during the summer of 1999 by a graduate student.
|Ethnic Groups||Labour||Ships and Shipping||Streets and Architecture||Total|
Essentially, this was the most difficult phase of the project since it required making decisions for which no hard and fast rules were available. Consultation with archives staff was crucial and continuous, and as the student progressed, selection criteria evolved.
The negatives were left in their envelopes, and in cases where they were part of a strip of three negatives, some indication was given about which negative should be digitized. For the digitization, Vallillee Digital Imaging Solutions was selected as the most competitive vendor and because they were very interested in making inroads into the academic community. After the initial run of samples, we requested that they increase the pixel count to give us more resolution, allowing for potential use of the masters for printing, since the first batch had noticeably rough edges when expanded on screen to more than 100%. Additional adjustments were made for brightness, contrast, and opacity, attributes which can be manipulated very easily after the image has been scanned. We agreed that the images would not be cropped and scanned in an neutral manner as was possible.
We discussed a file-naming convention with them which would allow us to easily correlate the files with their physical counterparts. For example, the file name 1974-002-252-285-001.TIF contains information from the accession number assigned by ASC, the box and envelope numbers, and a negative series number.
5.2 Image conversion and marking
We received the files from Vallillee on CD-ROM discs in both TIFF and Kodak 5-PAC formats (separate sets of discs), with disc labeling corresponding to the batch in which the files were found. This may seem arbitrary but the final destination of the discs and the master files on them will need to be determined before assigning ASC designated accession numbers.
Since the TIFF images were intended as master copies, to be stored off-line on CD-ROMs until needed, the first step was to create thumbnails and full screen display or service images for the web. Furthermore, the display images would need some sort of copyright notice imprinted on them to prevent unauthorized use (in so far as this is possible). Fortunately, there are many inexpensive image conversion utilities available that can process images in batches. Jasc Software, Inc.'s "Image Robot" was used to create the thumbnails and display images. Each CD-ROM, containing 60 images, was given a two run pass to produce batches of thumbnails and display images. The thumbnail pass used a script that made a reduced copy of the image at 130 pixels wide (with ratios maintained), cropped the image, and added a black border for definition. The display pass reduced each image to 800 pixels wide, cropped the image, added a "York red" border and a copyright notice indicating that the image is the property of York University Libraries.
6.1 Software considerations
Mounting online databases with a web interface can be done using a variety of technologies, most of which have some connection to standards and common programming practice. Microsoft ASP and SQL, PHP, and Oracle products are main contenders for flexible relational database connectivity, but all require programming expertise and development time. Since no programming resources were available for this project, the choice was narrowed to out-of-the-box solutions where the primary advantage is ease of design and implementation, particularly with respect to database connections. InMagic DB/TextWorks, an inverted-file text database program with image linking capabilities, was selected on the basis that designing and implementing screens and setting up a database connection through its add-on program DB/Text Web Publisher could be done without programming support from library computing services. Support was limited to provision of hardware and installation and troubleshooting of InMagic.
6.2 Database design and population
The data structure was designed to accommodate future searching enhancements, such as use of a controlled vocabulary for image description, and with an eye on maintaining flexibility so that the file eventually may be exported to another DBMS. The information from folders, notes, subject lists, etc. was kept separate so as to preserve the integrity of the collection. In other words, folder titles, captions and the like, however inconsistent or dated they may seem, were carefully transcribed and added as separate fields in the database. The resulting structure is as follows:
|1||Call number||the file name, derived from accession , box and negative number|
|2||Access point||heading assigned by ASC|
|3||Folder title||a broad designation, originally assigned by the Telegram librarian|
|4||Item caption||text that may have been used as part of the caption if used|
|5||Date 1||first date in a date range or if there is only one date|
|6||Date 2||second date in a date range, if applicable|
|7||Photographer's note||information supplied by the photographer about the assignment, the circumstances, etc.|
|8||Original format||physical description|
|12||Keyword 1||free text describing the image|
|13||Keyword 2||controlled vocabulary text describing the image (for future implementation)|
|15||Digitization resolution||a rough pixel dimension count|
|17||Record creation date||date|
|18||Editorial note||administrative notes about the photograph|
Metadata can be broken down as follows:
- fields 1 and 2 are the key identifying elements assigned by ASC
- fields 4-8 contain data that came with the collection
- fields 9-11 contain location information for the online and master files
- fields 12-13 contain enhanced searching information
- fields 14-18 contain administrative information
If and when the database is exported to a relational DBMS, it will be necessary to construct new tables and establish relations between them, but for the time being this structure is more than sufficient.
6.3 Web interface
InMagic provides a set of scripts for constructing a web interface to their databases. Creating the web forms and output pages is very cumbersome and restricted, in part since the interface to the scripting tools is a modified version of the one used to create hard-copy reports. Placement of elements on the screen is limited to prescribed layouts, so that if you want to design your own layout, you have to write the HTML and embed it in the input boxes provided by the program. Viewing the results of your code then requires following a series of steps, making the development process cumbersome. Nevertheless, establishing the database connection and using the indexes is very fast.
7.1 Projecting Costs
Comparative costs for digitization have been difficult to assess until recently. The October 1999 edition of RLG DigiNews (vol. 3, no. 5) contains one of the first general overviews in The Costs of Digital Imaging Projects. The overview concludes that the cost range for digitizing photographs varies greatly, as illustrated (figures are US dollars):
Average costs to digitize and process photographs seems to be around $20 US per item, which, considering that this cost does not include a preservation component, is very high.
As this was intended as a pilot project which ultimately could be used to attract funding from outside of the libraries, funding was drawn from within the libraries as follows:
- $6000 from the University Librarian's Office
- $4000 (2 x $2000) in research grants from the Librarian's Research Awards Committee
- $1500 from the ASC trust fund
- $2100 training funds from the University Librarian's Office
Of these costs, the training funds are outside the scope of the project since they pertain to professional development expenses related to general development of expertise within the libraries. The other two items are direct costs associated with the project.
7.3 Breakdown of costs
Directly measurable costs for digitizing and mounting 1000 photographs with a web interface are as follows:
|Phase I: Photograph selection and preparation||$3525||$3.52|
|Phase II: Digitization||$6180||$6.18|
|Phase III: Database setup and population||$1800||$1.80|
Phase I: We hired a graduate student for the summer of 1999 at $15 per hour to work 235 hours, which is the number of hours graduate students are allowed to work per term. She actually selected about 1200 negatives, from which the 1000 were taken. The selection process involved looking at a variety of formats, including 35 mm negatives, large format negatives, and a few glass plate negatives. About 10% of her time was spent on administrative matters such as searching, planning, keeping statistics, and writing a report. The cost includes all of these factors.
Phase II: Vallillee picked up the boxes (at no cost) in two week intervals and held to their initial quote of $6180, including taxes. Unfortunately, our instructions for indicating which pictures needed to be digitized were inadequate and they digitized too many multiple shots, which were part of the negative strips. This was corrected midway through the process.
Phase III: This phase went faster than anticipated, with time to add free text descriptions to the photos as a search enhancement. Much of the data input was done trough macros and utilizing shortcut keys. Adding controlled vocabulary terms would be expensive, perhaps adding several dollars per photograph to the overall cost. The student worked about 120 hours on this section, or roughly 8 records per hour.
Indirect costs, absorbed in other budgets, include:
- Supervision of the student in phases I and III: supervision is a piecemeal activity, perhaps best expressed as a percentage added on to the number of hours; so in this case, with 235 hours, about 10% or 23 hours added on would probably account for the supervision. Adding entries to the database was facilitated through a number of predetermined field entries.
- Image conversion and marking: this was done in small batches, resulting in a total of about 6 hours.
- Development of the data structure and web interface: these are one-time-only costs in this case, since further digitization will build on the existing database. In this case, about 25 hours was spent on learning the InMagic program and designing the interface.
- Hardware and software: server space was available and the software had already been purchased and installed for another project.
A very rough estimate of these indirect costs would add about $3500 to the project, making the average price of digitizing an image about $15. However, as a figure for future planning, it does not take into account an economy of scale that will be achieved now that the groundwork has been laid.
Things learned from the project:
- Selection of photos remains the largest hurdle, as there are many variables, including suitability, interest, quality, representation; it is also the most intellectually demanding part of the process, and cannot be unconditionally delegated
- Preservation needs of the collection are not addressed through digitization at manageable and cost-effective file sizes, let alone in terms of archival standards
Our biggest mistake was that we did not clearly communicate with the digitization vendor regarding which negatives needed to be digitized and as a result we ended up with a selection less interesting than it could have been.
This was a pilot project intended to test the feasibility of digitizing primary source materials in the libraries, and as such it has proven very useful. The basic procedures for digitizing the Toronto Telegram Photograph Collection have been tested and can be easily established as routines. In relation to other means of providing access to library and archival material, digitization is expensive, but it is not entirely appropriate to compare conventional, in-house access or hard-copy publication to web access. Web access should be viewed as something that is much more pervasive than conventional in-house access, perhaps with a large upfront investment but ultimately having a very beneficial effect through exposure of previously inaccessible materials, the development of a web presence for the libraries, and service to the community at large.