UCSF Guides: Archives as Data: Home

About this guide

This guide provides an overview of archival collections datasets (archives as data) made available by UCSF Archives and Special Collections, including guidance for accessing and using such data as well as descriptions of both the form and content this data takes. Additionally, this guide includes references to other archival collection datasets of potential interest for health sciences research made available by other institutions or organizations, as well as an overview of digital methods that can be used to analyze archives as data.

Archives as data

What is archives as data?

“Archives as Data” refers to archival collection materials in digital form that can be shared, accessed, analyzed, and referenced as data. Using digital tools, researchers can work with archives as data to explore and evaluate characteristics of collection materials and analyze trends.

What can you do with archives as data?

Computational methods can be applied to archives as data to, for example, calculate word frequencies in text or visual characteristics in images, identify place, event, personal, or corporate names, propose common topics within or across textual corpora, or assign sentiments from language used in a subset of documents from a collection. In addition to addressing other analyses, archives as data can support research inquiry that surfaces previously unarticulated relationships between people, institutions, places, ideas, and events or that maps any of those across place and time.

Considerations for working with archives as data

It is important to note the ways in which digital content from archival collections may need to be prepared or processed before it can be worked with as data. While archives as data allow for useful and innovative digital analysis, it's important to be aware that source materials are likely to have been subject to some processing or other preparation to become a dataset. Some examples include:

Aggregation:
- Archives as data may be made available in different quantities and configurations. Researchers may be invited to access individual records, records aggregated together in a dataset that represents an archival collection, or records aggregated from multiple collections that share topical or formal characteristics and presented as a curated corpus.
- Independent of how archives as data may be initially accessed by researchers, there are many approaches that can be taken to build a corpus specific to research goals and scope.
Format-specific processing:
- Archival collections can include a range of different materials (textual, visual, audio/visual) that may be digitized from their original, physical form or that may be born-digital. For example, in the case of textual material that has been digitized, the images resulting from digitization need to undergo Optical Character Recognition (OCR) to extract text that can then be analyzed or in the case of audio files, these may need to be subject to speech-to-text processes to extract transcripts for textual analysis.