Skip to Main Content

Archives as Data: Publications Collections Datasets

This guide provides an overview of archival collections datasets (“Archives as Data”), primarily that made available by UCSF Archives and Special Collections, including guidance for accessing and using such data.

UCSF University Publications Collections Dataset

Publications from or relating to UCSF schools, programs, and research institutes have been digitized and deposited to HathiTrust, . A selection of these materials, including course catalogs, announcements, student publications, annual reports, and newsletters, among others, is available as a HathiTrust collection of 712 volumes, UCSF University Publications. Also included in this collection are yearbooks from the UCSF Archives, dating back to 1864.

Data from this collection is also available for download and analysis via the HathiTrust Research Center (HTRC) as the UCSF University Publications workset. The HathiTrust Research Center Analytics website includes a range of tools and environments for conducting text and data mining research on data from materials in the HathiTrust repository. Using HTRC-provided algorithms, researchers can conduct initial topic modeling, Named Entity Recognition, and other processes to identify trends and patterns across this set of UCSF University Publications. Additionally, researchers can download "extracted features" from this collection, which include non-consumptive publications data such as metadata, unigram tokens (i.e. words), token counts, and other calculated or algorithmically-derived data from the HathiTrust volume. Extracted features are further described here

UC ClioMetric History Project Data

Housed by the UC Berkeley Center for Studies in Higher Education in partnership with the UC Office of the President, and directed by Zach Bleemer (Assistant Professor of Economics, Yale University), the UC ClioMetric History Project aggregates historical data about campuses of the University of California system and provides this as raw data, in selected dashboards and visualizations, and through analyses and policy briefs. Data, visualizations, and analysis from the UC ClioMetric History Project provides insight into UC's role as California's--and the country's--premier public university system.

The UC-CHP database currently includes directory records for all UC, Stanford, CalTech, and Mills College students (1893-1946); UC and Stanford faculty and courses (1900-2011); detailed annual UC budgets (1911-2012); digitized student transcripts for UC San Francisco (1947-2017), UC Santa Cruz (1965-2017), and UC Berkeley (1951-2017); and digital student transcript records for UC Irvine (1965-2020), UC Davis (2080-2018), UC Riverside (1981-2018), and UC Santa Barbara (1985-2018). Post-1946 student records are not currently available to outside researchers.