Skip to Main Content

Archives as Data: Digital Archives Collections Datasets

This guide provides an overview of archival collections datasets (“Archives as Data”), primarily that made available by UCSF Archives and Special Collections, including guidance for accessing and using such data.

Industry Documents Library: Data from Selected Collections

The Industry Documents Library (IDL) is a digital archive of documents created by industries which influence public health, hosted by the University of California, San Francisco Library. Originally established in 2002 to house the millions of documents publicly disclosed in litigation against the tobacco industry in the 1990s, the Library has expanded to include documents from the drug, chemical, food, and fossil fuel industries to preserve open access to this information and to support research on the commercial determinants of public health.

Industry Documents Library collections are made available as data, provided either as downloadable dataset files organized by collection, or can be accessed via API when limited amounts of targeted data are of interest. Researchers may also download the entire IDL database, but should consult the readme file fore instructions, and know that this will take 50GB of disk space. Note that the IDL website’s user interface provides access to the most current dataset, as the website undergoes a new release each month. In contrast, due to time constraints, the IDL dataset files will be updated only twice a year.

To work with downloadable collection-specific datasets from IDL, made available as delimited data files, it's helpful to know that these include both structured data (metadata describing individual digital files, which are most often digitized collection folder contents) and unstructured data (most often, this is OCR’ed full text derived from digitized items) for each collection, which is represented in delimited data files (comma separated .csv). OCR quality can vary and is not likely to represent document text with complete precision and accuracy.

Selected IDL collection-specific datasets are described below:

  • British American Tobacco (BAT) Africa Collection (Tobacco Collections)

Consult the Collection Description, individual items in this Digital Collection, which includes 273 documents, or go directly to this collection as a dataset, which includes metadata and OCR’ed full text. 

This collection contains documents related to British American Tobacco (BAT) operations in Africa. The records document evidence of payments made by BAT to African government officials (including officials working on tobacco control laws), legislators and civil servants, and payments intended to damage competitors. Documents include internal correspondence, reports, project plans, proposals and presentations prepared for African governments, and memos relating to BAT’s Anti Illicit Trade Unit. The records relate to BAT’s anti-illicit trade investigations and operations in Africa and the Middle East, particularly focused on the East African Community (EAC) including Kenya, Uganda, Burundi, Tanzania, and Rwanda, and on South Africa. Topics also include work with and payments to service providers; invoices; discussion of counterfeited products; smuggling; Track and Trace; Digital Tax Verification (DTV) and security products produced by SICPA, FractureCode (Codentify), and ATOS. Other documents are copies of FCTC proceedings; service provider agreements; affidavits and witness statements; MOUs between BAT and government agencies outlining agreements to address smuggling and contraband.

  • Tobacco Industry Influence in Public Policy Collection (Tobacco Collections)

Consult the Collection Description, individual items in this Digital Collection, which includes 2,259 documents, or go directly to this collection as a dataset, which includes metadata and OCR’ed full text.

Documents concerning tobacco industry activities to influence public policy in the United States. This collection is a subset of documents pulled from the larger Tobacco Master Settlement Agreement (MSA) collections for research into industry influence in state legislation. Some documents have been enhanced with descriptions tagged by researchers.

  • Kentucky Opioid Litigation Documents (UCSF-JHU Opioid Industry Documents Archive)

Consult the Collection Description, individual items in this Digital Collection, which includes 281 documents, or go directly to this collection as a dataset, which includes metadata and OCR’ed full text.

The records in this collection come from a lawsuit brought by the Commonwealth of Kentucky against Purdue Pharma alleging improper marketing of the drug OxyContin. The suit was settled in late 2015 for $24 million. Documents were acquired through Stat News' effort to unseal records in the case as well as an open records request made by Professor Antoine Lentacker of UC Riverside. Documents include court motions, filings and depositions of employees as well as internal company documents that have been publicly filed in the court’s docket as exhibits: emails, memos, reports, sales and marketing materials, and articles. Major issues, people, and companies represented: marketing plans; sales data; Purdue Pharma; Abbott Laboratories; US Department of Health and Human Services; Richard Sackler; Michael Friedman.

  • Oklahoma Opioid Litigation Documents (UCSF-JHU Opioid Industry Documents Archive​​​​​​​)

Consult the Collection Description, individual items in this Digital Collection, which includes 505 documents, or go directly to this collection as a dataset, which includes metadata and OCR’ed full text.

The records in this collection are selected Johnson & Johnson defendant exhibits and State exhibits admitted during trial in a lawsuit brought by the State of Oklahoma against Purdue, Johnson & Johnson, and other drug companies. The exhibits produced by Johnson & Johnson include agendas, minutes, and transcripts of the Oklahoma Health Care Authority Drug Utilization Review Board; data summaries and confidential reports on opioids, including investigations of use, abuse, misuse, and diversion of specific drugs (fentanyl, tapentadol [brand name Nucynta]).

  • Sales Visits by Purdue in Massachusetts (UCSF-JHU Opioid Industry Documents Archive​​​​​​​)

Consult the Collection Description, the Digital Collection, which is a single spreadsheet file, or to  directly to this collection as a dataset

This data in this file represents over 150,000 visits by Purdue Pharma sales representatives to Massachusetts prescribers and pharmacists between 2007-2018 and was filed as an exhibit in Commonwealth v. Purdue Pharma et al., Civil Action No. 1884-cv-01808 (BLS2) (Exhibit 1).

  • Vioxx Litigation Documents (Drug Collections)

Consult the Collection Description, individual items in the Digital Collection, which includes 467 documents, or go directly to this collection as a dataset, which includes metadata and OCR’ed full text.

The Vioxx Litigation Documents collection is comprised of documents from four cases where plaintiffs alleged Merck failed to warn patients of the risk of heart attack and possible death with Vioxx use. Plaintiffs claimed that Merck committed consumer fraud by misleading doctors and patients and by intentionally suppressing, concealing or omitting information about the risks of Vioxx.

  • DC Leaks Coca-Cola Emails (Food Collections)

Consult the Collection Description, individual items in the Digital Collection, which includes 346 documents, or go directly to this collection as a dataset, which includes metadata and OCR’ed full text.

Contains a set of internal emails between Coca Cola executives and Capricia Marshall, communications consultant working with Coca Cola as well as the Clinton campaign. Documents describe a variety of strategies to defeat local and national public health policies regarding sugary beverages.

  • Roundup Litigation Documents (Chemical Collections)

Consult the Collection Description, individual items in the Digital Collection, which includes 386 documents, or go directly to this collection as a dataset, which includes metadata and OCR’ed full text.

"The Monsanto Papers" - This collection contains a significant set of documents obtained during the RoundUp Products Liability Litigation - lead Case No. 3:16-md-02741-VC. Manufactured by Monsanto, Roundup (glyphosate) is an herbicide widely used by farmers, agricultural workers and the public throughout the United States. Studies have shown that exposure to Roundup can cause cancer and other serious health problems but Monsanto has repeatedly denied these claims while actively attempting to influence regulations that would address these harms.

  • Climate Investigations Center Collection (Fossil Fuel Collections)

Consult the Collection Description, individual items in the Digital Collection,which includes 1,161 documents, or or go directly to this collection as a dataset, which includes metadata and OCR’ed full text. Items may be protected by copyright and are made accessible for fair use purposes, including criticism, comment, news reporting, teaching, scholarship, and/or research.

This collection is comprised of documents obtained by the Climate Investigations Center (CIC) via Freedom of Information Act and other Public Records Requests; contact with industry companies and affiliates; industry archives and document repositories; contact with journalists, academics, and activists who shared their primary research; and litigation preparation.  Spanning 1953 to 2017, the documents come from corporate fossil fuel producers and users, their trade associations and front groups, and think tanks with significant funding from those entities. The corporate documents contain internal reports, memos, correspondence, and scientific studies into anthropogenic climate change and its impacts. The rest of the collection is comprised of news bulletins, mailers, and other information created for public consumption regarding climate change, environmental policy, and international climate negotiations.

COVID Tracking Project Datasets

The COVID Tracking Project (CTP) was a volunteer organization launched from The Atlantic and dedicated to collecting and publishing the data required to understand the COVID-19 outbreak in the United States.

Every day from March 7, 2020 to March 7, 2021, they collected data on COVID-19 testing and patient outcomes from all 50 states, 5 territories, and the District of Columbia. Their dataset was in use by national and local news organizations across the United States and by research projects and agencies worldwide. The over 800 volunteers and project staff built a large online-only community of people from a broad set of backgrounds, and developed a unique culture of work and community.

The UCSF Archives and Special Collections department is working with former CTP volunteers and staff to document the public work and internal community of the project. A finding aid describes collection materials including public websites, white papers, and datasets that are critical to understanding the COVID-19 pandemic outbreak in the United States as well as the organization responsible for data gathering and publication.

Online platforms such as Slack and Google Apps were critical to the daily operation of CTP, but these systems have never been archived for future researchers at a large scale. The archival process includes building open-source tools for the retrieval of data from hosted products and storing them in a format that future researchers can access and understand. Digital oral history materials about the creation of CTP are available, as are the following datasets published by the COVID Tracking Project archive in Dryad:

As well, the COVID Tracking Project data explorer provides researchers with access to all of the structured data captured by CTP, state-by-state, as well as field definitions to interpret project metadata. A collection of tools used to archive the COVID Tracking Project are made available on GitHub.