Person:

Boyd, Ceilyn

Loading...
Profile Picture

Email Address

AA Acceptance Date

Birth Date

Research Projects

Organizational Units

Job Title

Last Name

Boyd

First Name

Ceilyn

Name

Ceilyn Boyd

Search Results

Now showing 1 - 1 of 1
  • Publication

    "Is this data?" Investigating How Curators Define, Recognize, and Repair Research Data in Data Repositories

    (2025-04-25) Boyd, Ceilyn

    This dissertation project empirically answers the question: What is research data?

    Data has many definitions, meanings, and forms; the concept of data as recorded evidence is popular among scientists. However, what counts as data depends upon the discipline, its role in a research study, how data is created or collected, and who defines it. Furthermore, definitions are not neutral; they are informed by scientific discourse and use. Images, numerical tabular files, and other commonly accepted examples are abundant. In contrast, edge cases may highlight unusual, dissenting, or marginalized viewpoints and the politics underlying consensus formation. This project investigated the characteristics of these data anomalies and curators’ engagement with them.

    To understand how data curators recognize anomalous datasets deposited in research data repositories and transform them into acceptable research datasets, I used quantitative and qualitative methods to answer two research questions: RQ1: What are the characteristics of anomalous datasets? and RQ2: How do data curators identify and repair anomalous datasets? Acceptable research datasets meet minimum repository-specific curation expectations for metadata, data, and documentation files. The term does not imply that observations or values within the files are valid or correct, nor that the dataset meets a Platonic ideal. In contrast, anomalous datasets do not contain data files, are missing other essential elements, or otherwise leave curators uncertain about the presence or absence of research data. During this study, while working as a Dataverse Project software manager at the Institution for Quantitative Social Sciences (IQSS), I used trace information analysis and quantitative methods to answer RQ1, examining the characteristics of 89,625 datasets deposited in the Harvard Dataverse Repository from 2007 to August 22, 2023. I combined this metadata with information from 315 user support tickets about anomalous datasets in the IQSS Request Tracker (RT) system to understand the differences between acceptable and flawed datasets.

    My nine quantitative hypotheses investigated the relationships among datasets classified as Acceptable, Anomalous, or Unknown, as well as among subjects, file formats, the presence of optional metadata blocks, collection categories, publication status, the existence of restricted files, and the overall distribution of anomalous dataset types (e.g., Not data, Missing data). All study hypotheses were supported, indicating that anomalous datasets display distinct characteristics whose presence points to opportunities for workflow, repository software, and machine-assisted dataset quality improvements. As expected, anomalous datasets were more likely to be associated with rarer subject areas and file formats, to use fewer subjects and fewer optional metadata blocks, to have restricted files, and to be deaccessioned more frequently. Additionally, my analysis revealed a relationship between dataset classifications and collection category. However, group collections were unexpectedly more likely to contain anomalous datasets than individual collections. Also, as expected, anomalous dataset types were not equally distributed, with more Not data datasets present than other types. My quantitative investigation also indicated how well-curated collections can affect overall repository data quality and discussed the benefits and limitations of trace information analysis for capturing and interpreting past critical incidents and curator interventions.

    During the qualitative phase of the study, I used critical incident technique (CIT) and semi-structured interviews to answer RQ2. I asked 19 data curators at North American, European, and African repositories what they look for and how assess the “dataness” of repository users’ datasets. During each hour-long interview, participants shared their data inspection, recognition, and repair procedures and described indicators signaling the presence or absence of research data in users' deposits. They also shared how repository policies, academic and professional experience, curation goals, and perceptions of future data reusers shaped their definitions of research data, acceptable dataset characteristics, and data curation. Participants reported 93 encounters with anomalous datasets in 10 categories, including Missing content, Perhaps data, Not data, Broken data, and other types not mentioned in the literature. They also noted 19 specific anomalous dataset indicators of the presence or absence of file- and deposit-level characteristics, six categories of data sensemaking procedures, and four data repair categories. These results indicate that data curation involves sensemaking and that curators perform invisible work while recognizing and repairing datasets.

    Results from both streams of inquiry enhance research and practice in data curation, showing how repository policies, technical requirements, capabilities, and curators’ knowledge influence their research data operationalization procedures. The findings deepen our understanding of how curators implement acceptable research data in research data repositories and how repository infrastructure and data curators' competencies shape these engagements and outcomes. Furthermore, they contribute to research on curation workflow modeling, the characteristics of research data, and the often-overlooked work of curators’ invisible data sensemaking and repair. The results also underscore how curation tends to homogenize research datasets’ forms and characteristics, revealing that acceptable dataset definitions are not neutral but are political acts involving power dynamics. This confirmation furthers the discussion about how information and data workers' methods and goals may silence, devalue, or eliminate ways of knowing, such as traditional knowledge systems whose outputs may not conform to common curation practices. Additionally, I make practical recommendations for reducing the number of anomalous repository deposits, improving dataset findability and reusability, accounting for and supporting data curators’ work, and improving data curation training. These developments will help improve dataset reusability, reduce curation workflow frictions, and help further awareness of curation costs.