Publication: Essays on Statistics and Data Science Education
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Research Data
Abstract
Statistics & data science are growing, rapidly evolving, and increasingly important for an informed citizenry in a data-saturated world. In this dissertation, I address two central questions: (1) who is taking statistics? and (2) what are statistics courses teaching?
I estimate that 920,000 US students take statistics in high school each year, but this population has not yet been well studied. Using a rich set of survey responses describing 15,727 students’ demographics, career interests and values, STEM identity, grades, and test scores, my first study compares four groups of high-school course-takers: those who take statistics, calculus, both, and neither. I then employ latent profile analysis to shed light on who these students are, showing that students with different profiles take statistics at surprisingly similar rates: statistics is as an important part of the academic pathway for a wide range of students and serves a demographically diverse population.
In my second study, I build upon tools from natural language processing and psychometric measurement to develop a human-in-the-loop methodology for measuring latent constructs in large text corpora, and present a framework for doing so. I construct a lexicon-based instrument to measure the extent to which syllabi from college statistics and data science courses align with a vision for modernizing instruction set forth in the Guidelines for Assessment and Instruction in Statistics Education (GAISE) project and across 145 journal articles spanning almost a century. In so doing, I illustrate an approach that researchers can take in bringing measurement questions to text data, a method that I believe strikes a useful balance between interpretability, communicability, validity, and scalability.
My final study applies these instruments to 32,483 syllabi from US statistics and data science courses taught between 2010 and 2018. I find a modest overall increase in modern approaches over this decade. Finally, I explore differences between institution types using multilevel models, finding that private and four-year institutions, as well as those with higher admissions rates and Pell-recipient populations, have more modern syllabi, though two-year institutions and schools serving fewer Pell recipients seem to be gaining ground.