Publication: Statistical Advances in Bayesian Logistic Regression, Sports Forecasting, and the Theory of Distributed Estimation
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Chapter 1: The Pólya-gamma sampler was introduced by Poisson (2013) for posterior sampling via data augmentation in Bayesian logistic regression models. In this article we generalize the P'olya-gamma sampler by introducing a family of probability densities parameterized by $a \in (0,1/2]$ with $a = 1/2$ corresponding to the original P'olya-gamma distribution in \cite{Polson2013}. We derive properties of this distribution and provide stochastic representations that allow for easy sampling. While it is possible to use any $a \in (0,1/2]$ to construct a data augmentation scheme analogous to the P'olya-gamma data augmentation scheme used in Polson (2013), we show that the samplers constructed using $a 1/2$ will have higher auto-correlation and thus are less efficient compared to $a = 1/2$. We show our family of probability densities does not yield a valid probability distribution for $a > 1/2$ and in this sense the P'olya-gamma sampling scheme in \cite{Polson2013} is optimal.
Chapter 2: Many people interested in sports, from coaches to fans, desire to predict how athletes will perform in future competitions. The main tool for predictions is ratings systems, such as Elo or Glicko. While the ratings system provide a natural framework for athlete comparison, they do not have predictive power because they do not provide any structure for future ratings. In McKeough (2020) growth curves were used to model observed ratings provides a flexible approach to predicting future athlete skills. Here, we revisit the model from McKeough (2020) in order to improve upon the computational strategy proposed in that work. We also provide examples on data from men's slalom and women's luge to demonstrate the benefits of this approach.
Chapter 3: In the era of big data, it is necessary to split extremely large data sets across multiple computing nodes and construct estimators using the distributed data. When designing distributed estimators, it is desirable to minimize the amount of communication across the network because transmission between computers is slow in comparison to computations in a single computer. Our work provides a general framework for understanding the behavior of distributed estimation under communication constraints for nonparametric problems. We provide results for a broad class of models, moving beyond the Gaussian framework that dominates the literature. As concrete examples we derive minimax lower and matching upper bounds in the distributed regression, density estimation, classification, Poisson regression and volatility estimation models under communication constraints. To assist with this, we provide sufficient conditions that can be easily verified in all of our examples.