A Forest for the Trees: Using Random Forests for Small Area Estimation on US Forest Inventory Data

Schmitt, Julian Francis

View/Open

Julian_Schmitt_AM_Senior_Thesis_2023.pdf (4.550Mb)

Author

Schmitt, Julian Francis

Metadata

Show full item record

Citation

Schmitt, Julian Francis. 2023. A Forest for the Trees: Using Random Forests for Small Area Estimation on US Forest Inventory Data. Bachelor's thesis, Harvard University Engineering and Applied Sciences.

Abstract

Methods which estimate population parameters of interest across small areas is a growing field of research. These problems arise frequently in election prediction, healthcare monitoring, and environmental studies. The Forest Inventory and Analysis Program (FIA) of the US Forest Service tracks forest metrics, such as basal area and above ground carbon, to ensure sustainable stewardship of the nation's forests and preserve her resources for future generations. Their estimates combine expensive ground plot observations of the variables of interest alongside inexpensive and plentiful auxiliary data collected by remote sensing. Historically, estimators in this setting either rely on means or linear parametric models, such as the post-stratified estimator, area-level empirical best linear unbiased predictor (area-EBLUP), and unit-level empirical best linear unbiased predictor (unit-EBLUP) models. Here, we present the results of a simulation study to compare these standard estimators to a new problem-specific estimator, as well as machine learning models. The problem-specific zero-inflated estimator is introduced to address the overabundance of zero observations in FIA ground plot observations, while machine learning methods, including the random forest and mixed-effects random forest (SMERF) seek to flexibly capture non-linear relationships between the predictors and the response variable to improve performance while also addressing the zero-inflation problem. We track both bias and root mean squared error across the six estimators to assess their performance and find that there is no universal ``best model." Instead we find a complex story in which the post-stratified and area-EBLUP models have exceptionally low bias, particularly across areas with low-carbon levels however when examining root mean squared error, the zero-inflation model performs well. Across higher carbon levels model performance is even more complex. We close with implications for these results alongside avenues to improve estimation at scale across the US.

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

Citable link to this page

https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37378277

Collections

FAS Theses and Dissertations [6136]

Contact administrator regarding this item (to report mistakes or request changes)