Font Size:

Logistic Methods for Resource Selection Functions and Presence-Only Species Distribution Models

Last modified: 2011-08-04

#### Abstract

In order to better protect and conserve biodiversity, ecologists use machine learning and statistics to understand how species respond to their environment and to predict how they will respond to future climate change, habitat loss and other threats. A fundamental modeling task is to estimate the probability that a given species is present in (or uses) a site, conditional on environmental variables such as precipitation and temperature. For a limited number of species, survey data consisting of both presence and absence records are available, and can be used to fit a variety of conventional classification and regression models. For most species, however, the available data consist only of occurrence records --- locations where the species has been observed. In two closely-related but separate bodies of ecological literature, diverse special-purpose models have been developed that contrast occurrence data with a random sample of available environmental conditions. The most widespread statistical approaches involve either fitting an exponential model of species' conditional probability of presence, or fitting a naive logistic model in which the random sample of available conditions is treated as absence data; both approaches have well-known drawbacks, and do not necessarily produce valid probabilities. After summarizing existing methods, we overcome their drawbacks by introducing a new scaled binomial loss function for estimating an underlying logistic model of species presence/absence. Like the Expectation-Maximization approach of Ward et al. and the method of Steinberg and Cardell, our approach requires an estimate of population prevalence, $\Pr(y=1)$, since prevalence is not identifiable from occurrence data alone. In contrast to the latter two methods, our loss function is straightforward to integrate into a variety of existing modeling frameworks such as generalized linear and additive models and boosted regression trees. We also demonstrate that approaches by Lele and Keim and by Lancaster and Imbens that surmount the identifiability issue by making parametric data assumptions do not typically produce valid probability estimates.

Full Text:
PDF