Friday, January 30, 2015

I see you: presence-only data.

While I've been reading a lot of black bear papers recently in preparation for a grant proposal, today I'm going to post about the issue of presence-only data: data that include only information about where individuals have been detected and do not include information about where individuals have not been detected. Data that include both  are often referred to as presence-absence data, which in most cases, if not all, are preferred because information about where individuals are not is just as informative for inferences about range distributions, population size, etc. However, sometimes such data can't be collected, because data are from museum or herbarium collections, are collected opportunistically or incidentally from sightings or reports. Opportunistic citizen science data is an example of presence-only data. You take what you can get. In those cases, perhaps a lot of data can be collected but the 'compromised' quality requires some thought and creative work-arounds. Ive only just started thinking and reading about the problem, but my understanding of it is as such:

The problem: The data collected consist of instances and locations where individuals or species havebeen observed or detected, i.e., y=1. Basing range distributions solely on this data ignores the places that were not sampled that truly have y=1's, and there is also no information about where there are not individuals, in which y=0. There is no reference or background against which to assess the collected y=1's. Thus, only naive, cursory estimates of occupancy are possible - or at least so has been the prevailing thought.

Current proposed and adopted methods: In addition to the observed y=1's, environmental or covariate data are often also collected at the sampled locations, in an attempt to relate presence, occupancy, or distribution to environmental attributes like elevation, percent forest cover or urban densities.

There are envelope models that describe the distribution of the presence-only data. Methods like BIOCLIM, HABITAT, and SVM are examples, and I know nothing about these.

An option is to determine a reference or background against which to compare the observed y=1's. In other words, with the lack of y=0 data, an option is to generate them, or create pseudo-absences. But   post-hoc y=0's could contain both true y=0's as well as some y=1's that appear to be y=0 if detection probability is less than certain or 100%. Furthermore, a researcher then has to determine how many of these psuedo-absences to create, and how many are created can greatly affect the probability of occurence, that state y=1. One approach to creating psuedo-absences is a case-control design, where the logistic regression for the probability of state y=1 given the environmental covariate data  is adjusted with the ln of the proportions of the occupied and unoccupied locations. But we dont know how those occupied and unoccupied locations are actually split n the real world.  That background is an unsampled matrix of unused landscapes (like plants that either use a spot or dont), or it can be viewed as available for use  (like a bear moving on a landscape could use agricultural areas, but perhaps just less often); this is apparently a subtle but methodologically important distinction. Some have suggested that the ratio of sampled to unsampled background locations be several orders of magnitude in size to minimize sampling errors (Manly et al 2002 and McDonald 2003 via Pearce and Boyce 2005). An exponential model to estimate the relative likelihood of occupancy or occurrence can be used instead of the logistic model, and finally another approach is to use a logistic regression to approximate a logistic discrimination model.  When relative likelihoods of occupancy are estimated, they are not constrained to be less than 1, which is weird.  All of these approaches attempt to account for the background or landscape of the data from which the observed y=1's were collected, but each take a slightly different approach that I have not read enough on to describe.

Pearce and Boyce (2005) state, " We are unaware of any application explicitly modelling abundance given presence only". Although not about abundance, Royle et al in 2012 came out with a likelihood approach for occurrence/occupancy probability with presence-only data, arguing that the popular and widely-used Maxent doesnt actually do that but instead provides habitat suitability indices that are quite different from estimates of occurrence probability. Royle et al provide a parametric approach, that can be implemented via MaxLik, an R package, and invoke Bayes rule that requires random sampling and constant detection probability. The major problems with MaxEnt, they argue, as I see it, is that they use a penalized and exponential version of the detection probability given occurrence, based on the maxmum entropy distrbution. This penalization shrinks the regression coefficients to 0,  but Royle et al argues that this approach biases the estimator because the intercept, Beta0, is set to be an arbitrarily determined number. In comparing MaxEnt to their approach, they found that MaxEnt provided variable under and overestimates. They caution that effort and detection probability are in fact often not consistent, such as with roadside surveys or where density of the study population or effort are high.

 Of course, a reply came quickly in 2013, from  Hastie and Fithin (2013) , and the debate about making inferences from presence-only data is not over yet. They claim Royle et al have performed "statistical alchemy" by imposing parametric assumptions to estimate overall occurrence probability. This is shaky ground to build inference.

Needless to say, presence-only data are tricky to work with, and presence-absence data seem preferable in every comparison. On top of these issues, another aspect of presence-only data, that it can be cheaper to collect and therefore yield larger datasets, makes it an attractive to use. In particular, I have been thinking about studies that may try to combine presence-absence data, such as from capture-recapture and occupancy efforts, with presence-only. It seems like finding a way to generate psuedo-absences to make the presence-only data mirror the presence-absence data would be one approach.  I can imagine studies where presence-absence data are capture-recapture collected, while presence-only data come from depauperate occupancy approaches. How to combine, then? Blanc et al (2014) provide one such example with Eurasian lynx, by making abundance an explicit instead of derived parameter the estimating models, and hinging the connection between abundance and occupancy on the fact that occupancy is only possible when abundance is >0.  They mention that their approach is a development on Freeman and Besbeas (2012) with the addition of imperfect detection. But then they mention that this is all for non-spatial capture recapture, because their abundance N~homogeneousPoisson(lambda) and N is explicit, whereas spatial capture recapture approaches use an inhomogeneous process and N is derived.


PEARCE, J. L. and BOYCE, M. S. (2006), Modelling distribution and abundance with presence-only data. Journal of Applied Ecology, 43: 405–412. doi: 10.1111/j.1365-2664.2005.01112.x

Royle, J. A., Chandler, R. B., Yackulic, C. and Nichols, J. D. (2012), Likelihood analysis of species occurrence probability from presence-only data for modelling species distributions. Methods in Ecology and Evolution, 3: 545–554. doi: 10.1111/j.2041-210X.2011.00182.x

Hastie, T., & Fithian, W. (2013). Inference from presence-only data; the ongoing controversy. Ecography36(8), 864–867. doi:10.1111/j.1600-0587.2013.00321.x

Blanc, L., Marboutin, E., Gatti, S., Zimmermann, F., Gimenez, O. (2014), Improving abundance estimation by combining capture–recapture and occupancy data: example with a large carnivore. Journal of Applied Ecology, 51: 1733–1739. doi: 10.1111/1365-2664.12319

No comments:

Post a Comment