Ensemble Random Forests as a tool for modeling rare occurrences

Relative to target species, priority conservation species occur rarely in fishery interactions, resulting in imbalanced, overdispersed data. We present Ensemble Random Forests (ERFs) as an intuitive extension of the Random Forest algorithm to handle rare event bias. Each Random Forest receives individual stratified randomly sampled training/test sets, then downsamples the majority class for each decision tree. Results are averaged across Random Forests to generate an ensemble prediction. Through simulation, we show that ERFs outperform Random Forest with and without down-sampling, as well as with the synthetic minority over-sampling technique, for highly class imbalanced to balanced datasets. Spatial covariance greatly impacts ERFs’ perceived performance, as shown through simulation and case studies. In case studies from the Hawaii deep-set longline fishery, giant manta ray Mobula birostris syn. Manta birostris and scalloped hammerhead Sphyrna lewini presence had high spatial covariance and high model test performance, while false killer whale Pseudorca crassidens had low spatial covariance and low model test performance. Overall, we find ERFs have 4 advantages: (1) reduced successive partitioning effects; (2) prediction uncertainty propagation; (3) better accounting for interacting covariates through balancing; and (4) minimization of false positives, as the majority of Random Forests within the ensemble vote correctly. As ERFs can readily mitigate rare event bias without requiring large presence sample sizes or imparting considerable balancing bias, they are likely to be a valuable tool in bycatch and species distribution modeling, as well as spatial conservation planning, especially for protected species where presence can be rare.


INTRODUCTION
Machine learning algorithms have proven to be a ubiquitous tool for modeling species distributions. A subset of these approaches are nonparametric and have the potential to model complex, non-linear environmental responses that drive some species distri-butions (Breiman 2001b, Olden et al. 2008, Stock et al. 2020. For marine species, many environmental covariates are proxies for a given organism's environmental cues, increasing the importance of modeling non-linear responses (Pielke et al. 2003, Robinson et al. 2011. Unfortunately, many machine lear ning algorithms are designed for 'big data', which is lacking for many species (Martin et al. 2015), especially those that are priority conservation concerns, where observations are often a rare occurrence. Modeling these rare events requires special consideration, but the resulting species distribution is critical for determining key environmental covariates of the distribution, establishing conservation areas, and identifying potential conservation threats (Millar et al. 2007, Elith & Leathwick 2009, Evans et al. 2011.
Rare events induce an intrinsic imbalanced data problem where the presences of a species are underrepresented, skewing the distribution of classes (e.g. the presence and absence of a given species) (He & Garcia 2009, Kuhn & Johnson 2013. This has been broadly referred to as 'rare event bias', where the data are typically overdispersed, exhibiting greater variability than expected from standard probability distributions (Martin et al. 2015). A variety of methods have been utilized in ecology to deal with overdispersion (e.g. zero-inflated, hurdle, or delta− generalized linear models), most of which implement a mixture model in some fashion (Zuur et al. 2009, Campbell 2015, Stock et al. 2020. While these models can be well suited to modeling rare event data, large sample sizes are often needed to overcome the low proportion of presences and estimate model parameters with adequate precision (He & Garcia 2009, Rodriguez-Galiano et al. 2012. Further, many of these models are conditioned on linear responses that can be inadequate for modeling species distributions (Breiman 2001b, Austin 2007, Olden et al. 2008, Merow et al. 2014. Decision tree approaches, in particular Random Forest (RF) (Breiman 2001a), are well suited to modeling rare event data (Strobl et al. 2007). Random Forest is a supervised machine-learning algorithm based on generating a 'forest' of decision trees grown in parallel, referred to as bagging in machine learning. For each decision tree in the forest, classified data are passed through the tree and covariates are used to separate the data into classified groups (typically binomial data but can be categorical or continuous) (Breiman 2001a). Categorical and continuous covariates are then used (with cutoffs applied to continuous datasets) to make decisions at every node in individual decision trees to attempt to best separate the classes. A random subset of data is provided for each tree (the in-the-bag dataset) and a random subset of covariates are tried at each node in each tree. Each classification tree votes on a datum's class and the proportion of trees that vote correctly equal the probability of that class occurring for each observation.
Relative to other machine learning approaches, RF tends to balance overfitting and prediction well through the generation of many 'weak learners' (the individual decision trees) and the creation of an ensemble of their predictions (the leaves) to generate a 'strong learner' (the forest) (Breiman 2001b, Evans et al. 2011. Additionally, the performance of RF has been generally high in building species distribution models with rare species (Williams et al. 2009, Rodriguez-Galiano et al. 2012). In such models, randomness is imparted in 2 ways: (1) a random subset of the data is passed through a given tree in the 'forest'; and (2) random covariates are drawn at each node split when growing the tree, resulting in high diversity within the 'forest' as well as accounting for covariate interactions. Nonetheless, rare event bias and the imbalanced data problem result in low performing RFs due to 2 phenomena: (1) successive partitioning of the data when growing the decision trees causes them to 'see' fewer and fewer of the rare events, thus fitting more and more to the majority class (the absences); and (2) interactions between covariates can go unlearned by the decision trees due to the sparseness of the data induced by partitioning ( Fig. 1A) (He & Garcia 2009).
A variety of sampling methods have been proposed to deal with the class imbalance problem in RF (Kuhn & Johnson 2013). Down-sampling methods select data points from the minority and majority classes such that both classes end up with approximately the same sample size. Internally down-sampled approa ches accomplish this balancing via a stratified random sample of the full dataset to generate the subsets passed to each tree (Kuhn & Johnson 2013). However, down-sampling imparts a bias against the majority class, resulting in weak learning RF and poor model performance (Fig. 1B) (He & Garcia 2009). Upsampling methods generate pseudo-replicates of the minority class to correct the class imbalance. Chawla et al. (2002) describe the synthetic minority oversampling technique (SMOTE) that randomly selects a minority class datum, calculates the K nearest neighbors, and randomly blends data from these nearest neighbors into pseudo-replicates of the minor ity class. Additionally, the SMOTE approach can down-sample the majority class to further help balance the classes. As Kuhn & Johnson (2013) discuss, these resampling methods can resolve class imbalances, but there is little consensus on the best approach.
In this study, we present Ensemble Random Forests (ERF) as a method to mitigate the bias against the majority class that arises from balancing imbalanced datasets and, as a result, reduce rare event bias. The overall concept is simple: train an ensemble of RFs with down-sampling and average over the predictions of each RF in the ensemble. Using simulations, we compare ERF to other RF-based rare event approa ches and assess the performance of ERF across a range of spatial covariance in the simulated data. Specifically, this study (1) compares, using simulation, the performance of the standard, down-sampling, SMOTE, and ensemble implementations of RFs as a function of class imbalance for 3 performance metrics; and (2) evaluates the importance of spatial con-trast in ERF performance through simulation. To demonstrate the functionality of ERF to mo del real rare events, we modeled the distribution of pelagic species bycatch in the central north Pacific using fisheries-dependent data from the Hawaii-based deep-set longline fishery. Modeling the distributions of pelagic species is particularly difficult: few scientific surveys are directed at capturing species presence-absence, resulting in a reliance on bycatch in fisheries. However, these bycatch events are often rare (between 100:1 and 1000:1), or ultra-rare (between 1000:1 and 10 000:1) (Martin et al. 2015). We 185 Fig. 1. A schematic of a classical Random Forest implementation for a balanced dataset (A) using random draws (1) to generate the training/test sets. (B) Bootstrapping (2) and balancing (3) are techniques used to modify the classical implementation to deal with imbalanced class datasets. (C) The Ensemble Random Forests modeling framework proposed in this study iterates these modifications to generate an ensemble of Random Forest models. Observations (presence-absence data) are split into training and test datasets using bootstrapping (2), the former is balanced (3) and used to train a Random Forest model, while the latter is used to assess model performance, and the combined datasets are used to make model predictions. This process can be repeated k times to generate k models and distributions of model performance metrics and predictions. The mean and uncertainty of model performance metrics and predictions (4) can then be extracted to generate the results of the Ensemble Random Forests models used 2 case studies to (1) evaluate the performance and prediction uncertainty of ERF as a function of class imbalance using a commonly caught, secondary target species in the Hawaiian deep-set longline fishery, wahoo Acanthocybium solandri; and (2) compare spatial contrast and ERF predictive ability for 3 Endangered Species Act listed species: giant manta ray Mobula birostris syn. Manta birostris (White et al. 2018) (Threatened), scalloped hammerhead Sphyrna lewini (Endangered), and false killer whale Pseudorca crassidens (Endangered). The purpose of these simulation and case study objectives was to de monstrate the effects of sample size and sample covariation on model performance and the utility of ERFs as a bycatch and species distribution modeling tool.

Background
We refer to ERF as an intuitive extension of RF for 2 reasons. The first is because the ERF method builds upon the existing algorithmic architecture of RF that bags the data into subsets for each decision tree within the forest (Fig. 1A). ERFs extend this bagging procedure to double bagging: bagging first within each RF to grow decision trees and then by generating train/test data subsets for each RF within the ensemble (Fig. 1). The second reason is that ensemble modeling is recommended as common practice for species distribution models (Araújo & New 2007) and, in practice, each individual RF is already an ensemble of decision trees. The pro cedure is roughly equivalent to a k-fold cross-validation procedure where k paired training and test partitions are drawn from the full dataset. ERFs differ from k-fold cross-validation in that each model's predictions are used to form an ensemble prediction.
Overall, ERFs are straightforward to implement, employing many of the same 'best practices' recommended for RF when using imbalanced class datasets (Cutler et al. 2007, Rogan et al. 2008, Evans et al. 2011, Kuhn & Johnson 2013. Each RF receives mutually exclusive random subsets of the data: one to grow the decision trees, the training set, and a second to conduct internal validation, the test set. These datasets are created by stratified random draws with replacement from the total dataset, in order to generate training and test datasets with class proportions close to those of the total dataset. Each RF grows decision trees on subsets of their individual training set, called the in-the-bag dataset, and holds out some data, the out-of-bag dataset, used to internally measure the prediction error of the RF. When growing the decision trees, each RF uses a balanced weighting scheme (i.e. equal proportions of each class), also called down-sampling (Kuhn & Johnson 2013), to re duce the bias imparted by class imbalance (Fig. 1B). Each RF then predicts on the test dataset, to generate individual RF performance metrics, and on the full dataset. The predictions of each RF on the full dataset are then averaged to form the ERF pre dictions (Fig. 1C) (Marmion et al. 2009). Performance metrics are also calculated for the ensemble predictions.

Implementation
The base RF procedure, implemented using the randomForest package (Liaw & Wiener 2002) in R 3.6.1 (R Core Team 2018), was extended to create the ERF routine. Each RF received a training set containing 90% of the data and a test set of 10%, where the train and test subsets received a representative sample (i.e. the proportion of classes, presence and absence, was equal to their proportion in the whole dataset). The individual decision trees were trained on balanced in-bag datasets (i.e. an equal number of presences and absences was provided to each tree). Each RF model grew 500 decision trees and randomly drew 3 covariates at each node in each tree. The number of decision trees to fit and the number of covariates to try at each node was optimized using a single RF model for each case study. The number of RFs in the ensemble was optimized by determining the minimum number of RFs that resulted asymptotic behavior in the variance of model performance across a suite of case studies.

Performance
The test dataset for each RF in the ensemble was used for internal validation and to generate metrics of individual model performance. We chose to focus on 3 performance metrics: (1) the area under the curve (AUC), (2) the root mean squared error (RMSE), and (3) the true skill statistic (TSS). The re ceiver operator characteristic curve (ROC), which plots 1specificity against the sensitivity for various thresholds (binary cutoff values) between 0 and 1 was used to determine AUC. ROC metrics were calculated using the ROCR package (Sing et al. 2005) in R. AUC values range between 0 and 1; models with values less than 0.5 perform worse than random, models with values of 0.5 equal random, and models with values more than 0.5 perform better than random. Typically, models with AUCs > 0.7 are deemed useful (Phillips 2005), and models with AUCs > 0.95 are deemed to perform exceptionally well. RMSE was determined using: (1) where x i is the prediction of a given model, x i are the observations, and N is the total number of observations. TSS was calculated as the maximum of the sum of the true positive rate (TPR) and true negative rate (TNR) minus 1: (2) Code to generate the simulations and the ERF routine is provided at https://osf.io/q9wfn/.

Simulated covariates
Ten covariates were simulated using the Random-Fields package (Schlather et al. 2020) in R over a 100 × 100 cell grid. Each simulated covariate, X n , was assumed to be a multivariate normal process (MVN) (Eq. 3) with stationary isotropic Matérn covariance, Σ Xn , based on the distance r between 2 points (Eq. 4) (following Thorson et al. 2017): where σ 2 Σ controls the pointwise variance, κ Xn controls the geostatistical range of correlations (Eq. 5), ν controls the smoothness of the covariance matrix (we assumed ν = 1), and K ν is the Bessel function. Variance parameters, σ 2 Σ , were fixed at 0.0001 and scale parameters, l, were drawn from a uniform distribution (Eq. 6). Each scale parameter draw was used to simulate the Matérn covariance function over the 100 × 100 grid. We assumed all simulated covariates were moderately correlated and the correlations between covariates, ρ Xn,Xn , were drawn from a Beta distribution (Eq. 7). The standard deviation of each covariate, σ Xn , was determined from the variation of the simulated spatial field, X n and used to convert correlation between covariates, ρ Xn,Xn , to a covariance matrix (Eq. 8). The means of each covariate, μ Xn , were drawn from a standard normal distribution (Eq. 9) and along with the covariance matrix, Σ X,X , were used to draw coefficients for the interactions between environmental covariates from a multivariate normal distribution (Eq. 10). A probability of presence field, p 1 , was calculated following a linear combination of environmental covariates, X n (Eq. 11).
To generate point patterns from the probability of presence maps, a random cell was drawn and subsequent detections in that cell were assumed to have process and observation components (Royle & Kéry 2007) (Eqs. 12 & 13, respectively).
where ŷ n is the true presence, y n is the observed presence, δ is the background detection probability, and δ × p is the effective detection probability, always ≤δ.

Titration simulation
Point patterns with 50 000 samples were generated from p 1 (Eqs. 12 & 13) across a suite of detection prob abilities ranging from ultra-rare events to an equal number of presences and absences (0.001, .0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5). For each detection probability, 25 point patterns were generated (resulting in 225 point patterns) and 10% of the full dataset was held out from all models for external validation. The standard RF, an RF with down-sampling (RF-DS), an RF with the synthetic minority over-sampling technique (RF-SMOTE), and an ERF were fit to each point pattern. The standard RF was implemented with a stratified random sample (90% training, 10% test) from the full dataset (minus the external validation set) and no internal resampling. The RF-DS was implemented with the same stratified random sampling from the full dataset and internal down-sampling (i.e. balancing the subset passed to each decision tree). The RF-SMOTE was implemented in the same manner as the RF-DS but after generating pseudo-replicates of the minority class using a single nearest neighbor following Stock et al. (2020). The number of pseudo-replicates generated was equal to the minority class size. The ERF model was implemented with 100 forests in the ensemble and each RF was implemented in the same manner as the RF-DS approach. For each model and detection probability combination, model performance metrics (i.e. AUC, RMSE, and TSS) were calculated using the external holdout dataset as well as the whole dataset to evaluate changes in model performance as a function of the degree of class imbalance. For the ERF model, performance metrics were assessed on the ensemble predictions (i.e. the mean across RF predictions within the ensemble). Uncertainty in performance metrics for each model and detection probability combination was assessed across the 25 point pattern samples by calculating the 95% confidence intervals..

Spatial contrast simulation
Two probability of presence fields, p 2 and p 3 , were calculated from p 1 by rescaling to shrink probabilities towards 0.5 with ranges 0.25−0.75 and 0.45−0.55, respectively (see Fig. S1 in the Supplement at www. intres.com/articles/suppl/n043p183_supp. pdf). Re sca ling and shrinking the range of the probability of presence decreased the degree of clustering (Fig. S1, right panel) while preserving the relative autocorrelation in the environment covariates (Fig. S1, left panel). We assumed the detection probability for the point pattern process, δ, was 2% or a 50:1 chance of detection (i.e. a semi-rare event). For each probability of presence map, a point pattern was generated using this detection probability (Eqs. 12 & 13), 10% of the presences and absences were held out completely from any model fitting, and an ERF was fit to the remaining 90% of the point pattern data from which 90% train/test sets (90/10%) were generated for each RF in the ensemble. Each RF in the ensemble made predictions on the individual test sets, the overall test set, the total point pattern, and the gridded covariates used to generate the probability of presence maps, p x . Model performance metrics (i.e. AUC, RMSE, and TSS) were calculated to evaluate changes in model performance as a function of the degree of spatial clustering of detections for the test sets generated for each RF, the held out test set that no RFs in the ensemble trained on, and the overall point pattern. Gridded residuals for each individual RF in the ensemble were calculated as the difference between individual RF predictions over the grid and the ensemble predictions (mean across RFs). The gridded coefficient of variation was calculated by determining the standard deviation of the individual RF predictions divided by the ensemble prediction in a grid cell.

Fisheries-dependent data
Fisheries-dependent data from the Hawaii-based deep-set longline fishery targeting bigeye tuna Thunnus obesus were used to demonstrate the functionality of ERF. Data for 2005−2017 were provided by the National Oceanographic and Atmospheric Administration (NOAA) Fisheries' National Observer Program (NOP). The NOP conducts independent, atsea data collection for commercial fishing and processing vessels (Allen & Gough 2007). During fishing trips, observers record information about catch by species, the location of fishing effort, and the rigging of fishing gear. Observers currently accompany a target of 20% of deep-set trips, which are selected quasi-randomly. Of the 2 pelagic longline fisheries in Hawaii, the deep-set fishery comprises approximately 96−99% of the total trips (NMFS 2017). Typically, the deep-set fishery sets around ~230 m deep and fishes an area roughly 10−35°N and 180−135°W (Bigelow et al. 2006). In 2017, a 10 yr peak was reached in effort, with 145 active vessels making 1539 trips and setting 19 647 longlines, roughly 13 sets per trip (WPRFMC 2018).
From the NOP-provided data, the beginning and end coordinates where the longline was set as well as the beginning and end coordinates where the longline was hauled were extracted. These 4 co -ordinates were converted into a polygon for each set using the sp package (Pebesma & Bivand 2005) in R and the average across the polygon was extracted for each covariate. All NOP sets from 2005−2017 were used and datasets were generated for wahoo, giant manta ray, scalloped hammerhead, and false killer whale by dummy coding all sets without a given species as absences and sets with a given species as presences. Multiple individuals in a set were ignored, such that any set with at least one individual of a given species was treated as a presence record.

Covariates
To estimate bycatch distribution models from the NOP-provided data, we extracted environmental covariates at multiple spatial and temporal scales (Wiens 1989) (Table S1). Additionally, the RF algorithm can provide metrics of variable importance and we were interested in determining the variability of the variable importance rank across the ensemble of RFs. The environmental covariates we extracted can be largely split into 4 groups: (1) static (e.g. bathymetry, distance to shore, and distance to seamounts); (2) time dynamic (i.e. lunar phase and El Niño Southern Oscillation); (3) surficial spatiotemporally dy namic (e.g. sea surface temperature [SST], chlorophyll a [chl a], sea level anomaly, and blended sea winds); and (4) subsurface spatiotemporally dynamic (e.g. mixed layer depth, temperature at the mixing layer, and depth to the dissolved oxygen [DO] minimum). Gear covariates (e.g. set length, float length, number of floats, and bait type) were not included as model covariates, nor were set locations. These co variates were omitted to focus on the environmental drivers of bycatch interactions. Most covariates were used directly in the model; however, a subset of co variates required additional processing or derivation from existing covariates. Both SST and chl a were extracted as level-3 data products, meaning that cloud interference resulted in occasional data gaps. Fixed rank kriging implemented using the FRK package (Zammit-Mangion 2018) in R was used to interpolate the data layer and create a cloud-free product. Frontal structures and the gradient across identified structures were calculated from cloud-free products using the grec package (Lau-Medrano 2018) in R. Geostrophic current and wind divergence and vorticity were calculated based on the N−S and E−W current speed of adjacent cells based on the equations shown in Fig. S2.

Titration case study
As a commonly caught species, wahoo were an ideal case study to titrate rare event bias in the dataset and monitor the effects of rare events on model performance and uncertainty. Using the wahoo dataset from the fisheries-dependent data and environmental covariates, we derived rare event subdatasets. For each sub-dataset, a given number of presences were drawn randomly from the original dataset and combined with the full set of absences. Presence sample sizes were 15, 25, 35, 50, 75, 100, 150, 250, 500, 100, 2500, 5000, and 10 000. Each rare event dataset was partitioned into 90/10% training/ testing subsets and used to model the distribution of wahoo bycatch using ERF. All models were implemented in the same manner as outlined above (see Section 2.1.2), with 20 RFs comprising each ensemble to reduce the computational overhead. Model performance metrics (i.e. AUC, RMSE, and TSS) were assessed on the full dataset while the standard deviation in AUC (a measure of inter-model variability) and the standard deviation in variable importance were assessed for individual RFs in the ensemble. For each possible presence sample size, 10 random draws were conducted and the results were averaged to reduce the effect of noise from a single random draw.

Spatial contrast case studies
From the NOP-provided dataset, datasets were constructed for giant manta ray, scalloped hammerhead, and false killer whale presence and absence. The RF, RF-DS, RF-SMOTE, and ERF models were fit to each dataset. Each ERF model had 100 RFs implemented as outlined in Section 2.1.2. As so few presences occurred, a new stratified random draw was made from the full dataset for each species and model performance metrics (i.e. AUC, RMSE, and TSS) were evaluated for this new dataset as well as for the internal test datasets for all models. Probability of presence maps were generated for each species by making predictions on the full species' dataset and then averaging predictions over a hexagonal grid with cells roughly 100 km 2 in area. To protect the confidentiality of fishing locations, hexagonal grid cells with fewer than 3 vessels fishing the cell (279 of 3018 cells) were removed. Similarly, the presence points of each species were visualized by highlighting grid cells that contained presences rather than individual set locations. Rip-ley's K functions, a measure of spatial clustering, were calculated from the probability of presence maps using the spatstat package (Baddeley et al. 2015) in R.

Titration simulation
For each of the point patterns resulting from different detection probabilities, the spatial autocorrelation in the point pattern presences was similar (Fig. S3), mitigating the spatial autocorrelation effect on model performance. The ERF approach did as well or slightly better than RF-DS and RF-SMOTE approaches for AUC, RMSE, and TSS performance metrics across the range of detection probabilities for the external holdout dataset ( Fig. 2A−C) and the full dataset ( Fig. 2D−F). The RF approach outperformed all other approaches for the RMSE performance metric for all detection probabilities that resulted in class imbalance (δ ≥ 0.25) (Fig. 2B,E). This is a result of the RF approach learning the absences better than the other tested models as a function of the declining number of presences, as exemplified by the cooccurring decline in the TSS statistic for the RF approach. By random chance, the RF-DS and RF-SMOTE ap proaches perform on par with ERF for ultrarare events (δ = 0.001 or 1000:1 absences: presence), but on average ERF perform better according to AUC and TSS performance metrics ( Fig. 2A,C,D,F). In general, uncertainty generated from different point pattern samples declined for the RF-DS, RF-SMOTE, and ERF approaches as class imbalance decreased. The same held for the RF approach, with the excep-

Spatial contrast simulation
The random fields simulation of spatial clustering (Fig. 3) yielded 10 simulated environmental covariates (Fig. 3A−J) and 3 probability of presence maps (Fig. 3K−M) with similar spatial autocorrelation (Moran's I between 0.26 and 0.27; Fig. S1), declining spatial clustering from p 1 to p 3 (Fig. S1, right panel), and declining contrast in the probability of presence (Figs. 3K−M & S1, left panel). Residuals between in-dividual RFs (Fig. 3N−P) and the true probability of presence maps (Fig. 3K−M) highlight the variability in learning the majority and minority classes be tween individual RFs. By treating p 1 as the true probability of presence pattern and comparing p 1 to the ERF predictions (Fig. 3Q−S), the decline in model performance can be visually assessed. The ERF predictions (Fig. 3Q−S) resulted in the test AUC declining from p 1 to p 3 and as a function of declining spatial clustering (test AUC ± SD: 0.74 ± 0.03, 0.64 ± 0.04, and 0.52 ± 0.04, respectively). Performance on the held out test set was similar for AUC (overall test AUC: p 1 = 0.72, p 2 = 0.70, and p 3 = 0.51), RMSE (overall test RMSE: p 1 = 0.38, p 2 = 0.41, and p 3 = 0.42), and TSS (overall test TSS: p 1 = 0.37, p 2 = 0.25, and p 3 = 0.02). However, 413, respectively) also increased from p 1 to p 3 , though the latter indicates poorer model performance as spatial clustering declines. Comparing the ERF predictions ( Fig. 3Q−S) to the true probability of presence pattern (Fig. 3K−M), the ability of the ensemble to smooth across the individual RF predictions ( Fig. 3N−P) to generate a central tendency can be observed. As spatial clustering declines, the ERF predictions gradually fill in areas of low probability of presence. Despite this, the ERF still delineates areas of high and low probability of presence and inflates the difference between these areas (as shown by comparing Fig. 3L &  Fig. 3R). Differences in the test AUC do not equate to one prob ability of presence map being more 'true' than any other. Rather, declining AUC corresponds to the model's ability to discriminate between the true/false positives and negatives. As spatial clustering de clines, the ability of an individual RF to identify patterns within the data that match those in the environmental covariates can be expected to decline.

Titration case study
Titrating rare event bias (decreasing number of presences) into the NOP-provided deep-set longline data for wahoo (21 279 presences: 26 076 absences) resulted in declining model performance in terms of raw metrics and increased metric un certainty across RFs in the ensemble (Fig. 4). The median AUC de rived from ROCs decreased as a function of increasing rare event bias (Fig. 4A). All presence sample sizes resulted in models with AUC values greater than 0.5 (better than random). RMSE was similar from 15 to 2500 presences and decreased as presences increased over 5000 (Fig. 4B). The median TSS decreased as a function of increasing rare event bias (Fig. 4C). The top 5 covariates (in order) were day of year, distance to current front, distance to chl a front, distance to SST front, and current divergence. The standard deviation in the variable importance metric increased exponentially with increasing rare event bias (Fig. 4D). All co variates had similar uncertainty in variable importance.

Spatial contrast case studies
Available presences for giant manta ray (n = 24) and scalloped hammerhead (n = 23) were ultra-rare (~1:2000 presence:absence) while false killer whale presences (n = 62) were considered rare (~1:750 presence:absence) in the 47 355 observed longline sets. The presence-absence point patterns of these 3 case study species (represented in Fig. 5A−C) re sul ted in contrasting ERF predictions and, as expected, different areas of high probability of presence. Giant manta ray and scalloped hammerhead had a high probability of presence concentra ted south of the main Hawaiian Islands (Fig. 5A,B). The probability of presence map for giant manta ray also had an area of high probability southwest of the main Hawaiian islands near 12°N, 165°W (Fig. 5A). False killer whale probability of presence was dispersed but an area north−northeast of the main Hawaiian Islands had higher probability than the surrounding areas (Fig. 5C). AUC declined from giant manta ray (0.78 ± 0.09, μ ± σ), to scalloped hammerhead (0.75 ± 0.11), to false killer whale (0.55 ± 0.09) (Fig. 5D−F). External model performance test metrics (Table 1) and internal test metrics (Table S2) showed ERFs performed better than the other models, with RF-SMOTE a close second, followed by RF-DS, and then RF. This follows the simulation spatial clustering results, as test AUC for each RF in the ensemble declined as a function of spatial clustering (Fig. 5G). However, the ensemble AUC was high across all 3 species, with false killer whale having the highest (0.999), followed by scalloped hammerhead (0.918), and then giant manta ray (0.916). Spatial clustering of giant manta ray and scalloped hammerhead was similar and greater than spatial clustering of false killer whale, while all 3 species had greater clustering than expected by chance (Fig. 5G). RMSE was highest for false killer whale (0.401), then giant manta ray (0.384), and lowest for scalloped hammerhead (0.374). Similarly to AUC, TSS was high for all species but highest for false killer whale (0.999), then scalloped hammerhead (0.996), and then giant manta ray (0.987).

DISCUSSION
Ensemble Random Forests (ERFs) is an intuitive method for reducing rare event bias by generating stratified randomly sampled training/test sets and training multiple RFs (Breiman 2001a,b, Cutler et al. 2007) with down-sampling (Kuhn & Johnson 2013) to generate an ensemble of 'strong learners'. Here, the performance of ERF as a function of rare event bias is demonstrated through simulation and case studies. In simulations, ERF performed better on average than 3 alternative RF approaches for rare events across highly class imbalanced to balanced datasets. However, the performance of individual RFs in the ensemble is related to the strength of spatial clustering in a given dataset. In the case studies, class imbalance impacted the overall performance of the ensemble, while contrast in the presence-absence point patterns (Evans et al. 2011) negatively impacted individual RF models but did not greatly impact the ensemble performance, similar to the simulations. Nonetheless, the ERF routine performs admirably even for very low sample sizes of presences given enough absences.
In the titration simulation, ERF performed, on average, better in regards to full performance than the RF with a stratified training/test set, RF with down-sampling, and RF with synthetic minority oversampling technique for AUC and TSS. On average, ERF, RF-DS, and RF-SMOTE converged as class imbalance increased for all holdout performance metrics. Not surprisingly, the base RF ap proach performed the worst with respect to AUC and TSS and the best for RMSE. This performance reflects the influence of successive partitioning on the growth of decision trees, where fewer presences are . For each of the wahoo case study titration sub-models (see Section 2.3.3), the median (the line) and 95% confidence interval (the gray shading) of (A) the area under the curve (AUC) model performance metric calculated using individual receiver operator characteristic curves, (B) the root mean square error (RMSE), and (C) the true skill statistic (TSS) as a function of the number of presences. (D) Standard deviation of the variable importance metric (mean decrease in accuracy) for the 5 most important covariates (determined across all titration submodels), with warmer colors indicating higher average importance tained as trees grow, resulting in the model training predominantly on absences (He & Garcia 2009). Effects from successive partitioning are most noticeable when δ ≤ 0.005, where the effects become so severe that multiple RFs trained on independent point pattern samples all perform similarly. Interestingly, the RF-DS and RF-SMOTE methods can perform on par or, in the case of RF-SMOTE, better than ERF by random chance. This is similar to the conclusion of Kuhn & Johnson (2013) that the performance of different approaches to mitigating rare event bias can differ between datasets and there is little consensus on a 'best' approach. However, ERF was more consistent in the AUC and TSS performance metrics than the RF-DS and RF-SMOTE approaches across independent point pattern samples over the whole range of class imbalances. Thus, ERF may be an option to mitigate rare event bias regardless of the dataset. Across a range of spatial clustering in presences, ERF overall performance actually increased for AUC and TSS as spatial clustering declined. However, the   test performance of individual RFs within the ensemble and the overall test performance of the ERF declined. This indicates that ERF tends to overfit as a function of declining spatial clustering (the difference between the overall performance metric and the test performance metric increases). Poor test performance is not unanticipated, as there is little to distinguish an absence from a presence in regards to the true probability of presence in the p 3 probability of presence maps and it would be surprising if models performed well in these circumstances. Thus, caution should be exercised in interpreting performance metrics generated for the ensemble predictions, as they tend to inflate the ability of model to predict to test sets.
In the titration of rare event bias into the wahoo dataset, models performed better than random ac cording to AUC for all presence sample sizes. Uncertainty in performance metrics increased as sample sizes fell below 100 presences (250:1 absences:presences). As expected, as presence sample sizes de clined, the uncertainty in variable importance in creased. This should advise users of ERF, and realistically any model with rare event data, to interpret variable importance metrics with caution, as the ranking changes considerably between individual RFs in the ensemble. ERFs for giant manta ray and scalloped hammerhead had greater test model performance than ERFs for false killer whale, even with only 24 and 23 presences, respectively. This highlights the tradeoff between spatial covariation and sample size: fewer samples are needed for individual RFs to discern patterns as covariation increases. However, as demonstrated by the spatial contrast simulation results, the high performance of the ensemble predictions can be misleading and mean test performance metrics should be used in interpreting the performance of ERF in the case studies.
There are 4 distinct advantages of ERF for rare events: (1) no individual RF trains on the whole dataset but the ERF does 'see' the whole dataset, reducing the effects of successive partitioning (e.g. there is minimal data loss by setting test sets) (He & Garcia 2009); (2) uncertainty in the RF algorithm can be propagated forward; (3) a balanced weighting scheme allows for a better accounting of interacting covariates when growing the decision trees (Kuhn & Johnson 2013); and (4) the propagation of Type II error (absences predicted as presences) from rebalancing the dataset to a balanced weighting scheme is minimized in the ensemble as the majority of RFs will vote correctly on absences (He & Garcia 2009). These advantages allow further gains to be made from base RF implementation for small sample sizes. In the case of our applied example, the Hawaii-based longline deep-set fishery, 78% of species caught by the gear had at least a 10:1 absence to presence ratio, 62% had at least 100:1, and 44% had at least 1000:1. These bycatch occurrences are fairly typical in a specialized fishery with specific target species. Fisheries of this nature occur globally, drastically increasing the utility of ERF in directly analogous systems. As sample size increases, the performance from ERF converges towards the base RF method. Similarly, the uncertainty between RFs declines as sample size increases, reducing the necessity for incorporating inter-model variability. Overall, ERF is likely to be a useful tool for generating probability of presence maps (relative to the sampling regime) for any species, as it propagates uncertainty in model fit and includes interactions between covariates. However, the highest model gains are likely to occur with rarely occurring species.
There are disadvantages to ERF relative to other statistical applications. Linear models offer an explicit ease-of-interpretation that is lacking in 'black box' machine-learning approaches (Breiman 2001b, Olden et al. 2008). Relative to RF or the ERF implementation presented here, the commonly used maximum entropy species distribution models do have an advantage: total sample size (presences and absences) can be considerably smaller, as these models only rely on presence data points and a background sample (from the environmental covariates) (Phillips et al. 2006). Despite high model performance with ultra-rare presences, each of our case studies had the advantage of a plethora of absence records (presences + absences = 47 355 records). This difference does currently limit the application of ERF to datasets with known absences.
In the ma rine environment where industry-based and re lated sampling regimes are common, many conservation priority species have both presences and absences. In terrestrial systems, ERF is likely to be most applicable to modeling rare species with extremely few occurrences within a larger sampling scheme (e.g. quadrat-based sampling), due to its reliance on informative ab sences. RFs have been used to classify presence-only data and performed better or on par with MaxEnt (Williams et al. 2009). It stands to reason that ERF may perform similarly on presenceonly datasets; however, testing is necessary to de monstrate such capabilities. Broadly, there is a pressing need to evaluate the performance of species distribution models using simulation and tools like RandomFields, which can greatly ease the computational overhead in generating semi-realistic covariates. Future avenues for ERF include extension to presence-only datasets as well as management strategy evaluations on the inclusion of species distribution models in conservation planning.
Rare event bias is a difficult statistical property of many datasets, and is challenging to model appropriately, often relying on high sample sizes of rare events or conditioning on linear responses (He & Garcia 2009). For many species, rare occurrences prohibit the former and the use of proxy or indirect covariates often necessitates interacting covariates or non-linear responses. ERF is a useful tool to reduce biases resulting from dealing with the imbalanced data problem in machine-learning algorithms. Additionally, ERF reduces the reliance on a high number of positives in the data needed in GLMs, GAMs, or other mixture models, and incorporates interacting and non-linear covariate responses (even though we used linear effects for simplicity in our simulation). The application of ERF to rare event presence-absence data is a straightforward bagging of a down-sampled RF routine and mostly involves capturing the individual RF predictions and collating them to generate inter-model uncertainty and model averages. For the rarely occurring, ESA-listed species in the Hawaii-based deep-set longline dataset, application of ERF generated datum-specific and spatially averaged predictions of interactions with the fishery gear. These maps have direct utility in evaluating threats to species across their range and in the rebuilding plans mandated by the Endangered Species Act. Within the datasets for similar fisheries and other large surveys around the world, there are likely many species where the application of ERF could be informative in assessing their spatial distribution and determining the influence of the environment in driving ecological patterns.
Data availability. Due to the confidential nature of the NOAA NOP Hawaii-based deep-set longline fishery ob server data, data for the case studies are solely available upon request from the NOAA NOP. All code for the simulation has been provided at https://osf.io/q9wfn/