A machine-learning approach to assign species to ‘unidentified’ entangled whales

Whale entanglements in US west coast fishing gear are largely represented by opportunistic sightings, and some reports lack species identifications due to rough seas, distance from whales, or a lack of cetacean identification expertise. Unidentified entanglements are often ignored in species risk assessments and thus, entanglement risk is underestimated. To address this negative bias, a species identification model was built from random forest (RF) classification trees using 199 identified entanglements (‘model data’). Humpback Megaptera novaeangliae and gray whales Eschrichtius robustus represented 92% of identified entanglements; the remaining 8% were minke whales Balaenoptera acutorostrata, fin whales B. physalus, blue whales B. musculus, and sperm whales Physeter macrocephalus. Predictor variables included year, gear type, location, season, sea surface temperature, water depth, and a multivariate El Niño index. Crossvalidated species classifications were correct in 78% (155/199) of cases, significantly higher (p < 0.001, permutation test) than the 49% correct classification rate expected by chance. The RF model correctly classified 91% of humpback whale cases, 64% of gray whale cases, and 100% of sperm whale cases, but misclassified all minke, blue, and fin whale cases. The cross-validated RF classification-tree species model was used to classify 35 entanglements without species identifications (‘novel data’) and each case was assigned a probability of belonging to each of 6 model data species. This approach eliminates the negative bias associated with ignoring unidentified entanglements in species risk assessments. Applications to other wildlife studies where some detections are unidentified include fisheries bycatch, line-transect surveys, and large-whale vessel strikes.


INTRODUCTION
Entanglement of large whales in fishing gear and marine debris is a source of anthropogenic mortality and serious injury worldwide (Read et al. 2006, Bradford et al. 2009, Cassoff et al. 2011, Meÿer et al. 2011, Groom & Coughran 2012, Knowlton et al. 2012, Moore 2014, van der Hoop et al. 2017).Documented entanglements represent a minimum accounting of impacts, because not all at-sea entanglements are detected; either the whale is never seen or observers fail to recognize that a whale is entangled.Negative reporting biases are not limited to at-sea sightings.Beach-stranded carcasses may go undetected along remote coastlines or detected carcasses may lack visible evidence of entanglement due to decomposition and thus, are not categorized as anthropogenic mortality.Studies of recovery rates of cetacean carcasses suggest that observed levels of anthropogenic mortality and injury grossly underestimate actual levels (Knowlton & Kraus 2001, Kraus et al. 2005, Williams et al. 2011), even for extremely coastal species (Prado et al. 2013, Wells et al. 2015, Carretta et al. 2016a).
Compounding the problem of incomplete detection is that not all at-sea sightings of entanglements are identified to species.Approximately 15% of US west coast whale entanglement cases lack species identifications due to rough seas, observer distance to whales, or a lack of whale identification expertise (Carretta et al. 2016b).Quantitative methods to prorate unidentified cases to species are lacking in US marine mammal stock assessments (Muto et al. 2016, Waring et al. 2016, Carretta et al. 2017); thus the perceived entanglement risk to some species is negatively biased via omission of these cases.To better account for entanglement risk, I developed a species classification model using random forest (RF) classification trees (Breiman 2001a,b, Liaw & Wiener 2002), which are used to classify unidentified sightings of entangled whales to species.

Data and model overview
Data on large-whale entanglements are compiled by the National Oceanic and Atmospheric Administration (NOAA) through regional marine mammal stranding networks and disentanglement teams (Carretta et al. 2016b).Reports and sightings are verified with photos and/or video when possible, but many records are opportunistically reported; thus species identification and the type of fishing gear involved in entanglements are sometimes based on first-hand accounts.Only entanglement records with photo/video documentation or those received from re porting parties considered reliable (i.e.whale-watching companies, researchers, members of the public who sufficiently describe the entanglement and species involved) are included in an entanglement database of known-species identifications (here after referred to as 'model data').Records lacking supporting species identification evidence are categorized as 'unidentified whale' cases, hereafter referred to as 'novel data').All sighting locations for model data and novel data entanglements are shown in Fig. 1.
The RF machine-learning method, using classification trees (Breiman 2001a,b, Liaw & Wiener 2002), was used to evaluate if known species entanglements (model data) could be accurately classified to species via cross-validation.Once an accurate species ID clas-sifier is created, it is used to classify unidentified large-whale entanglements (novel data) to species.Variables in the RF model included geographic location, season, observation year, sea surface temperature (SST), water depth, an annual multivariate El Niño index, and the type of fishing gear (Table 1).All analyses were performed in the R programming environment, version 3.2.3(R Development Core Team 2017), using the R package randomForest, version 4.6-12 (Liaw & Wiener 2002).Known-species entanglements (n = 199) documented from 2007 to 2016 (Carretta et al. 2013(Carretta et al. , 2014(Carretta et al. , 2015(Carretta et al. , 2016b) ) served as model data, from which a classification tree RF was generated.For 35 entanglement cases lacking a species identification (novel data), the variable characteristics associated with these cases were used to classify the species from constructed RF model trees.Details on the RF model parameters and variables used to construct it are summarized below and in Table 1.Most documented large-whale entanglements along the US west coast result from fishing gear interactions that include pot/trap fisheries, gillnets, marine debris, and unidentified fisheries (Carretta et al. 2016b).The ability to identify entanglement sources depends upon the level of detail provided by reporting parties and opportunities for whale disentanglement teams to approach the animals.Some entanglement cases are linked to specific fisheries (i.e.'California Dungeness crab pot'), based on identification of permit tag numbers on buoys associated with the entangled line.Most cases can only be categorized to a generic entanglement category such as 'pot/trap fishery', 'gillnet', or 'unknown fishery interaction' due to the opportunistic nature of reporting and lack of recovered gear (Carretta et al. 2016b).For the purposes of creating an entanglement species model, I treated the variable Interaction.Type as a categorical variable, with values limited to the generic categories 'pot/trap fishery', 'net fishery', and 'unknown fishery interaction'.Latitude ('LAT'), longitude ('LON'), and water depth ('Depth') Specific latitude and longitude coordinates of entangled whales were used when available, but such locations were not always recorded because reports and narratives reflect opportunistic sightings (e.g.'entangled whale seen halfway between Catalina Island and mainland').In those cases where entanglement narratives lacked latitude and longitude coordinates, there was enough information (e.g.'3 miles [5 km] offshore of Point Loma, San Diego') to infer approximate locations and assign latitude/ longitude coordinates.Water depth (in meters) was inter polated for latitude and longitude point data using a geographic information system (GIS) with a world ocean depth raster in ArcGIS software, version 10.4.1.Some depths were assigned a value of zero because they involved entanglements extremely close to shore or beach-stranded animals where GIS water depth interpolations resulted in positive values above sea level.
'SST' SST data were obtained for each entanglement record from archived data at NOAA's National Data Buoy Center (www.ndbc.noaa.gov/obs.shtml).SST data were obtained from the nearest buoy location to the entanglement and were based on the noon-time temperature for that day.
'Year' and 'MEI.mean' The calendar year ('Year') of the observed entanglement was included as a categorical RF model variable.In addition to Year, a multivariate El Niño index variable was included to serve as a measure of the broad-scale oceanographic conditions along the US west coast in a given year.The 'MEI.mean' was calculated for each calendar year as the annual mean of 12 bimonthly (2× a month) values obtained from NOAA's Earth System Research Laboratory (NOAA 2017).
Season ('Day.of.Year') The seasonality of large-whale entanglements varies by species; thus the sequential calendar day of the year (Day.of.Year) was included as a candidate Simultaneous use of both month and Day.of.Year variables is not recommended, as they are highly correlated, which can negatively impact classification accuracy of RF models (Strobl et al. 2008).

RF model construction and cross-validation
The RF model consists of classification trees, since the response is 'Species', a category to be classified.Classification trees are recursive partitioning algorithms.Random subsets of variables (default = √n where n equals the number of variables) are selected at each tree node and the variable that results in the greatest variance reduction of the response is used to split the data into successive daughter nodes.Such variable splits continue until all observations in each terminal node contain the same response variable value or the terminal nodes each contain only a single sample.Each classification tree is built from a bootstrap sample of model data entanglements and those model data omitted from construction of individual RF trees are referred to as 'out-of-bag' (OOB) data.Due to bootstrap sampling with replacement, OOB data represent approximately 1/3 of all data (Efron & Tibshirani 1997).Evaluation of the RF model is based on classification accuracy, based on how often cross-validated OOB model data are correctly classified.The OOB data are introduced to constructed RF trees and species classifications are made for all OOB data, based on variable characteristics of the OOB data.The number of RF trees (n = 500 in this study) is based on the approximate number of RF trees required to return an asymptotic OOB error rate.Cross-validated species classifications for OOB data are summarized as a confusion matrix that includes the number of correctly and incorrectly classified cases by species (see Table 2).Only species for which there were at least 2 documented entanglements were included in the analysis, due to the need for model data cross-validation for each species.
I optimized the RF model by exhaustively searching for the number and combination of variables that maximized OOB correct classification rates for a RF of 500 trees.This strategy was implemented by randomly selecting subsets of all 8 candidate variables, ranging from 2 (the minimum required) to all 8 variables, and recording the OOB correct classification rate for each variable combination.The OOB correct classification rates for the optimized RF model were compared to correct classification rates expected by chance when all model data cases are randomly assigned a species in proportions equal to the observed entanglements (i.e.permutation of the response variable 'Species').This was done 1000 times to generate a null distribution of correct classification rates.The 1-tailed probability of observing the correct classification rate from OOB model data was calculated as the observed fraction of null distribution correct classification rates greater than or equal to the observed correct classification rate (Fig. 2).
RF offers many tuning parameters for model evaluation.The major ones are: maximum tree depth, number of variables tested at each node, and number of forest trees.These parameters were assessed during model-building and the RF model that was used in this study ultimately included trees grown to full extent and the default number of variables considered for splitting at each node, or √n variables.
Variable importance for the optimized RF model was assessed by permuting variables individually, running a RF model with the permuted variable, and comparing OOB correct classification rates between the RF model run with and without permutation.Negligible declines in classification accuracy with permutation indicate that a given variable is no more important than random noise in predicting species identifications.Conversely, a large decline in classification accuracy indicates that the permuted variable is informative.Variable importance was quantified as a 'permutation cost', which is equal to the number of additional OOB entanglement cases misclassified when a given variable was permuted.

Application of RF model to novel data
The RF model with the lowest OOB classification error rate was applied to 35 novel data entanglement cases lacking a species identification.For each novel data case, a species assignment is generated, based on the consensus predictions of all 500 RF trees (also referred to as the plurality vote; Svetnik et al. 2003).For each novel data case, the number of trees that classify a given species varies from a minimum of zero to the number of trees in the RF.The distribution of species classifications over all 500 RF trees is analogous to a species probability assignment.For example, a RF of 500 trees constructed from model data consisting of 6 species, when applied to a novel data case where the species is unknown, might yield the following classifications: 'Species.1'= 300 trees; 'Species.2'= 100 trees; 'Species.3'= 50 trees; 'Species.4'= 25 trees; 'Species.5'= 25 trees; and 'Species.6'= zero trees.The assigned species in this novel data example is 'Species.1'(300/500 trees = plurality vote) and the probability of assignment to Species 1−6 are 0.60, 0.20, 0.10, 0.05, 0.05, and 0, respectively.

RESULTS
The RF model that minimized model data OOB error rates included 5 of 8 variables evaluated (Day.of.Year + Interaction.Type + LAT + SST + Year) and correctly classified the species in 78% (155/199) of model data cases (Tables 2 & 3).This correct classification rate for all 6 species combined was significantly higher (p < 0.001, permutation test) than the rate expected by chance (49%).Humpback whale cases were classified correctly as Megaptera novae -angliae 91% of the time, which was significantly higher (p < 0.001) than the 63% rate expected by chance.Correct classification (as Eschrichtius robustus) of gray whale cases (64%) was significantly higher (p < 0.001) than the 29% rate expected by chance.None of the minke (n = 2), blue (n = 5), or fin whale (n = 5) model data cases were correctly classified to species (Balaenoptera acutorostrata, B. musculus, and B. physa lus, respectively).Poor classification ac curacy for minke, blue, and fin whale cases is not unexpected, given that these species collectively represent only 6% of all model data cases and share many of the same variable attributes as humpback and gray whale re cords.Both sperm whale cases were correctly classified to species Physeter macrocephalus (see 'Discussion').The 3 most important variables in the RF model were Day.of.Year + Interaction.Type + LAT, based on a comparison of correct classification rates of OOB mo del data using intact versus permuted versions of each variable (Table 3).Day.of.Year was the most important variable, based on 15 additional misclassified OOB cases when this variable was permuted.The next most important variable was Interaction.Type, with 12 additional mis classified cases, followed by LAT, with 9 additional misclassified cases.
Differences in the documented types of fishing gear entangling humpback and gray whales were evident (Table 4).Entanglements in net fishery gear were relatively rare for humpback whales (7/126 = 5%), compared with gray whales (14/59 = 24%).Entanglements in pot or trap gear were greater for humpback (67/126 = 53%) and gray whales (23/59 = 39%) compared to net gear.The fraction of entanglements where the gear type could not be identified was similar (~40%) for humpbacks (52/126) and gray whales (22/59).Differences in gear types between the 2 species may reflect multiple factors, including the spatial/temporal overlap of each species with different fisheries and possible observation biases in the ability to detect one gear type versus another.For example, monofilament gillnet entanglements are more difficult to detect at a distance than pot/trap gear entanglements; the latter usually include highly visible buoys trailing behind the whale.Species classifications for 35 un identified novel da ta cases included 24 humpback whales and 11 gray whales (Table 5).These classifications are based on the plurality vote of 500 RF trees.For example, novel data case #1 in Table 5 shows that the overall classification was gray whale, based on 86% of forest trees assigning this species.No novel data cases resulted in an overall classification of minke, blue, fin, or sperm whale, but most novel cases include a small percentage of RF trees assigning these species.Despite the lack of minke, blue, fin, or sperm whale plurality vote classifications, the proportion of trees predicting each species is analogous to a species assignment probability, where higher values imply greater confidence.For example, no vel case #20 in Table 5 has the following assignment probabilities for minke, blue, fin, gray, humpback, and sperm whales, respectively: 0.002, 0.002, 0.01, 0.042, 0.944, and 0.00.Thus, the assigned species is hump back whale with a 94% probability.However, all 6 species assignment probabilities can be used to prorate this novel data case.One alternative to accepting species classifications based on the plurality votes is to sum individual species classification probabilities over all 35 novel data cases.This yields fractional species classifications, resulting in 0.218 minke, 0.462 blue, 0.77 fin, 11.97 gray, 21.5 humpback, and 0.078 sperm whale entanglements (Table 5).This approach yields approximately the same number of gray and humpback ent anglements as the plurality vote approach, but it does a better job of representing the rare species classes by assigning them some small probability of occurrence, which is otherwise zero with the plurality vote results.The uncertainty of the plurality vote classifications can also be expressed as the range of summed species classifications for each of the 500 individual RF trees.For example, summing the species classifications from tree #1 of the RF model results in the following classifications for the 35 novel data cases: 4 fin whales, 11 gray whales, 20 humpback whales, and zero classifications for the remaining species.Tree #500 yields 1 fin whale, 9 gray whale, 24 humpback whale, and 1 sperm whale classification.Confidence intervals (95%) for all species classifications were calculated by summing species classifications individually for all 500 RF trees and identifying the 2.5 th and 97.5 th percentiles of the sums over all 35

DISCUSSION
Nearly all of the known-species entanglement cases (92%) involved humpback and gray whales, 2 species that tend to utilize the California Current in different seasons and which are documented in net and pot/trap gear at different rates.High rates of cor-rect species classification from the RF model are largely due to differences in seasonal occurrence of gray and humpback whales, proportions of entanglements involving net versus pot/trap gear, and the locations of the observed entanglements (Table 3).These differences are reflected by the identification of Day.of.Year, Interaction.Type, and LAT as the 3 most important variables in terms their numerical contribution to correct classification rates (Table 3).Low classification accuracy for minke, blue, and fin whale model data cases is expected, given that these cases comprise only 6% of the observations.The correct classification of both sperm whale model data cases was initially surprising because  (Efron & Tibshirani 1997).When the 2 cases are split between tree construction and OOB sample roles, the OOB sample will be assigned to the terminal node occupied by the first sperm whale case, because the variables are identical for each.This represents a special case of overfitting, which could be addressed by excluding the 2 sperm whale entanglements from analysis.However, the value of including these cases is that a RF data model lacking sperm whale entanglements would assign a zero risk of such entanglements in the novel data, which is known a priori to be untrue.Despite poor classification accuracy for a few species with small sample sizes, their inclusion in the RF model is worthwhile because fractional estimates of entanglements can be produced for the novel data, despite the lack of any plurality vote assignments for these minor species.The classification accuracy for humpback and gray whales is, however, encou raging, in terms of prorating unknown species entanglement cases, the majority of which should comprise these 2 species.The importance of accurately assigning unknown cases to species can be considered as a form of risk management.For humpback and gray whale populations along the US west coast, there is a greater penalty for misclassifying a humpback en tanglement.This is because humpbacks are less abundant than gray whales (estimated population sizes ~ 2000 and 20 000, respectively) and humpbacks have lower allowable anthropogenic injury and mortality thresholds (potential biological removal or PBR; Wade 1998) under the Marine Mammal Protection Act.Current PBR levels for each population are 11 humpbacks versus 624 gray whales (Carretta et al. 2017).
The variable Day.of.Year was identified as the most important predictor variable, based on the greatest permutation cost to correct classification rates, but the context of variable importance is worth discussion.Algorithms such as RFs are designed to simultaneously handle many predictors and automatically deal with interactions between variables (Breiman 2001a,b).However, variable importance in the context of RF usually measures the effect on classification accuracy of permuting a single variable at a time.Some methods used to assess RF variable importance, such as rfPermute (Archer 2016), include statistical p-values for each variable.This is a useful tool for considering variables for model inclusion.However, analysts may be tempted to arbitrarily eliminate candidate variables that do not meet default p-value thresholds (p < 0.05).Such an approach may un necessarily exclude multiple nonsignificant predictors whose collective classification power is superior to a smaller set of significant predictors (Breiman 2001a,b).It is recommended that analysts consider wider inclusion of candidate variables in RF models and examine cross-validated correct classification rates under different suites of variable numbers and combinations.
Species classifications for novel data could also be obtained via simple proration: multiplying observed model data species proportions by the number of novel cases (n = 35).This results in the following number of estimated entanglements for unidentified cases: 0.01 × 35 = 0.35 minke whales, 0.025 × 35 = 0.875 blue whales, 0.025 × 35 = 0.88 fin whales, 0.296 × 35 = 10.4 gray whales, 0.63 × 35 = 22 humpback whales, and 0.01 × 35 = 0.35 sperm whales.The RF model resulted in similar species classifications (0.218 minke, 0.462 blue, 0.77 fin, 11.97 gray, 21.5 humpback, and 0.078 sperm whales).The similarity in species classifications using simple proration and the RF model suggests that the 35 novel data cases may reflect an unbiased sample of the knownspecies model data observations.However, it is unknown whether or not the model data are representative of all large-whale entanglements.For example, gray whales generally occur closer to shore, compared to other species.This may introduce a positive detection bias for gray whales in the model data, as they may be more likely to be detected and reported from observers on shore or whale-watching vessels.Additionally, recreational vessel traffic is generally concentrated closer to shore, which would amplify this bias.If a positive gray-whale bias exists, the model data may represent an underestimation of other species' entanglements as a fraction of total entanglements.While the simple proration is easy to implement, it is crude and forfeits potential insights into predictor variables that may be related to entanglement risk.However, if a suitable species identification model cannot be generated using RF or some other method, then at a minimum, unidentified cases should be prorated to fully account for entanglement risks to all species.
The RF species assignment approach described here has applications to other wildlife studies, particularly transect surveys, where a non-trivial fraction of detections may lack species identifications: raptors (Andersen et al. 1985), seabirds (Piatt et al. 2011), large whales (Barlow & Forney 2007), and sea turtles (Seminoff et al. 2014).When unidentified detections are not prorated to species, they are often omitted from analyses and can result in underestimates of animal abundance.Other applications may include species proration of unidentified bycatch in commercial fisheries and species assignments of large-whale vessel strikes.

Fig. 1 .
Fig. 1.Locations of (a−c) model data (identified species -a: gray whale, b: humpback whale, c: sperm, minke, fin, and blue whales) and (d) novel data (unidentified) whale entanglements used in this analysis

Fig. 2 .
Fig. 2. Expected (null) and observed correct classification rates for crossvalidated out-of-bag (OOB) model data.Expected values are based on permuting the response variable 'Species' 1000 times.This is equivalent to random assignment of a species to each model data observation based on observed species proportions.Observed correct classification rates from the random forest model are shown as a vertical red line for (a) all species combined, (b) gray whale, and (c) humpback whale.The probability that observed correct classification rates were less than or equal to null distribution correct classification rates was < 0.001 in all cases

Table 1 .
Variables tested and used in the random forest large-whale entanglement species-identification model.Variables used in the final ID model are in bold variable.Day.of.Year was used instead of calendar month, as it represents a finer measure of seasonality.

Table 3 .
Variable importance as measured by the decrease in classification accuracy when each variable is permuted.Variables appear in increasing order of importance, where the cost of permutation is the decrease in the number of correctly classified out-of-bag (OOB) cases.Permuting the variable Day.of.Year had the largest cost (8%) to classification accuracy, resulting in 15 fewer correct classifications than a random forest model with all variables intact.BA = minke whale, BM = blue whale, BP = fin whale, ER = gray whale, MN = humpback whale, PM = sperm whale.See Table1for description of variables

Table 2 .
Random forest confusion matrix and correct classification rates for crossvalidated out-of-bag (OOB) large-whale entanglement cases of known species.Rows represent known species and columns represent number of classifications of each species.The overall correct classification rate for OOB entanglement cases was 0.78, or 155 of 199 model data cases.The last column shows expected correct classification rates under the condition of permuting the response variable ('Species').This is equivalent to a null distribution of OOB correct classification rates where all variables lack predictive value

Table 5 .
Random forest (RF) species classifications for novel data cases.Columns 2−6 represent variables used in the RF model.Values in species columns are the fraction of RF trees resulting in a given species classification.The overall classification for a novel data case is based on the plurality vote of all RF trees and appears in the last column.BA = minke whale, BM = blue whale, BP = fin whale, ER = gray whale, MN = humpback whale, PM = sperm whale.See Table1for description of variables