Overview

The IUCN Red List is the world's most comprehensive inventory of species' conservation status. It classifies species from Least Concern through to Extinct — but a critical subset are labelled Data Deficient (DD): species where available information is too poor to even assign a threat category.

This is not a safe status. A DD classification often reflects under-resourced field research rather than genuine absence of threat. Left unaddressed, these species fall through the cracks of conservation policy.

Distribution of IUCN Red List categories across all 39,024 species in the dataset
Distribution of IUCN Red List categories. Least Concern dominates, while Data Deficient (far right) represents 9.3% of the dataset — species with insufficient information to classify.

The goal: Train a classification model on species with known threat status, then use it to predict whether Kenya's 335 Data Deficient species are likely to be Threatened or Non-Threatened — surfacing candidates for urgent field research.


The Dataset

Data was sourced from the IUCN Red List bulk download, comprising 8 CSV files covering different aspects of species records:

FileRowsDescription
simple_summary41,949Core taxonomy, Red List category, population trend
all_other_fields41,949Extended metadata fields
habitats109,594Habitat type associations per species
threats88,999Threat codes and stresses per species
research_needed82,263Research priority codes
conservation_needed50,414Conservation action codes
usetrade18,524Human use and trade records
countries431,396Country-level occurrence records

Missingness & Deduplication

Before analysis, the data needed cleaning. Initial inspection revealed significant missing values across several fields.

Missingness matrix showing gaps across fields before cleaning
Initial missingness matrix. Several fields — including population trend, habitat associations, and threat data — have substantial gaps, particularly for Data Deficient species.

The primary join key (internalTaxonId) had 2,925 duplicated rows, retained as first occurrences and removed. After deduplication:

Missingness matrix after cleaning and deduplication
Missingness after cleaning. The dataset was consolidated to 39,024 unique species, with 7 processed output files prepared for downstream analysis.

Exploratory Data Analysis

Red List Category Imbalance

For modelling purposes, the 13 IUCN categories were collapsed into three simplified groups:

  • Threatened — Vulnerable, Endangered, Critically Endangered
  • Non-Threatened — Least Concern, Near Threatened, Lower Risk/*
  • Extinct — Extinct, Extinct in the Wild, Regionally Extinct

Data Deficient species were excluded from training and reserved as prediction targets.

Simplified Red List category groupings — Threatened vs Non-Threatened vs Extinct
Simplified groupings used for binary classification. Non-Threatened (Least Concern + Near Threatened) vastly outnumbers Threatened, creating a class imbalance problem addressed later with SMOTE.

Taxonomic Signals

Not all taxonomic groups face equal risk. Threat rates vary substantially by phylum and class — making taxonomy a potentially powerful predictor.

Species count and threat rate by phylum
Species count (left) and threat rate (right) by phylum. Chordata dominates in species count, but some smaller phyla show disproportionately high threat rates.
Red List category proportions by taxonomic class — stacked bar chart
Simplified Red List category proportions within each taxonomic class. Classes like AMPHIBIA and CYCADOPSIDA have markedly higher threatened fractions than ACTINOPTERYGII (fish) or INSECTA.

Geographic Range

Species occurring in fewer countries tend to have narrower ranges, making them more vulnerable to localised threats and habitat loss.

Country count distribution and boxplot comparing threatened vs non-threatened species
Distribution of country occurrences per species (left) and boxplot by threat status (right). Threatened species occur in significantly fewer countries on average — range size is a meaningful signal.

Habitat Specialisation

Similarly, species dependent on fewer habitat types have less adaptive buffer against environmental change.

Habitat count distribution and boxplot by threat category
Habitat count distribution (left) and boxplot by category (right). Habitat specialists — those relying on 1–2 habitat types — skew toward Threatened status.

Threat Documentation

Species with more documented threats are, unsurprisingly, more likely to be classified as Threatened. But this also reveals a critical bias:

Documented threat count distribution and boxplot by category
Threat count by species. Threatened species have substantially more documented threats. However, Data Deficient species have only 35.4% threat documentation coverage, versus 63.8% for non-DD species — meaning the model may underestimate risk for poorly documented DD species.

Data Availability for DD Species

A key challenge: Data Deficient species are deficient not just in their Red List assessment, but in the features the model relies on.

Data availability comparison between DD and non-DD species across all feature sources
Feature data availability for DD (orange) vs non-DD (blue) species. Countries (100%) and habitats (97%) are well covered. Threats (35.4%), uses (17.8%), and conservation records (20.2%) are sparse for DD species — a systematic bias that predictions must be interpreted against.

Feature Engineering

Eight candidate features were selected based on EDA findings and data availability across DD species:

FeatureTypeSourceRationale
classNameCategoricalsimple_summaryTaxonomic group is a proxy for ecological vulnerability
populationTrendCategoricalsimple_summaryDeclining trend strongly associated with threatened status
n_countriesNumericcountries (count)Proxy for range size
n_habitatsNumerichabitats (count)Habitat breadth / specialisation
n_threatsNumericthreats (count)Documented threat burden
n_usesNumericusetrade (count)Human exploitation pressure
n_conservationNumericconservation_needed (count)Level of conservation attention
n_researchNumericresearch_needed (count)Documented knowledge gaps

className and populationTrend were one-hot encoded, producing 52 total features (6 numeric + 46 binary). The dataset was split 80/20 into training (28,163 species) and test (7,041 species) sets.


Handling Class Imbalance

Non-Threatened species significantly outnumber Threatened ones — a ratio that would bias a naïve classifier toward the majority class. SMOTE (Synthetic Minority Over-sampling Technique) was applied to the training set to generate synthetic Threatened examples and balance the distribution before fitting.

Class distribution before and after SMOTE oversampling
Training set class distribution before (left) and after (right) SMOTE. The synthetic samples bring Threatened species up to parity with Non-Threatened, allowing the model to learn from both classes equally.

Model Training + Evaluation

Two classifiers were trained and evaluated: Logistic Regression and Random Forest.

ModelTest AccuracyTest F1 (weighted)5-Fold CV F1
Logistic Regression83.4%83.7%84.0% ± 0.5%
Random Forest87.5%87.6%90.1% ± 1.8%

Random Forest outperformed Logistic Regression on all metrics and was selected for final predictions.

Logistic Regression confusion matrix on test set
Logistic Regression confusion matrix. The model achieves reasonable recall for Threatened species (87%) but with lower precision (72%) — it over-predicts threat status.
Random Forest confusion matrix on test set
Random Forest confusion matrix. Higher precision (79%) and similar recall (87%) for Threatened species — a better balance, with fewer false alarms.

Feature Importance

Top 20 most important features in the Random Forest model
Top 20 Random Forest feature importances. Numeric count features — particularly n_countries, n_habitats, and n_threats — dominate. Population trend categories and taxonomic class are the most informative categorical signals.

Results: Kenya's Data Deficient Species

335 species occurring in Kenya were identified as Data Deficient. Using the Random Forest model, each was assigned a predicted threat status.

Baseline distribution of known threat status for Kenya species with Red List categories
Known threat status distribution for Kenya species that already have Red List categories (i.e. non-DD). This provides a reference baseline for interpreting the DD predictions.
Random Forest and Logistic Regression predictions for Kenya's 335 Data Deficient species
Prediction distributions for Kenya's 335 DD species from both models. The Random Forest (right) predicts 291 Non-Threatened (86.9%) and 44 Threatened (13.1%). Logistic Regression is more conservative, flagging 79 as Threatened (23.6%).

The 44 RF-predicted Threatened species span multiple taxonomic classes — with fish (ACTINOPTERYGII), invertebrates, and mammals most represented among the flagged species.

Per-class breakdown of predicted Threatened vs Non-Threatened for Kenya's DD species
Predicted threat status by taxonomic class for Kenya's 335 Data Deficient species. Each panel shows both models' outputs side by side, highlighting where they agree and diverge. Fish and invertebrate classes show the highest absolute counts of predicted-Threatened species.

Limitations

These predictions are best treated as research prioritisation hypotheses, not definitive assessments.

  • Sparse data bias: DD species are data-deficient by definition. With only 35% threat documentation coverage (vs 64% for non-DD), the model sees a less complete picture and may systematically underestimate risk.
  • SMOTE caveats: Synthetic oversampling generates plausible but artificial training examples. Edge cases near class boundaries may not reflect real ecological patterns.
  • Collapsed categories: Binary Threatened / Non-Threatened classification loses meaningful nuance — a Vulnerable and a Critically Endangered species are treated identically.
  • Count features only: Threat type and habitat type were reduced to simple counts. Qualitative differences (e.g., habitat destruction vs invasive species) are invisible to the model.
  • No external validation: Predictions have not been validated against expert assessments or field surveys. They should prompt investigation, not replace it.

References + Acknowledgements

Data Source

IUCN Red List of Threatened Species. Version 2024-2. www.iucnredlist.org. Downloaded November 2024.

Methods

  • Chawla, N.V., Bowyer, K.W., Hall, L.O., & Kegelmeyer, W.P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357.
  • Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.

Libraries

pandas · scikit-learn · imbalanced-learn · matplotlib · seaborn

Acknowledgements

This project was completed as part of the Machine Learning module at Zindua School, Nairobi. Thanks to the IUCN Red List team for making bulk species data publicly available to researchers.