Methods

Data Sources

Metagenomic assemblies from public repositories (ENA/SRA). Each biome uses sequences from representative studies of extreme environments.

ESM-2 Scoring

We use Meta's ESM-2 (650M parameter model, esm2_t33_650M_UR50D) to generate protein embeddings, followed by a binary XGBoost classifier trained on known AMPs from DRAMP and APD3 databases.

Threshold: AMP probability of 0.87 or higher

Biophysical Filtering

Candidates must pass physicochemical criteria consistent with known AMPs:

Parameter	Threshold
Length	10-50 amino acids
Net charge	at least +2
Hydrophobic moment	at least 1.0
Helix propensity	at least 1.0
Non-standard amino acids	None

Safety Stack

Multi-tool safety screening:

HemoPi3 (SVM): Hemolysis prediction (~85% accuracy)
ToxinPred3: General toxicity prediction
Candidates are flagged but not removed — researchers can make informed decisions

Novelty Assessment

BLAST against three databases:

DRAMP v3 (2026-01-15)
APD3 (2026-01-20)
AMPSphere (2024-Zenodo)

Candidates with max identity below 40% are classified as "database-novel."

Structure Prediction

Top candidates receive structure prediction via:

ESMFold for fast screening
ColabFold (MMSeqs2 + AlphaFold2) for high-confidence structures

Tiering

Tier	Criteria
Tier-1 Lead	AMP above 0.95, non-hemolytic, good physicochemical, novel
Tier-1 (risk-flagged)	Meets Tier-1 but hemolytic or safety-flagged
Tier-2 Exploratory	AMP above 0.87, passes basic filters
Hold	Borderline scores
Reject	Below thresholds

caution

All predictions are computational. The pipeline has known limitations including potential training data bias and limited hemolysis prediction accuracy.

Data Sources​

ESM-2 Scoring​

Biophysical Filtering​

Safety Stack​

Novelty Assessment​

Structure Prediction​

Tiering​