Skip to main content

Methods

Data Sources​

Metagenomic assemblies from public repositories (ENA/SRA). Each biome uses sequences from representative studies of extreme environments.

ESM-2 Scoring​

We use Meta's ESM-2 (650M parameter model, esm2_t33_650M_UR50D) to generate protein embeddings, followed by a binary XGBoost classifier trained on known AMPs from DRAMP and APD3 databases.

Threshold: AMP probability of 0.87 or higher

Biophysical Filtering​

Candidates must pass physicochemical criteria consistent with known AMPs:

ParameterThreshold
Length10-50 amino acids
Net chargeat least +2
Hydrophobic momentat least 1.0
Helix propensityat least 1.0
Non-standard amino acidsNone

Safety Stack​

Multi-tool safety screening:

  • HemoPi3 (SVM): Hemolysis prediction (~85% accuracy)
  • ToxinPred3: General toxicity prediction
  • Candidates are flagged but not removed — researchers can make informed decisions

Novelty Assessment​

BLAST against three databases:

  • DRAMP v3 (2026-01-15)
  • APD3 (2026-01-20)
  • AMPSphere (2024-Zenodo)

Candidates with max identity below 40% are classified as "database-novel."

Structure Prediction​

Top candidates receive structure prediction via:

  • ESMFold for fast screening
  • ColabFold (MMSeqs2 + AlphaFold2) for high-confidence structures

Tiering​

TierCriteria
Tier-1 LeadAMP above 0.95, non-hemolytic, good physicochemical, novel
Tier-1 (risk-flagged)Meets Tier-1 but hemolytic or safety-flagged
Tier-2 ExploratoryAMP above 0.87, passes basic filters
HoldBorderline scores
RejectBelow thresholds
caution

All predictions are computational. The pipeline has known limitations including potential training data bias and limited hemolysis prediction accuracy.