Methods
Data Sources​
Metagenomic assemblies from public repositories (ENA/SRA). Each biome uses sequences from representative studies of extreme environments.
ESM-2 Scoring​
We use Meta's ESM-2 (650M parameter model, esm2_t33_650M_UR50D) to generate protein embeddings, followed by a binary XGBoost classifier trained on known AMPs from DRAMP and APD3 databases.
Threshold: AMP probability of 0.87 or higher
Biophysical Filtering​
Candidates must pass physicochemical criteria consistent with known AMPs:
| Parameter | Threshold |
|---|---|
| Length | 10-50 amino acids |
| Net charge | at least +2 |
| Hydrophobic moment | at least 1.0 |
| Helix propensity | at least 1.0 |
| Non-standard amino acids | None |
Safety Stack​
Multi-tool safety screening:
- HemoPi3 (SVM): Hemolysis prediction (~85% accuracy)
- ToxinPred3: General toxicity prediction
- Candidates are flagged but not removed — researchers can make informed decisions
Novelty Assessment​
BLAST against three databases:
- DRAMP v3 (2026-01-15)
- APD3 (2026-01-20)
- AMPSphere (2024-Zenodo)
Candidates with max identity below 40% are classified as "database-novel."
Structure Prediction​
Top candidates receive structure prediction via:
- ESMFold for fast screening
- ColabFold (MMSeqs2 + AlphaFold2) for high-confidence structures
Tiering​
| Tier | Criteria |
|---|---|
| Tier-1 Lead | AMP above 0.95, non-hemolytic, good physicochemical, novel |
| Tier-1 (risk-flagged) | Meets Tier-1 but hemolytic or safety-flagged |
| Tier-2 Exploratory | AMP above 0.87, passes basic filters |
| Hold | Borderline scores |
| Reject | Below thresholds |
All predictions are computational. The pipeline has known limitations including potential training data bias and limited hemolysis prediction accuracy.