Monitor a run
After you submit a run you follow it to completion: stream live metrics, watch the status lifecycle, read the result, and download the model. Everything here is a poll — there are no webhooks to configure.
The run lifecycle
A run moves through these statuses (from GET /jobs/{id}):
| Status | Meaning |
|---|---|
queued | Submitted, waiting for a worker. Usually seconds. |
provisioning | A GPU worker is being allocated. |
staging | The worker is loading the base model and your dataset. |
running | Actively growing and training. Live metrics stream during this phase. |
checkpointing | Writing a durable checkpoint (the resume point). |
uploading | Saving the finished model to storage. |
completed | ✅ Done — result and model are available. |
failed | ❌ Stopped after exhausting retries. error explains why. |
preempted | A worker was reclaimed; the run automatically requeues and resumes from its last checkpoint. Not an error — you'll just see retries increase. |
Interruptions are safe
Fusion runs are checkpointed and preemption-safe. If a worker vanishes mid-run, the run resumes from its last checkpoint on a fresh worker with no lost progress. A rising retries count is normal on spot capacity — it does not mean anything is wrong.
bash
curl https://console.axomlabs.ai/api/jobs/<job-id> -H "Authorization: Bearer $AXOM_KEY"Live metrics
Stream per-step telemetry while a run trains. Pass since_step with the last step you have, and append the new points — this is how the console draws the live loss curve.
bash
curl "https://console.axomlabs.ai/api/jobs/<job-id>/metrics?since_step=600" \
-H "Authorization: Bearer $AXOM_KEY"json
{ "job_id": "…", "count": 50, "latest_step": 1000,
"points": [
{ "ts": "2026-06-04T02:45:20Z", "step": 620, "cycle": 1, "phase": "train",
"loss": 0.78, "lr": 2.1e-05, "throughput": 1.39, "vram_gb": 23.6, "grad_norm": 1.02 }
] }| Field | What it tells you |
|---|---|
step / latest_step | Progress through training. |
cycle | Which training cycle this point is in. |
phase | The current phase (e.g. train). |
loss | Training loss — the number to watch. It should trend down; a smooth decline means healthy training. |
lr | Learning rate (follows a schedule across the run). |
throughput | Tokens/sec — processing speed. |
vram_gb | GPU memory in use. |
grad_norm | Gradient norm — stability indicator; wild spikes can signal instability. |
Reading the loss curve: a steady downward trend is what you want. A flat curve from the start may mean the learning rate or dataset needs attention; loss that rises and diverges suggests instability (rare with defaults).
Live logs
Watch the run's progress as a console stream — phase messages and live per-step training output. Poll with ?since=<latest_seq> and append, exactly like metrics:
bash
curl "https://console.axomlabs.ai/api/jobs/<job-id>/logs?since=1840" \
-H "Authorization: Bearer $AXOM_KEY"json
{ "job_id": "…", "latest_seq": 1862,
"lines": [
{ "seq": 1841, "ts": "…", "stream": "stdout", "text": "step 2000/6000 · loss 1.83 · lr 3.1e-05" },
{ "seq": 1842, "ts": "…", "stream": "stdout", "text": "Training complete — saving your model…" }
] }Keep latest_seq and pass it as the next since to follow the tail. See GET /jobs/{id}/logs.
Status & timeline
bash
curl https://console.axomlabs.ai/api/jobs/<job-id>/events -H "Authorization: Bearer $AXOM_KEY"Returns the full event history — [ { ts, kind, payload } ] — every transition from submitted to completed, including checkpoints and any requeues. Useful for audit and debugging.
The result
Once a run completes, the result summary carries the headline outcome (404 until it exists):
bash
curl https://console.axomlabs.ai/api/jobs/<job-id>/result -H "Authorization: Bearer $AXOM_KEY"json
{ "job_id": "…",
"ppl_baseline": 11023, "ppl_final": 324, "ppl_delta_pct": -97.06,
"loss_final": 0.327,
"params_before": 1540000000, "params_after": 2360000000, "params_delta": 820000000,
"function_preservation": 0.0009,
"gpu_seconds": 1418, "tokens_processed": 1536000,
"cost_usd": 25.00 }| Field | What it tells you |
|---|---|
ppl_baseline → ppl_final | Perplexity before and after — lower is better. The headline quality number. |
ppl_delta_pct | The percentage improvement (negative = better). |
loss_final | Final training loss. |
params_before → params_after | Model size before and after, and params_delta is what was added (train) or removed (contract). |
function_preservation | How cleanly capacity was added — near-zero means the grown model started out behaving identically to the original. |
gpu_seconds / tokens_processed | Compute the run used. |
cost_usd | What you were billed — see Pricing. |
Download the model
A completed run produces a new model in your account. Get a download manifest — presigned URLs for each file — and pull the weights directly from storage:
bash
# list your models, find the new one
curl https://console.axomlabs.ai/api/models -H "Authorization: Bearer $AXOM_KEY"
# get its download manifest, then fetch each file
curl https://console.axomlabs.ai/api/models/<model-id>/download -H "Authorization: Bearer $AXOM_KEY"
curl -o model-00001.safetensors "<url-from-manifest>"The manifest's URLs are short-lived; re-request the manifest if they expire. See GET /models/{id}/download.
Next
- Pricing — what the run cost and why.
- Settings reference — tune the next run.