Skip to content

Monitor a run

After you submit a run you follow it to completion: stream live metrics, watch the status lifecycle, read the result, and download the model. Everything here is a poll — there are no webhooks to configure.

The run lifecycle

A run moves through these statuses (from GET /jobs/{id}):

StatusMeaning
queuedSubmitted, waiting for a worker. Usually seconds.
provisioningA GPU worker is being allocated.
stagingThe worker is loading the base model and your dataset.
runningActively growing and training. Live metrics stream during this phase.
checkpointingWriting a durable checkpoint (the resume point).
uploadingSaving the finished model to storage.
completed✅ Done — result and model are available.
failed❌ Stopped after exhausting retries. error explains why.
preemptedA worker was reclaimed; the run automatically requeues and resumes from its last checkpoint. Not an error — you'll just see retries increase.

Interruptions are safe

Fusion runs are checkpointed and preemption-safe. If a worker vanishes mid-run, the run resumes from its last checkpoint on a fresh worker with no lost progress. A rising retries count is normal on spot capacity — it does not mean anything is wrong.

bash
curl https://console.axomlabs.ai/api/jobs/<job-id> -H "Authorization: Bearer $AXOM_KEY"

Live metrics

Stream per-step telemetry while a run trains. Pass since_step with the last step you have, and append the new points — this is how the console draws the live loss curve.

bash
curl "https://console.axomlabs.ai/api/jobs/<job-id>/metrics?since_step=600" \
  -H "Authorization: Bearer $AXOM_KEY"
json
{ "job_id": "…", "count": 50, "latest_step": 1000,
  "points": [
    { "ts": "2026-06-04T02:45:20Z", "step": 620, "cycle": 1, "phase": "train",
      "loss": 0.78, "lr": 2.1e-05, "throughput": 1.39, "vram_gb": 23.6, "grad_norm": 1.02 }
  ] }
FieldWhat it tells you
step / latest_stepProgress through training.
cycleWhich training cycle this point is in.
phaseThe current phase (e.g. train).
lossTraining loss — the number to watch. It should trend down; a smooth decline means healthy training.
lrLearning rate (follows a schedule across the run).
throughputTokens/sec — processing speed.
vram_gbGPU memory in use.
grad_normGradient norm — stability indicator; wild spikes can signal instability.

Reading the loss curve: a steady downward trend is what you want. A flat curve from the start may mean the learning rate or dataset needs attention; loss that rises and diverges suggests instability (rare with defaults).

Live logs

Watch the run's progress as a console stream — phase messages and live per-step training output. Poll with ?since=<latest_seq> and append, exactly like metrics:

bash
curl "https://console.axomlabs.ai/api/jobs/<job-id>/logs?since=1840" \
  -H "Authorization: Bearer $AXOM_KEY"
json
{ "job_id": "…", "latest_seq": 1862,
  "lines": [
    { "seq": 1841, "ts": "…", "stream": "stdout", "text": "step 2000/6000 · loss 1.83 · lr 3.1e-05" },
    { "seq": 1842, "ts": "…", "stream": "stdout", "text": "Training complete — saving your model…" }
  ] }

Keep latest_seq and pass it as the next since to follow the tail. See GET /jobs/{id}/logs.

Status & timeline

bash
curl https://console.axomlabs.ai/api/jobs/<job-id>/events -H "Authorization: Bearer $AXOM_KEY"

Returns the full event history — [ { ts, kind, payload } ] — every transition from submitted to completed, including checkpoints and any requeues. Useful for audit and debugging.

The result

Once a run completes, the result summary carries the headline outcome (404 until it exists):

bash
curl https://console.axomlabs.ai/api/jobs/<job-id>/result -H "Authorization: Bearer $AXOM_KEY"
json
{ "job_id": "…",
  "ppl_baseline": 11023, "ppl_final": 324, "ppl_delta_pct": -97.06,
  "loss_final": 0.327,
  "params_before": 1540000000, "params_after": 2360000000, "params_delta": 820000000,
  "function_preservation": 0.0009,
  "gpu_seconds": 1418, "tokens_processed": 1536000,
  "cost_usd": 25.00 }
FieldWhat it tells you
ppl_baselineppl_finalPerplexity before and after — lower is better. The headline quality number.
ppl_delta_pctThe percentage improvement (negative = better).
loss_finalFinal training loss.
params_beforeparams_afterModel size before and after, and params_delta is what was added (train) or removed (contract).
function_preservationHow cleanly capacity was added — near-zero means the grown model started out behaving identically to the original.
gpu_seconds / tokens_processedCompute the run used.
cost_usdWhat you were billed — see Pricing.

Download the model

A completed run produces a new model in your account. Get a download manifest — presigned URLs for each file — and pull the weights directly from storage:

bash
# list your models, find the new one
curl https://console.axomlabs.ai/api/models -H "Authorization: Bearer $AXOM_KEY"

# get its download manifest, then fetch each file
curl https://console.axomlabs.ai/api/models/<model-id>/download -H "Authorization: Bearer $AXOM_KEY"
curl -o model-00001.safetensors "<url-from-manifest>"

The manifest's URLs are short-lived; re-request the manifest if they expire. See GET /models/{id}/download.

Next

Fusion Training Console