Monitor a run

After you submit a run you follow it to completion: stream live metrics, watch the status lifecycle, read the result, and download the model. Everything here is a poll — there are no webhooks to configure.

The run lifecycle

A run moves through these statuses (from GET /jobs/{id}):

Status	Meaning
`queued`	Submitted, waiting for a worker. Usually seconds.
`provisioning`	A GPU worker is being allocated.
`staging`	The worker is loading the base model and your dataset.
`running`	Actively growing and training. Live metrics stream during this phase.
`checkpointing`	Writing a durable checkpoint (the resume point).
`uploading`	Saving the finished model to storage.
`completed`	✅ Done — result and model are available.
`failed`	❌ Stopped after exhausting retries. `error` explains why.
`preempted`	A worker was reclaimed; the run automatically requeues and resumes from its last checkpoint. Not an error — you'll just see `retries` increase.

Interruptions are safe

Fusion runs are checkpointed and preemption-safe. If a worker vanishes mid-run, the run resumes from its last checkpoint on a fresh worker with no lost progress. A rising retries count is normal on spot capacity — it does not mean anything is wrong.

bash

curl https://console.axomlabs.ai/api/jobs/<job-id> -H "Authorization: Bearer $AXOM_KEY"

Live metrics

Stream per-step telemetry while a run trains. Pass since_step with the last step you have, and append the new points — this is how the console draws the live loss curve.

bash

curl "https://console.axomlabs.ai/api/jobs/<job-id>/metrics?since_step=600" \
  -H "Authorization: Bearer $AXOM_KEY"

json

{ "job_id": "…", "count": 50, "latest_step": 1000,
  "points": [
    { "ts": "2026-06-04T02:45:20Z", "step": 620, "cycle": 1, "phase": "train",
      "loss": 0.78, "lr": 2.1e-05, "throughput": 1.39, "vram_gb": 23.6, "grad_norm": 1.02 }
  ] }

Field	What it tells you
`step` / `latest_step`	Progress through training.
`cycle`	Which training cycle this point is in.
`phase`	The current phase (e.g. `train`).
`loss`	Training loss — the number to watch. It should trend down; a smooth decline means healthy training.
`lr`	Learning rate (follows a schedule across the run).
`throughput`	Tokens/sec — processing speed.
`vram_gb`	GPU memory in use.
`grad_norm`	Gradient norm — stability indicator; wild spikes can signal instability.

Reading the loss curve: a steady downward trend is what you want. A flat curve from the start may mean the learning rate or dataset needs attention; loss that rises and diverges suggests instability (rare with defaults).

Live logs

Watch the run's progress as a console stream — phase messages and live per-step training output. Poll with ?since=<latest_seq> and append, exactly like metrics:

bash

curl "https://console.axomlabs.ai/api/jobs/<job-id>/logs?since=1840" \
  -H "Authorization: Bearer $AXOM_KEY"

json

{ "job_id": "…", "latest_seq": 1862,
  "lines": [
    { "seq": 1841, "ts": "…", "stream": "stdout", "text": "step 2000/6000 · loss 1.83 · lr 3.1e-05" },
    { "seq": 1842, "ts": "…", "stream": "stdout", "text": "Training complete — saving your model…" }
  ] }

Keep latest_seq and pass it as the next since to follow the tail. See GET /jobs/{id}/logs.

Status & timeline

bash

curl https://console.axomlabs.ai/api/jobs/<job-id>/events -H "Authorization: Bearer $AXOM_KEY"

Returns the full event history — [ { ts, kind, payload } ] — every transition from submitted to completed, including checkpoints and any requeues. Useful for audit and debugging.

The result

Once a run completes, the result summary carries the headline outcome (404 until it exists):

bash

curl https://console.axomlabs.ai/api/jobs/<job-id>/result -H "Authorization: Bearer $AXOM_KEY"

json

{ "job_id": "…",
  "ppl_baseline": 11023, "ppl_final": 324, "ppl_delta_pct": -97.06,
  "loss_final": 0.327,
  "params_before": 1540000000, "params_after": 2360000000, "params_delta": 820000000,
  "function_preservation": 0.0009,
  "gpu_seconds": 1418, "tokens_processed": 1536000,
  "cost_usd": 25.00 }

Field	What it tells you
`ppl_baseline` → `ppl_final`	Perplexity before and after — lower is better. The headline quality number.
`ppl_delta_pct`	The percentage improvement (negative = better).
`loss_final`	Final training loss.
`params_before` → `params_after`	Model size before and after, and `params_delta` is what was added (train) or removed (contract).
`function_preservation`	How cleanly capacity was added — near-zero means the grown model started out behaving identically to the original.
`gpu_seconds` / `tokens_processed`	Compute the run used.
`cost_usd`	What you were billed — see Pricing.

Download the model

A completed run produces a new model in your account. Get a download manifest — presigned URLs for each file — and pull the weights directly from storage:

bash

# list your models, find the new one
curl https://console.axomlabs.ai/api/models -H "Authorization: Bearer $AXOM_KEY"

# get its download manifest, then fetch each file
curl https://console.axomlabs.ai/api/models/<model-id>/download -H "Authorization: Bearer $AXOM_KEY"
curl -o model-00001.safetensors "<url-from-manifest>"

The manifest's URLs are short-lived; re-request the manifest if they expire. See GET /models/{id}/download.

Pricing — what the run cost and why.
Settings reference — tune the next run.

Monitor a run ​

The run lifecycle ​

Live metrics ​

Live logs ​

Status & timeline ​

The result ​

Download the model ​

Next ​

Monitor a run

The run lifecycle

Live metrics

Live logs

Status & timeline

The result

Download the model

Next