Skip to content

Jobs & runs

A job is one run — either a training run (type: "train") or a contraction run (type: "contract"). This is the endpoint reference; for the how-and-why, see the guides: Train a model, Contract a model, Monitor & retrieve.

Submit a run — POST /jobs

Returns the created job immediately (status queued); the run executes asynchronously.

Training run

json
// request
{ "type": "train", "idempotency_key": "run-1",
  "base_model_id": "<uuid>", "dataset_id": "<uuid>",
  "hyperparams": { "turns": 2, "output_name": "my-model" } }
FieldTypeNotes
type"train"Required.
idempotency_keystringRequired. Re-using a key returns the same job.
base_model_iduuidRequired. The model to grow — from GET /base-models or your own GET /models.
dataset_iduuidRequired. From GET /datasets.
hyperparams.turnsintQuality lever — full training passes (default 1; 2 recommended).
hyperparams.output_namestringName for the new model.
hyperparams.cycles / steps_per_cycle / lr / seq_lennumAdvanced (optional) — standard tuning knobs; all have defaults.

→ Full setting guidance: Settings reference.

Fine-tune run

Adapt a model's behavior (LoRA) — no size change. → Fine-tune guide.

json
// request
{ "type": "finetune", "idempotency_key": "ft-1",
  "source_model_id": "<uuid>", "dataset_id": "<uuid>",
  "hyperparams": { "lora_rank": 32, "steps": 2000, "output_name": "my-model" } }
FieldTypeNotes
type"finetune"Required.
source_model_iduuidThe model to fine-tune (top-level or in hyperparams). Or use base_model_id for a catalog base.
dataset_iduuidRequired.
hyperparams.lora_rankintAdapter capacity (default 32).
hyperparams.stepsintTraining steps (default 2000).
hyperparams.output_namestringName for the new model.

Contraction run

json
// request
{ "type": "contract", "idempotency_key": "contract-1",
  "hyperparams": { "source_model_id": "<uuid>",
                   "contraction_ratio": 0.5, "num_layers_to_contract": 8 } }
FieldTypeNotes
type"contract"Required.
idempotency_keystringRequired.
hyperparams.source_model_iduuidRequired. The model to shrink.
hyperparams.contraction_ratiofloatPrune aggressiveness, 0–1 (default 0.5).
hyperparams.num_layers_to_contractintLayers to process (default 8).

No dataset_id — contraction operates on the model itself.

Response (201)

json
{ "id": "<uuid>", "type": "train", "status": "queued", "retries": 0,
  "last_checkpoint_step": 0, "output_model_id": null,
  "created_at": "…", "started_at": null, "finished_at": null }

Recommended config — POST /training/recommend

A tailored starting point + safe window for a given model and dataset, so you don't start from one-size-fits-all defaults. Learning rate scales with model size; training amount scales with dataset size.

json
// request
{ "base_model_id": "<uuid>", "dataset_id": "<uuid>" }

// response
{ "model_params": 7000000000, "dataset_tokens_est": 2100000,
  "recommended": { "turns": 2, "lr": 7e-5, "cycles": 6, "steps_per_cycle": 1000, "seq_len": 256 },
  "windows": { "lr": {"min":3.5e-5,"max":1.4e-4}, "cycles": {"min":3,"max":10}, "...": {} },
  "notes": ["≈3 epochs over your ~2.1M-token dataset", "Learning rate tuned for a ~7B model"],
  "disclaimer": "Recommended starting points — tune within the window for your model and data." }

You can also pass params / dataset_tokens / dataset_bytes directly instead of IDs.

The job object

Returned by GET /jobs (list) and GET /jobs/{id} (single).

FieldMeaning
idJob ID — used for all follow-up calls.
typetrain or contract.
statusLifecycle state — see the lifecycle.
retriesTimes reclaimed after an interruption (resumes from checkpoint; not an error).
last_checkpoint_stepResume point of the last durable checkpoint.
output_model_idnull until complete, then your new model's ID.
created_at / started_at / finished_atTimestamps.

GET /jobs

List your jobs, newest first.

GET /jobs/{id}

A single job object.

Telemetry

GET /jobs/{id}/metrics?since_step=N

Per-step training telemetry. Pass since_step to fetch only points newer than the last you have, and append client-side (this drives the live loss curve).

json
{ "job_id": "…", "count": 50, "latest_step": 1000,
  "points": [
    { "ts": "…", "step": 620, "cycle": 1, "phase": "train",
      "loss": 0.78, "lr": 2.1e-05, "throughput": 1.39, "vram_gb": 23.6, "grad_norm": 1.02 }
  ] }

Point fields: step, cycle, phase, loss, lr, throughput (tok/s), vram_gb, grad_norm. → How to read these.

GET /jobs/{id}/logs?since=<seq>

The live console tail — curated progress lines (phases + per-step training output). Poll with ?since=<latest_seq> and append the new lines, identical to the metrics pattern.

json
{ "job_id": "…", "latest_seq": 1862,
  "lines": [
    { "seq": 1841, "ts": "…", "stream": "stdout", "text": "step 2000/6000 · loss 1.83 · lr 3.1e-05" },
    { "seq": 1842, "ts": "…", "stream": "stdout", "text": "Training complete — saving your model…" }
  ] }

The tail is capped to the most recent lines for live viewing.

GET /jobs/{id}/events

The run timeline — [ { ts, kind, payload } ], every state transition.

GET /jobs/{id}/result

Headline result for a finished run (404 until it exists).

json
{ "job_id": "…",
  "ppl_baseline": 11023, "ppl_final": 324, "ppl_delta_pct": -97.06, "loss_final": 0.327,
  "params_before": 1540000000, "params_after": 2360000000, "params_delta": 820000000,
  "function_preservation": 0.0009,
  "gpu_seconds": 1418, "tokens_processed": 1536000, "cost_usd": 25.00 }
FieldNotes
ppl_baseline / ppl_final / ppl_delta_pctQuality: perplexity before → after, and % change.
params_before / after / deltaSize before/after and the change.
function_preservationHow cleanly capacity was added (near-zero = clean).
gpu_seconds / tokens_processedCompute meters.
cost_usdBilled amount — see Pricing.

Field-by-field interpretation.

Fusion Training Console