Skip to content

Training Arguments

This guide explains TrainingArguments in detail. For a conceptual overview and minimal usage, see Tutorial: Training. For the full API (including defaults), see API reference.

TrainingArguments follows Hugging Face Trainer conventions where applicable (e.g. num_train_epochs, learning_rate, eval_steps).


Output and paths

Argument Default Description
experiment_dir None Root directory for this experiment. When set, checkpoints, caches, and plots use subpaths under it. Required for saving models and using encoder/decoder cache. With run_id, output goes under experiment_dir/run_id/.
output_dir None Directory to save the trained model. If None and experiment_dir is set, uses experiment_dir/model (or experiment_dir/run_id/model).
use_cache False When True, skip training if a saved model exists at the output path; skip encoder/decoder recomputation when cache exists. Use False to force retrain or recompute.
add_identity_for_other_classes False When True, add identity examples (factual == alternative) for classes not in the target pair. Important for multi-class data so extra classes do not push the model arbitrarily.

GRADIEND interpretation (source / target)

Argument Default Description
source "alternative" Which gradient feeds the encoder: "factual", "alternative", or "diff". Common: "alternative" so the encoder sees the alternative gradient.
target "diff" Which quantity the decoder predicts: "factual", "alternative", or "diff". Common: "diff" so the decoder predicts the difference; at inference you combine base model + decoder(diff).

Training loop (HF-style)

Argument Default Description
train_batch_size 32 Batch size for training.
train_max_size None If set, cap training samples per feature class (for speed/memory). None = use all data.
learning_rate 1e-5 Peak learning rate.
num_train_epochs 3 Number of training epochs. Ignored when max_steps > 0.
max_steps -1 If > 0, total training steps; overrides num_train_epochs. -1 = use epochs.
weight_decay 1e-2 Weight decay for the optimizer.
adam_epsilon 1e-8 Epsilon for Adam/AdamW.
optim "adamw" Optimizer: "adamw" or "adam".

Evaluation during training

Argument Default Description
eval_strategy "steps" When to run evaluation: "steps" (every eval_steps) or "no".
eval_steps 250 Run evaluation every eval_steps (when eval_strategy == "steps").
do_eval True Whether to run evaluation during training.
encoder_eval_train_max_size None Max samples for in-training encoder evaluation. None = use encoder_eval_max_size.
encoder_eval_max_size None Max samples for encoder evaluation outside training (e.g. evaluate_encoder, analysis). None = use all.
encoder_eval_balance True If True, balance encoder eval data per feature class.
seed_selection_eval_max_size None Max samples for encoder evaluation when selecting the best seed. None = use encoder_eval_max_size.
decoder_eval_max_size_training_like None Max samples for decoder training-like evaluation.
decoder_eval_max_size_neutral None Max samples for decoder neutral evaluation (LMS).
eval_batch_size 32 Batch size for evaluation.

Checkpointing and saving

Argument Default Description
save_strategy "best" "best" = keep only best checkpoint; "steps" = also save every save_steps; "no" = no checkpointing.
save_steps 5000 Save every save_steps when save_strategy == "steps".
save_only_best True If True, keep only the best checkpoint (by evaluation metric).
delete_models False If True, delete intermediate model files at end (to save disk).

Multi-seed training

Argument Default Description
max_seeds 3 Maximum number of seeds to try.
min_convergent_seeds 1 Stop once this many seeds have converged. None = run max_seeds.
convergent_metric "correlation" Metric for convergence: "correlation" (encoder) or "loss".
convergent_score_threshold 0.5 Threshold; a seed is “convergent” if its metric passes this. For correlation, typical 0.5–0.6.
seed None Base seed; seeds used are seed+i for i = 0..max_seeds-1. None = 0..max_seeds-1.
seed_runs_dir None Directory for per-seed runs. Default: experiment_dir/seeds.
keep_seed_runs False If True, keep all per-seed model directories; else delete model files, keep metrics only.

Seed report format

When max_seeds > 1, Trainer.train() writes a JSON report to <experiment_dir>/seeds/seed_report.json summarizing how each seed performed and which seed was selected.

Top-level keys: convergence_metric, threshold, min_convergent_seeds, max_seeds, seeds_tried, convergent_count, best_seed, best_selection_score, early_stop_reason, runs.

Per-seed entries in runs: seed, output_dir, trained / used_cache, training_score, eval_correlation, selection_score, convergence_metric, convergence_metric_value, threshold, converged.

  • training_score: Best training-time score (for correlation = best encoder correlation; for loss = negative loss, higher is better).
  • eval_correlation: Post-hoc evaluate_encoder(split="val") correlation used for seed selection; may be null when skipped.
  • selection_score: Score used to pick best seed (eval_correlation when available, else training_score).
  • convergence_metric_value: Raw metric (for loss = un-negated loss, lower is better for threshold checks).

This report helps debug why a particular seed was chosen and how training-time metrics compare to validation metrics.


Pre- and post-pruning

Argument Default Description
pre_prune_config None If set, pre-pruning runs before training (gradient-based importance, reduces input dimension). See Pruning.
post_prune_config None If set, post-pruning runs after training (weight-based, keeps top-k dimensions). See Pruning.

Model parameters (which layers to use)

By default, GRADIEND is trained over all backbone parameters of the base model (everything except prediction heads, e.g. MLM head). You can restrict which parameters (layers) are included in two ways.

params (TrainingArguments)

When using the Trainer with a model path string, set params to a list of parameter names or wildcards. Only backbone parameters whose names match are included in the GRADIEND param map.

  • Exact names: e.g. ["bert.encoder.layer.0.attention.self.query.weight"]
  • Wildcards: * matches any substring; . in the pattern is literal. Examples:
  • ["bert.encoder.layer.0.*"] — only the first encoder layer
  • ["*.encoder.layer.0.*", "*.encoder.layer.1.*"] — first two layers
  • ["*.encoder.layer.*"] — all encoder layers
args = TrainingArguments(
    experiment_dir="./out",
    params=["bert.encoder.layer.0.*", "bert.encoder.layer.1.*"],  # first two layers only
)
trainer = TextPredictionTrainer("bert-base-cased", data=..., training_args=args)
trainer.train()

param_map (when creating the model)

When you create the model yourself and pass it to the Trainer, use param_map in create_model_with_gradiend(). Same name/wildcard rules as params. This is the only way to restrict layers when you pass a pre-built model.

from gradiend import create_model_with_gradiend, TextPredictionTrainer, TrainingArguments
from gradiend.trainer.text.prediction.model_with_gradiend import TextPredictionModelWithGradiend

model = create_model_with_gradiend(
    "bert-base-cased",
    model_class=TextPredictionModelWithGradiend,
    param_map=["bert.encoder.layer.0.*", "bert.encoder.layer.1.*"],
)
trainer = TextPredictionTrainer(model, data=..., training_args=TrainingArguments(experiment_dir="./out"))
trainer.train()

Backbone vs head: The code uses Hugging Face convention (base_model_prefix or base_model) to define the backbone. Prediction heads (e.g. MLM head) are always excluded; only backbone parameters are considered for params / param_map.


Advanced

Argument Default Description
trust_remote_code False Pass to Hugging Face when loading models.
params None If set, only these parameter names or wildcards are included in the GRADIEND param map (backbone only). See Model parameters (which layers to use) above.
normalize_gradiend True Normalize encodings (first target class → +1, second → -1).
torch_dtype torch.float32 Model dtype.
supervised_encoder False If True, train only the encoder (baseline mode).
supervised_decoder False If True, train only the decoder (baseline mode).
use_cached_gradients False Use cached gradients if available (faster, higher memory/disk).
class_merge_map None Optional mapping from base feature classes to merged classes (e.g. {"singular": ["1SG","3SG"], "plural": ["1PL","3PL"]}). When set, training and evaluation use the merged ids; with exactly two merged keys, target_classes can be omitted.
class_merge_transition_groups None Optional list of base‑class clusters that limit which raw transitions are created before merging. Only transitions where both raw classes lie in the same cluster (and differ) are kept (e.g. [["1SG","1PL"], ["3SG","3PL"]] keeps 1SG↔1PL and 3SG↔3PL, but drops 1SG→3PL).

Device placement

Device placement is automatic based on GPU count (no model-size heuristic):

GPUs Placement
1 encoder, decoder, base model all on cuda:0
2 encoder + base model on cuda:0, decoder on cuda:1
≥3 encoder on cuda:0, decoder on cuda:1, base model on cuda:2
0 all on CPU (automatic when no GPUs)
CPU mode: To force CPU when GPUs are available, pass device="cpu" when creating the model (e.g. via ModelWithGradiend.from_pretrained(..., device="cpu") or the trainer's model-loading kwargs).Override individual devices: Use device_encoder, device_decoder, device_base_model to override specific components.