Training Arguments
This guide explains TrainingArguments in detail. For a conceptual overview and minimal usage, see Tutorial: Training. For the full API (including defaults), see API reference.
TrainingArguments follows Hugging Face Trainer conventions where applicable (e.g. num_train_epochs, learning_rate, eval_steps).
Output and paths
| Argument | Default | Description |
|---|---|---|
| experiment_dir | None |
Root directory for this experiment. When set, checkpoints, caches, and plots use subpaths under it. Required for saving models and using encoder/decoder cache. With run_id, output goes under experiment_dir/run_id/. |
| output_dir | None |
Directory to save the trained model. If None and experiment_dir is set, uses experiment_dir/model (or experiment_dir/run_id/model). |
| use_cache | False |
When True, skip training if a saved model exists at the output path; skip encoder/decoder recomputation when cache exists. Use False to force retrain or recompute. |
| add_identity_for_other_classes | False |
When True, add identity examples (factual == alternative) for classes not in the target pair. Important for multi-class data so extra classes do not push the model arbitrarily. |
GRADIEND interpretation (source / target)
| Argument | Default | Description |
|---|---|---|
| source | "alternative" |
Which gradient feeds the encoder: "factual", "alternative", or "diff". Common: "alternative" so the encoder sees the alternative gradient. |
| target | "diff" |
Which quantity the decoder predicts: "factual", "alternative", or "diff". Common: "diff" so the decoder predicts the difference; at inference you combine base model + decoder(diff). |
Training loop (HF-style)
| Argument | Default | Description |
|---|---|---|
| train_batch_size | 32 |
Batch size for training. |
| train_max_size | None |
If set, cap training samples per feature class (for speed/memory). None = use all data. |
| learning_rate | 1e-5 |
Peak learning rate. |
| num_train_epochs | 3 |
Number of training epochs. Ignored when max_steps > 0. |
| max_steps | -1 |
If > 0, total training steps; overrides num_train_epochs. -1 = use epochs. |
| weight_decay | 1e-2 |
Weight decay for the optimizer. |
| adam_epsilon | 1e-8 |
Epsilon for Adam/AdamW. |
| optim | "adamw" |
Optimizer: "adamw" or "adam". |
Evaluation during training
| Argument | Default | Description |
|---|---|---|
| eval_strategy | "steps" |
When to run evaluation: "steps" (every eval_steps) or "no". |
| eval_steps | 250 |
Run evaluation every eval_steps (when eval_strategy == "steps"). |
| do_eval | True |
Whether to run evaluation during training. |
| encoder_eval_train_max_size | None |
Max samples for in-training encoder evaluation. None = use encoder_eval_max_size. |
| encoder_eval_max_size | None |
Max samples for encoder evaluation outside training (e.g. evaluate_encoder, analysis). None = use all. |
| encoder_eval_balance | True |
If True, balance encoder eval data per feature class. |
| seed_selection_eval_max_size | None |
Max samples for encoder evaluation when selecting the best seed. None = use encoder_eval_max_size. |
| decoder_eval_max_size_training_like | None |
Max samples for decoder training-like evaluation. |
| decoder_eval_max_size_neutral | None |
Max samples for decoder neutral evaluation (LMS). |
| eval_batch_size | 32 |
Batch size for evaluation. |
Checkpointing and saving
| Argument | Default | Description |
|---|---|---|
| save_strategy | "best" |
"best" = keep only best checkpoint; "steps" = also save every save_steps; "no" = no checkpointing. |
| save_steps | 5000 |
Save every save_steps when save_strategy == "steps". |
| save_only_best | True |
If True, keep only the best checkpoint (by evaluation metric). |
| delete_models | False |
If True, delete intermediate model files at end (to save disk). |
Multi-seed training
| Argument | Default | Description |
|---|---|---|
| max_seeds | 3 |
Maximum number of seeds to try. |
| min_convergent_seeds | 1 |
Stop once this many seeds have converged. None = run max_seeds. |
| convergent_metric | "correlation" |
Metric for convergence: "correlation" (encoder) or "loss". |
| convergent_score_threshold | 0.5 |
Threshold; a seed is “convergent” if its metric passes this. For correlation, typical 0.5–0.6. |
| seed | None |
Base seed; seeds used are seed+i for i = 0..max_seeds-1. None = 0..max_seeds-1. |
| seed_runs_dir | None |
Directory for per-seed runs. Default: experiment_dir/seeds. |
| keep_seed_runs | False |
If True, keep all per-seed model directories; else delete model files, keep metrics only. |
Seed report format
When max_seeds > 1, Trainer.train() writes a JSON report to <experiment_dir>/seeds/seed_report.json summarizing how each seed performed and which seed was selected.
Top-level keys: convergence_metric, threshold, min_convergent_seeds, max_seeds, seeds_tried, convergent_count, best_seed, best_selection_score, early_stop_reason, runs.
Per-seed entries in runs: seed, output_dir, trained / used_cache, training_score, eval_correlation, selection_score, convergence_metric, convergence_metric_value, threshold, converged.
training_score: Best training-time score (forcorrelation= best encoder correlation; forloss= negative loss, higher is better).eval_correlation: Post-hocevaluate_encoder(split="val")correlation used for seed selection; may benullwhen skipped.selection_score: Score used to pick best seed (eval_correlationwhen available, elsetraining_score).convergence_metric_value: Raw metric (forloss= un-negated loss, lower is better for threshold checks).
This report helps debug why a particular seed was chosen and how training-time metrics compare to validation metrics.
Pre- and post-pruning
| Argument | Default | Description |
|---|---|---|
| pre_prune_config | None |
If set, pre-pruning runs before training (gradient-based importance, reduces input dimension). See Pruning. |
| post_prune_config | None |
If set, post-pruning runs after training (weight-based, keeps top-k dimensions). See Pruning. |
Model parameters (which layers to use)
By default, GRADIEND is trained over all backbone parameters of the base model (everything except prediction heads, e.g. MLM head). You can restrict which parameters (layers) are included in two ways.
params (TrainingArguments)
When using the Trainer with a model path string, set params to a list of parameter names or wildcards. Only backbone parameters whose names match are included in the GRADIEND param map.
- Exact names: e.g.
["bert.encoder.layer.0.attention.self.query.weight"] - Wildcards:
*matches any substring;.in the pattern is literal. Examples: ["bert.encoder.layer.0.*"]— only the first encoder layer["*.encoder.layer.0.*", "*.encoder.layer.1.*"]— first two layers["*.encoder.layer.*"]— all encoder layers
args = TrainingArguments(
experiment_dir="./out",
params=["bert.encoder.layer.0.*", "bert.encoder.layer.1.*"], # first two layers only
)
trainer = TextPredictionTrainer("bert-base-cased", data=..., training_args=args)
trainer.train()
param_map (when creating the model)
When you create the model yourself and pass it to the Trainer, use param_map in create_model_with_gradiend(). Same name/wildcard rules as params. This is the only way to restrict layers when you pass a pre-built model.
from gradiend import create_model_with_gradiend, TextPredictionTrainer, TrainingArguments
from gradiend.trainer.text.prediction.model_with_gradiend import TextPredictionModelWithGradiend
model = create_model_with_gradiend(
"bert-base-cased",
model_class=TextPredictionModelWithGradiend,
param_map=["bert.encoder.layer.0.*", "bert.encoder.layer.1.*"],
)
trainer = TextPredictionTrainer(model, data=..., training_args=TrainingArguments(experiment_dir="./out"))
trainer.train()
Backbone vs head: The code uses Hugging Face convention (base_model_prefix or base_model) to define the backbone. Prediction heads (e.g. MLM head) are always excluded; only backbone parameters are considered for params / param_map.
Advanced
| Argument | Default | Description |
|---|---|---|
| trust_remote_code | False |
Pass to Hugging Face when loading models. |
| params | None |
If set, only these parameter names or wildcards are included in the GRADIEND param map (backbone only). See Model parameters (which layers to use) above. |
| normalize_gradiend | True |
Normalize encodings (first target class → +1, second → -1). |
| torch_dtype | torch.float32 |
Model dtype. |
| supervised_encoder | False |
If True, train only the encoder (baseline mode). |
| supervised_decoder | False |
If True, train only the decoder (baseline mode). |
| use_cached_gradients | False |
Use cached gradients if available (faster, higher memory/disk). |
| class_merge_map | None |
Optional mapping from base feature classes to merged classes (e.g. {"singular": ["1SG","3SG"], "plural": ["1PL","3PL"]}). When set, training and evaluation use the merged ids; with exactly two merged keys, target_classes can be omitted. |
| class_merge_transition_groups | None |
Optional list of base‑class clusters that limit which raw transitions are created before merging. Only transitions where both raw classes lie in the same cluster (and differ) are kept (e.g. [["1SG","1PL"], ["3SG","3PL"]] keeps 1SG↔1PL and 3SG↔3PL, but drops 1SG→3PL). |
Device placement
Device placement is automatic based on GPU count (no model-size heuristic):
| GPUs | Placement |
|---|---|
| 1 | encoder, decoder, base model all on cuda:0 |
| 2 | encoder + base model on cuda:0, decoder on cuda:1 |
| ≥3 | encoder on cuda:0, decoder on cuda:1, base model on cuda:2 |
| 0 | all on CPU (automatic when no GPUs) |
CPU mode: To force CPU when GPUs are available, pass device="cpu" when creating the model (e.g. via ModelWithGradiend.from_pretrained(..., device="cpu") or the trainer's model-loading kwargs).Override individual devices: Use device_encoder, device_decoder, device_base_model to override specific components. |