FAQ / troubleshooting
Which modalities are supported?
Currently only gradients based on text prediction (MLM/CLM) are supported. All documentation and examples use TextPredictionTrainer. Other modalities may be added in the future.
My model does not converge (bad correlations). What parameters should I tweak?
You should check the training log and inspect the mean probabilities per label class.
- Mean Encoded values of the two target classes are similar and close to +1 or -1: The encoder is not separating the classes (likely only separates target classes from neutral inputs). Try a smaller learning rate and/or different seeds (increase
TrainingArguments.max_seeds). - Mean Encoded values of the two target classes are near zero and barely moving during training: Add more data (increase
train_max_size), increase training length (max_stepsornum_train_epochs), and/or use a larger learning rate. - Mean Encoded values do not change during training even with larger learning rate: Likely only zero gradients are computed based on the provided data. Probably your mask identifier (default
[MASK]) is different in your data? Or the mask occurs outside the model's context (too long texts)?
More general hints:
- Source — Using
source="alternative"typically yields simpler conversion thansource='factual'. - Data balance and size — Ensure both classes in your pair have enough examples; use
train_max_sizeper class if needed. See Data handling for balancing. - Encoder evaluation — Run
trainer.evaluate_encoder(return_df=True, plot=True)and check convergence plots for debugging; usetrainer.plot_training_convergence()to see correlation over steps. - Data Quality — Check the quality of your data and labels (spaCy misclassifies, etc.).
- Pruning — try first to train the full (un-pruned) model to check if pre-pruning may accidently removes actually important weigths that are required for convergence
See TrainingArguments and the start here / detailed workflow tutorials for all training options.
I get a CUDA out of memory error. How can I reduce memory usage?
The GRADIEND model itself holds 3*n+1 parameters, where n is the number of (considered) parameters of the base model. During training, due to the optimizer and the large input parameter space (n), we require a multiple of the base model's memory. To reduce memory usage, you can:
- Apply pre-pruning (typically, a top-k of 0.01 (i.e, retaining 1% of the base model's parameters) still yields full GRADIEND performance and reduces GRADIEND size significantly.)
- Use a smaller base model (e.g.
bert-base-uncasedinstead ofbert-large-uncased). - Reduce the batch size (
train_batch_size) and/or sequence length of your data. - Use mixed precision training (
TrainingArguments.torch_dtype = torch.bfloat16), which typically reduces memory usage by about half with minimal impact on convergence. Note: this requires a compatible GPU (e.g. NVIDIA Ampere or later for bfloat16). - Use multiple GPUs: device placement is automatic based on GPU count. See Device placement in the training arguments guide.
- Consider fewer base model parameters for GRADIEND training (e.g., only paramerters in the last few layers) by using
TrainingArguments.params, see TrainingArguments) for details.
Is neutral data (eval_neutral_data) required?
No. eval_neutral_data is optional. When omitted:
- Encoder evaluation still runs; it simply has no
neutral_datasetvariant (only training and optionallyneutral_training_masked). - Decoder evaluation uses a fallback: training-like data (test split with masks filled by the factual token). Target tokens are automatically ignored in LMS (language modeling score) to avoid distorting perplexity.
For best practice, provide true neutral data (e.g. TextPredictionDataCreator.generate_neutral_data() or datasets like aieng-lab/wortschatz-leipzig-de-grammar-neutral) when available. See Data handling and Evaluation (intra-model).
I have a different issue
Write a GitHub issue with a clear description of the problem, steps to reproduce, and any error messages or logs. We will try to help as soon as possible!