Documentation

What is GRADIEND?

GRADIEND is a method for learning features within neural networks by training an encoder-decoder architecture on model gradients. With this library you can find where a language model encodes a feature (e.g. gender, race, religion) and rewrite the model to strengthen or weaken it—for example debias it—while keeping other behaviour.

GRADIEND overview

GRADIEND works by: 1. Training an encoder-decoder network on gradients computed from masked text predictions 2. Learning a single latent feature neuron that encodes the desired interpretation (e.g., gender bias) 3. Using the decoder to modify the base model's weights, enabling targeted feature manipulation

The method is described in detail in the paper: GRADIEND: Feature Learning within Neural Networks Exemplified through Biases (ICLR 2026, Drechsel & Herbold, 2025).

While GRADIEND is methodologically defined to work with any gradient-learned and weight-based model, this library currently only supports text prediction models (specifically transformers.AutoModelForMaskedLM and transformers.AutoModelForCausalLM). However, the library is designed to be modular and we plan to support more models in the future (e.g., text classification, vision, ...).

Example use cases (gradiend/examples on GitHub): - English pronouns — notebook; script - English gender - German gender–case - Race and religion

Links: GitHub · PyPI · arXiv (main paper) · arXiv (German articles)

Get started

Installation — Install the package and optional dependencies.
Train your first GRADIEND model — A minimal runnable example to train and evaluate in a few steps.

Tutorials

Step-by-step workflows, following the 5 steps of above's overview:

Workflow Overview — Tutorial Overview.
Feature Selection and Data Generation — Build training and neutral data from raw text (syncretism, spaCy, one filter per grammatical cell). Part 1 of the detailed workflow.
GRADIEND Training — Experiment layout, pruning (pre/post), multi-seed, convergence plot, and training options in detail.
Intra-Model Evaluation — Encoder analysis (are target classes seperated?) and decoder evaluation (determine parameters to update model's feature behavior under a language modeling constraint).
Model Rewrite — Using decoder-selected settings to rewrite base-model weights in memory or on disk.
Inter-Model Evaluation — Comparing multiple runs: top-k overlap and heatmap.

Guides

When you need to understand a topic or look up options:

Core classes and use cases — Overview of the most important classes and when to use them.
Data handling — Data formats, columns, and balancing (DataFrames, per-class dicts, Hugging Face datasets).
Pruning — Pre-pruning (from gradients) and post-pruning (from weights); when and how to use them.
Evaluation & visualization — Encoder and decoder evaluation, convergence and top-k plots, and how to customize plots.
Saving & loading — Where results are stored and how to reload a trained model.
Training arguments — Full parameter reference, including multi-seed training and seed report format.
Decoder-only models — Use causal (decoder-only) LMs with the same API; optional MLM head for better mask gradients.

Reference

Examples — All example scripts with short descriptions.
API reference — Auto-generated from docstrings; main classes and entry points.
FAQ — Troubleshooting and common pitfalls.
Citation — How to cite the paper and library.