Skip to content

TextPreprocessConfig

Configuration for preprocessing text: sentence splitting and length/character filters.

Used by :class:~gradiend.data.text.prediction.creator.TextPredictionDataCreator and :func:preprocess_texts. When preprocess is None (e.g. in the data creator), no preprocessing is applied and texts are used as-is.

Attributes:

Name Type Description
split_to_sentences bool

If True, split input texts into sentences (via regex on .!? or, when a spaCy model is provided, via spaCy sentencizer). If False, each input string is treated as one segment. Default False.

min_chars Optional[int]

Drop segments (sentences or whole texts) with strictly fewer than this many characters. None means no minimum. Default None.

max_chars Optional[int]

Drop segments with strictly more than this many characters. None means no maximum. Default None.

exclude_chars Optional[str]

Drop segments that contain any character in this string (e.g. "\x00" to drop nulls). None means no character exclusion. Default None.

custom_filter Optional[Callable[[str], bool]]

Optional callable (str) -> bool. Only segments for which it returns True are kept. None means no custom filter. Default None.

custom_filter class-attribute instance-attribute

custom_filter = field(default=None, repr=False)

exclude_chars class-attribute instance-attribute

exclude_chars = None

max_chars class-attribute instance-attribute

max_chars = None

min_chars class-attribute instance-attribute

min_chars = None

split_to_sentences class-attribute instance-attribute

split_to_sentences = False