UFDPreprocessor

class UFDPreprocessor(tokenizer: Optional[transformers.tokenization_utils.PreTrainedTokenizer] = None, embedding_model: Optional[transformers.modeling_utils.PreTrainedModel] = None, tokenizer_name: str = 'xlm-roberta-large', embedding_model_name: str = 'xlm-roberta-large', device: torch.device = device(type='cpu'))[source]

Class for preprocessing a raw text to a batch of tensors for the UFDModel to predict on. Inject tokenizer and/or embedding model instances via the ‘tokenizer’ and ‘embedding_model’ input args, or pass in the tokenizer name and/or embedding model name via the ‘tokenizer_name’ and ‘embedding_model_name’ input args to create from_pretrained.

__call__(data_batch: List[str])transformers.tokenization_utils_base.BatchEncoding[source]

Main method to start preprocessing.

Parameters

data_batch (List[str]) – list of raw text to process.

Returns

return a BatchEncoding instance with key ‘input_ids’ and embedded values of data batch.

Return type

BatchEncoding