RstPreprocessor¶

class RstPreprocessor(tokenizer: Optional[transformers.tokenization_utils.PreTrainedTokenizer] = None)[source]¶

Class for preprocessing a list of raw texts to a batch of tensors.

__call__(sentences: List[str])[source]¶

Main method to start preprocessing for RST.

Parameters: sentences (List[str]) – list of input texts
Returns: return a BatchEncoding instance with key ‘data_batch’ and embedded values of data batch. Also return a list of lengths of each text in the batch.
Return type: Tuple[BatchEncoding, List[int]]

get_elmo_char_ids(tokenized_sentences: List[str])[source]¶

Method to get elmo embedding from a batch of texts.