RstPreprocessor

class RstPreprocessor(tokenizer: Optional[transformers.tokenization_utils.PreTrainedTokenizer] = None)[source]

Class for preprocessing a list of raw texts to a batch of tensors.

__call__(sentences: List[str])[source]

Main method to start preprocessing for RST.

Parameters

sentences (List[str]) – list of input texts

Returns

return a BatchEncoding instance with key ‘data_batch’ and embedded values of data batch. Also return a list of lengths of each text in the batch.

Return type

Tuple[BatchEncoding, List[int]]

get_elmo_char_ids(tokenized_sentences: List[str])[source]

Method to get elmo embedding from a batch of texts.

Parameters

tokenized_sentences (List[str]) – list of input texts

Returns

return a dictionary of elmo embeddings

Return type

Dict[str, List]