RumourDetectionTwitterTokenizer¶
-
class
RumourDetectionTwitterTokenizer
(*args, vocab_file, **kwargs)[source]¶ This Tokenizer class performs word-level tokenization to generate tokens.
- Parameters
text (
str
) – input text string to tokenize
Example:
# 1. From local vocab file vocab_file = 'vocab.txt' tokenizer = RumourDetectionTwitterTokenizer(vocab_file=vocab_file) tokenizer.build_vocab() token_ids, token_attention_masks = tokenizer.tokenize_threads( [ ["The quick brown fox", "jumped over the lazy dog"], [ "Are those shy Eurasian footwear", "cowboy chaps", "or jolly earthmoving headgear?", ], ] ) # 2. Download pretrained from storage #TODO
-
tokenize_threads
(threads, max_length=None, max_posts=None, **kwargs)[source]¶ This function performs tokenization on a batch of Twitter threads and returns the token ids and attention masks for each tweet.
- Parameters
threads (List[List[str]]) – A batch of threads containing the raw text from the Tweets to be tokenized.
max_length (int) – Maxmium number of tokens in a single Tweet
max_posts (int) – Maximum number of Tweets in a single thread
- Returns
List[List[int]]: token ids for each token in each Tweet. Each tweet/thread would have been padded (or truncated) to max_length/max_posts respectively. :List[List[int]]: attention mask (0 or 1) for each token in each Tweet. Each tweet/thread would have been padded (or truncated) to max_length/max_posts respectively.