RumourDetectionTwitterTokenizer

class RumourDetectionTwitterTokenizer(*args, vocab_file, **kwargs)[source]

This Tokenizer class performs word-level tokenization to generate tokens.

Parameters

text (str) – input text string to tokenize

Example:

# 1. From local vocab file
vocab_file = 'vocab.txt'
tokenizer = RumourDetectionTwitterTokenizer(vocab_file=vocab_file)
tokenizer.build_vocab()
token_ids, token_attention_masks = tokenizer.tokenize_threads(
    [
        ["The quick brown fox", "jumped over the lazy dog"],
        [
            "Are those shy Eurasian footwear",
            "cowboy chaps",
            "or jolly earthmoving headgear?",
        ],
    ]
)

# 2. Download pretrained from storage
#TODO
tokenize_threads(threads, max_length=None, max_posts=None, **kwargs)[source]

This function performs tokenization on a batch of Twitter threads and returns the token ids and attention masks for each tweet.

Parameters
  • threads (List[List[str]]) – A batch of threads containing the raw text from the Tweets to be tokenized.

  • max_length (int) – Maxmium number of tokens in a single Tweet

  • max_posts (int) – Maximum number of Tweets in a single thread

Returns

List[List[int]]: token ids for each token in each Tweet. Each tweet/thread would have been padded (or truncated) to max_length/max_posts respectively. :List[List[int]]: attention mask (0 or 1) for each token in each Tweet. Each tweet/thread would have been padded (or truncated) to max_length/max_posts respectively.