Bert

class Bert(model_name, device=None)[source]
subword_tokenize(tokens)[source]

Segment each token into subwords while keeping track of token boundaries. :param tokens: :type tokens: A sequence of strings, representing input tokens.

Returns

  • A list of subwords, flanked by the special symbols required

    by Bert (CLS and SEP).

  • An array of indices into the list of subwords, indicating

    that the corresponding subword is the start of a new token. For example, [1, 3, 4, 7] means that the subwords 1, 3, 4, 7 are token starts, while all other subwords (0, 2, 5, 6, 8…) are in or at the end of tokens. This list allows selecting Bert hidden states that represent tokens, which is necessary in sequence labeling.

Return type

A tuple consisting of

subword_tokenize_to_ids(tokens)[source]

Segment each token into subwords while keeping track of token boundaries and convert subwords into IDs. :param tokens: :type tokens: A sequence of strings, representing input tokens.

Returns

  • A list of subword IDs, including IDs of the special

    symbols (CLS and SEP) required by Bert.

  • A mask indicating padding tokens.

  • An array of indices into the list of subwords. See

    doc of subword_tokenize.

Return type

A tuple consisting of