MultiHeadAttention¶
-
class
MultiHeadAttention
(config, d_model, n_head, attention_mask=None)[source]¶ - Based on the paper, each layer has 2 subayers:
A multi-headed attention mechanism & a position-wise fully connected feed-forward network
Each layer employs a residual connection, y = f(x) + id(x) = f(x) + x, followed by layer normalization This python file would define the Multi Attention network