MultiHeadAttention

class MultiHeadAttention(config, d_model, n_head, attention_mask=None)[source]
Based on the paper, each layer has 2 subayers:

A multi-headed attention mechanism & a position-wise fully connected feed-forward network

Each layer employs a residual connection, y = f(x) + id(x) = f(x) + x, followed by layer normalization This python file would define the Multi Attention network

forward(query, key, val, key_structure=None, val_structure=None, attention_mask=None)[source]

This function defines the multi head attention network