Multi-Head Self Attention - Transformer Encoder

728x90

Multi-Head Self-Attention

위의 내용은 (Single-Head) Self-Attention에 해당하는 것이며,

Multi-Head 에 대한 설명은 다음을 참고

각 head는 Scaled Dot-Product Attention을 독립적으로 수행 하며,
서로 다른 head들이 서로 다른 관계(position relation, semantic relation, syntactic relation 등)를
병렬적으로 보는 역할 을 담당함.
또한 hidden_size ($d_\text{model}$)를 여러 head가 나누어 사용하므로,
hidden_size를 $d_\text{model}$, head 수를 $h$라 하면
각 head의 차원(dimension)은 보통 $\frac{d_\text{model}}{h}$가 되며,
각 head의 결과를 concat한 뒤 다시 projection하여 최종 표현을 구성함.

Self Attention과 Scaled Dot Production Attention에 대한 자세한 건 다음을 참고:

Self Attention

ds31x.github.io

실제 Vaswani et al. (2017)의 구현에서

Multi-head Attention의 과정을 요약하면 다음 그림과 같음:

이 Multi-head Attention Layer 는
short-cut connection 과 Layer Normalization 이 다음과 같이 연결되며,
이를 합쳐 Multi-head Attention 이라고 지칭함 (실제 구현에서 dropout도 포함됨):

Sequential Text를 다루는 Transformer (2017) 에선 (BERT 포함)
Layer Normalization이 Multihead Attention 뒤(post-normalization)에 놓이는데 ,

ViT (Vision Transformer, 2020) 에선 Multihead Attention 앞에 놓이는 pre-normalization이 사용됨(이는 GPT에서도 적용됨)

위의 구조와 다음의 Position-wise Feed Forward Layer 와 연결한 것이 Transformer에서의 Encoder Block임:

MHA에서는

각 token (Query)이 다른 token(Key)들과의 관계를 이용해 attention weight를 구하고,
이를 통해 value vector들의 weighted sum을 계산함.
이때 각 value vector는 입력 representation에 대한 linear transformation 결과이며,
attention에 의한 weighted sum이 이루어지니 이 역시 linear transformation에 해당함.
즉, attention에선 linear transformation이 주로 이루어짐.

이후 Position-wise Feed-Forward Layer는

각 position의 representation에 동일한 parameter를 갖는 feed-forward network를 독립적으로 적용하여,
attention만으로는 부족한 non-linearity와 feature transformation을 추가하는 역할을 수행함.

다음 그림은 Transformer에서 MHA와 FF 를 통해 만들어진 Encoder Block을 간략히 보여주는 것으로 Vaswani et al. 2017의 논문의 그림임:

Attention is All You Need (2017)

Transformer

Attention is all you need

ds31x.github.io

728x90

torch.nn.Module의 상태(state)-Parameter and Buffer (0)	2026.04.09
pytorch-torchinfo 란 (0)	2026.04.09
Hugging Face 캐시 디렉터리 구조 정리 (1)	2026.03.18
RL: non-associative setting vs associative setting (0)	2026.03.06
Hugging Face Access Token 생성 및 권한 설정 가이드 (0)	2026.03.05