https://huggingface.co/transformers/model_summary.html
预训练方法
- MLM
- masked language modeling
- Bert
- Roberta..
- masked language modeling
- PMLM
- pseudo-masked language modeling
- Unilm-v2
- pseudo-masked language modeling
- WWM
- Whole word masking
- CLM
- causal language modeling
- GPT、GPT2..
- causal language modeling
- PLM
- permutation language modeling
- XLNet
- permutation language modeling
- RTD
- replaced token detection
- ELECTRA
- replaced token detection
- NSP
- next sentence prediction
- 第二句是不是原文中第一句的下一句
- SOP
- sentence order prediction
- 第一句和第二句的顺序是正还是反
ELMO
Embeddings from Language Model 经过循环神经网络后的embeding,考虑了语句的信息
BERT
Bidirectional Encoder Representations from Transformers
- Pretrain Masked LM Next Sentence Prediction
- embeding
token embedding:词向量
segment embedding:表示句子是第一句还是第二句
position embedding:表示词汇在句子中的位置
3.多层transformer的encode层
BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) # https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html 如果elementwise_affine=True,有参数weight和bias可以学习 (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) # 还有一层激活函数 ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ... )
Output层
BertOutput
与BertSelfOutput
模型结果一样,参数维度有不同BertOutput层
的LayerNorm
是把经过BertIntermediate层
、BertOutput.dense
、BertOutput.dropout
前和后的两个中间层相加做正则BertSelfOutput层
的LayerNorm
是把经过BertSelfAttention层
、BertSelfOutput.dense
、BertSelfOutput.dropout
前和后的两个中间层相加做正则
注意力qkv的拆分
def transpose_for_scores(self, x):
print(x.shape)
sz = x.size()[:-1] + (self.num_attention_heads,
self.attention_head_size)
# (batch, pos, head, head_hid)
x = x.view(*sz)
print(x.shape)
# (batch, head, pos, head_hid)
return x.permute(0, 2, 1, 3)
torch.Size([1, 9, 768])
torch.Size([1, 9, 12, 64])
- 把768维度,拆成了12头,没头64维度
ERNIE
Enhance Representation through Knowledge Integration 盖掉词汇
RoBETRa
- 动态掩码(原版1个epoch做一次掩码)
- 去掉下一句预测(NSP)任务
- Byte-Pair Encoding(BPE)
XLNET
- Transformer内部mask
- Permutation Language Modeling
ALBERT
A Lite Bert 缩小了整体的参数量,加快了训练速度 - Emedding dim != Hidden dim bert、roberta中,embedding size(E)和transformer层的hidden size(H)都是相等的 albert 把H(理解上下文,更重要?)提高大于E
- 12层每层参数共享
- SOP
GPT
Generative Pre-Training - Transformer 的 Decoder
- 非双向
ELECTRA
- RTD