https://huggingface.co/transformers/model_summary.html

预训练方法

MLM
- masked language modeling
  - Bert
  - Roberta..
PMLM
- pseudo-masked language modeling
  - Unilm-v2
WWM
- Whole word masking
CLM
- causal language modeling
  - GPT、GPT2..
PLM
- permutation language modeling
  - XLNet
RTD
- replaced token detection
  - ELECTRA
NSP
- next sentence prediction
- 第二句是不是原文中第一句的下一句
SOP
- sentence order prediction
- 第一句和第二句的顺序是正还是反

ELMO

Embeddings from Language Model 经过循环神经网络后的embeding，考虑了语句的信息

BERT

Bidirectional Encoder Representations from Transformers

Pretrain Masked LM Next Sentence Prediction

embeding token embedding：词向量 segment embedding：表示句子是第一句还是第二句 position embedding：表示词汇在句子中的位置 3.多层transformer的encode层

BertModel(
(embeddings): BertEmbeddings(
 (word_embeddings): Embedding(30522, 768, padding_idx=0)
 (position_embeddings): Embedding(512, 768)
 (token_type_embeddings): Embedding(2, 768)
 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
 (dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
 (layer): ModuleList(
   (0): BertLayer(
     (attention): BertAttention(
       (self): BertSelfAttention(
         (query): Linear(in_features=768, out_features=768, bias=True)
         (key): Linear(in_features=768, out_features=768, bias=True)
         (value): Linear(in_features=768, out_features=768, bias=True)
         (dropout): Dropout(p=0.1, inplace=False)
       )
       (output): BertSelfOutput(
         (dense): Linear(in_features=768, out_features=768, bias=True)
         (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) # https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html 如果elementwise_affine=True，有参数weight和bias可以学习
         (dropout): Dropout(p=0.1, inplace=False)
       )
     )
     (intermediate): BertIntermediate(
       (dense): Linear(in_features=768, out_features=3072, bias=True)  # 还有一层激活函数
     )
     (output): BertOutput(
       (dense): Linear(in_features=3072, out_features=768, bias=True)
       (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
       (dropout): Dropout(p=0.1, inplace=False)
     )
   )
   ...
 )

Output层
- BertOutput 与 BertSelfOutput模型结果一样，参数维度有不同
- BertOutput层的LayerNorm是把经过BertIntermediate层、BertOutput.dense、BertOutput.dropout前和后的两个中间层相加做正则
- BertSelfOutput层的LayerNorm是把经过BertSelfAttention层、BertSelfOutput.dense、BertSelfOutput.dropout前和后的两个中间层相加做正则
注意力qkv的拆分

def transpose_for_scores(self, x):
    print(x.shape)
    sz = x.size()[:-1] + (self.num_attention_heads,
                            self.attention_head_size)
    # (batch, pos, head, head_hid)
    x = x.view(*sz)
    print(x.shape)
    # (batch, head, pos, head_hid)
    return x.permute(0, 2, 1, 3)

torch.Size([1, 9, 768])
torch.Size([1, 9, 12, 64])

把768维度，拆成了12头，没头64维度

ERNIE

Enhance Representation through Knowledge Integration 盖掉词汇

RoBETRa

动态掩码（原版1个epoch做一次掩码）
去掉下一句预测(NSP)任务
Byte-Pair Encoding（BPE）

XLNET

Transformer内部mask
Permutation Language Modeling
ALBERT
A Lite Bert 缩小了整体的参数量，加快了训练速度
Emedding dim ！= Hidden dim bert、roberta中，embedding size(E)和transformer层的hidden size(H)都是相等的 albert 把H(理解上下文，更重要？)提高大于E
12层每层参数共享
SOP
GPT
Generative Pre-Training
Transformer 的 Decoder
非双向
ELECTRA
RTD

results matching ""

No results matching ""