https://huggingface.co/transformers/model_summary.html

预训练方法

  • MLM
    • masked language modeling
      • Bert
      • Roberta..
  • PMLM
    • pseudo-masked language modeling
      • Unilm-v2
  • WWM
    • Whole word masking
  • CLM
    • causal language modeling
      • GPT、GPT2..
  • PLM
    • permutation language modeling
      • XLNet
  • RTD
    • replaced token detection
      • ELECTRA
  • NSP
    • next sentence prediction
    • 第二句是不是原文中第一句的下一句
  • SOP
    • sentence order prediction
    • 第一句和第二句的顺序是正还是反

ELMO

Embeddings from Language Model 经过循环神经网络后的embeding,考虑了语句的信息

BERT

Bidirectional Encoder Representations from Transformers

  1. Pretrain Masked LM Next Sentence Prediction
  2. embeding token embedding:词向量 segment embedding:表示句子是第一句还是第二句 position embedding:表示词汇在句子中的位置 3.多层transformer的encode层
    BertModel(
    (embeddings): BertEmbeddings(
     (word_embeddings): Embedding(30522, 768, padding_idx=0)
     (position_embeddings): Embedding(512, 768)
     (token_type_embeddings): Embedding(2, 768)
     (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
     (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
     (layer): ModuleList(
       (0): BertLayer(
         (attention): BertAttention(
           (self): BertSelfAttention(
             (query): Linear(in_features=768, out_features=768, bias=True)
             (key): Linear(in_features=768, out_features=768, bias=True)
             (value): Linear(in_features=768, out_features=768, bias=True)
             (dropout): Dropout(p=0.1, inplace=False)
           )
           (output): BertSelfOutput(
             (dense): Linear(in_features=768, out_features=768, bias=True)
             (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) # https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html 如果elementwise_affine=True,有参数weight和bias可以学习
             (dropout): Dropout(p=0.1, inplace=False)
           )
         )
         (intermediate): BertIntermediate(
           (dense): Linear(in_features=768, out_features=3072, bias=True)  # 还有一层激活函数
         )
         (output): BertOutput(
           (dense): Linear(in_features=3072, out_features=768, bias=True)
           (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
           (dropout): Dropout(p=0.1, inplace=False)
         )
       )
       ...
     )
    
  3. Output层

    • BertOutputBertSelfOutput模型结果一样,参数维度有不同
    • BertOutput层LayerNorm是把经过BertIntermediate层BertOutput.denseBertOutput.dropout前和后的两个中间层相加做正则
    • BertSelfOutput层LayerNorm是把经过BertSelfAttention层BertSelfOutput.denseBertSelfOutput.dropout前和后的两个中间层相加做正则
  4. 注意力qkv的拆分

def transpose_for_scores(self, x):
    print(x.shape)
    sz = x.size()[:-1] + (self.num_attention_heads,
                            self.attention_head_size)
    # (batch, pos, head, head_hid)
    x = x.view(*sz)
    print(x.shape)
    # (batch, head, pos, head_hid)
    return x.permute(0, 2, 1, 3)
torch.Size([1, 9, 768])
torch.Size([1, 9, 12, 64])
  • 把768维度,拆成了12头,没头64维度

ERNIE

Enhance Representation through Knowledge Integration 盖掉词汇

RoBETRa

  1. 动态掩码(原版1个epoch做一次掩码)
  2. 去掉下一句预测(NSP)任务
  3. Byte-Pair Encoding(BPE)

XLNET

  1. Transformer内部mask
  2. Permutation Language Modeling

    ALBERT

    A Lite Bert 缩小了整体的参数量,加快了训练速度
  3. Emedding dim != Hidden dim bert、roberta中,embedding size(E)和transformer层的hidden size(H)都是相等的 albert 把H(理解上下文,更重要?)提高大于E
  4. 12层每层参数共享
  5. SOP

    GPT

    Generative Pre-Training
  6. Transformer 的 Decoder
  7. 非双向

    ELECTRA

  8. RTD

results matching ""

    No results matching ""