本文介绍了精细调整DistilBertForSequenceClassification:不学习,为什么损失没有改变?重量未更新?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 PyTorch 和 Huggingface-transformers 比较陌生,并在这个 Kaggle 上试验了 DistillBertForSequenceClassification-数据集.

I am relatively new to PyTorch and Huggingface-transformers and experimented with DistillBertForSequenceClassification on this Kaggle-Dataset.

from transformers import DistilBertForSequenceClassification
import torch.optim as optim
import torch.nn as nn
from transformers import get_linear_schedule_with_warmup

n_epochs = 5 # or whatever
batch_size = 32 # or whatever

bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
#bert_distil.classifier = nn.Sequential(nn.Linear(in_features=768, out_features=1), nn.Sigmoid())
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(bert_distil.parameters(), lr=0.1)

X_train = []
Y_train = []

for row in train_df.iterrows():
    seq = tokenizer.encode(preprocess_text(row[1]['text']),  add_special_tokens=True, pad_to_max_length=True)
    X_train.append(torch.tensor(seq).unsqueeze(0))
    Y_train.append(torch.tensor([row[1]['target']]).unsqueeze(0))
X_train = torch.cat(X_train)
Y_train = torch.cat(Y_train)

running_loss = 0.0
bert_distil.cuda()
bert_distil.train(True)
for epoch in range(n_epochs):
    permutation = torch.randperm(len(X_train))
    j = 0
    for i in range(0,len(X_train), batch_size):
        optimizer.zero_grad()
        indices = permutation[i:i+batch_size]
        batch_x, batch_y = X_train[indices], Y_train[indices]
        batch_x.cuda()
        batch_y.cuda()
        outputs = bert_distil.forward(batch_x.cuda())
        loss = criterion(outputs[0],batch_y.squeeze().cuda())
        loss.requires_grad = True
   
        loss.backward()
        optimizer.step()
   
        running_loss += loss.item()  
        j+=1
        if j == 20:   
            #print(outputs[0])
            print('[%d, %5d] running loss: %.3f loss: %.3f ' %
              (epoch + 1, i*1, running_loss / 20, loss.item()))
            running_loss = 0.0
            j = 0

无论我尝试什么,损失都不会减少甚至增加,预测也不会变得更好.在我看来,我忘记了一些东西,因此权重实际上并未更新.有人有主意吗?

Regardless on what I tried, loss did never decrease, or even increase, nor did the prediction get better. It seems to me that I forgot something so that weights are actually not updated. Someone has an idea?O

我尝试了什么

  • 不同的损失函数
    • BCE
    • 交叉熵
    • MSE损失均匀

    推荐答案

    查看运行损失和小批量损失很容易引起误解.您应该查看时代损失,因为每次损失的输入都是相同的.

    Looking at running loss and minibatch loss is easily misleading. You should look at epoch loss, because the inputs are the same for every loss.

    此外,您的代码中存在一些问题,可以解决所有问题,并且行为符合预期:损失在每个时期后逐渐减少,它也可能过度适合小批量生产.请查看代码,更改包括:使用 model(x)代替 model.forward(x) cuda()仅调用一次,学习率较低等.

    Besides, there are some problems in your code, fixing all of them and the behavior is as expected: the loss slowly decreases after each epoch, and it can also overfit to a small minibatch. Please look at the code, changes include: using model(x) instead of model.forward(x), cuda() only called once, smaller learning rate, etc.

    调整和微调ML模型是一项艰巨的工作.

    Tuning and fine-tuning ML models are difficult work.

    n_epochs = 5
    batch_size = 1
    
    bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(bert_distil.parameters(), lr=1e-3)
    
    X_train = []
    Y_train = []
    for row in train_df.iterrows():
        seq = tokenizer.encode(row[1]['text'],  add_special_tokens=True, pad_to_max_length=True)[:100]
        X_train.append(torch.tensor(seq).unsqueeze(0))
        Y_train.append(torch.tensor([row[1]['target']]))
    X_train = torch.cat(X_train)
    Y_train = torch.cat(Y_train)
    
    running_loss = 0.0
    bert_distil.cuda()
    bert_distil.train(True)
    for epoch in range(n_epochs):
        permutation = torch.randperm(len(X_train))
        for i in range(0,len(X_train), batch_size):
            optimizer.zero_grad()
            indices = permutation[i:i+batch_size]
            batch_x, batch_y = X_train[indices].cuda(), Y_train[indices].cuda()
            outputs = bert_distil(batch_x)
            loss = criterion(outputs[0], batch_y)
            loss.backward()
            optimizer.step()
       
            running_loss += loss.item()  
    
        print('[%d] epoch loss: %.3f' %
          (epoch + 1, running_loss / len(X_train) * batch_size))
        running_loss = 0.0
    

    输出:

    [1] epoch loss: 0.695
    [2] epoch loss: 0.690
    [3] epoch loss: 0.687
    [4] epoch loss: 0.685
    [5] epoch loss: 0.684
    

    这篇关于精细调整DistilBertForSequenceClassification:不学习,为什么损失没有改变?重量未更新?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-18 15:18