Fine-tuning a pre-trained large language model (LLM) involves adjusting the model parameters with a new dataset to customize its performance for specific tasks. Below is a step-by-step guide with a Python code snippet to help you fine-tune an LLM using the Hugging Face Transformers library, which is one of the most popular and well-supported frameworks for this purpose.
Step 1: Set Up the Environment
First, ensure you have the necessary libraries installed. You can install these using pip:
undefined
pip install transformers datasets torch
Step 2: Prepare the Dataset
For the purpose of this example, let's assume we are fine-tuning a model on a text classification task. We will use the datasets
library to load a sample dataset. Here, we'll use the imdb
movie review dataset, which is a popular binary sentiment classification dataset.
Here's how you can load and preprocess the data:
python
fromdatasets
importload_dataset
# Load the dataset
dataset = load_dataset(
'imdb')
# Split the dataset into train and test sets
train_dataset = dataset[
'train']
test_dataset = dataset[
'test']
Step 3: Tokenize the Data
Next, we need to tokenize the text data using the tokenizer associated with the pre-trained model we plan to fine-tune. Here, we will use the bert-base-uncased
model as an example:
python
fromtransformers
importAutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
'bert-base-uncased')
# Tokenize the train and test sets
deftokenize_function
(
examples):
return
tokenizer(examples[
'text'], padding=
"max_length", truncation=
True, max_length=
512)
train_dataset = train_dataset.
map(tokenize_function, batched=
True)
test_dataset = test_dataset.
map(tokenize_function, batched=
True)
# Set format for PyTorch
train_dataset.set_format(
type=
'torch', columns=[
'input_ids',
'attention_mask',
'label'])
test_dataset.set_format(
type=
'torch', columns=[
'input_ids',
'attention_mask',
'label'])
Step 4: Load the Pre-trained Model
Now that our data is tokenized, let's load the pre-trained model. We'll use the BertForSequenceClassification
model from transformers
:
python
fromtransformers
importAutoModelForSequenceClassification
# Load the model
model = AutoModelForSequenceClassification.from_pretrained(
'bert-base-uncased', num_labels=
2)
Step 5: Define the Training Arguments
We'll now define the training arguments using the TrainingArguments
class. These arguments control the behavior of the training loop.
python
fromtransformers
importTrainingArguments
training_args = TrainingArguments(
output_dir=
'./results',
num_train_epochs=
3,
# Number of training epochs
per_device_train_batch_size=
16,
# Batch size for training
per_device_eval_batch_size=
16,
# Batch size for evaluation
warmup_steps=
500,
# Number of warmup steps for learning rate scheduler
weight_decay=
.01,
# Strength of weight decay
logging_dir=
'./logs',
# Directory for storing logs
logging_steps=
10,
)
Step 6: Create the Trainer
Using the Trainer
API makes it convenient to handle the training process. We provide the model, arguments, datasets, and evaluation metrics to the Trainer
.
python
fromtransformers
importTrainer
defcompute_metrics
(
p):
from
sklearn.metrics
importaccuracy_score, precision_recall_fscore_support
preds = p.predictions.argmax(-
1)
labels = p.label_ids
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average=
'binary')
acc = accuracy_score(labels, preds)
return
{
'accuracy'
: acc,
'f1'
: f1,
'precision'
: precision,
'recall'
: recall
}
trainer = Trainer(
model=model,
# The model to be trained
args=training_args,
# Training arguments
train_dataset=train_dataset,
# Training dataset
eval_dataset=test_dataset,
# Evaluation dataset
compute_metrics=compute_metrics
# Function to compute metrics
)
Step 7: Train the Model
Finally, you can start the training process:
scss
trainer
.train()
Step 8: Evaluate the Model
After training, you can evaluate the model on the test set:
bash
evaluation_results = trainer.evaluate()
print(evaluation_results)
Step 9: Save the Model
Once the model is fine-tuned, save it for later use:
bash
trainer.save_model(
'./fine-tuned-bert')
tokenizer.save_pretrained(
'./fine-tuned-bert')
Summary
This step-by-step guide provided you with an overview of fine-tuning a pre-trained large language model using the transformers
library. You learned how to set up the environment, prepare and tokenize your dataset, define training arguments, and train and evaluate the model using the Trainer
API.