# !pip install datasets
# !pip install transformers -U
# !pip install huggingface_hub
# !pip install rich
# !pip install accelerate -U
# !pip install evaluate
Movie Genre Predictions with Hugging Face Transformers
I just finished Part 1 of Hugging Face NLP course and wanted to put my new skills to the test. I stumbled upon a Hugging Face competition called Movie Genre Prediction, where the challenge is to guess the movie genre based on the synopsis and title. In this blog post, I’ll share my journey and findings. Let’s dive in!
Lets install the following packages by uncommenting the following code if not installed already
Following are the steps to create hugging face credentials token which be needed when using notebook_login
below
- Create a Hugging Face account (if you don’t have one): If you don’t already have an account on the Hugging Face website, you’ll need to create one. Visit the Hugging Face website (https://huggingface.co/) and sign up for an account.
- Log in to your Hugging Face account: Use your credentials to log in to your Hugging Face account.
- Generate an API token: Hugging Face provides API tokens for authentication. To generate an API token, go to your account settings on the Hugging Face website. You can usually find this in your account dashboard or profile settings.
- Generate the token: Once you’re in your account settings, look for an option related to API tokens or credentials. You should find an option to generate a new token. Click on it, and the system will generate a unique API token for you.
- Copy the API token: Once the token is generated, you’ll typically see it displayed on the screen. It might be a long string of characters. Copy this token to your clipboard.
- Store the token securely: API tokens are sensitive credentials, so it’s essential to store them securely. You should never share your API token publicly or expose it in your code repositories.
Now, you have your Hugging Face API token, which you can use for authentication when making requests to the Hugging Face API or accessing resources on the Hugging Face Model Hub.
from huggingface_hub import notebook_login
notebook_login()
Lets import the following pacakges
from transformers import TrainingArguments, Trainer, AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset, Dataset
from collections import Counter
import evaluate
import numpy as np
from rich import print
import pandas as pd
Datasets
We will be using the datadrivenscience/movie-genre-prediction
competition dataset for model training. You can read more about the competition here and the dataset here.
= load_dataset("datadrivenscience/movie-genre-prediction"); dataset dataset
DatasetDict({
train: Dataset({
features: ['id', 'movie_name', 'synopsis', 'genre'],
num_rows: 54000
})
test: Dataset({
features: ['id', 'movie_name', 'synopsis', 'genre'],
num_rows: 36000
})
})
The dataset has train
and test
splits with following features - id - movie name - synopsis - genre
print(dataset['train'][:3])
{ 'id': [44978, 50185, 34131], 'movie_name': ['Super Me', 'Entity Project', 'Behavioral Family Therapy for Serious Psychiatric Disorders'], 'synopsis': [ 'A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. Selling them makes him rich.', 'A director and her friends renting a haunted house to capture paranormal events in order to prove it and become popular.', 'This is an educational video for families and family therapists that describes the Behavioral Family Therapy approach to dealing with serious psychiatric illnesses.' ], 'genre': ['fantasy', 'horror', 'family'] }
Above we have sliced and printed 3 rows of training dataset
= set(dataset['train']['genre'])
labels = len(labels)
num_labels labels
{'action',
'adventure',
'crime',
'family',
'fantasy',
'horror',
'mystery',
'romance',
'scifi',
'thriller'}
There are 10 genres, - action - adventure - crime - family - fantasy - horror - mystery - romance - scifi - thriller
= Counter(dataset['train']['genre']); print(labels_count) labels_count
Counter({ 'fantasy': 5400, 'horror': 5400, 'family': 5400, 'scifi': 5400, 'action': 5400, 'crime': 5400, 'adventure': 5400, 'mystery': 5400, 'romance': 5400, 'thriller': 5400 })
Looks like the labels are evenly sampled, everyone has count of 5400. Thats good.
= dataset.rename_column('genre', 'labels') dataset
= dataset.class_encode_column("labels") dataset
In the above steps we rename the column genre
to labels
to mark the genre
as target variable
Then we convert the labels
to ClassLabel
type
'train'].features['labels'] dataset[
ClassLabel(names=['action', 'adventure', 'crime', 'family', 'fantasy', 'horror', 'mystery', 'romance', 'scifi', 'thriller'], id=None)
Converting labels
to the ClassLabel
type in the Hugging Face library helps with:
- Consistency: It makes your labels work smoothly with the library’s tools and models.
- Number Conversion: It turns text labels into numbers, which some models need.
- Easy Mapping: It simplifies translating between text labels and numbers.
Remove Duplicates
I could not find a bettery way to remove duplicates than converting to pandas, removing duplicates and converting back to hugging face dataset.
= pd.DataFrame(dataset['train']) # converting to pandas train_df
= train_df.drop_duplicates(['movie_name', 'synopsis']) # Removes duplicates based on `movie_name` and `synopsis` attributes train_df
= Dataset.from_pandas(train_df) # converting back to dataset ds
'labels'] ds.features[
Value(dtype='int64', id=None)
When converting a dataset from pandas, it creates the label type as Value
. However, we will need to subsequently convert it to ClassLabel
.
= ds.class_encode_column("labels"); ds # ds
Dataset({
features: ['id', 'movie_name', 'synopsis', 'labels', '__index_level_0__'],
num_rows: 46344
})
Tokenization
= "bert-base-uncased" checkpoint
A checkpoint is a saved model state, including its architecture and trained weights, which can be used for various NLP tasks and fine-tuning.
= AutoTokenizer.from_pretrained(checkpoint)
tokenizer 'Movie Genre Predictions with Hugging Face Transformers') tokenizer(
{'input_ids': [101, 3185, 6907, 20932, 2007, 17662, 2227, 19081, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
Above we load the tokenizer and use it on a sentence. Loading a checkpoint of a tokenizer associated with a pretrained language model is necessary to maintain consistency in the tokenization process. This ensures that your input text is processed in a way that aligns with the model’s pre-existing knowledge and allows you to use the pretrained model effectively
What is attention_mask
? > Sometimes, we want to tell the computer which parts of the sentence are important and which are not. The attention mask is like a spotlight. It’s a list of 1s and 0s, where 1 means “pay attention” and 0 means “ignore.” For our sentence, it could be [1, 1, 1, 1, 1] because we want the computer to pay attention to all tokens.
What is token_type_ids
? > If you have multiple sentences, you’d want the computer to know which sentence each token belongs to. Token Type IDs help with that. For one sentence, it’s all 0s. If you had two sentences, the first sentence would have 0s, and the second sentence would have 1s.
Let’s break down the process of creating input_ids
below into following steps:
1. Tokenize:
Imagine you have a sentence, “Hugging Face is awesome!” To help a computer understand it, you first split it into smaller parts, like words: [“Hugging”, “Face”, “is”, “awesome”, “!”]. These smaller parts are called tokens.
We can tokenize the synopsis of the first row of training set
'train'][0]['synopsis'] dataset[
'A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. Selling them makes him rich.'
= tokenizer.tokenize(dataset['train'][0]['synopsis']); tokens tokens
['a',
'young',
'script',
'##writer',
'starts',
'bringing',
'valuable',
'objects',
'back',
'from',
'his',
'short',
'nightmares',
'of',
'being',
'chased',
'by',
'a',
'demon',
'.',
'selling',
'them',
'makes',
'him',
'rich',
'.']
2. Conversion to IDs:
Computers prefer numbers, so we need to convert these tokens into unique numbers. Each token gets a special ID. For example, “Hugging” might be ID 101, “Face” might be ID 102, and so on. The sentence becomes a list of IDs: [101, 102, 103, 104, 105].
= tokenizer.convert_tokens_to_ids(tokens); ids ids
[1037,
2402,
5896,
15994,
4627,
5026,
7070,
5200,
2067,
2013,
2010,
2460,
15446,
1997,
2108,
13303,
2011,
1037,
5698,
1012,
4855,
2068,
3084,
2032,
4138,
1012]
In summary, Hugging Face tokenization takes your text, breaks it into tokens (smaller parts), gives each token a unique ID, creates an attention mask to say what’s important, and token type IDs to track different sentences if needed.
= ds.train_test_split(test_size=0.2, stratify_by_column="labels") ds
The above code splits a dataset (ds
) into two parts: a training set and a testing set.
- The
test_size=0.2
means that 20% of the data will be used for testing, and the remaining 80% for training. stratify_by_column="labels"
means that the split will ensure both the training and testing sets have a similar distribution of labels (class proportions) as in the original dataset. This is useful for maintaining balance in classification tasks.
ds
DatasetDict({
train: Dataset({
features: ['id', 'movie_name', 'synopsis', 'labels', '__index_level_0__'],
num_rows: 37075
})
test: Dataset({
features: ['id', 'movie_name', 'synopsis', 'labels', '__index_level_0__'],
num_rows: 9269
})
})
def tokenize(sample):
'text'] = list(map(lambda x: ': '.join(x), zip(sample['movie_name'], sample['synopsis'])))
sample[return tokenizer(sample['text'], truncation=True)
= ds.map(tokenize, batched=True); tokenized_ds tokenized_ds
DatasetDict({
train: Dataset({
features: ['id', 'movie_name', 'synopsis', 'labels', '__index_level_0__', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 37075
})
test: Dataset({
features: ['id', 'movie_name', 'synopsis', 'labels', '__index_level_0__', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 9269
})
})
The above code defines a tokenize function that combines text from two columns, movie_name
and synopsis
, and then tokenizes the combined text
using a tokenizer.
- sample[
text
] is created by joiningmovie_name
andsynopsis
with a colon and space. - The tokenizer is applied to the
text
column with truncation enabled.
After defining the function, it’s applied to a dataset (ds
) using .map()
to tokenize the text in batches, and the result is stored in tokenized_ds
.
Training
= TrainingArguments('movie-genre-predictions',
training_args = 'epoch',
evaluation_strategy = 32,
per_device_train_batch_size = 64,
per_device_eval_batch_size = 'epoch',
save_strategy = True,
push_to_hub = 1e-5
learning_rate )
The above code sets up the configuration for training a Hugging Face model, for a movie genre prediction task. Let’s break it down step by step:
TrainingArguments
: This is a special object or data structure that holds various settings and options for training a machine learning model.'movie-genre-predictions'
: It’s naming the training process or giving it a unique name. It’s like giving a name to a file so you can easily identify it later.evaluation_strategy = 'epoch'
: This line specifies how often the model’s performance should be evaluated. In this case, it’s set toepoch
, which means after every complete pass through the training data. An epoch is like a full round of training.per_device_train_batch_size = 32
: This indicates how many examples or data points should be processed at once on each processing unit during training. It’s set to 32, so 32 data points will be processed together in parallel.per_device_eval_batch_size = 64
: Similar to the previous line, but this one specifies the batch size for evaluation (measuring how well the model is doing). It’s set to 64, so 64 examples will be evaluated at once.save_strategy = 'epoch'
: This determines when the model’s checkpoints (saves of the model’s progress) should be saved. Again, it’s set toepoch
, meaning after each training round.push_to_hub = True
: This is likely specific to the Hugging Face Transformers library. If set toTrue
, it means that the model checkpoints will be pushed or uploaded to the Hugging Face Model Hub, a place to store and share models.learning_rate
: This is like a step size or a pace setter for a machine learning model when it’s trying to learn from data. Imagine you’re trying to find the lowest point in a hilly area by taking small steps downhill. The learning rate determines how big or small those steps should be.
In simple terms, this code is configuring how a machine learning model should be trained for movie genre prediction. It sets up details like when to check how well the model is doing, how much data to process at a time, and where to save the model’s progress. It also says that the model checkpoints should be uploaded to the Hugging Face Model Hub.
You may see more details about TrainingArguments
here
= AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = num_labels) model
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Above we load the model for Sequence Classification
of 10 labels
= evaluate.load("accuracy") clf_metrics
The evaluate
library provides the metrics on which to evaluate the validation set. Above I have choosen accuracy
as the metrics (The accuracy
metric is used in the competition too)
def compute_metrics(batch):
= batch
logits, labels = np.argmax(logits, axis=-1) # finds the position (index) of the highest value in a NumPy array and returns that index as an integer.
predictions return clf_metrics.compute(predictions=predictions, references=labels)
I have defined compute_metrics
to compute the metrics after each epoch on validation set
= Trainer(model,
trainer = training_args,
args = tokenized_ds['train'],
train_dataset = tokenized_ds['test'],
eval_dataset = tokenizer,
tokenizer = compute_metrics
compute_metrics )
The Trainer
function in Hugging Face simplifies the process of fine-tuning pre-trained NLP models for specific tasks. It handles data loading, training, evaluation, and model saving, making it easier to customize and use these models for various NLP tasks.
# the model is training trainer.train()
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | 1.664900 | 1.585141 | 0.446542 |
2 | 1.522100 | 1.557735 | 0.456576 |
3 | 1.422900 | 1.559032 | 0.458733 |
TrainOutput(global_step=3477, training_loss=1.5756116926858643, metrics={'train_runtime': 568.5141, 'train_samples_per_second': 195.642, 'train_steps_per_second': 6.116, 'total_flos': 3706787213357172.0, 'train_loss': 1.5756116926858643, 'epoch': 3.0})
trainer.push_to_hub()
'https://huggingface.co/anubhavmaity/movie-genre-predictions/tree/main/'
The above code pushes the model to model hub and creates the model card. The model card is here
Submitting to the competition
= dataset["test"].map(tokenize, batched=True) tokenized_test_ds
We apply the same tokenization method to the test dataset as we did for the training dataset above.
= trainer.predict(tokenized_test_ds) test_logits
test_logits.predictions.shape
(36000, 10)
The predict function on tokenized_dataset
throws out logits
= np.argmax(test_logits.predictions, axis=-1) test_predictions
We get the index with the highest value along the last dimension
= dataset["train"].features["labels"].int2str(test_predictions) predicted_genre
Here we convert the index to corresponding genre
= pd.DataFrame({'id':tokenized_test_ds['id'],
df 'genre': predicted_genre}) # creating dataframe with `id` and `genre` as columns for submission
'submission.csv') df.to_csv(
Submitting the submission.csv
got me the following score - Public Score: 0.4176 - Private Score: 0.4162
which ranks me around 27th rank. The 1st rank has accuracy of 0.4456 and 0.4412 in the public and private leaderboard respectively
We can improve the score by using following strategies 1. Training with a bigger architecture 2. Training for more epochs 3. Ensembling with other models