Movie Genre Predictions with Hugging Face Transformers

My attempt on the Movie Genre Predictions competition
Author

Anubhav Maity

Published

September 6, 2023

Movie Genre Predictions with Hugging Face Transformers

I just finished Part 1 of Hugging Face NLP course and wanted to put my new skills to the test. I stumbled upon a Hugging Face competition called Movie Genre Prediction, where the challenge is to guess the movie genre based on the synopsis and title. In this blog post, I’ll share my journey and findings. Let’s dive in!

Lets install the following packages by uncommenting the following code if not installed already

# !pip install datasets
# !pip install transformers -U
# !pip install huggingface_hub
# !pip install rich
# !pip install accelerate -U
# !pip install evaluate

Following are the steps to create hugging face credentials token which be needed when using notebook_login below

  1. Create a Hugging Face account (if you don’t have one): If you don’t already have an account on the Hugging Face website, you’ll need to create one. Visit the Hugging Face website (https://huggingface.co/) and sign up for an account.
  2. Log in to your Hugging Face account: Use your credentials to log in to your Hugging Face account.
  3. Generate an API token: Hugging Face provides API tokens for authentication. To generate an API token, go to your account settings on the Hugging Face website. You can usually find this in your account dashboard or profile settings.
  4. Generate the token: Once you’re in your account settings, look for an option related to API tokens or credentials. You should find an option to generate a new token. Click on it, and the system will generate a unique API token for you.
  5. Copy the API token: Once the token is generated, you’ll typically see it displayed on the screen. It might be a long string of characters. Copy this token to your clipboard.
  6. Store the token securely: API tokens are sensitive credentials, so it’s essential to store them securely. You should never share your API token publicly or expose it in your code repositories.

Now, you have your Hugging Face API token, which you can use for authentication when making requests to the Hugging Face API or accessing resources on the Hugging Face Model Hub.

from huggingface_hub import notebook_login

notebook_login()

Lets import the following pacakges

from transformers import TrainingArguments, Trainer, AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset, Dataset
from collections import Counter
import evaluate
import numpy as np
from rich import print
import pandas as pd

Datasets

We will be using the datadrivenscience/movie-genre-prediction competition dataset for model training. You can read more about the competition here and the dataset here.

dataset = load_dataset("datadrivenscience/movie-genre-prediction"); dataset
DatasetDict({
    train: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'genre'],
        num_rows: 54000
    })
    test: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'genre'],
        num_rows: 36000
    })
})

The dataset has train and test splits with following features - id - movie name - synopsis - genre

print(dataset['train'][:3])
{
    'id': [44978, 50185, 34131],
    'movie_name': ['Super Me', 'Entity Project', 'Behavioral Family Therapy for Serious Psychiatric Disorders'],
    'synopsis': [
        'A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a 
demon. Selling them makes him rich.',
        'A director and her friends renting a haunted house to capture paranormal events in order to prove it and 
become popular.',
        'This is an educational video for families and family therapists that describes the Behavioral Family 
Therapy approach to dealing with serious psychiatric illnesses.'
    ],
    'genre': ['fantasy', 'horror', 'family']
}

Above we have sliced and printed 3 rows of training dataset

labels = set(dataset['train']['genre'])
num_labels = len(labels)
labels
{'action',
 'adventure',
 'crime',
 'family',
 'fantasy',
 'horror',
 'mystery',
 'romance',
 'scifi',
 'thriller'}

There are 10 genres, - action - adventure - crime - family - fantasy - horror - mystery - romance - scifi - thriller

labels_count = Counter(dataset['train']['genre']); print(labels_count)
Counter({
    'fantasy': 5400,
    'horror': 5400,
    'family': 5400,
    'scifi': 5400,
    'action': 5400,
    'crime': 5400,
    'adventure': 5400,
    'mystery': 5400,
    'romance': 5400,
    'thriller': 5400
})

Looks like the labels are evenly sampled, everyone has count of 5400. Thats good.

dataset = dataset.rename_column('genre', 'labels')
dataset = dataset.class_encode_column("labels")

In the above steps we rename the column genre to labels to mark the genre as target variable

Then we convert the labels to ClassLabel type

dataset['train'].features['labels']
ClassLabel(names=['action', 'adventure', 'crime', 'family', 'fantasy', 'horror', 'mystery', 'romance', 'scifi', 'thriller'], id=None)

Converting labels to the ClassLabel type in the Hugging Face library helps with:

  • Consistency: It makes your labels work smoothly with the library’s tools and models.
  • Number Conversion: It turns text labels into numbers, which some models need.
  • Easy Mapping: It simplifies translating between text labels and numbers.

Remove Duplicates

I could not find a bettery way to remove duplicates than converting to pandas, removing duplicates and converting back to hugging face dataset.

train_df = pd.DataFrame(dataset['train']) # converting to pandas
train_df = train_df.drop_duplicates(['movie_name', 'synopsis']) # Removes duplicates based on `movie_name` and `synopsis` attributes
ds = Dataset.from_pandas(train_df) # converting back to dataset
ds.features['labels']
Value(dtype='int64', id=None)

When converting a dataset from pandas, it creates the label type as Value. However, we will need to subsequently convert it to ClassLabel.

ds = ds.class_encode_column("labels"); ds # 
Dataset({
    features: ['id', 'movie_name', 'synopsis', 'labels', '__index_level_0__'],
    num_rows: 46344
})

Tokenization

checkpoint = "bert-base-uncased"

A checkpoint is a saved model state, including its architecture and trained weights, which can be used for various NLP tasks and fine-tuning.

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer('Movie Genre Predictions with Hugging Face Transformers')
{'input_ids': [101, 3185, 6907, 20932, 2007, 17662, 2227, 19081, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Above we load the tokenizer and use it on a sentence. Loading a checkpoint of a tokenizer associated with a pretrained language model is necessary to maintain consistency in the tokenization process. This ensures that your input text is processed in a way that aligns with the model’s pre-existing knowledge and allows you to use the pretrained model effectively

What is attention_mask? > Sometimes, we want to tell the computer which parts of the sentence are important and which are not. The attention mask is like a spotlight. It’s a list of 1s and 0s, where 1 means “pay attention” and 0 means “ignore.” For our sentence, it could be [1, 1, 1, 1, 1] because we want the computer to pay attention to all tokens.

What is token_type_ids? > If you have multiple sentences, you’d want the computer to know which sentence each token belongs to. Token Type IDs help with that. For one sentence, it’s all 0s. If you had two sentences, the first sentence would have 0s, and the second sentence would have 1s.

Let’s break down the process of creating input_ids below into following steps:

1. Tokenize:

Imagine you have a sentence, “Hugging Face is awesome!” To help a computer understand it, you first split it into smaller parts, like words: [“Hugging”, “Face”, “is”, “awesome”, “!”]. These smaller parts are called tokens.

We can tokenize the synopsis of the first row of training set

dataset['train'][0]['synopsis']
'A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. Selling them makes him rich.'
tokens = tokenizer.tokenize(dataset['train'][0]['synopsis']); tokens
['a',
 'young',
 'script',
 '##writer',
 'starts',
 'bringing',
 'valuable',
 'objects',
 'back',
 'from',
 'his',
 'short',
 'nightmares',
 'of',
 'being',
 'chased',
 'by',
 'a',
 'demon',
 '.',
 'selling',
 'them',
 'makes',
 'him',
 'rich',
 '.']

2. Conversion to IDs:

Computers prefer numbers, so we need to convert these tokens into unique numbers. Each token gets a special ID. For example, “Hugging” might be ID 101, “Face” might be ID 102, and so on. The sentence becomes a list of IDs: [101, 102, 103, 104, 105].

ids = tokenizer.convert_tokens_to_ids(tokens); ids
[1037,
 2402,
 5896,
 15994,
 4627,
 5026,
 7070,
 5200,
 2067,
 2013,
 2010,
 2460,
 15446,
 1997,
 2108,
 13303,
 2011,
 1037,
 5698,
 1012,
 4855,
 2068,
 3084,
 2032,
 4138,
 1012]

In summary, Hugging Face tokenization takes your text, breaks it into tokens (smaller parts), gives each token a unique ID, creates an attention mask to say what’s important, and token type IDs to track different sentences if needed.

ds = ds.train_test_split(test_size=0.2, stratify_by_column="labels")

The above code splits a dataset (ds) into two parts: a training set and a testing set.

  • The test_size=0.2 means that 20% of the data will be used for testing, and the remaining 80% for training.
  • stratify_by_column="labels" means that the split will ensure both the training and testing sets have a similar distribution of labels (class proportions) as in the original dataset. This is useful for maintaining balance in classification tasks.
ds
DatasetDict({
    train: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', '__index_level_0__'],
        num_rows: 37075
    })
    test: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', '__index_level_0__'],
        num_rows: 9269
    })
})
def tokenize(sample):
    sample['text'] = list(map(lambda x: ': '.join(x), zip(sample['movie_name'], sample['synopsis']))) 
    return tokenizer(sample['text'], truncation=True)
tokenized_ds = ds.map(tokenize, batched=True); tokenized_ds
DatasetDict({
    train: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', '__index_level_0__', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 37075
    })
    test: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', '__index_level_0__', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9269
    })
})

The above code defines a tokenize function that combines text from two columns, movie_name and synopsis, and then tokenizes the combined text using a tokenizer.

  • sample[text] is created by joining movie_name and synopsis with a colon and space.
  • The tokenizer is applied to the text column with truncation enabled.

After defining the function, it’s applied to a dataset (ds) using .map() to tokenize the text in batches, and the result is stored in tokenized_ds.

Training

training_args = TrainingArguments('movie-genre-predictions', 
                                  evaluation_strategy = 'epoch',
                                  per_device_train_batch_size = 32,
                                  per_device_eval_batch_size = 64,
                                  save_strategy = 'epoch',
                                  push_to_hub = True, 
                                  learning_rate = 1e-5
                                 )

The above code sets up the configuration for training a Hugging Face model, for a movie genre prediction task. Let’s break it down step by step:

  1. TrainingArguments: This is a special object or data structure that holds various settings and options for training a machine learning model.

  2. 'movie-genre-predictions': It’s naming the training process or giving it a unique name. It’s like giving a name to a file so you can easily identify it later.

  3. evaluation_strategy = 'epoch': This line specifies how often the model’s performance should be evaluated. In this case, it’s set to epoch, which means after every complete pass through the training data. An epoch is like a full round of training.

  4. per_device_train_batch_size = 32: This indicates how many examples or data points should be processed at once on each processing unit during training. It’s set to 32, so 32 data points will be processed together in parallel.

  5. per_device_eval_batch_size = 64: Similar to the previous line, but this one specifies the batch size for evaluation (measuring how well the model is doing). It’s set to 64, so 64 examples will be evaluated at once.

  6. save_strategy = 'epoch': This determines when the model’s checkpoints (saves of the model’s progress) should be saved. Again, it’s set to epoch, meaning after each training round.

  7. push_to_hub = True: This is likely specific to the Hugging Face Transformers library. If set to True, it means that the model checkpoints will be pushed or uploaded to the Hugging Face Model Hub, a place to store and share models.

  8. learning_rate: This is like a step size or a pace setter for a machine learning model when it’s trying to learn from data. Imagine you’re trying to find the lowest point in a hilly area by taking small steps downhill. The learning rate determines how big or small those steps should be.

In simple terms, this code is configuring how a machine learning model should be trained for movie genre prediction. It sets up details like when to check how well the model is doing, how much data to process at a time, and where to save the model’s progress. It also says that the model checkpoints should be uploaded to the Hugging Face Model Hub.

You may see more details about TrainingArguments here

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = num_labels)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Above we load the model for Sequence Classification of 10 labels

clf_metrics = evaluate.load("accuracy")

The evaluate library provides the metrics on which to evaluate the validation set. Above I have choosen accuracy as the metrics (The accuracy metric is used in the competition too)

def compute_metrics(batch):
    logits, labels = batch
    predictions = np.argmax(logits, axis=-1) # finds the position (index) of the highest value in a NumPy array and returns that index as an integer.
    return clf_metrics.compute(predictions=predictions, references=labels)

I have defined compute_metrics to compute the metrics after each epoch on validation set

trainer = Trainer(model, 
                  args = training_args,
                  train_dataset = tokenized_ds['train'],
                  eval_dataset = tokenized_ds['test'], 
                  tokenizer = tokenizer,
                  compute_metrics = compute_metrics
                 )

The Trainer function in Hugging Face simplifies the process of fine-tuning pre-trained NLP models for specific tasks. It handles data loading, training, evaluation, and model saving, making it easier to customize and use these models for various NLP tasks.

trainer.train() # the model is training
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[3477/3477 09:27, Epoch 3/3]
Epoch Training Loss Validation Loss Accuracy
1 1.664900 1.585141 0.446542
2 1.522100 1.557735 0.456576
3 1.422900 1.559032 0.458733

TrainOutput(global_step=3477, training_loss=1.5756116926858643, metrics={'train_runtime': 568.5141, 'train_samples_per_second': 195.642, 'train_steps_per_second': 6.116, 'total_flos': 3706787213357172.0, 'train_loss': 1.5756116926858643, 'epoch': 3.0})
trainer.push_to_hub()
'https://huggingface.co/anubhavmaity/movie-genre-predictions/tree/main/'

The above code pushes the model to model hub and creates the model card. The model card is here

Submitting to the competition

tokenized_test_ds = dataset["test"].map(tokenize, batched=True)

We apply the same tokenization method to the test dataset as we did for the training dataset above.

test_logits = trainer.predict(tokenized_test_ds)
test_logits.predictions.shape
(36000, 10)

The predict function on tokenized_dataset throws out logits

test_predictions = np.argmax(test_logits.predictions, axis=-1)

We get the index with the highest value along the last dimension

predicted_genre = dataset["train"].features["labels"].int2str(test_predictions)

Here we convert the index to corresponding genre

df = pd.DataFrame({'id':tokenized_test_ds['id'],
                  'genre': predicted_genre})   # creating dataframe with `id` and `genre` as columns for submission
df.to_csv('submission.csv')

Submitting the submission.csv got me the following score - Public Score: 0.4176 - Private Score: 0.4162

which ranks me around 27th rank. The 1st rank has accuracy of 0.4456 and 0.4412 in the public and private leaderboard respectively

We can improve the score by using following strategies 1. Training with a bigger architecture 2. Training for more epochs 3. Ensembling with other models