Few-shot learning in NLP: many-classes classification from few examples


Posted on Sun 19 August 2018

If you're doing machine learning and meet a classification problem with many categories and only a few examples per category, it is usually thought that you're in trouble ๐Ÿ˜จ. Acquiring new data to solve this issue is not always easy or even doable. Luckily, we'll see that efficient techniques exist to deal with this situation with Siamese Neural Networks ๐Ÿ•บ.

This problem of learning with only a few examples per category is called "few-shot learning", and "one-shot learning" in the extreme case of only one example per class (yes, you can even do this and obtain decent results!).

Most of the machine learning research on one-shot learning involves images, but some recent research papers address the same problem in the Natural Language Processing (NLP) realm.

In this blog post, I will use a siamese neural network to tackle a few-shot learning prolbem, following a method that was originally applied to images and that is nicely explained here.

Job title classification provides a good example of a few-shot learning problem in NLP. Imagine you want to group job titles in different categories or "occupations" (e.g. gather "Programmer" and "Software engineer" under the same occupation, and "Sales manager" and "Account executive" under another one). Unless you have hundreds of job titles examples per occupation you are facing a few-shot learning problem. The U.S government provides such a job title/occupations taxonomy: the Standard Occupational Classification. I'll use it as a toy dataset to understand how few-shot learning with siamese neural networks works.

Let's start by downloading the taxonomy and check what's in there.

In [1]:
from io import StringIO
import requests
import pandas as pd

# Set random seeds for reproducibility
from random import seed
import numpy as np

# Import a home-made dictionary that maps SOC codes to "occupations" names
# The corresponding file can be downloaded here: https://gist.github.com/nkthiebaut/c2895b8bb77bdf3253fb581622dca51b
from soc import SOC_MINOR_GROUPS

# Download the Standard Occupation Classification with job titles examples
file_url = 'https://www.onetcenter.org/dl_files/database/db_20_1_text/Sample%20of%20Reported%20Titles.txt'
csv = StringIO(requests.get(file_url).text)

# Load it in a pandas DataFrame and drop a useless column
df = pd.read_csv(csv, sep='\t').drop('Shown in My Next Move', axis=1)

# Get the occupation name from the code and remove the original code column
df['SOC minor group'] = df['O*NET-SOC Code'].apply(lambda x: SOC_MINOR_GROUPS[x[:4]])
df.drop('O*NET-SOC Code', axis=1, inplace=True)

# Lower all job titles for simplicity
df['Reported Job Title'] = df['Reported Job Title'].str.lower()

# Display a few examples
Reported Job Title SOC minor group
1 chief executive officer (ceo) Top Executives
2 chief financial officer (cfo) Top Executives
3 chief nursing officer Top Executives
100 banking center manager (bcm) Operations Specialties Managers
101 banking officer Operations Specialties Managers
102 branch manager Operations Specialties Managers
301 project engineering manager Other Management Occupations
302 project manager Other Management Occupations
303 analytical research program manager Other Management Occupations

The downloaded file contains job categories codes ("SOC minor group") and samples of job titles that belong to those categories. The categories descriptions are available on the Standard Occupation Classification website.

Let's investigate our dataset:

In [2]:
df.nunique()  # Count the number of different modalities in each column
Reported Job Title    7174
SOC minor group         94
dtype: int64

So we have 94 categories ("SOC minor groups") for 7174 examples, i.e. 75 examples per category on average, but some categories have as few as 10 examples (df.value_counts() would tell you that). This is not the most extreme examples of few-shot learning but it's still an example that is better tackled by siamese models than by standard multi-class classification approaches.

Before proceeding with modelling let's create a train and test sets, by putting one example in the test set for each class.

In [3]:
test_set = df.groupby('SOC minor group', as_index=False)['Reported Job Title'].first()
train_set = df[~df['Reported Job Title'].isin(test_set['Reported Job Title'])]

x_train, y_train = train_set['Reported Job Title'], train_set['SOC minor group']
x_test, y_test = test_set['Reported Job Title'], test_set['SOC minor group']

We then encode the targets as numbers.

In [4]:
from sklearn.preprocessing import LabelEncoder

classes_encoder = LabelEncoder()

y_train = classes_encoder.fit_transform(y_train)
y_test = classes_encoder.transform(y_test)

Building a baseline 🐣

Before experimenting with fancy models, let's establish a strong baseline. We can start by using word embeddings to get a vector representation of each job title and use a nearest neighbor classifier that is less likely to overfit than tree-based models or parametric classifiers.

To get the representation of a sentence from pre-trained word embeddings I'll use Zeugma, an NLP python library I've written that conveniently provides pre-trained word embeddings in the form of scikit-learn transformers.

In [5]:
from zeugma import EmbeddingTransformer

# We'll use the GloVe pre-trained embeddings, using the sum of the word embeddings
# of a job title as the embedding vector
embedding = EmbeddingTransformer('glove', aggregation='sum')
Using TensorFlow backend.
In [6]:
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier

# Our model is a nearest neighbor classifier, the input of which is the sum of the 
# embeddings of words in the job title.
clf = KNeighborsClassifier(n_neighbors=1)

baseline = make_pipeline(embedding, clf)
In [7]:
baseline.fit(x_train, y_train)
print(f'Train accuracy (baseline): {100*baseline.score(x_train, y_train):.2f} %')
print(f'Test accuracy (baseline): {100*baseline.score(x_test, y_test):.2f} %')
Train accuracy (baseline): 84.94 %
Test accuracy (baseline): 25.53 %

Not great... but not too bad for a simple baseline model, considering a random guess would have a $\frac{1}{n_{\text{classes}}} \simeq 0.1 \%$ accuracy. I have tried a few other simple models but, surprisingly, I could not beat this simple baseline easily. Let me know if you find a better model!

Note that the test accuracy is usually expected to be as high as the train one for a nearest neighbors classifier, since no parameters are fitted. Here this is not the case because the test set contains one example per category, such that rare categories have the same weight as prominent categories in the test accuracy.

It may seem hard to beat this simple baseline with a deep learning model due to the high chances of overfitting with such a small dataset, but here come siamese networks to the rescue.

Few-shot learning with siamese neural networks 👯‍♀️

The nearest neighbor model of the previous section is performing quite well despite its simplicity because it uses word embeddings learned on a huge NLP dataset from Twitter. Using these words embeddings is a basic form of transfer learning, which reduces the overfit by allowing smaller models to perform well, ultimately giving better performances on the test set.

Although the pre-trained embeddings are valuable, the embedding space used to determine nearest neighbors knows nothing about job titles in particular. There must be a way to learn an embedding space in which jobs that belong to the same occupation category are closer. This is where siamese networks come into play.

The main idea of siamese networks is to learn the above mentioned vector representation by training a model that discriminates between pairs of examples that are in the same category, and pairs of examples that come from different categories.

Building the pairs dataset

Let's create positive samples with pairs of job titles from the same SOC category, and negative examples with pairs of job titles sampled from different SOC codes.

In [8]:
import itertools
from random import sample

jobs_left = []
jobs_right = []
target = []

soc_codes = train_set['SOC minor group'].unique()
for code in soc_codes:
    # 1) create similar categories pairs, with a corresponding target of 1
    similar_jobs = train_set[train_set['SOC minor group'] == code]['Reported Job Title']
    # Pick 1000 random pairs from the SOC group's job titles combinations 
    group_pairs = list(itertools.combinations(similar_jobs, 2)) 
    positive_pairs = sample(group_pairs, 1000) if len(group_pairs) > 1000 else group_pairs
    jobs_left.extend([p[0] for p in positive_pairs])
    jobs_right.extend([p[1] for p in positive_pairs])
    # 2) create pairs of examples with jobs from different categories, with a target set to 0
    other_jobs = train_set[train_set['SOC minor group'] != code]['Reported Job Title']
    for i in range(len(positive_pairs)):

dataset = pd.DataFrame({
        'job_left': jobs_left,
        'job_right': jobs_right,
        'target': target
    }).sample(frac=1)  # Shuffle dataset

job_left job_right target
116767 carman marine design engineer 0.0
81897 childcare worker recreation supervisor 1.0
100046 brood hatchery manager sow farm manager 1.0
64558 public safety officer traffic control officer 1.0
14386 senior planner compiler 1.0

Here we see that the pair "childcare worker" and "recreation supervisor" belong to the same job occupation category (second example, target = 1), while "carman" and "marine design engineer" are in different occupation categories (first example, target = 0). Note that we end up with a much bigger dataset by creating pairs. The synthetic dataset contains 218,669 pairs while the original dataset has only 8,921 samples. Of course, we have only artificially increased the dataset size because we have not generated new data, but we'll see that this technique is still very powerful.


The general architecture of the model is based on this very good tutorial. The preprocessing is fairly simple:

  • we remove parts of the jobs titles that are between parenthesis, we lowercase them,
  • then we turn the examples into index sequences, and
  • "pad" them to get a valid input for the neural network classifier.

Here, again, I use convenient transformers from the Zeugma to perform all those steps.

In [9]:
import re
from sklearn.pipeline import make_pipeline, FeatureUnion
from sklearn.preprocessing import FunctionTransformer
from zeugma import TextsToSequences, Padder, ItemSelector

max_words_job_title = 10  # To avoid very long job titles we limit them to 10 words
vocab_size = 10000  # Number of most-frequent words kept in the vocabulary

def preprocess_job_titles(job_titles):
    """ Return a list of clean job titles """
    def preprocess_job_title(raw_job_title):
        """ Clean a single job title"""
        job_title = re.sub(r'\(.*\)', '', raw_job_title)  # Remove everything between parenthesis
        return job_title.lower().strip()
    return [preprocess_job_title(jt) for jt in job_titles]
pipeline = make_pipeline(
    FunctionTransformer(preprocess_job_titles, validate=False),  # Preprocess the text
    TextsToSequences(num_words=vocab_size),  # Turn word sequences into indexes sequences
    Padder(max_length=max_words_job_title),  # Pad shorter job titles with a dummy index

# Note that the preprocessing pipeline must be fit on both the right and left examples
# simultaneously
pipeline.fit(list(dataset['job_left']) + list(dataset['job_right']));
In [10]:
x_left = pipeline.transform(dataset['job_left'])
x_right = pipeline.transform(dataset['job_right'])
x_pairs = [x_left, x_right]   # this will be the input of the siamese network

y_pairs = dataset['target'].values
In [11]:
# We re-use the same embedding as with the baseline model
embedding_layer = embedding.model.get_keras_embedding()

Now that we have created the pairs dataset and preprocessed the job titles we can turn our attention to the siamese model itself. It consists of a cloned sequential network, the input of which is a pair of vectors x_left and x_right. The last layer of the sequential network for x_left is the vector representation of the left job title and same thing for x_right the right input job title. The representations of the right and left inputs are used to compute the similarity between the job titles:

$$\text{sim}\left(x_{l}, x_{r}\right) = \exp \left(-\| f(x_l) - f(x_r) \|_1\right),$$

where $\text{sim} \in [0, 1]$, $\|\cdot\|_1$ is the L1 norm, and $f$ is the function corresponding to the application of the cloned sequential network to the left/right input.

This setting is called the Manhattan LSTM because we'll use LSTMs as the sequential network, and the L1 norm (used to compute the distance between two samples of a pair) is also called the Manhattan distance. Here is the corresponding code.

In [12]:
from keras.layers import LSTM, Bidirectional
from keras import Model, Sequential
from keras.layers import Input, Dense, Dropout, Lambda, Subtract
from keras import backend as K

def exponent_neg_manhattan_distance(arms_difference):
    """ Compute the exponent of the opposite of the L1 norm of a vector, to get the left/right inputs
    similarity from the inputs differences. This function is used to turned the unbounded
    L1 distance to a similarity measure between 0 and 1"""
    return K.exp(-K.sum(K.abs(arms_difference), axis=1, keepdims=True))

def siamese_lstm(max_length, embedding_layer):
    """ Define, compile and return a siamese LSTM model """
    input_shape = (max_length,)
    left_input = Input(input_shape, name='left_input')
    right_input = Input(input_shape, name='right_input')

    # Define a single sequential model for both arms.
    # In this example I've chosen a simple bidirectional LSTM with no dropout
    seq = Sequential(name='sequential_network')
    seq.add(Bidirectional(LSTM(32, dropout=0., recurrent_dropout=0.)))
    left_output = seq(left_input)
    right_output = seq(right_input)

    # Here we subtract the neuron values of the last layer from the left arm 
    # with the corresponding values from the right arm
    subtracted = Subtract(name='pair_representations_difference')([left_output, right_output])
    malstm_distance = Lambda(exponent_neg_manhattan_distance, 

    siamese_net = Model(inputs=[left_input, right_input], outputs=malstm_distance)
    siamese_net.compile(loss="binary_crossentropy", optimizer='adam', metrics=['accuracy'])
    return siamese_net

siamese_lstm = siamese_lstm(max_words_job_title, embedding_layer)

# Print a summary of the model mainly to know the number of trainable parameters
Layer (type)                    Output Shape         Param #     Connected to                     
left_input (InputLayer)         (None, 10)           0                                            
right_input (InputLayer)        (None, 10)           0                                            
sequential_network (Sequential) (None, 64)           29852698    left_input[0][0]                 
pair_representations_difference (None, 64)           0           sequential_network[1][0]         
masltsm_distance (Lambda)       (None, 1)            0           pair_representations_difference[0
Total params: 29,852,698
Trainable params: 14,848
Non-trainable params: 29,837,850
In [13]:
siamese_lstm.fit(x_pairs, y_pairs, validation_split=0.1, epochs=1);
Train on 132636 samples, validate on 14738 samples
Epoch 1/1
132636/132636 [==============================] - 43s 321us/step - loss: 0.6732 - acc: 0.5754 - val_loss: 0.6548 - val_acc: 0.5948

Without much effort (light preprocessing, only one epoch, no early stopping, no hyper-parameters optimization) we obtain a decent ~60 % accuracy on the validation set. But remember that this is not the final task, here we are only solving the binary classification problem of recognizing pairs of job titles that belong to the same occupations category and pairs of jobs that are sampled from different occupations categories.

To address the initial problem of finding each job title's category we have to compute, for each example in the test set, the similarity score of this example with all the examples in the training set. The predicted category is the one of the closest example in training set.

In [14]:
x_references = pipeline.transform(x_train)  # Preprocess the training set examples

def get_prediction(job_title):
    """ Get the predicted job title category, and the most similar job title
    in the train set. Note that this way of computing a prediction is highly 
    not optimal, but it'll be sufficient for us now. """
    x = pipeline.transform([job_title])
    # Compute similarities of the job title with all job titles in the train set
    similarities = siamese_lstm.predict([[x[0]]*len(x_references), x_references])
    most_similar_index = np.argmax(similarities)
    # The predicted category is the one of the most similar example from the train set
    prediction = train_set['SOC minor group'].iloc[most_similar_index]
    most_similar_example = train_set['Reported Job Title'].iloc[most_similar_index]
    return prediction, most_similar_example

Let's check a prediction example

In [15]:
sample_idx = 1
pred, most_sim = get_prediction(x_test[sample_idx])

print(f'Sampled test job title: {x_test[sample_idx]}')
print(f'True occupation: {test_set["SOC minor group"].iloc[sample_idx]}')
print(f'Occupation prediction: {pred}')
print(f'Most similar example in train set: {most_sim}') 
Sampled test job title: brand inspector
True occupation: Agricultural Workers
Occupation prediction: Agricultural Workers
Most similar example in train set: deputy brand inspector
In [16]:
from sklearn.metrics import accuracy_score

y_pred = [get_prediction(job_title)[0] for job_title in test_set['Reported Job Title']]
accuracy = accuracy_score(classes_encoder.transform(y_pred), y_test)

print(f'Test accuracy (siamese model): {100*accuracy:.2f} %')
Test accuracy (siamese model): 39.36 %

The siamese model thus outperforms the random guess (accuracy ~0.1 %) and the nearest neighbor baseline (~25 %) by a substantial margin ๐Ÿคนโ€โ™€๏ธ . Even though it is far from perfect, it predicts the right category for a job title roughly 2 times out of 5, while it has to choose between roughly a hundred of them.

What this means is that the siamese model managed to squeeze some juice out of all the examples in the dataset, even across different categories. The original multi-class classification approach does not allow to learn "across categories" because the categorical cross-entropy that is always used to treat those problems actually treats the multi-class classification tasks as a set of independent binary classification tasks.

The siamese network approach to the few-shot learning problem is definitely a way out with textual data ๐Ÿ“š. At the cost of a more complex modelling, it gives better results than standard classification methods. Give it a shot if you want to classify your data with machine learning but don't have many examples per category, you won't regret it ๐ŸŽŠ.

Please don't hesitate to leave a comment, I'd be happy to learn about your experiments with few-shot learning problems, to clarify some parts, or to have any feedback on this post. Thanks for reading!

Share on: