Skip to content
Generic filters
Exact matches only

Recognizing human facial expressions with machine learning

Commonly used FER system architectures

The image preprocessing stage can include image transformations such as scaling, cropping, or filtering images. It is often used to accentuate relevant image information, like cropping an image to remove a background. It can also be used to augment a dataset, for example to generate multiple versions from an original image with varying cropping or transformations applied.

The feature extraction stage goes further in finding the more descriptive parts of an image. Often this means finding information which can be most indicative of a particular class, such as the edges, textures, or colors.

The training stage takes place according to the defined training architecture, which determines the combinations of layers which feed into each other in the neural network. Architectures must be designed for training with the composition of the feature extraction and image preprocessing stages in mind. This is necessary because some architectural components work better with others when applied separately or together.

For example, certain types of feature extraction are not useful in conjunction with deep learning algorithms. They both find relevant features in images, such as edges, and therefore it is redundant to use the two together. Applying feature extraction prior to a deep learning algorithm is not only unnecessary, but can even negatively impact the performance of the architecture.

A comparison of training algorithms

Once any feature extraction or image preprocessing stages are complete, the training algorithm produces a trained prediction model. A number of options exist for training FER models, each of which has strengths and weaknesses making them more or less suitable for particular situations.

In this article we will compare some of the most common algorithms used in FER:


  • Multiclass Support Vector Machines (SVM)
  • Convolutional Neural Networks (CNN)
  • Recurrent Neural Networks (RNN)
  • Convolutional Long Short-Term Memory (ConvLSTM)

Multiclass Support Vector Machines (SVM) are supervised learning algorithms that analyze and classify data, and they perform well when classifying human facial expressions. However, they only do so when the images are created in a controlled lab setting with consistent head poses and illumination.

SVMs perform less well when classifying images captured “in the wild,” or in spontaneous, uncontrolled settings. Therefore, the latest training architectures being explored are all deep neural networks which perform better under those circumstances. Convolutional Neural Networks (CNN) are currently considered the go-to neural networks for image classification, because they pick up on patterns in small parts of an image, such as the curve of an eyebrow.

CNNs apply kernels, which are matrices smaller than the image, to chunks of the input image. By applying kernels to inputs, new activation matrices, sometimes referred to as feature maps, are generated and passed as inputs to the next layer of the network. In this way, CNNs process more granular elements within an image, making them better at distinguishing between two similar emotion classifications.

Alternatively, Recurrent Neural Networks (RNN) use dynamic temporal behavior when classifying an image. This means that when an RNN processes an input example, it doesn’t just look at the data from that example — it also looks at the data from previous inputs, which are used to provide further context. In FER, the context could be previous image frames of a video clip.

The idea of this approach is to capture the transitions between facial patterns over time, allowing these changes to become additional data points supporting classification. For example, it is possible to capture the changes in the edges of the lips as an expression goes from neutral to happy by smiling, rather than just the edges of a smile from an individual image frame.

Combinations for greater effect

A CNN’s strength in extracting local data can be combined with an RNN’s ability to use temporal context using Convolutional Long Short-Term Memory (ConvLSTM). These systems use convolutional layers to extract features and LSTM layers to capture changes in image sequences.

Since deep neural networks are good at identifying patterns in images, they can also be used for feature extraction. Some FER approaches use CNNs to produce feature vectors that are then sent to an SVM for classification. This approach can lead to more accurate results, but is a more complex architecture that requires extra programming effort and an increase in the processing time for each classified image.

The performance of any of these approaches varies depending on the input data, training parameters, emotion set and the system requirements. For these reasons it is important to experiment with various training architectures and datasets, assessing the accuracy and usefulness of each combination.

Finding the right data

As described above, FER models must be trained on a set of labeled images before they can be used to classify new input images. Training for these applications requires large datasets of facial imagery with each displaying a discrete emotion — the more labeled images, the better.

Making a decision on which dataset to train the network on is no easy task, particularly because high-quality FER datasets are hard to find. Few such datasets are publicly available, and those that are, come with idiosyncrasies which must be understood and taken into account.