Introducing Optimum: The Optimization Toolkit for Transformers at Scale

Transformers have revolutionized the field of natural language processing (NLP) and have expanded to other modalities such as speech and vision. However, taking these massive models into production and making them run fast at scale is a significant challenge for any machine learning engineering team. In this article, we will introduce Optimum, a new open-source library that aims to build the definitive toolkit for transformers production performance.

Scaling Transformers is Hard

What do Tesla, Google, Microsoft, and Facebook all have in common? They all run billions of transformer model predictions every day. Transformers have brought a step-change improvement in the accuracy of machine learning models and have conquered NLP. However, taking these massive models into production and making them run fast at scale is a huge challenge for any machine learning engineering team.

The Complexity of Model Acceleration

To get optimal performance training and serving models, the model acceleration techniques need to be specifically compatible with the targeted hardware. Each hardware platform offers specific software tooling, features, and knobs that can have a huge impact on performance. Similarly, to take advantage of advanced model acceleration techniques like sparsity and quantization, optimized kernels need to be compatible with the operators on silicon, and specific to the neural network graph derived from the model architecture.

The 3D Compatibility Matrix

Diving into this 3D compatibility matrix and how to use model acceleration libraries is daunting work, which few machine learning engineers have experience on. Optimum aims to make this work easy, providing performance optimization tools targeting efficient AI hardware, built in collaboration with our hardware partners, and turn machine learning engineers into ML optimization wizards.

Optimum in Practice: How to Quantize a Model for Intel Xeon CPU

Pre-trained language models such as BERT have achieved state-of-the-art results on a wide range of natural language processing tasks. However, putting transformer-based models into production can be tricky and expensive as they need a lot of compute power to work. To solve this, many techniques exist, the most popular being quantization.

The Challenges of Quantization

Quantizing a model requires a lot of work, for many reasons:

The model needs to be edited: some ops need to be replaced by their quantized counterparts, new ops need to be inserted (quantization and dequantization nodes), and others need to be adapted to the fact that weights and activations will be quantized.
This part can be very time-consuming because frameworks such as PyTorch work in eager mode, meaning that the changes mentioned above need to be added to the model implementation itself.
PyTorch now provides a tool called torch.fx that allows you to trace and transform your model without having to actually change the model implementation, but it is tricky to use when tracing is not supported for your model out of the box.
On top of the actual editing, it is also necessary to find which parts of the model need to be edited, which ops have an available quantized kernel counterpart and which ops don't, and so on.

The Trade-Off Between Quantization and Accuracy

Once the model has been edited, there are many parameters to play with to find the best quantization settings:

Which kind of observers should I use for range calibration?
Which quantization scheme should I use?
Which quantization related data types (int8, uint8, int16) are supported on my target device?
Balance the trade-off between quantization and an acceptable accuracy loss.
Export the quantized model for the target device.

How Intel is Solving Quantization and More with Neural Compressor

Intel Neural Compressor (formerly referred to as Low Precision Optimization Tool or LPOT) is an open-source python library designed to help users deploy low-precision inference solutions. The latter applies low-precision recipes for deep-learning models to achieve optimal product objectives, such as inference performance and memory usage, with expected performance criteria. Neural Compressor supports post-training quantization, quantization-aware training, and dynamic quantization.

The Configuration File

In order to specify the quantization approach, objective, and performance criteria, the user must provide a configuration yaml file specifying the tuning parameters. The configuration file can either be hosted on the Hugging Face's Model Hub or can be given through a local directory path.

How to Easily Quantize Transformers for Intel Xeon CPUs with Optimum

Optimum will focus on achieving optimal production performance on dedicated hardware, where software and hardware acceleration techniques can be applied for maximum efficiency. We will work hand in hand with our hardware partners to enable, test, and maintain acceleration, and deliver it in an easy and accessible way through Optimum, as we did with Intel and Neural Compressor.

The Collaboration with Hardware Partners

The collaboration with our hardware partners will yield hardware-specific optimized model configurations and artifacts, which we will make available to the AI community via the Hugging Face Model Hub. We hope that Optimum and hardware-optimized models will accelerate the adoption of efficiency in production workloads, which represent most of the aggregate energy spent on machine learning.

Code Blocks:

import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple neural network model
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(5, 10)  # input layer (5) -> hidden layer (10)
        self.fc2 = nn.Linear(10, 5)  # hidden layer (10) -> output layer (5)

    def forward(self, x):
        x = torch.relu(self.fc1(x))  # activation function for hidden layer
        x = self.fc2(x)
        return x

# Initialize the model, loss function, and optimizer
model = Net()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Train the model
for epoch in range(100):
    inputs = torch.randn(10, 5)
    labels = torch.randn(10, 5)
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

// Define a simple neural network model
class Net {
  constructor() {
    this.fc1 = new Linear(5, 10);  // input layer (5) -> hidden layer (10)
    this.fc2 = new Linear(10, 5);  // hidden layer (10) -> output layer (5)
  }

  forward(x) {
    x = this.fc1.forward(x);  // activation function for hidden layer
    x = this.fc2.forward(x);
    return x;
  }
}

// Initialize the model, loss function, and optimizer
let model = new Net();
let criterion = new MSE();
let optimizer = new SGD(model.parameters(), 0.01);

// Train the model
for (let epoch = 0; epoch < 100; epoch++) {
  let inputs = new Float32Array(10 * 5);
  let labels = new Float32Array(10 * 5);
  optimizer.zeroGrad();
  let outputs = model.forward(inputs);
  let loss = criterion.forward(outputs, labels);
  loss.backward();
  optimizer.step();
  console.log(`Epoch ${epoch + 1}, Loss: ${loss}`);
}

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// Define a simple neural network model
typedef struct {
  float *weights;
  float *bias;
} Linear;

Linear *linear_create(int input_size, int output_size) {
  Linear *linear = (Linear *)malloc(sizeof(Linear));
  linear->weights = (float *)malloc(input_size * output_size * sizeof(float));
  linear->bias = (float *)malloc(output_size * sizeof(float));
  return linear;
}

void linear_forward(Linear *linear, float *inputs, float *outputs) {
  for (int i = 0; i < linear->weights_size; i++) {
    outputs[i] = linear->weights[i] * inputs[i] + linear->bias[i];
  }
}

// Initialize the model, loss function, and optimizer
int main() {
  Linear *linear = linear_create(5, 10);
  float *inputs = (float *)malloc(5 * sizeof(float));
  float *labels = (float *)malloc(5 * sizeof(float));
  float loss;

  for (int epoch = 0; epoch < 100; epoch++) {
    // Train the model
    linear_forward(linear, inputs, labels);
    loss = 0.0f;
    for (int i = 0; i < 5; i++) {
      loss += (labels[i] - inputs[i]) * (labels[i] - inputs[i]);
    }
    loss = sqrt(loss / 5.0f);
    printf("Epoch %d, Loss: %f\n", epoch + 1, loss);
  }

  return 0;
}

Source: https://huggingface.co/blog/hardware-partners-program

Introducing Optimum: The Optimization Toolkit for Transformers at Scale

Introducing Optimum: The Optimization Toolkit for Transformers at Scale