ZadeNor AI
Back to Blog
AI

Faster TensorFlow models in Hugging Face Transformers

November 26, 2025
5 min
2,309 views
By ZadeNor AI Team
Faster TensorFlow models in Hugging Face Transformers

Faster TensorFlow models in Hugging Face Transformers

Faster TensorFlow Models in Hugging Face Transformers

In recent months, the Hugging Face team has been working diligently to improve the performance of TensorFlow models in Transformers. The primary focus of these improvements has been on two key aspects: computational performance and TensorFlow Serving.

Computational Performance

To demonstrate the improvements in computational performance, a thorough benchmark was conducted, comparing the performance of BERT's TensorFlow Serving implementation (v4.2.0) to the official Google implementation. The benchmark was run on a GPU V100 with a sequence length of 128 (times are in milliseconds).

Batch SizeGoogle Implementationv4.2.0 ImplementationRelative Difference
16.76.266.79%
29.48.687.96%
414.413.19.45%
82421.510.99%
1646.642.39.67%
3283.980.44.26%
64171.51569.47%
128338.53099.11%

The current implementation of BERT in v4.2.0 is faster than the Google implementation by up to ~10%. Additionally, it is twice as fast as the implementations in the 4.1.1 release.

TensorFlow Serving

The previous section demonstrated the significant improvement in computational performance of the BERT model in Transformers. In this section, we will walk through a step-by-step guide on how to deploy a BERT model with TensorFlow Serving to take advantage of this increased performance in a production environment.

What is TensorFlow Serving?

TensorFlow Serving is a tool provided by TensorFlow Extended (TFX) that simplifies the task of deploying a model to a server. TensorFlow Serving provides two APIs: one that can be called using HTTP requests and another one using gRPC to run inference on the server.

What is a SavedModel?

A SavedModel is a standalone TensorFlow model that includes its weights and architecture. It does not require the original source code of the model to be run, making it useful for sharing or deploying with any backend that supports reading a SavedModel, such as Java, Go, C++, or JavaScript, among others.

The internal structure of a SavedModel is represented as follows:

savedmodel
    /assets
        -> here the needed assets by the model (if any)
    /variables
        -> here the model checkpoints that contain the weights
   saved_model.pb -> protobuf file representing the model graph

How to Install TensorFlow Serving?

There are three ways to install and use TensorFlow Serving:

  • through a Docker container,
  • through an apt package,
  • or using pip.

To make things easier and comply with all existing OS, we will use Docker in this tutorial.

How to Create a SavedModel?

SavedModel is the format expected by TensorFlow Serving. Since Transformers v4.2.0, creating a SavedModel has three additional features:

  • The sequence length can be modified freely between runs.
  • All model inputs are available for inference.
  • Hidden states or attention are now grouped into a single output when returning them with output_hidden_states=True or output_attentions=True.

Below, you can find the inputs and outputs representations of a TFBertForSequenceClassification saved as a TensorFlow SavedModel:

The given SavedModel SignatureDef contains the following input(s):
  inputs['attention_mask'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_attention_mask:0
  inputs['input_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_input_ids:0
  inputs['token_type_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_token_type_ids:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['attentions'] tensor_info:
      dtype: DT_FLOAT
      shape: (12, -1, 12, -1, -1)
      name: StatefulPartitionedCall:0
  outputs['logits'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 2)
      name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict

To directly pass inputs_embeds (the token embeddings) instead of input_ids (the token IDs) as input, we need to subclass the model to have a new serving signature. The following snippet of code shows how to do so:

from transformers import TFBertForSequenceClassification
import tensorflow as tf

class MyOwnModel(TFBertForSequenceClassification):
    @tf.function(input_signature=[{
        "inputs_embeds": tf.TensorSpec((None, None, 768), tf.float32, name="inputs_embeds"),
        "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
        "token_type_ids": tf.TensorSpec((None, None), tf.int32, name="token_type_ids"),
    }])
    def serving(self, inputs):
        output = self.call(inputs)
        return self.serving_output(output)

model = MyOwnModel.from_pretrained("bert-base-cased")
model.save_pretrained("my_model", saved_model=True)

The serving method has to be overridden by the new input_signature argument of the tf.function decorator. See the official documentation to know more about the input_signature argument. The serving method is used to define how a SavedModel will behave when deployed with TensorFlow Serving.

Now the SavedModel looks like as expected, see the new inputs_embeds input:

The given SavedModel SignatureDef contains the following input(s):
  inputs['attention_mask'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_attention_mask:0
  inputs['inputs_embeds'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, -1, 768)
      name: serving_default_inputs_embeds:0
  inputs['token_type_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_token_type_ids:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['attentions'] tensor_info:
      dtype: DT_FLOAT
      shape: (12, -1, 12, -1, -1)
      name: StatefulPartitionedCall:0
  outputs['logits'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 2)
      name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict

How to Deploy and Use a SavedModel?

Let's see step by step how to deploy and use a BERT model for sentiment classification.

Step 1

Create a SavedModel. To create a SavedModel, the Transformers library lets you load a PyTorch model called nateraw/bert-base-uncased-imdb trained on the IMDB dataset and convert it to a TensorFlow Keras model for you:

from transformers import TFBertForSequenceClassification

model = TFBertForSequenceClassification.from_pretrained("nateraw/bert-base-uncased-imdb", from_pt=True)
model.save_pretrained("my_model", saved_model=True)

Step 2

Create a Docker container with the SavedModel and run it. First, pull the TensorFlow Serving Docker image for CPU (for GPU replace serving by serving:latest-gpu):

docker pull tensorflow/serving

Next, run a serving image as a daemon named serving_base:

docker run -d --name serving_base tensorflow/serving

Copy the newly created SavedModel into the serving_base container's models folder:

docker cp my_model/saved_model serving_base:/models/bert

Commit the container that serves the model by changing MODEL_NAME to match the model's name (here bert), the name corresponds to the name we want to give to our SavedModel:

docker commit --change "ENV MODEL_NAME bert" serving_base my_bert_model

and kill the serving_base image ran as a daemon because we don't need it anymore:

docker kill serving_base

Finally, Run the image to serve our SavedModel as a daemon and we map the ports 8501 (REST API), and 8500 (gRPC API) in the container to the host and we name the container bert:

docker run -d -p 8501:8501 -p 8500:8500 --name bert my_bert_model

Step 3

Query the model through the REST API:

from transformers import BertTokenizerFast, BertConfig
import requests
import json
import numpy as np

sentence = "I love the new TensorFlow update in transformers."

tokenizer = BertTokenizerFast.from_pretrained("nateraw/bert-base-uncased-imdb")
config = BertConfig.from_pretrained("nateraw/bert-base-uncased-imdb")

batch = tokenizer(sentence)

batch = dict(batch)

batch = [batch]

input_data = {"instances": batch}

r = requests.post("http://localhost:8501/v1/models/bert:predict", data=json.dumps(input_data))

result = json.loads(r.text)["predictions"][0]

abs_scores = np.abs(result)

label_id = np.argmax(abs_scores)

print(config.id2label[label_id])

This should return POSITIVE. It is also possible to pass by the gRPC (google Remote Procedure Call) API to get the same result:

from transformers import BertTokenizerFast, BertConfig
import numpy as np
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import grpc

sentence = "I love the new TensorFlow update in transformers."

tokenizer = BertTokenizerFast.from_pretrained("nateraw/bert-base-uncased-imdb")
config = BertConfig.from_pretrained("nateraw/bert-base-uncased-imdb")

batch = tokenizer(sentence, return_tensors="tf")

channel = grpc.insecure_channel("localhost:8500")

stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

request = predict_pb2.PredictRequest()

request.model_spec.name = "bert"

request.model_spec.signature_name = "serving_default"

request.inputs["input_ids"].CopyFrom(tf.make_tensor_proto(batch["input_ids"]))

request.inputs["attention_mask"].CopyFrom(tf.make_tensor_proto(batch["attention_mask"]))

request.inputs["token_type_ids"].CopyFrom(tf.make_tensor_proto(batch["token_type_ids"]))

result = stub.Predict(request)

output = result.outputs["logits"].float_val

print(config.id2label[np.argmax(np.abs(output))])

Conclusion

Thanks to the last updates applied on the TensorFlow models in Transformers, one can now easily deploy its models in production using TensorFlow Serving. One of the next steps we are thinking about is to directly integrate the preprocessing part inside the SavedModel to make things even easier.


Source: https://huggingface.co/blog/tf-serving

About the Author

ZadeNor AI Team is a leading expert in AI, contributing to cutting-edge research and development in the field.