Faster TensorFlow models in Hugging Face Transformers
Faster TensorFlow Models in Hugging Face Transformers
In recent months, the Hugging Face team has been working diligently to improve the performance of TensorFlow models in Transformers. The primary focus of these improvements has been on two key aspects: computational performance and TensorFlow Serving.
Computational Performance
To demonstrate the improvements in computational performance, a thorough benchmark was conducted, comparing the performance of BERT's TensorFlow Serving implementation (v4.2.0) to the official Google implementation. The benchmark was run on a GPU V100 with a sequence length of 128 (times are in milliseconds).
| Batch Size | Google Implementation | v4.2.0 Implementation | Relative Difference |
|---|---|---|---|
| 1 | 6.7 | 6.26 | 6.79% |
| 2 | 9.4 | 8.68 | 7.96% |
| 4 | 14.4 | 13.1 | 9.45% |
| 8 | 24 | 21.5 | 10.99% |
| 16 | 46.6 | 42.3 | 9.67% |
| 32 | 83.9 | 80.4 | 4.26% |
| 64 | 171.5 | 156 | 9.47% |
| 128 | 338.5 | 309 | 9.11% |
The current implementation of BERT in v4.2.0 is faster than the Google implementation by up to ~10%. Additionally, it is twice as fast as the implementations in the 4.1.1 release.
TensorFlow Serving
The previous section demonstrated the significant improvement in computational performance of the BERT model in Transformers. In this section, we will walk through a step-by-step guide on how to deploy a BERT model with TensorFlow Serving to take advantage of this increased performance in a production environment.
What is TensorFlow Serving?
TensorFlow Serving is a tool provided by TensorFlow Extended (TFX) that simplifies the task of deploying a model to a server. TensorFlow Serving provides two APIs: one that can be called using HTTP requests and another one using gRPC to run inference on the server.
What is a SavedModel?
A SavedModel is a standalone TensorFlow model that includes its weights and architecture. It does not require the original source code of the model to be run, making it useful for sharing or deploying with any backend that supports reading a SavedModel, such as Java, Go, C++, or JavaScript, among others.
The internal structure of a SavedModel is represented as follows:
savedmodel
/assets
-> here the needed assets by the model (if any)
/variables
-> here the model checkpoints that contain the weights
saved_model.pb -> protobuf file representing the model graph
How to Install TensorFlow Serving?
There are three ways to install and use TensorFlow Serving:
- through a Docker container,
- through an apt package,
- or using pip.
To make things easier and comply with all existing OS, we will use Docker in this tutorial.
How to Create a SavedModel?
SavedModel is the format expected by TensorFlow Serving. Since Transformers v4.2.0, creating a SavedModel has three additional features:
- The sequence length can be modified freely between runs.
- All model inputs are available for inference.
- Hidden states or attention are now grouped into a single output when returning them with
output_hidden_states=Trueoroutput_attentions=True.
Below, you can find the inputs and outputs representations of a TFBertForSequenceClassification saved as a TensorFlow SavedModel:
The given SavedModel SignatureDef contains the following input(s):
inputs['attention_mask'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_attention_mask:0
inputs['input_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_input_ids:0
inputs['token_type_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_token_type_ids:0
The given SavedModel SignatureDef contains the following output(s):
outputs['attentions'] tensor_info:
dtype: DT_FLOAT
shape: (12, -1, 12, -1, -1)
name: StatefulPartitionedCall:0
outputs['logits'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 2)
name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict
To directly pass inputs_embeds (the token embeddings) instead of input_ids (the token IDs) as input, we need to subclass the model to have a new serving signature. The following snippet of code shows how to do so:
from transformers import TFBertForSequenceClassification
import tensorflow as tf
class MyOwnModel(TFBertForSequenceClassification):
@tf.function(input_signature=[{
"inputs_embeds": tf.TensorSpec((None, None, 768), tf.float32, name="inputs_embeds"),
"attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
"token_type_ids": tf.TensorSpec((None, None), tf.int32, name="token_type_ids"),
}])
def serving(self, inputs):
output = self.call(inputs)
return self.serving_output(output)
model = MyOwnModel.from_pretrained("bert-base-cased")
model.save_pretrained("my_model", saved_model=True)
The serving method has to be overridden by the new input_signature argument of the tf.function decorator. See the official documentation to know more about the input_signature argument. The serving method is used to define how a SavedModel will behave when deployed with TensorFlow Serving.
Now the SavedModel looks like as expected, see the new inputs_embeds input:
The given SavedModel SignatureDef contains the following input(s):
inputs['attention_mask'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_attention_mask:0
inputs['inputs_embeds'] tensor_info:
dtype: DT_FLOAT
shape: (-1, -1, 768)
name: serving_default_inputs_embeds:0
inputs['token_type_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_token_type_ids:0
The given SavedModel SignatureDef contains the following output(s):
outputs['attentions'] tensor_info:
dtype: DT_FLOAT
shape: (12, -1, 12, -1, -1)
name: StatefulPartitionedCall:0
outputs['logits'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 2)
name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict
How to Deploy and Use a SavedModel?
Let's see step by step how to deploy and use a BERT model for sentiment classification.
Step 1
Create a SavedModel. To create a SavedModel, the Transformers library lets you load a PyTorch model called nateraw/bert-base-uncased-imdb trained on the IMDB dataset and convert it to a TensorFlow Keras model for you:
from transformers import TFBertForSequenceClassification
model = TFBertForSequenceClassification.from_pretrained("nateraw/bert-base-uncased-imdb", from_pt=True)
model.save_pretrained("my_model", saved_model=True)
Step 2
Create a Docker container with the SavedModel and run it. First, pull the TensorFlow Serving Docker image for CPU (for GPU replace serving by serving:latest-gpu):
docker pull tensorflow/serving
Next, run a serving image as a daemon named serving_base:
docker run -d --name serving_base tensorflow/serving
Copy the newly created SavedModel into the serving_base container's models folder:
docker cp my_model/saved_model serving_base:/models/bert
Commit the container that serves the model by changing MODEL_NAME to match the model's name (here bert), the name corresponds to the name we want to give to our SavedModel:
docker commit --change "ENV MODEL_NAME bert" serving_base my_bert_model
and kill the serving_base image ran as a daemon because we don't need it anymore:
docker kill serving_base
Finally, Run the image to serve our SavedModel as a daemon and we map the ports 8501 (REST API), and 8500 (gRPC API) in the container to the host and we name the container bert:
docker run -d -p 8501:8501 -p 8500:8500 --name bert my_bert_model
Step 3
Query the model through the REST API:
from transformers import BertTokenizerFast, BertConfig
import requests
import json
import numpy as np
sentence = "I love the new TensorFlow update in transformers."
tokenizer = BertTokenizerFast.from_pretrained("nateraw/bert-base-uncased-imdb")
config = BertConfig.from_pretrained("nateraw/bert-base-uncased-imdb")
batch = tokenizer(sentence)
batch = dict(batch)
batch = [batch]
input_data = {"instances": batch}
r = requests.post("http://localhost:8501/v1/models/bert:predict", data=json.dumps(input_data))
result = json.loads(r.text)["predictions"][0]
abs_scores = np.abs(result)
label_id = np.argmax(abs_scores)
print(config.id2label[label_id])
This should return POSITIVE. It is also possible to pass by the gRPC (google Remote Procedure Call) API to get the same result:
from transformers import BertTokenizerFast, BertConfig
import numpy as np
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import grpc
sentence = "I love the new TensorFlow update in transformers."
tokenizer = BertTokenizerFast.from_pretrained("nateraw/bert-base-uncased-imdb")
config = BertConfig.from_pretrained("nateraw/bert-base-uncased-imdb")
batch = tokenizer(sentence, return_tensors="tf")
channel = grpc.insecure_channel("localhost:8500")
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = "bert"
request.model_spec.signature_name = "serving_default"
request.inputs["input_ids"].CopyFrom(tf.make_tensor_proto(batch["input_ids"]))
request.inputs["attention_mask"].CopyFrom(tf.make_tensor_proto(batch["attention_mask"]))
request.inputs["token_type_ids"].CopyFrom(tf.make_tensor_proto(batch["token_type_ids"]))
result = stub.Predict(request)
output = result.outputs["logits"].float_val
print(config.id2label[np.argmax(np.abs(output))])
Conclusion
Thanks to the last updates applied on the TensorFlow models in Transformers, one can now easily deploy its models in production using TensorFlow Serving. One of the next steps we are thinking about is to directly integrate the preprocessing part inside the SavedModel to make things even easier.




