Why we’re switching to Hugging Face Inference Endpoints, and maybe you should too

Introduction

As a company that specializes in Natural Language Processing (NLP) solutions, we are always on the lookout for ways to improve our deployment process and reduce the time spent on managing our models. Recently, we discovered Hugging Face Inference Endpoints, a managed service that allows us to deploy transformer models in production with ease. In this article, we will discuss our experience with Hugging Face Inference Endpoints, its benefits, and why we think it's a great solution for companies like ours.

Our Previous Deployment Process

Before switching to Hugging Face Inference Endpoints, we were using Amazon Web Services (AWS) Elastic Container Service (ECS) backed by AWS Fargate. Our process involved training our models on a GPU instance, uploading them to Hugging Face Hub, building an API to serve the model using FastAPI, wrapping the API in a container using Docker, uploading the container to AWS Elastic Container Repository (ECR), and finally deploying the model to an ECS cluster.

While this process worked for us, it had its limitations. Managing our own ECS cluster and containers required a significant amount of time and effort, which could have been better spent on developing new NLP solutions. Moreover, our deployment process was not scalable, and as our model inventory grew, it became increasingly difficult to manage.

Hugging Face Inference Endpoints

Hugging Face Inference Endpoints is a managed service that allows us to deploy transformer models in production with ease. With Inference Endpoints, we can deploy our models to any cloud (AWS, Azure, GCP), on a range of instance types (including GPU), and with minimal configuration.

Our deployment process with Hugging Face Inference Endpoints is significantly simpler than our previous process. We train our models on a GPU instance, upload them to Hugging Face Hub, and then deploy them using Hugging Face Inference Endpoints. This process eliminates the need for managing our own ECS cluster and containers, freeing up time for us to focus on developing new NLP solutions.

Latency and Stability

Before switching to Hugging Face Inference Endpoints, we tested different CPU endpoint types using the ab tool. Our tests showed that the vanilla Hugging Face container was more than twice as fast as our bespoke container run on ECS. The slowest response we received from the large Inference Endpoint was just 108ms.

Here is a summary of our test results:

Endpoint Type	vCPU	Memory (GB)	ECS (ms)	Hugging Face (ms)
Small	1	2	_	296
Medium	2	4	_	156 ± 51
Large	4	8	~200	80 ± 30
X-Large	8	16	_	43 ± 31

Cost

Our tests also showed that Hugging Face Inference Endpoints are more expensive than our previous ECS approach. However, the additional cost is minimal, and we believe that the benefits of using Hugging Face Inference Endpoints far outweigh the costs.

Here is a summary of our cost comparison:

Endpoint Type	vCPU	Memory (GB)	ECS	Hugging Face	% Diff
Small	1	2	$33.18	$43.80	0.24
Medium	2	4	$60.38	$87.61	0.31
Large	4	8	$114.78	$175.22	0.34
X-Large	8	16	$223.59	$350.44	0.5

Deployment Options

Hugging Face Inference Endpoints can be deployed using the GUI, a RESTful API, or manually using the hugie command-line tool. We use the hugie tool to deploy our Inference Endpoints from a GitHub action.

Hosting Multiple Models on a Single Endpoint

Hugging Face Inference Endpoints allows us to host multiple models on a single endpoint. This can help reduce costs and improve scalability. We can use a custom Endpoint Handler class to allow us to host multiple models on a single endpoint.

Conclusion

In conclusion, we believe that Hugging Face Inference Endpoints is a great solution for companies like ours that specialize in NLP solutions. It simplifies our deployment process, reduces the time spent on managing our models, and improves scalability. While it may be more expensive than our previous ECS approach, we believe that the benefits far outweigh the costs.

If you're interested in learning more about Hugging Face Inference Endpoints or would like to discuss how it can help your company, please don't hesitate to contact us.

Code Blocks

import os
import json
import requests

# Set API endpoint and API key
api_endpoint = "https://api.huggingface.co/models"
api_key = "YOUR_API_KEY"

# Set model name and version
model_name = "bert-base-uncased"
model_version = "main"

# Set endpoint type and instance type
endpoint_type = "cpu"
instance_type = "small"

# Set API request headers
headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

# Set API request body
body = {
    "model_name": model_name,
    "model_version": model_version,
    "endpoint_type": endpoint_type,
    "instance_type": instance_type
}

# Send API request
response = requests.post(api_endpoint, headers=headers, json=body)

# Check API response status code
if response.status_code == 201:
    print("Inference Endpoint created successfully")
else:
    print("Error creating Inference Endpoint")

# Create Inference Endpoint using hugie command-line tool
hugie endpoint create example/development.json

{
    "model_name": "bert-base-uncased",
    "model_version": "main",
    "endpoint_type": "cpu",
    "instance_type": "small"
}

Note: Replace YOUR_API_KEY with your actual API key.

Source: https://huggingface.co/blog/mantis-case-study

Why we’re switching to Hugging Face Inference Endpoints, and maybe you should too

Why we’re switching to Hugging Face Inference Endpoints, and maybe you should too

Introduction

Our Previous Deployment Process

Hugging Face Inference Endpoints

Latency and Stability

Cost

Deployment Options

Hosting Multiple Models on a Single Endpoint

Conclusion

Code Blocks

About the Author

Share this article

Related Posts

The latest AI news we announced in May 2026

The Download: AI hacking beyond Mythos, and chatbots' impact on our brains

The Meta hack shows there’s more to AI security than Mythos