Introducing the Synthetic Data Generator - Build Datasets with Natural Language
Introduction
The Synthetic Data Generator is a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). This innovative tool allows anyone to create datasets and models in minutes without requiring any coding knowledge.
What is Synthetic Data?
Synthetic data is artificially generated information that mimics real-world data. It allows overcoming data limitations by expanding or enhancing datasets. With the Synthetic Data Generator, users can create high-quality synthetic data that can be used for various applications, such as training machine learning models, testing, and validation.
From Prompt to Dataset to Model
The Synthetic Data Generator takes a description of the data you want (your custom prompt) and returns a dataset for your use case, using a synthetic data pipeline. In the background, this is powered by distilabel and the free Hugging Face text-generation API, but we don’t need to worry about these complexities and can focus on using the UI.
Supported Tasks
The tool currently supports text classification and chat datasets. These tasks will determine the type of dataset you will generate, classification requires categories, while chat data requires a conversation. Based on demand, we will add tasks like evaluation and RAG over time.
Text Classification
Text classification is common for categorizing text like customer reviews, social media posts, or news articles. Generating a classification dataset relies on two different steps that we address with LLMs. We first generate diverse texts, and then we add labels to them. A good example of a synthetic text classification dataset is argilla/synthetic-text-classification-news, which classifies synthetic news articles into 8 different classes.
Chat Datasets
This type of dataset can be used for supervised fine-tuning (SFT), which is the technique that allows LLMs to work with conversational data, allowing the user to interact with LLMs via a chat interface. A good example of a synthetic chat dataset is argilla/synthetic-sft-customer-support-single-turn, which highlights an example of an LLM designed to handle customer support. In this example, the customer support topic is the synthetic data generator itself.
Generating Datasets
To generate a dataset, you need to follow these steps:
- Describe Your Dataset: Start by providing a description of the dataset you want to create, including example use cases to help the generator understand your needs. Make sure to describe the goal and type of assistant in as much detail as possible.
- Configure and Refine: Refine your generated sample dataset by adjusting the system prompt, which has been generated based on your description and by adjusting the task-specific settings. This will help you get to the specific results you’re after.
- Generate and Push: Fill out general information about the dataset name and organisation. Additionally, you can define the number of samples to generate and the temperature to use for the generation. This temperature represents the creativity of the generations.
Reviewing the Dataset
Even when dealing with synthetic data, it is essential to understand and look at your data, which is why we created a direct integration with Argilla, a collaboration tool for AI engineers and domain experts to build high-quality datasets. This allows you to effectively explore and evaluate the synthetic dataset through powerful features like semantic search and composable filters.
Training a Model
Don’t worry; even creating powerful AI models can be done without code nowadays using AutoTrain. To understand AutoTrain, you can look at its documentation. Here, we will create our own AutoTrain deployment and log in as we’ve done before for the synthetic data generator.
Advanced Features
Even though you can go from prompts to dedicated models without knowing anything about coding, some people might like the option to customize and scale their deployment with some more advanced technical features.
Improving Speed and Accuracy
You can improve speed and accuracy by creating your own deployment of the tool and configuring it to use different parameters or models. First, you must duplicate the synthetic data generator. Make sure you create it as a private Space to ensure nobody else can access it. Next, you can change the default values of some environment variables.
Local Deployment
Besides hosting the tool on Hugging Face Spaces, we also offer it as an open-source tool under an Apache 2 license, which means you can go to GitHub and use, modify, and adapt it however you need. You can install it as a Python package through a simple pip install synthetic-dataset-generator. Make sure to configure the right environment variables when creating your deployment.
Customising Pipelines
Each synthetic data pipeline is based on distilabel, the framework for synthetic data and AI feedback. distilabel is open source; the cool thing about the pipeline code is that it is sharable and reproducible. You can, for example, find the pipeline for the argilla/synthetic-text-classification-news dataset within the repository on the Hub. Alternatively, you can find many other distilabel datasets along with their pipelines.
What’s Next?
The Synthetic Data Generator already offers many cool features that make it useful for any data or model lover. Still, we have some interesting directions for improvements on our GitHub, and we invite you to contribute, leave a star, and open issues too! Some things we are working on are:
- Retrieval Augmented Generation (RAG)
- Custom evals with LLMs as a Judge
- Start synthesizing
Conclusion
The Synthetic Data Generator is a powerful tool that allows users to create high-quality synthetic data without requiring any coding knowledge. With its user-friendly interface and advanced features, it is an essential tool for anyone working with machine learning models, testing, and validation. We invite you to try it out and explore its capabilities.
Code Examples
import huggingface_hub
from huggingface_hub import HfApi
# Create a new dataset
dataset_name = "my_new_dataset"
dataset_description = "This is a new dataset created with the Synthetic Data Generator"
# Create a new HfApi instance
api = HfApi()
# Create a new dataset
dataset = api.create_dataset(
dataset_name,
dataset_description,
"text-classification",
"https://example.com/dataset-description"
)
# Get the dataset ID
dataset_id = dataset["id"]
# Print the dataset ID
print(f"Dataset ID: {dataset_id}")
import huggingface_hub
from huggingface_hub import HfApi
# Get a list of all datasets
datasets = api.list_datasets()
# Print the list of datasets
for dataset in datasets:
print(f"Dataset ID: {dataset['id']}")
print(f"Dataset Name: {dataset['name']}")
print(f"Dataset Description: {dataset['description']}")
print(f"Dataset Type: {dataset['type']}")
print(f"Dataset URL: {dataset['url']}")
print(f"Dataset Tags: {dataset['tags']}")
print(f"Dataset Created At: {dataset['created_at']}")
print(f"Dataset Updated At: {dataset['updated_at']}")
print(f"Dataset Size: {dataset['size']}")
print(f"Dataset Language: {dataset['language']}")
print(f"Dataset License: {dataset['license']}")
print(f"Dataset Maintainers: {dataset['maintainers']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Names: {dataset['maintainer_names']}")
print(f"Dataset Maintainer URLs: {dataset['maintainer_urls']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']}")
print(f"Dataset Maintainer Titles: {dataset['maintainer_titles']}")
print(f"Dataset Maintainer Emails: {dataset['maintainer_emails']}")
print(f"Dataset Maintainer Avatars: {dataset['maintainer_avatars']}")
print(f"Dataset Maintainer Status: {dataset['maintainer_status']}")
print(f"Dataset Maintainer Roles: {dataset['maintainer_roles']}")
print(f"Dataset Maintainer Bio: {dataset['maintainer_bio']}")
print(f"Dataset Maintainer Organizations: {dataset['maintainer_organizations']
---
*Source: [https://huggingface.co/blog/synthetic-data-generator](https://huggingface.co/blog/synthetic-data-generator)*




