Pioneer Overview
What is Pioneer?
👷♀️ What can I build with Pioneer?
Currently, Pioneer offers fine-tuning for GLiNER2. This means you can fine-tune your own models for:
Named Entity Recognition: Identify and extract entities (e.g. people, places, companies, dates) from text.
Structured data extraction: Retrieve specific information (e.g. medication, dosage, symptoms) from unstructured text in a neat JSON format.
Relation extraction: Identify and label semantic relationships between entities (e.g. founder of, works for, works with)
Text classification (single- and multi-label): Analyzing and assigning text to one or more predefined categories (e.g. positive sentiment/negative sentiment, spam/not spam).
💫 Example projects for inspiration
Task | Projects |
|---|---|
Text classification | Prompt hack detection, user intent classification, agentic model routing |
NER | PII redaction for domain-specific data (e.g. financial, medical, security logs) |
Relation extraction | Knowledge graph construction from research papers, genomic relation extraction |
Structured data extraction | Receipt/invoice extraction, reasoning traces extraction, geopolitical event tracking |
🏁 Getting started
Here are some steps to get you started fine-tuning your first model.
Create a dataset. Click on the “Data” tab and click “Create Dataset”. You can create a dataset several ways:
Synthesize: Generate synthetic training data based only on a natural language prompt.
Upload: Upload an existing(labeled) dataset in CSV, JSON, or JSONL format.
Auto Relabel: Upload an existing (labeled) dataset and automatically relabel it with new categories. This is best when you have existing labeled data but what to change the label schemas.
Manual Relabel: Upload an existing (labeled) dataset and manually reassign labels. This is best for fixing noisy labels.
External: Load a dataset from the HuggingFace Hub.
Modify an existing dataset. Click on “Create Dataset” and then click on “Grow”.
Grow: Expand an existing labeled dataset with more synthetic examples. This is best when you already have high-quality labeled data but need more of it to train a model.
Review seed data (if generating synthetic data). The first step in generating synthetic data is generating a small number of seed examples based on the prompt you entered. You’ll be asked to approve (or change) the inferred labels for these examples before generating more examples for a full dataset. To add or modify labels for NER data, select the desired span and right click to select a label.
Vibe-check your dataset. There are several ways to assess the quality of your dataset before training.
Check the Vendi score: In the Data Overview section, you’ll be able to check the Vendi score for your dataset. Data that is less diverse might lead to overfitting and memorization, while data that is too diverse may lead to a model that fails to converge or generalizes poorly due to high noise.
Check for duplicates: The dataset overview will also have a section that identifies duplicate data examples and allows you to edit or delete them.
Edit the dataset: You can edit your dataset by clicking on the “Editor” button.
Start fine-tuning. Once you’re confident with your dataset, start fine-tuning your model by clicking on the “Train Model” tab!
Vibe-check your fine-tuned model. There are several ways to get a quick overview of your fine-tuned model.
Check the training logs: As the model trains, training loss, and the learning rate should decrease steadily, and validation loss should decrease and eventually plateau.
Run inference on the models: Click on the “Inference” tab to vibe-check your model on unseen examples.
Run proper evals on an unseen evaluation dataset. Our evals allow you to benchmark your newly fine-tuned model against other GLiNER Base and other (larger) models as well as against custom validation datasets or industry standard benchmarks. Under the “Evals” tab, select the model you want to evaluate.
Compare against baseline models: Select all the models you wish to compare performance against under “Baseline Models”.
Run on evaluation benchmarks: Select all the datasets you wish you evaluate your model against. We recommend against evaluating on your training dataset. To create a new validation dataset, create a new dataset like you did in Step #1.
🪄 Tips and tricks
Give specific definitions to individual classes, entities, relationships, and data you want to extract when generating synthetic data
Upload real-world datasets to use as seed examples for generating synthetic data
Explore and edit your data directly in the platform with the dataset viewer
Ensure training data is relatively balanced, including “edge cases” and examples where the target class labels or entities are not present
Ask the agent for help, if you get stuck, or need some ideas!
🐛 Reporting bugs and feedback
We love feedback! You can use the feedback button on the platform to report.
❓Still have questions?
Full documents available here
We’re happy to help! Please post your questions about using Pioneer in this group channel here so that as many people as possible can see them. 🙂 If you have questions about using Pioneer for a specific business use case, feel free to reach out to George.
Join our Discord Community