Skip to content
/ SEEDRA Public

SEEDRA is a data augmentation AI designed to overcome the challenges of building diverse training datasets.

License

Notifications You must be signed in to change notification settings

sjy-dv/SEEDRA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SEEDRA

img

MODEL DOWNLOAD

MODEL : devJy/SEEDRA-zero-small
Feature : 1500 steps – (A100) – loss 0.518100

What is SEEDRA?

SEEDRA is an AI for data augmentation. It can effectively expand datasets by utilizing various sentence structures, rearrangements, synonyms, and more.

Why use SEEDRA?

Reducing Financial Barriers to Entry

When using GPT or DeepSeek to generate data (collecting naturally occurring data is generally very difficult for individuals or small companies), building 3,000 unique data pairs typically costs around $20 on average (though this varies depending on the token volume of the actual dataset).

Assuming you need at least 100,000 data points to fine-tune a small model for a proof of concept, you would incur costs of approximately $700. (As the dataset grows, duplicated entries become more frequent, so actual costs can be even higher.)

Moreover, for heterogeneous or highly specialized domains, if you assume you need at least 300,000 data points, costs will scale up accordingly.

SEEDRA runs within a single GPU with 10 GiB of memory—making it accessible even if you or your company don’t have a GPU, as it can be run in Colab.

Securing Subtoken Coverage via Data Augmentation

By exposing the model to diverse synonyms and sentence patterns during training, you improve its ability to recognize and consistently generate rare tokens (including domain‑specific ones) at inference time.

Mid‑scale models (27 B, 70 B+) perform very well on general‑domain text but may see accuracy drop when they encounter specialized terms or expressions that were not included in their training data.

By contrast, smaller “core” models (1 B, 3 B, 8 B, 14 B) often struggle with efficient inference. In particular, 1 B and 3 B variants remain popular with enterprises due to their lower resource requirements—but their out‑of‑the‑box performance can be limited. Data augmentation (e.g., synonym expansion, alternative sentence structures) can effectively upgrade these small models’ inference capabilities.

How to use SEEDRA

img

  1. If you need to build a special instruction dataset or have a domain-specific training dataset, you can augment it using SEEDRA.
  2. For example, Qwen 2.5 is a very powerful model, strong in Chinese and English but somewhat weak in Korean. In such cases, you can use SEEDRA to augment Korean data—using varied expressions and sentence orders to secure more subtoken coverage and strengthen its Korean capabilities.
  3. In a multimodal workflow, you can convert SEEDRA‑generated text into speech using a tool like OuteTTS, then feed that synthesized audio into an ASR (automatic speech recognition) model as additional training or validation data.

img

DEMO

from transformers import pipeline
import torch

pipe = pipeline(
    "text-generation",
    model="devJy/SEEDRA-zero-small",
    torch_dtype=torch.bfloat16
)

sentence = "I have to go to work, but I'm so sleepy that I want to sleep more instead of going to work."

SYSTEM_PROMPT = """You are a similar sentence generation Assistant.
For the instructions below, output **only** an array of similar sentences in JSON format.
Example output: ["similar_sentence1","similar_sentence2"]"""

messages = [
        {"role": "system",    "content": SYSTEM_PROMPT},
        {"role": "user",      "content": sentence},
]

output = pipe(messages, max_new_tokens=4096)

print(output[0]["generated_text"][-1]["content"])


# terminal output
# ["I need to go to work, but I'm feeling so tired that I'd rather sleep than go to work.",
#  "I have to go to work, but I'm so drowsy that I'd rather sleep than go to work.",
#  "I have to go to work, but I'm so fatigued that I'd rather sleep than go to work.",
#  "I need to go to work, but I'm so exhausted that I'd rather sleep than go to work.",
#  "I have to go to work, but I'm so sleepy that I'd rather sleep than go to work."]

SFT LOSS Comparison Using SEEDRA

(To be updated later.)

About

SEEDRA is a data augmentation AI designed to overcome the challenges of building diverse training datasets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages