AI Personal Learning
and practical guidance
CyberKnife Drawing Mirror

GraphGen: Fine-tuning Language Models Using Knowledge Graphs to Generate Synthetic Data

General Introduction

GraphGen is an open source framework developed by OpenScienceLab, an AI lab in Shanghai, hosted on GitHub, focusing on optimizing supervised fine-tuning of Large Language Models (LLMs) by guiding synthetic data generation through knowledge graphs. It constructs fine-grained knowledge graphs from source text, identifies model knowledge blind spots using expected calibration error (ECE) metrics, and prioritizes the generation of Q&A pairs targeting high-value, long-tail knowledge.GraphGen supports multi-hop neighborhood sampling to capture complex relational information, and generates diversified data through style control. The project is licensed under the Apache 2.0 license, and the code is open for academic research and commercial development. Users can configure the generation process flexibly via command line or Gradio interface, and the generated data can be directly used for model training.

GraphGen: Using Knowledge Graphs to Generate Fine-tuned Language Models for Synthetic Data-1


GraphGen: Using Knowledge Graphs to Generate Fine-tuned Language Models for Synthetic Data-1

 

Function List

  • Constructing fine-grained knowledge graphs: extracting entities and relationships from text to generate structured knowledge graphs.
  • Identifying knowledge blind spots: locating knowledge weaknesses in language models based on expected calibration error (ECE) metrics.
  • Generate high-value Q&A pairs: prioritize the generation of Q&A data targeting long-tail knowledge to improve model performance.
  • Multi-hop neighborhood sampling: capturing multi-level relationships in knowledge graphs to enhance data complexity.
  • Style control generation: Support diverse Q&A styles, such as concise or detailed, to adapt to different scenarios.
  • Custom Configuration: Adjust data types, input files and output paths via YAML files.
  • Gradio Interface Support: Provides a visual interface to simplify data generation operations.
  • Model compatibility: Supports multiple language models (e.g. Qwen, OpenAI) for data generation and training.

 

Using Help

Installation process

GraphGen is a Python project that supports installation from PyPI or running from source. Here are the detailed installation steps:

Installing from PyPI

  1. Install GraphGen
    Ensure that Python version is 3.8 or higher by running the following command:

    pip install graphg
  1. Configuring Environment Variables
    GraphGen requires a call to a language modeling API (such as Qwen or OpenAI). Set environment variables in the terminal:

    export SYNTHESIZER_MODEL="your_synthesizer_model_name"
    export SYNTHESIZER_BASE_URL="your_base_url"
    export SYNTHESIZER_API_KEY="your_api_key"
    export TRAINEE_MODEL="your_trainee_model_name"
    export TRAINEE_BASE_URL="your_base_url"
    export TRAINEE_API_KEY="your_api_key"
    
    • SYNTHESIZER_MODEL: Models for generating knowledge graphs and data.
    • TRAINEE_MODEL: Models used for training.
  2. Run the command line tool
    Execute the following command to generate data:

    graphg --output_dir cache
    

Installation from source

  1. clone warehouse
    Clone the GraphGen repository locally:

    git clone https://github.com/open-sciencelab/GraphGen.git
    cd GraphGen
    
  2. Creating a Virtual Environment
    Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate  # Linux/Mac
    venv\Scripts\activate     # Windows
    
  3. Installation of dependencies
    Install project dependencies:

    pip install -r requirements.txt
    

    Make sure PyTorch (recommended 1.13.1 or higher) and related libraries (e.g. LiteLLM, DSPy) are installed. If using a GPU, install a CUDA-compatible version:

    pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
    
  4. Configuring Environment Variables
    Copy the example environment file and edit it:

    cp .env.example .env
    

    exist .env file to set model-related information:

    SYNTHESIZER_MODEL=your_synthesizer_model_name
    SYNTHESIZER_BASE_URL=your_base_url
    SYNTHESIZER_API_KEY=your_api_key
    TRAINEE_MODEL=your_trainee_model_name
    TRAINEE_BASE_URL=your_base_url
    TRAINEE_API_KEY=your_api_key
    
  5. Preparing to enter data
    GraphGen requires input text in JSONL format. The example data is located in the resources/examples/raw_demo.jsonl. Users can prepare customized data to ensure consistent formatting.

Usage

GraphGen supports both command line and Gradio interface. Here are the detailed steps:

command-line operation

  1. Modify the configuration file
    compiler configs/graphgen_config.yaml file to set the data generation parameters:

    data_type: "raw"
    input_file: "resources/examples/raw_demo.jsonl"
    output_dir: "cache"
    ece_threshold: 0.1
    sampling_hops: 2
    style: "detailed"
    
    • data_type: Input data type (e.g. raw).
    • input_file: Enter the file path.
    • output_dir: Output directory.
    • ece_threshold: ECE thresholds for knowledge blind spot identification.
    • sampling_hops: Multi-hop sampling depth.
    • style: Q&A generation styles (e.g. detailed maybe concise).
  2. Run the generated script
    Execute the following command to generate data:

    bash scripts/generate.sh
    

    or just run a Python script:

    python -m graphg --config configs/graphgen_config.yaml
    
  3. View Generated Results
    The generated Q&A pairs are saved in the cache/data/graphgen directory in the format of a JSONL file:

    ls cache/data/graphgen
    

Gradio Interface Operation

  1. Launching the Gradio Interface
    Run the following command to start the visualization interface:

    python webui/app.py
    

    The browser will open the Gradio interface showing the data generation process.

  2. workflow
    • Uploads a JSONL-formatted input file in the interface.
    • Configure generation parameters (e.g., ECE threshold, sample depth, generation style).
    • Click on the "Generate" button and the system will process the input and output the Q&A pairs.
    • Download the generated JSONL file.

Featured Function Operation

  • knowledge graph construction: GraphGen automatically extracts entities and relationships from the input text, generates a knowledge graph, and saves it in JSON format. No manual intervention is required.
  • Knowledge blind spot identification: Predict bias through the ECE metrics analytics model and generate targeted Q&A pairs. Adjustment ece_threshold Controls blind screening rigor.
  • Multi-hop neighborhood sampling: Capturing multi-level relationships in knowledge graphs to generate complex Q&A pairs. Settings sampling_hops Controls the sampling depth.
  • Style control generation: Multiple Q&A styles are supported for different scenarios. Users can style The parameter selects the style.

training model

The generated data can be used for supervised fine tuning (SFT). Import the output file into a framework that supports SFT (e.g. XTuner):

xtuner train --data cache/data/graphgen/output.jsonl --model qwen-7b

caveat

  • Ensure that the API key and network connection are stable and that the generation process calls an external model.
  • The input data should be in JSONL format, refer to the raw_demo.jsonlThe
  • GPU devices are recommended for large-scale data generation to optimize performance.
  • Check dependency versions to avoid conflicts. Update if necessary requirements.txtThe

Supplementary resources

  • OpenXLab Application Center: Users can access this information through the OpenXLab Experience GraphGen.
  • Official FAQ: refer to the GitHub FAQ Solve common problems.
  • technical analysis: Courtesy of DeepWiki System Architecture AnalysisThis section describes the workflow of GraphGen in detail.

 

application scenario

  1. academic research
    Researchers can use GraphGen to generate Q&A data for specialized domains. For example, generating training data for a chemistry or medical domain model improves the model's knowledge coverage.
  2. Enterprise AI Optimization
    Enterprises can use GraphGen to generate customized Q&A pairs for customer service or recommender systems, optimizing the responsiveness of the conversation model.
  3. Education platform development
    Developers can generate diverse teaching Q&A data to build intelligent educational tools to support personalized learning.

 

QA

  1. What models does GraphGen support?
    GraphGen supports OpenAI, Qwen, Ollama and other models through LiteLLM. Model API keys and addresses are required.
  2. How do I prepare the input data?
    The input data should be in JSONL format, with each line containing text content. Reference resources/examples/raw_demo.jsonlThe
  3. How long does it take to generate data?
    Small data sizes (100 entries) can take a few minutes, and large data sizes can take hours, depending on the amount of input and hardware performance.
  4. How does the Gradio interface work?
    (of a computer) run python webui/app.pyIf you want to generate data, you can upload the input file through your browser and configure the parameters to generate the data.
May not be reproduced without permission:Chief AI Sharing Circle " GraphGen: Fine-tuning Language Models Using Knowledge Graphs to Generate Synthetic Data
en_USEnglish