Project Overview¶

Understanding THETA's architecture and workflow will help you use it effectively.

Architecture Overview¶

THETA consists of three main components:

Embedding Module: Generates contextual embeddings using Qwen3-Embedding
Topic Model: Neural variational inference for topic discovery
Evaluation & Visualization: Comprehensive assessment and presentation

Data flow:

Raw Text → Data Cleaning → Preprocessing → Training → Evaluation → Visualization
              ↓              ↓               ↓           ↓            ↓
         Cleaned CSV    Embeddings+BOW   Model Ckpt  Metrics     Figures

Supported Models¶

THETA supports multiple topic modeling approaches:

THETA Model (Our Method)¶

Architecture: - Variational autoencoder with Qwen3-Embedding - Neural encoder for topic distribution inference - Reconstruction via topic-word distributions

Training Modes:

Mode	Description	Use Case	Requirements
zero_shot	Unsupervised learning	No labels available	Text column only
supervised	Label-guided learning	Labels available	Text + label columns
unsupervised	Unsupervised (ignores labels)	Compare with zero_shot	Text column only

Model Sizes:

All three sizes share the same architecture but differ in embedding quality: - 0.6B: Fastest, suitable for development and testing - 4B: Balanced performance for production use - 8B: Best quality for research and high-stakes applications

Baseline Models¶

LDA (Latent Dirichlet Allocation) - Classic probabilistic topic model - No neural components - Fast and interpretable

ETM (Embedded Topic Model) - Uses Word2Vec embeddings - Neural topic model - Better than LDA, faster than THETA

CTM (Contextualized Topic Model) - Uses SBERT embeddings - Contextualized representations - Good balance of quality and speed

DTM (Dynamic Topic Model) - Temporal topic modeling - Tracks topic evolution over time - Requires timestamp information

Directory Structure¶

THETA organizes files in the following structure:

Project Directory¶

/root/autodl-tmp/ETM/
├── main.py                   # THETA training script
├── run_pipeline.py           # Unified entry point
├── prepare_data.py           # Data preprocessing
├── config.py                 # Configuration
├── requirements.txt          # Dependencies
├── dataclean/               # Data cleaning module
│   └── main.py
├── src/
│   ├── bow/                 # BOW generation
│   ├── model/               # Model definitions
│   │   ├── etm.py          # THETA/ETM model
│   │   ├── lda.py          # LDA model
│   │   ├── ctm.py          # CTM model
│   │   └── baseline_trainer.py
│   ├── evaluation/          # Evaluation metrics
│   │   ├── topic_metrics.py
│   │   └── unified_evaluator.py
│   ├── visualization/       # Visualization
│   │   ├── run_visualization.py
│   │   ├── topic_visualizer.py
│   │   └── visualization_generator.py
│   └── utils/               # Utilities
│       └── result_manager.py
└── scripts/
    └── download_models.py

Data Directory¶

/root/autodl-tmp/data/
└── {dataset_name}/
    └── {dataset_name}_cleaned.csv

Results Directory¶

/root/autodl-tmp/result/
├── 0.6B/                    # THETA 0.6B results
│   └── {dataset_name}/
│       ├── bow/             # Shared by all modes
│       ├── zero_shot/       # Zero-shot results
│       │   ├── checkpoints/
│       │   ├── metrics/
│       │   └── visualizations/
│       ├── supervised/      # Supervised results
│       └── unsupervised/    # Unsupervised results
├── 4B/                      # THETA 4B results
├── 8B/                      # THETA 8B results
└── baseline/                # Baseline results
    └── {dataset_name}/
        ├── bow/
        ├── lda/
        │   └── K20/        # 20 topics
        ├── etm/
        ├── ctm/
        └── dtm/

Embedding Models Directory¶

/root/embedding_models/
├── qwen3_embedding_0.6B/
├── qwen3_embedding_4B/
└── qwen3_embedding_8B/

Workflow Summary¶

The typical THETA workflow consists of four stages:

Stage 1: Data Preparation 1. Collect raw text data 2. Clean and format as CSV 3. Ensure proper column names

Stage 2: Preprocessing 1. Run prepare_data.py to generate embeddings 2. Create bag-of-words representations 3. Build vocabulary 4. Save preprocessed files

Stage 3: Training 1. Run run_pipeline.py to train model 2. Model trains with early stopping 3. Automatic evaluation on multiple metrics 4. Automatic visualization generation

Stage 4: Analysis 1. Review evaluation metrics 2. Examine visualizations 3. Analyze discovered topics 4. Compare with baseline models

Next Steps¶

Now that you understand the architecture, you can:

Explore the User Guide for detailed documentation on each component
Try different training modes (supervised, unsupervised)
Experiment with different model sizes (4B, 8B)
Learn about hyperparameter tuning in the Advanced Usage section
Compare THETA with baseline models (LDA, ETM, CTM)
Process Chinese text data with specialized pipelines