run_pipeline.py
Unified training, evaluation, and visualization pipeline.
Basic Usage
python run_pipeline.py --dataset DATASET --models MODELS [OPTIONS]
Required Parameters
| Parameter |
Type |
Description |
--dataset |
string |
Dataset name |
--models |
string |
Comma-separated model list: theta, lda, etm, ctm, dtm |
Model Configuration (THETA)
| Parameter |
Type |
Default |
Description |
--model_size |
string |
0.6B |
Qwen model size: 0.6B, 4B, or 8B |
--mode |
string |
zero_shot |
Training mode: zero_shot, supervised, or unsupervised |
Topic Model Parameters
| Parameter |
Type |
Default |
Range |
Description |
--num_topics |
int |
20 |
5-100 |
Number of topics to discover |
--epochs |
int |
100 |
10-500 |
Maximum training epochs |
--batch_size |
int |
64 |
8-512 |
Training batch size |
Neural Network Architecture
| Parameter |
Type |
Default |
Range |
Description |
--hidden_dim |
int |
512 |
128-1024 |
Encoder hidden dimension |
Optimization
| Parameter |
Type |
Default |
Range |
Description |
--learning_rate |
float |
0.002 |
0.00001-0.1 |
Learning rate for optimizer |
KL Annealing
| Parameter |
Type |
Default |
Range |
Description |
--kl_start |
float |
0.0 |
0.0-1.0 |
Initial KL divergence weight |
--kl_end |
float |
1.0 |
0.0-1.0 |
Final KL divergence weight |
--kl_warmup |
int |
50 |
0-200 |
Number of warmup epochs for KL annealing |
Early Stopping
| Parameter |
Type |
Default |
Range |
Description |
--patience |
int |
10 |
1-50 |
Epochs to wait before early stopping |
--no_early_stopping |
flag |
False |
N/A |
Disable early stopping |
Hardware Configuration
| Parameter |
Type |
Default |
Description |
--gpu |
int |
0 |
GPU device ID |
Output Configuration
| Parameter |
Type |
Default |
Description |
--language |
string |
en |
Visualization language: en or zh |
Pipeline Control
| Parameter |
Type |
Default |
Description |
--skip-train |
flag |
False |
Skip training, evaluate only |
--skip-eval |
flag |
False |
Skip evaluation |
--skip-viz |
flag |
False |
Skip visualization |
--check-only |
flag |
False |
Check data files only |
--prepare |
flag |
False |
Run preprocessing before training |
Examples
Basic THETA training:
python run_pipeline.py \
--dataset my_dataset \
--models theta \
--model_size 0.6B \
--mode zero_shot \
--num_topics 20 \
--epochs 100 \
--gpu 0
Multiple baseline models:
python run_pipeline.py \
--dataset my_dataset \
--models lda,etm,ctm \
--num_topics 20 \
--epochs 100 \
--gpu 0
Custom hyperparameters:
python run_pipeline.py \
--dataset my_dataset \
--models theta \
--model_size 0.6B \
--mode zero_shot \
--num_topics 30 \
--epochs 150 \
--batch_size 32 \
--hidden_dim 768 \
--learning_rate 0.001 \
--kl_start 0.0 \
--kl_end 1.0 \
--kl_warmup 80 \
--patience 15 \
--gpu 0
Evaluate existing model:
python run_pipeline.py \
--dataset my_dataset \
--models theta \
--model_size 0.6B \
--mode zero_shot \
--num_topics 20 \
--skip-train \
--gpu 0
Output Files
THETA models:
/root/autodl-tmp/result/{model_size}/{dataset}/{mode}/
├── checkpoints/
│ ├── best_model.pt
│ └── training_history.json
├── metrics/
│ └── evaluation_results.json
└── visualizations/
├── topic_words_bars.png
├── topic_similarity.png
├── doc_topic_umap.png
├── topic_wordclouds.png
├── metrics.png
└── pyldavis.html
Baseline models:
/root/autodl-tmp/result/baseline/{dataset}/{model}/K{num_topics}/
├── checkpoints/
├── metrics/
└── visualizations/
Return Codes
| Exit Code |
Meaning |
| 0 |
Success |
| 1 |
General error |
| 2 |
File not found |
| 3 |
Invalid parameters |
| 4 |
CUDA out of memory |
| 5 |
Convergence failure |