run_pipeline.py
Unified training, evaluation, and visualization pipeline.
Basic Usage
python run_pipeline.py --dataset DATASET --models MODELS [ OPTIONS]
Required Parameters
Parameter
Type
Description
--dataset
string
Dataset name
--models
string
Comma-separated model list: theta,lda,hdp,stm,btm,etm,ctm,dtm,nvdm,gsm,prodlda,bertopic
Common Parameters
Shared across all or most models. Parameters marked * apply to neural network–based models only.
Parameter
Type
Default
Range
Description
--num_topics
int
20
5–100
Number of topics K (upper bound for HDP; optional for BERTopic)
--vocab_size
int
5000
1000–20000
Vocabulary size
--epochs *
int
100
10–500
Training epochs
--batch_size *
int
64
8–512
Mini-batch size
--learning_rate *
float
0.002
1e-5–0.1
Learning rate
--dropout *
float
0.2
0–0.9
Encoder dropout rate
--hidden_dim *
int
512
128–2048
Hidden units per layer (NVDM/GSM/ProdLDA default: 256)
--num_layers *
int
2
1–5
Number of encoder hidden layers
--patience *
int
10
1–50
Early stopping patience
Model-Specific Additional Parameters
THETA
Additional parameters beyond common defaults:
Parameter
Type
Default
Range
Description
--model_size
str
0.6B
0.6B / 4B / 8B
Qwen model size
--mode
str
zero_shot
zero_shot / supervised / unsupervised
Embedding mode
--kl_start
float
0.0
0–1
KL annealing start weight
--kl_end
float
1.0
0–1
KL annealing end weight
--kl_warmup
int
50
0–epochs
KL warmup epochs
--language
str
zh
en / zh
Visualization language
LDA
Parameter
Type
Default
Range
Description
--max_iter
int
100
10–500
Maximum EM iterations
--alpha
float
1/K (auto)
>0
Document-topic Dirichlet prior
HDP
Parameter
Type
Default
Range
Description
--max_topics
int
150
50–300
Upper bound on number of topics (replaces --num_topics)
--alpha
float
1.0
>0
Document-level concentration parameter
STM
Parameter
Type
Default
Range
Description
--max_iter
int
100
10–500
Maximum EM iterations
BTM
Parameter
Type
Default
Range
Description
--n_iter
int
100
10–500
Gibbs sampling iterations (replaces --epochs)
--alpha
float
1.0
>0
Topic distribution Dirichlet prior
--beta
float
0.01
>0
Word distribution Dirichlet prior
ETM
Parameter
Type
Default
Range
Description
--embedding_dim
int
300
50–1024
Word embedding dimension (Word2Vec)
CTM
Parameter
Type
Default
Range
Description
--inference_type
str
zeroshot
zeroshot / combined
Inference mode: SBERT only or SBERT + BOW
--hidden_dim
int
100
32–1024
Overrides common default (512 → 100)
DTM
Parameter
Type
Default
Range
Description
--embedding_dim
int
300
50–1024
Word embedding dimension
Note : DTM does not use --num_layers, --dropout, or --patience.
Data requirement : DTM requires a timestamp column. Run python prepare_data.py --dataset your_data --model dtm before training.
NVDM / GSM / ProdLDA
No additional parameters — all settings covered by common defaults.
Note : --hidden_dim defaults to 256 for these models.
BERTopic
Parameter
Type
Default
Range
Description
--min_cluster_size
int
10
2–100
HDBSCAN minimum cluster size; controls topic granularity
--min_samples
int
None
1–100
HDBSCAN min_samples (defaults to min_cluster_size)
--top_n_words
int
10
1–30
Top words displayed per topic
--n_neighbors
int
15
2–100
UMAP number of neighbors
--n_components
int
5
2–50
UMAP reduced dimensions
--random_state
int
42
any int
UMAP random seed for reproducibility
Note : BERTopic does not use --epochs, --batch_size, --learning_rate, or other neural training parameters.
--num_topics is optional; set to None for auto-detection.
Pipeline Control Flags
Parameter
Type
Default
Range
Description
--kl_start
float
0.0
0.0-1.0
Initial KL divergence weight
--kl_end
float
1.0
0.0-1.0
Final KL divergence weight
--kl_warmup
int
50
0-200
Number of warmup epochs for KL annealing
Early Stopping
Parameter
Type
Default
Range
Description
--patience
int
10
1-50
Epochs to wait before early stopping
--no_early_stopping
flag
False
N/A
Disable early stopping
Hardware Configuration
Parameter
Type
Default
Description
--gpu
int
0
GPU device ID
Output Configuration
Parameter
Type
Default
Description
--language
string
en
Visualization language: en or zh
Pipeline Control
Parameter
Type
Default
Description
--skip-train
flag
False
Skip training, evaluate only
--skip-eval
flag
False
Skip evaluation
--skip-viz
flag
False
Skip visualization
--check-only
flag
False
Check data files only
--prepare
flag
False
Run preprocessing before training
Examples
Basic THETA training:
python run_pipeline.py \
--dataset my_dataset \
--models theta \
--model_size 0 .6B \
--mode zero_shot \
--num_topics 20 \
--epochs 100 \
--gpu 0
Multiple baseline models:
python run_pipeline.py \
--dataset my_dataset \
--models lda,etm,ctm \
--num_topics 20 \
--epochs 100 \
--gpu 0
Custom hyperparameters:
python run_pipeline.py \
--dataset my_dataset \
--models theta \
--model_size 0 .6B \
--mode zero_shot \
--num_topics 30 \
--epochs 150 \
--batch_size 32 \
--hidden_dim 768 \
--learning_rate 0 .001 \
--kl_start 0 .0 \
--kl_end 1 .0 \
--kl_warmup 80 \
--patience 15 \
--gpu 0
Evaluate existing model:
python run_pipeline.py \
--dataset my_dataset \
--models theta \
--model_size 0 .6B \
--mode zero_shot \
--num_topics 20 \
--skip-train \
--gpu 0
Output Files
THETA models:
./result/{model_size}/{dataset}/{mode}/
├── checkpoints/
│ ├── best_model.pt
│ └── training_history.json
├── metrics/
│ └── evaluation_results.json
└── visualizations/
├── topic_words_bars.png
├── topic_similarity.png
├── doc_topic_umap.png
├── topic_wordclouds.png
├── metrics.png
└── pyldavis.html
Baseline models:
./result/baseline/{dataset}/{model}/K{num_topics}/
├── checkpoints/
├── metrics/
└── visualizations/
Return Codes
Exit Code
Meaning
0
Success
1
General error
2
File not found
3
Invalid parameters
4
CUDA out of memory
5
Convergence failure