Skip to content

run_pipeline.py

Unified training, evaluation, and visualization pipeline.


Basic Usage

python run_pipeline.py --dataset DATASET --models MODELS [OPTIONS]

Required Parameters

Parameter Type Description
--dataset string Dataset name
--models string Comma-separated model list: theta,lda,hdp,stm,btm,etm,ctm,dtm,nvdm,gsm,prodlda,bertopic

Common Parameters

Shared across all or most models. Parameters marked * apply to neural network–based models only.

Parameter Type Default Range Description
--num_topics int 20 5–100 Number of topics K (upper bound for HDP; optional for BERTopic)
--vocab_size int 5000 1000–20000 Vocabulary size
--epochs * int 100 10–500 Training epochs
--batch_size * int 64 8–512 Mini-batch size
--learning_rate * float 0.002 1e-5–0.1 Learning rate
--dropout * float 0.2 0–0.9 Encoder dropout rate
--hidden_dim * int 512 128–2048 Hidden units per layer (NVDM/GSM/ProdLDA default: 256)
--num_layers * int 2 1–5 Number of encoder hidden layers
--patience * int 10 1–50 Early stopping patience

Model-Specific Additional Parameters

THETA

Additional parameters beyond common defaults:

Parameter Type Default Range Description
--model_size str 0.6B 0.6B / 4B / 8B Qwen model size
--mode str zero_shot zero_shot / supervised / unsupervised Embedding mode
--kl_start float 0.0 0–1 KL annealing start weight
--kl_end float 1.0 0–1 KL annealing end weight
--kl_warmup int 50 0–epochs KL warmup epochs
--language str zh en / zh Visualization language

LDA

Parameter Type Default Range Description
--max_iter int 100 10–500 Maximum EM iterations
--alpha float 1/K (auto) >0 Document-topic Dirichlet prior

HDP

Parameter Type Default Range Description
--max_topics int 150 50–300 Upper bound on number of topics (replaces --num_topics)
--alpha float 1.0 >0 Document-level concentration parameter

STM

Parameter Type Default Range Description
--max_iter int 100 10–500 Maximum EM iterations

BTM

Parameter Type Default Range Description
--n_iter int 100 10–500 Gibbs sampling iterations (replaces --epochs)
--alpha float 1.0 >0 Topic distribution Dirichlet prior
--beta float 0.01 >0 Word distribution Dirichlet prior

ETM

Parameter Type Default Range Description
--embedding_dim int 300 50–1024 Word embedding dimension (Word2Vec)

CTM

Parameter Type Default Range Description
--inference_type str zeroshot zeroshot / combined Inference mode: SBERT only or SBERT + BOW
--hidden_dim int 100 32–1024 Overrides common default (512 → 100)

DTM

Parameter Type Default Range Description
--embedding_dim int 300 50–1024 Word embedding dimension

Note: DTM does not use --num_layers, --dropout, or --patience.
Data requirement: DTM requires a timestamp column. Run python prepare_data.py --dataset your_data --model dtm before training.

NVDM / GSM / ProdLDA

No additional parameters — all settings covered by common defaults.

Note: --hidden_dim defaults to 256 for these models.

BERTopic

Parameter Type Default Range Description
--min_cluster_size int 10 2–100 HDBSCAN minimum cluster size; controls topic granularity
--min_samples int None 1–100 HDBSCAN min_samples (defaults to min_cluster_size)
--top_n_words int 10 1–30 Top words displayed per topic
--n_neighbors int 15 2–100 UMAP number of neighbors
--n_components int 5 2–50 UMAP reduced dimensions
--random_state int 42 any int UMAP random seed for reproducibility

Note: BERTopic does not use --epochs, --batch_size, --learning_rate, or other neural training parameters.
--num_topics is optional; set to None for auto-detection.


Pipeline Control Flags

Parameter Type Default Range Description
--kl_start float 0.0 0.0-1.0 Initial KL divergence weight
--kl_end float 1.0 0.0-1.0 Final KL divergence weight
--kl_warmup int 50 0-200 Number of warmup epochs for KL annealing

Early Stopping

Parameter Type Default Range Description
--patience int 10 1-50 Epochs to wait before early stopping
--no_early_stopping flag False N/A Disable early stopping

Hardware Configuration

Parameter Type Default Description
--gpu int 0 GPU device ID

Output Configuration

Parameter Type Default Description
--language string en Visualization language: en or zh

Pipeline Control

Parameter Type Default Description
--skip-train flag False Skip training, evaluate only
--skip-eval flag False Skip evaluation
--skip-viz flag False Skip visualization
--check-only flag False Check data files only
--prepare flag False Run preprocessing before training

Examples

Basic THETA training:

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --epochs 100 \
    --gpu 0

Multiple baseline models:

python run_pipeline.py \
    --dataset my_dataset \
    --models lda,etm,ctm \
    --num_topics 20 \
    --epochs 100 \
    --gpu 0

Custom hyperparameters:

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 30 \
    --epochs 150 \
    --batch_size 32 \
    --hidden_dim 768 \
    --learning_rate 0.001 \
    --kl_start 0.0 \
    --kl_end 1.0 \
    --kl_warmup 80 \
    --patience 15 \
    --gpu 0

Evaluate existing model:

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --skip-train \
    --gpu 0


Output Files

THETA models:

./result/{model_size}/{dataset}/{mode}/
├── checkpoints/
│   ├── best_model.pt
│   └── training_history.json
├── metrics/
│   └── evaluation_results.json
└── visualizations/
    ├── topic_words_bars.png
    ├── topic_similarity.png
    ├── doc_topic_umap.png
    ├── topic_wordclouds.png
    ├── metrics.png
    └── pyldavis.html

Baseline models:

./result/baseline/{dataset}/{model}/K{num_topics}/
├── checkpoints/
├── metrics/
└── visualizations/


Return Codes

Exit Code Meaning
0 Success
1 General error
2 File not found
3 Invalid parameters
4 CUDA out of memory
5 Convergence failure