Training Models¶

This guide covers training THETA and baseline models with various configurations.

THETA Model Training¶

Basic Training¶

Train a THETA model with default hyperparameters:

cd /root/autodl-tmp/ETM

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --epochs 100 \
    --batch_size 64 \
    --hidden_dim 512 \
    --learning_rate 0.002 \
    --kl_start 0.0 \
    --kl_end 1.0 \
    --kl_warmup 50 \
    --patience 10 \
    --gpu 0 \
    --language en

Training typically completes in 20-40 minutes depending on dataset size and hardware.

Topic Number Selection¶

The number of topics is a key hyperparameter that affects granularity:

Topics	Appropriate For
10-15	Small corpora, broad categories, high-level overview
20-30	Medium corpora, balanced granularity, default choice
40-100	Large diverse corpora, fine-grained analysis

Learning Rate Tuning¶

Learning Rate	Use When
0.001	Training is unstable, loss oscillates
0.002	Default choice for most datasets
0.005	Training is too slow, need faster convergence

KL Annealing Configuration¶

KL annealing gradually increases the KL divergence weight during training to prevent posterior collapse.

Standard KL annealing: Weight increases linearly from 0.0 to 1.0 over 50 epochs.

Slow KL annealing: --kl_warmup 80 — Longer warmup period helps prevent early posterior collapse.

Partial KL annealing: --kl_start 0.1 --kl_end 0.9 --kl_warmup 30 — Starts with non-zero weight and stops before full weight.

Hidden Dimension Configuration¶

Hidden Dim	Use When
256	Small datasets, faster training, limited VRAM
512	Default choice for most datasets
768-1024	Large complex datasets, sufficient VRAM available

Early Stopping¶

Early stopping prevents overfitting by monitoring validation performance:

Default: --patience 10 — Stops if no improvement after 10 epochs
Disabled: --no_early_stopping — Trains for all specified epochs

Chinese Data Training¶

python run_pipeline.py \
    --dataset chinese_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --epochs 100 \
    --batch_size 64 \
    --gpu 0 \
    --language zh

The language parameter affects visualization rendering (fonts, layout) but does not change the training algorithm.

Supervised Training¶

For datasets with labels:

python run_pipeline.py \
    --dataset labeled_dataset \
    --models theta \
    --model_size 0.6B \
    --mode supervised \
    --num_topics 20 \
    --epochs 100 \
    --batch_size 64 \
    --gpu 0 \
    --language en

The model incorporates label information to guide topic discovery.

Baseline Model Training¶

LDA Training¶

python run_pipeline.py \
    --dataset my_dataset \
    --models lda \
    --num_topics 20 \
    --epochs 100 \
    --batch_size 64 \
    --gpu 0 \
    --language en

LDA uses Gibbs sampling and does not utilize GPU acceleration.

ETM Training¶

python run_pipeline.py \
    --dataset my_dataset \
    --models etm \
    --num_topics 20 \
    --epochs 100 \
    --batch_size 64 \
    --hidden_dim 512 \
    --learning_rate 0.002 \
    --gpu 0 \
    --language en

ETM uses Word2Vec embeddings (300 dimensions).

CTM Training¶

python run_pipeline.py \
    --dataset my_dataset \
    --models ctm \
    --num_topics 20 \
    --epochs 100 \
    --batch_size 64 \
    --hidden_dim 512 \
    --learning_rate 0.002 \
    --gpu 0 \
    --language en

CTM uses SBERT embeddings (768 dimensions).

DTM Training¶

python run_pipeline.py \
    --dataset temporal_dataset \
    --models dtm \
    --num_topics 20 \
    --epochs 100 \
    --batch_size 64 \
    --hidden_dim 512 \
    --learning_rate 0.002 \
    --gpu 0 \
    --language en

DTM models topic evolution across time slices defined by the time column in preprocessing.

Training Multiple Models¶

Compare multiple models simultaneously:

python run_pipeline.py \
    --dataset my_dataset \
    --models lda,etm,ctm \
    --num_topics 20 \
    --epochs 100 \
    --batch_size 64 \
    --gpu 0 \
    --language en

Models train sequentially. Results are saved in separate directories for comparison.