Skip to content

Hyperparameter Tuning

Systematic guide to optimizing THETA hyperparameters.


Parameter Reference

Common Parameters

Shared across all or most models. Parameters marked * apply to neural network–based models only.

Parameter Type Default Range Description
--num_topics int 20 5–100 Number of topics K (upper bound for HDP; optional for BERTopic)
--vocab_size int 5000 1000–20000 Vocabulary size
--epochs * int 100 10–500 Training epochs
--batch_size * int 64 8–512 Mini-batch size
--learning_rate * float 0.002 1e-5–0.1 Learning rate
--dropout * float 0.2 0–0.9 Encoder dropout rate
--hidden_dim * int 512 128–2048 Hidden units per layer (NVDM/GSM/ProdLDA default: 256)
--num_layers * int 2 1–5 Number of encoder hidden layers
--patience * int 10 1–50 Early stopping patience

Model-Specific Parameters

THETA

Additional parameters beyond common defaults:

Parameter Type Default Range Description
--model_size str 0.6B 0.6B / 4B / 8B Qwen model size
--mode str zero_shot zero_shot / supervised / unsupervised Embedding mode
--kl_start float 0.0 0–1 KL annealing start weight
--kl_end float 1.0 0–1 KL annealing end weight
--kl_warmup int 50 0–epochs KL warmup epochs
--language str zh en / zh Visualization language

LDA

Parameter Type Default Range Description
--max_iter int 100 10–500 Maximum EM iterations
--alpha float 1/K (auto) >0 Document-topic Dirichlet prior

HDP

Parameter Type Default Range Description
--max_topics int 150 50–300 Upper bound on number of topics (replaces --num_topics)
--alpha float 1.0 >0 Document-level concentration parameter

STM

Parameter Type Default Range Description
--max_iter int 100 10–500 Maximum EM iterations

BTM

Parameter Type Default Range Description
--n_iter int 100 10–500 Gibbs sampling iterations (replaces --epochs)
--alpha float 1.0 >0 Topic distribution Dirichlet prior
--beta float 0.01 >0 Word distribution Dirichlet prior

ETM

Parameter Type Default Range Description
--embedding_dim int 300 50–1024 Word embedding dimension (Word2Vec)

CTM

Parameter Type Default Range Description
--inference_type str zeroshot zeroshot / combined Inference mode: SBERT only or SBERT + BOW
--hidden_dim int 100 32–1024 Overrides common default (512 → 100)

DTM

Parameter Type Default Range Description
--embedding_dim int 300 50–1024 Word embedding dimension

Note: DTM does not use --num_layers, --dropout, or --patience.
Data requirement: DTM requires a timestamp column. Run python prepare_data.py --dataset your_data --model dtm before training.

NVDM / GSM / ProdLDA

No additional parameters — all settings covered by common defaults.

Note: --hidden_dim defaults to 256 for these models.

BERTopic

Parameter Type Default Range Description
--min_cluster_size int 10 2–100 HDBSCAN minimum cluster size; controls topic granularity
--min_samples int None 1–100 HDBSCAN min_samples (defaults to min_cluster_size)
--top_n_words int 10 1–30 Top words displayed per topic
--n_neighbors int 15 2–100 UMAP number of neighbors
--n_components int 5 2–50 UMAP reduced dimensions
--random_state int 42 any int UMAP random seed for reproducibility

Note: BERTopic does not use --epochs, --batch_size, --learning_rate, or other neural training parameters.
--num_topics is optional; set to None for auto-detection.


Tuning Strategies

Learning Rate Scheduling

Conservative approach (unstable training):

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --learning_rate 0.0005 \
    --epochs 150 \
    --gpu 0

Standard approach:

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --learning_rate 0.002 \
    --epochs 100 \
    --gpu 0

Aggressive approach (slow convergence):

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --learning_rate 0.01 \
    --epochs 80 \
    --gpu 0

Monitor training loss curves to determine if adjustment is needed.


Batch Size Optimization

Batch Size Advantages Disadvantages
32 Lower memory, better exploration Noisy updates, slower convergence
64 Balanced (default)
128 Stable updates, faster epochs Higher memory, may overfit

KL Annealing Strategies

No annealing (immediate full KL): --kl_start 1.0 --kl_end 1.0 --kl_warmup 0 Risk: Posterior collapse, poor topic quality

Standard annealing (recommended): --kl_start 0.0 --kl_end 1.0 --kl_warmup 50

Slow annealing (complex data): --kl_start 0.0 --kl_end 1.0 --kl_warmup 80

Partial annealing (fine-tuning): --kl_start 0.2 --kl_end 0.8 --kl_warmup 40


Hidden Dimension Tuning

Hidden Dim Use Case
256 Small datasets or memory constrained
512 Default choice for most applications
1024 Large complex datasets when VRAM permits

Early Stopping Configuration

Patience Behavior
5 Stops quickly if validation loss plateaus
10 Default setting
20 Allows longer training before stopping
Disabled (--no_early_stopping) Trains for all specified epochs

Vocabulary Size Selection

Corpus Size Vocabulary Size Coverage
< 1K docs 2000-3000 ~85%
1K-10K docs 5000 ~90%
10K-100K docs 8000-10000 ~92%
> 100K docs 10000-15000 ~95%

Using Different Model Sizes

Scaling Strategy

Development workflow: 1. Start with 0.6B model 2. Optimize hyperparameters 3. Scale to 4B for production 4. Use 8B for final results if needed

Quick comparison:

for size in 0.6B 4B 8B; do
    python run_pipeline.py \
        --dataset my_dataset \
        --models theta \
        --model_size $size \
        --mode zero_shot \
        --num_topics 20 \
        --gpu 0
done

Quality vs Cost Analysis

0.6B → 4B: - Topic diversity: +3-5% - Coherence (NPMI): +10-15% - Training time: +60-80%

4B → 8B: - Topic diversity: +1-2% - Coherence (NPMI): +5-8% - Training time: +80-100%

Diminishing returns suggest 4B is often the best choice for production.


Systematic hyperparameter exploration:

#!/bin/bash
topics=(15 20 25 30)
learning_rates=(0.001 0.002 0.005)
hidden_dims=(256 512 768)

for K in "${topics[@]}"; do
    for lr in "${learning_rates[@]}"; do
        for hd in "${hidden_dims[@]}"; do
            echo "Training K=$K, lr=$lr, hd=$hd"

            python run_pipeline.py \
                --dataset my_dataset \
                --models theta \
                --model_size 0.6B \
                --mode zero_shot \
                --num_topics $K \
                --learning_rate $lr \
                --hidden_dim $hd \
                --epochs 100 \
                --batch_size 64 \
                --gpu 0

            mkdir -p results_grid/K${K}_lr${lr}_hd${hd}
            cp -r result/0.6B/my_dataset/zero_shot/* results_grid/K${K}_lr${lr}_hd${hd}/
        done
    done
done

Batch Processing Multiple Datasets

#!/bin/bash
datasets=("news" "reviews" "papers" "social")

for dataset in "${datasets[@]}"; do
    echo "Processing $dataset..."

    python prepare_data.py \
        --dataset $dataset \
        --model theta \
        --model_size 0.6B \
        --mode zero_shot \
        --vocab_size 5000 \
        --gpu 0

    python run_pipeline.py \
        --dataset $dataset \
        --models theta \
        --model_size 0.6B \
        --mode zero_shot \
        --num_topics 20 \
        --gpu 0
done

Parallel Processing on Multiple GPUs

# Terminal 1
CUDA_VISIBLE_DEVICES=0 python run_pipeline.py \
    --dataset dataset1 --models theta --gpu 0 &

# Terminal 2  
CUDA_VISIBLE_DEVICES=1 python run_pipeline.py \
    --dataset dataset2 --models theta --gpu 0 &

# Terminal 3
CUDA_VISIBLE_DEVICES=2 python run_pipeline.py \
    --dataset dataset3 --models theta --gpu 0 &

Each process uses a different GPU.