Skip to content

Hyperparameter Tuning

Systematic guide to optimizing THETA hyperparameters.


Learning Rate Scheduling

Conservative approach (unstable training):

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --learning_rate 0.0005 \
    --epochs 150 \
    --gpu 0

Standard approach:

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --learning_rate 0.002 \
    --epochs 100 \
    --gpu 0

Aggressive approach (slow convergence):

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --learning_rate 0.01 \
    --epochs 80 \
    --gpu 0

Monitor training loss curves to determine if adjustment is needed.


Batch Size Optimization

Batch Size Advantages Disadvantages
32 Lower memory, better exploration Noisy updates, slower convergence
64 Balanced (default)
128 Stable updates, faster epochs Higher memory, may overfit

KL Annealing Strategies

No annealing (immediate full KL): --kl_start 1.0 --kl_end 1.0 --kl_warmup 0 Risk: Posterior collapse, poor topic quality

Standard annealing (recommended): --kl_start 0.0 --kl_end 1.0 --kl_warmup 50

Slow annealing (complex data): --kl_start 0.0 --kl_end 1.0 --kl_warmup 80

Partial annealing (fine-tuning): --kl_start 0.2 --kl_end 0.8 --kl_warmup 40


Hidden Dimension Tuning

Hidden Dim Use Case
256 Small datasets or memory constrained
512 Default choice for most applications
1024 Large complex datasets when VRAM permits

Early Stopping Configuration

Patience Behavior
5 Stops quickly if validation loss plateaus
10 Default setting
20 Allows longer training before stopping
Disabled (--no_early_stopping) Trains for all specified epochs

Vocabulary Size Selection

Corpus Size Vocabulary Size Coverage
< 1K docs 2000-3000 ~85%
1K-10K docs 5000 ~90%
10K-100K docs 8000-10000 ~92%
> 100K docs 10000-15000 ~95%

Using Different Model Sizes

Scaling Strategy

Development workflow: 1. Start with 0.6B model 2. Optimize hyperparameters 3. Scale to 4B for production 4. Use 8B for final results if needed

Quick comparison:

for size in 0.6B 4B 8B; do
    python run_pipeline.py \
        --dataset my_dataset \
        --models theta \
        --model_size $size \
        --mode zero_shot \
        --num_topics 20 \
        --gpu 0
done

Quality vs Cost Analysis

0.6B → 4B: - Topic diversity: +3-5% - Coherence (NPMI): +10-15% - Training time: +60-80%

4B → 8B: - Topic diversity: +1-2% - Coherence (NPMI): +5-8% - Training time: +80-100%

Diminishing returns suggest 4B is often the best choice for production.


Systematic hyperparameter exploration:

#!/bin/bash
topics=(15 20 25 30)
learning_rates=(0.001 0.002 0.005)
hidden_dims=(256 512 768)

for K in "${topics[@]}"; do
    for lr in "${learning_rates[@]}"; do
        for hd in "${hidden_dims[@]}"; do
            echo "Training K=$K, lr=$lr, hd=$hd"

            python run_pipeline.py \
                --dataset my_dataset \
                --models theta \
                --model_size 0.6B \
                --mode zero_shot \
                --num_topics $K \
                --learning_rate $lr \
                --hidden_dim $hd \
                --epochs 100 \
                --batch_size 64 \
                --gpu 0

            mkdir -p results_grid/K${K}_lr${lr}_hd${hd}
            cp -r result/0.6B/my_dataset/zero_shot/* results_grid/K${K}_lr${lr}_hd${hd}/
        done
    done
done

Batch Processing Multiple Datasets

#!/bin/bash
datasets=("news" "reviews" "papers" "social")

for dataset in "${datasets[@]}"; do
    echo "Processing $dataset..."

    python prepare_data.py \
        --dataset $dataset \
        --model theta \
        --model_size 0.6B \
        --mode zero_shot \
        --vocab_size 5000 \
        --gpu 0

    python run_pipeline.py \
        --dataset $dataset \
        --models theta \
        --model_size 0.6B \
        --mode zero_shot \
        --num_topics 20 \
        --gpu 0
done

Parallel Processing on Multiple GPUs

# Terminal 1
CUDA_VISIBLE_DEVICES=0 python run_pipeline.py \
    --dataset dataset1 --models theta --gpu 0 &

# Terminal 2  
CUDA_VISIBLE_DEVICES=1 python run_pipeline.py \
    --dataset dataset2 --models theta --gpu 0 &

# Terminal 3
CUDA_VISIBLE_DEVICES=2 python run_pipeline.py \
    --dataset dataset3 --models theta --gpu 0 &

Each process uses a different GPU.