Hyperparameter Tuning¶

Systematic guide to optimizing THETA hyperparameters.

Learning Rate Scheduling¶

Conservative approach (unstable training):

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --learning_rate 0.0005 \
    --epochs 150 \
    --gpu 0

Standard approach:

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --learning_rate 0.002 \
    --epochs 100 \
    --gpu 0

Aggressive approach (slow convergence):

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --learning_rate 0.01 \
    --epochs 80 \
    --gpu 0

Monitor training loss curves to determine if adjustment is needed.

Batch Size Optimization¶

Batch Size	Advantages	Disadvantages
32	Lower memory, better exploration	Noisy updates, slower convergence
64	Balanced (default)	—
128	Stable updates, faster epochs	Higher memory, may overfit

KL Annealing Strategies¶

No annealing (immediate full KL): --kl_start 1.0 --kl_end 1.0 --kl_warmup 0 Risk: Posterior collapse, poor topic quality

Standard annealing (recommended): --kl_start 0.0 --kl_end 1.0 --kl_warmup 50

Slow annealing (complex data): --kl_start 0.0 --kl_end 1.0 --kl_warmup 80

Partial annealing (fine-tuning): --kl_start 0.2 --kl_end 0.8 --kl_warmup 40

Hidden Dimension Tuning¶

Hidden Dim	Use Case
256	Small datasets or memory constrained
512	Default choice for most applications
1024	Large complex datasets when VRAM permits

Early Stopping Configuration¶

Patience	Behavior
5	Stops quickly if validation loss plateaus
10	Default setting
20	Allows longer training before stopping
Disabled (`--no_early_stopping`)	Trains for all specified epochs

Vocabulary Size Selection¶

Corpus Size	Vocabulary Size	Coverage
< 1K docs	2000-3000	~85%
1K-10K docs	5000	~90%
10K-100K docs	8000-10000	~92%
> 100K docs	10000-15000	~95%

Using Different Model Sizes¶

Scaling Strategy¶

Development workflow: 1. Start with 0.6B model 2. Optimize hyperparameters 3. Scale to 4B for production 4. Use 8B for final results if needed

Quick comparison:

for size in 0.6B 4B 8B; do
    python run_pipeline.py \
        --dataset my_dataset \
        --models theta \
        --model_size $size \
        --mode zero_shot \
        --num_topics 20 \
        --gpu 0
done

Quality vs Cost Analysis¶

0.6B → 4B: - Topic diversity: +3-5% - Coherence (NPMI): +10-15% - Training time: +60-80%

4B → 8B: - Topic diversity: +1-2% - Coherence (NPMI): +5-8% - Training time: +80-100%

Diminishing returns suggest 4B is often the best choice for production.

Grid Search¶

Systematic hyperparameter exploration:

#!/bin/bash
topics=(15 20 25 30)
learning_rates=(0.001 0.002 0.005)
hidden_dims=(256 512 768)

for K in "${topics[@]}"; do
    for lr in "${learning_rates[@]}"; do
        for hd in "${hidden_dims[@]}"; do
            echo "Training K=$K, lr=$lr, hd=$hd"

            python run_pipeline.py \
                --dataset my_dataset \
                --models theta \
                --model_size 0.6B \
                --mode zero_shot \
                --num_topics $K \
                --learning_rate $lr \
                --hidden_dim $hd \
                --epochs 100 \
                --batch_size 64 \
                --gpu 0

            mkdir -p results_grid/K${K}_lr${lr}_hd${hd}
            cp -r result/0.6B/my_dataset/zero_shot/* results_grid/K${K}_lr${lr}_hd${hd}/
        done
    done
done

Batch Processing Multiple Datasets¶

#!/bin/bash
datasets=("news" "reviews" "papers" "social")

for dataset in "${datasets[@]}"; do
    echo "Processing $dataset..."

    python prepare_data.py \
        --dataset $dataset \
        --model theta \
        --model_size 0.6B \
        --mode zero_shot \
        --vocab_size 5000 \
        --gpu 0

    python run_pipeline.py \
        --dataset $dataset \
        --models theta \
        --model_size 0.6B \
        --mode zero_shot \
        --num_topics 20 \
        --gpu 0
done

Parallel Processing on Multiple GPUs¶

# Terminal 1
CUDA_VISIBLE_DEVICES=0 python run_pipeline.py \
    --dataset dataset1 --models theta --gpu 0 &

# Terminal 2  
CUDA_VISIBLE_DEVICES=1 python run_pipeline.py \
    --dataset dataset2 --models theta --gpu 0 &

# Terminal 3
CUDA_VISIBLE_DEVICES=2 python run_pipeline.py \
    --dataset dataset3 --models theta --gpu 0 &

Each process uses a different GPU.