Skip to content

Distributed & Memory-Efficient Training

Guide for scaling THETA to larger datasets and constrained environments.


Memory-Efficient Training

For limited VRAM:

0.6B with reduced batch size:

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 0.6B \
    --mode zero_shot \
    --num_topics 20 \
    --batch_size 32 \
    --gpu 0

4B with minimal batch size:

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 4B \
    --mode zero_shot \
    --num_topics 20 \
    --batch_size 16 \
    --gpu 0

8B requiring high-end GPU:

python run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 8B \
    --mode zero_shot \
    --num_topics 20 \
    --batch_size 8 \
    --gpu 0

Reduce batch size if out-of-memory errors occur.


Memory Requirements

Model Size Batch Size VRAM Required
0.6B 16 ~6GB
0.6B 32 ~8GB
0.6B 64 ~12GB
4B 8 ~10GB
4B 16 ~14GB
4B 32 ~22GB
8B 8 ~18GB
8B 16 ~28GB

Multi-GPU Processing

Train different configurations in parallel using separate GPUs:

# Terminal 1
CUDA_VISIBLE_DEVICES=0 python run_pipeline.py \
    --dataset dataset1 --models theta --gpu 0 &

# Terminal 2  
CUDA_VISIBLE_DEVICES=1 python run_pipeline.py \
    --dataset dataset2 --models theta --gpu 0 &

# Terminal 3
CUDA_VISIBLE_DEVICES=2 python run_pipeline.py \
    --dataset dataset3 --models theta --gpu 0 &

Each process uses a different GPU.


Environment Variables

CUDA Configuration

# Use specific GPU
CUDA_VISIBLE_DEVICES=0 python run_pipeline.py --dataset my_dataset --models theta

# Limit GPU memory fraction
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 python run_pipeline.py ...

# Enable memory debugging
PYTORCH_NO_CUDA_MEMORY_CACHING=1 python run_pipeline.py ...

Logging

# Disable progress bars
TQDM_DISABLE=1 python run_pipeline.py ...

# Reduce logging
export PYTHONWARNINGS="ignore"
python run_pipeline.py ...