Model Comparison¶

Comprehensive comparison of all models supported by THETA.

Performance Comparison¶

Typical performance on benchmark datasets:

Model	TD	NPMI	C_V	PPL	Speed	VRAM
LDA	0.75	0.25	0.45	180	Fast	0GB
ETM	0.82	0.32	0.52	165	Medium	4GB
CTM	0.85	0.38	0.58	155	Medium	6GB
THETA-0.6B	0.88	0.42	0.64	145	Medium	8GB
THETA-4B	0.91	0.48	0.69	138	Slow	16GB
THETA-8B	0.93	0.52	0.72	132	Slowest	28GB

Values are approximate and vary by dataset. Higher is better for TD, NPMI, C_V. Lower is better for PPL.

Selection Guidelines¶

Use LDA when: - Need fast baseline results - Interpretability is critical - No GPU available - Computing topic distributions for new documents frequently

Use ETM when: - Want better performance than LDA - Have GPU available - Need moderate computational budget - Comparing against original ETM papers

Use CTM when: - Need contextualized understanding - Want good balance of quality and speed - Following recent topic modeling literature - Working with standard-size corpora

Use DTM when: - Analyzing temporal dynamics - Have time-stamped documents - Studying topic evolution - Investigating emerging trends

Use THETA-0.6B when: - Need better quality than CTM - Have 8-12GB VRAM available - Rapid experimentation required

Use THETA-4B when: - Need high-quality results - Have 16-20GB VRAM available - Production deployment

Use THETA-8B when: - Need highest possible quality - Have 24-32GB VRAM available - Critical applications

Computational Requirements¶

Training time comparison on 10K document corpus:

Model	CPU Time	GPU Time	VRAM	Storage
LDA	15 min	N/A	0GB	100MB
ETM	N/A	20 min	4GB	500MB
CTM	N/A	25 min	6GB	800MB
THETA-0.6B	N/A	30 min	8GB	2GB
THETA-4B	N/A	50 min	16GB	6GB
THETA-8B	N/A	90 min	28GB	12GB

Times assume single GPU (V100 or A100).

Embedding Comparison¶

Model	Embedding	Dimension	Contextual	Pre-trained
LDA	None	N/A	No	N/A
ETM	Word2Vec	300	No	Yes
CTM	SBERT	768	Yes	Yes
THETA-0.6B	Qwen3	1024	Yes	Yes
THETA-4B	Qwen3	2560	Yes	Yes
THETA-8B	Qwen3	4096	Yes	Yes

Model Selection Workflow¶

Step 1: Determine Requirements¶

Consider: - Dataset size (number of documents) - Available computational resources (GPU memory) - Time constraints - Quality requirements (research vs prototyping)

Step 2: Choose Initial Model¶

Default recommendations: - Prototyping: THETA-0.6B or CTM - Production: THETA-4B - Research: THETA-8B - Quick baseline: LDA - Temporal analysis: DTM

Step 3: Evaluate and Compare¶

Train multiple models:

python run_pipeline.py \
    --dataset my_dataset \
    --models lda,etm,ctm,theta \
    --model_size 0.6B \
    --num_topics 20

Step 4: Scale Up If Needed¶

THETA-0.6B → THETA-4B: Significant quality improvement
THETA-4B → THETA-8B: Marginal quality improvement
Consider collecting more data before scaling model size