Appendix¶

Reference materials and supplementary information.

Complete Parameter Reference¶

prepare_data.py¶

Parameter	Type	Default	Range/Options	Required	Description
`--dataset`	string	-	-	Yes	Dataset name
`--model`	string	-	theta/baseline/dtm	Yes	Model type
`--model_size`	string	0.6B	0.6B/4B/8B	No	Qwen model size
`--mode`	string	zero_shot	zero_shot/supervised/unsupervised	No	Training mode
`--vocab_size`	int	5000	1000-20000	No	Vocabulary size
`--batch_size`	int	32	8-128	No	Batch size
`--max_length`	int	512	128-2048	No	Max sequence length
`--gpu`	int	0	0-7	No	GPU device ID
`--clean`	flag	False	-	No	Clean data first
`--raw-input`	string	None	filepath	No	Raw CSV path
`--language`	string	english	english/chinese	No	Cleaning language
`--bow-only`	flag	False	-	No	BOW only
`--check-only`	flag	False	-	No	Check files only
`--time_column`	string	year	column name	No	Time column (DTM)

run_pipeline.py¶

Parameter	Type	Default	Range/Options	Required	Description
`--dataset`	string	-	-	Yes	Dataset name
`--models`	string	-	theta,lda,etm,ctm,dtm	Yes	Model list
`--model_size`	string	0.6B	0.6B/4B/8B	No	Qwen model size
`--mode`	string	zero_shot	zero_shot/supervised/unsupervised	No	Training mode
`--num_topics`	int	20	5-100	No	Number of topics
`--epochs`	int	100	10-500	No	Training epochs
`--batch_size`	int	64	8-512	No	Batch size
`--hidden_dim`	int	512	128-1024	No	Hidden dimension
`--learning_rate`	float	0.002	0.00001-0.1	No	Learning rate
`--kl_start`	float	0.0	0.0-1.0	No	KL start weight
`--kl_end`	float	1.0	0.0-1.0	No	KL end weight
`--kl_warmup`	int	50	0-200	No	KL warmup epochs
`--patience`	int	10	1-50	No	Early stopping patience
`--no_early_stopping`	flag	False	-	No	Disable early stopping
`--gpu`	int	0	0-7	No	GPU device ID
`--language`	string	en	en/zh	No	Visualization language
`--skip-train`	flag	False	-	No	Skip training
`--skip-eval`	flag	False	-	No	Skip evaluation
`--skip-viz`	flag	False	-	No	Skip visualization

visualization.run_visualization¶

Parameter	Type	Default	Range/Options	Required	Description
`--result_dir`	string	-	directory	Yes	Results directory
`--dataset`	string	-	-	Yes	Dataset name
`--mode`	string	zero_shot	zero_shot/supervised/unsupervised	No	THETA mode
`--model_size`	string	0.6B	0.6B/4B/8B	No	Model size
`--baseline`	flag	False	-	No	Baseline flag
`--model`	string	None	lda/etm/ctm/dtm	No	Baseline model
`--num_topics`	int	20	5-100	No	Number of topics
`--language`	string	en	en/zh	No	Language
`--dpi`	int	300	72-1200	No	Image resolution

Directory Structure¶

/root/autodl-tmp/
├── ETM/
│   ├── main.py
│   ├── run_pipeline.py
│   ├── prepare_data.py
│   └── src/
├── data/
│   └── {dataset}/
│       └── {dataset}_cleaned.csv
├── result/
│   ├── 0.6B/
│   ├── 4B/
│   ├── 8B/
│   └── baseline/
└── embedding_models/

Hardware Requirements¶

Setup	CPU	RAM	GPU	CUDA	Storage
Minimum	4 cores	8GB	4GB VRAM	11.8+	20GB
Recommended	8 cores	16GB	12GB VRAM	12.1+	50GB SSD
High-Performance	16+ cores	32GB+	A100 40GB	12.1+	200GB NVMe

FAQ¶

Q: What makes THETA different?
A: THETA uses Qwen embeddings and neural variational inference for better semantic understanding than LDA or ETM.

Q: Which model size to use?
A: 0.6B for prototyping, 4B for production, 8B for maximum quality.

Q: Minimum dataset size?
A: 500+ documents with 50+ words average recommended.

Q: Training time?
A: 5K docs with 0.6B on V100: ~25 min. 4B: ~50 min.

Q: GPU required?
A: Yes. GPU required for preprocessing and training.

Citation¶

@article{theta2024,
  title={THETA: Advanced Topic Modeling with Qwen Embeddings},
  author={CodeSoul Team},
  year={2024},
  url={https://github.com/CodeSoul-co/THETA}
}

Contact¶

Website: https://theta.code-soul.com
GitHub: https://github.com/CodeSoul-co/THETA
Email: support@theta.code-soul.com

Document Version: 1.0.0
Last Updated: February 6, 2026