Appendix A: FAQ & Supplementary Information¶

Reference materials and supplementary information.

Complete Parameter Reference¶

To avoid duplicated and drifting parameter definitions, the canonical parameter reference is maintained in:

advanced/hyperparameters.md (recommended)
api/run-pipeline.md (CLI-oriented reference)

Directory Structure¶

./
├── ETM/
│   ├── main.py
│   ├── run_pipeline.py
│   ├── prepare_data.py
│   └── src/
├── data/
│   └── {dataset}/
│       └── {dataset}_cleaned.csv
├── result/
│   ├── 0.6B/
│   ├── 4B/
│   ├── 8B/
│   └── baseline/
└── embedding_models/

Hardware Requirements¶

Setup	CPU	RAM	GPU	CUDA	Storage
Minimum	4 cores	8GB	4GB VRAM	11.8+	20GB
Recommended	8 cores	16GB	12GB VRAM	12.1+	50GB SSD
High-Performance	16+ cores	32GB+	A100 40GB	12.1+	200GB NVMe

FAQ¶

Q: What makes THETA different?
A: THETA uses Qwen embeddings and neural variational inference for better semantic understanding than LDA or ETM.

Q: Which model size to use?
A: 0.6B for prototyping, 4B for production, 8B for maximum quality.

Q: Minimum dataset size?
A: 500+ documents with 50+ words average recommended.

Q: Training time?
A: 5K docs with 0.6B on V100: ~25 min. 4B: ~50 min.

Q: GPU required?
A: Yes. GPU required for preprocessing and training.

Citation¶

@article{theta2024,
  title={THETA: Advanced Topic Modeling with Qwen Embeddings},
  author={CodeSoul Team},
  year={2024},
  url={https://github.com/CodeSoul-co/THETA}
}

Contact¶

Website: https://theta.code-soul.com
GitHub: https://github.com/CodeSoul-co/THETA
Email: support@theta.code-soul.com

Document Version: 1.0.0
Last Updated: February 6, 2026