Skip to content

Appendix

Reference materials and supplementary information.


Complete Parameter Reference

prepare_data.py

Parameter Type Default Range/Options Required Description
--dataset string - - Yes Dataset name
--model string - theta/baseline/dtm Yes Model type
--model_size string 0.6B 0.6B/4B/8B No Qwen model size
--mode string zero_shot zero_shot/supervised/unsupervised No Training mode
--vocab_size int 5000 1000-20000 No Vocabulary size
--batch_size int 32 8-128 No Batch size
--max_length int 512 128-2048 No Max sequence length
--gpu int 0 0-7 No GPU device ID
--clean flag False - No Clean data first
--raw-input string None filepath No Raw CSV path
--language string english english/chinese No Cleaning language
--bow-only flag False - No BOW only
--check-only flag False - No Check files only
--time_column string year column name No Time column (DTM)

run_pipeline.py

Parameter Type Default Range/Options Required Description
--dataset string - - Yes Dataset name
--models string - theta,lda,etm,ctm,dtm Yes Model list
--model_size string 0.6B 0.6B/4B/8B No Qwen model size
--mode string zero_shot zero_shot/supervised/unsupervised No Training mode
--num_topics int 20 5-100 No Number of topics
--epochs int 100 10-500 No Training epochs
--batch_size int 64 8-512 No Batch size
--hidden_dim int 512 128-1024 No Hidden dimension
--learning_rate float 0.002 0.00001-0.1 No Learning rate
--kl_start float 0.0 0.0-1.0 No KL start weight
--kl_end float 1.0 0.0-1.0 No KL end weight
--kl_warmup int 50 0-200 No KL warmup epochs
--patience int 10 1-50 No Early stopping patience
--no_early_stopping flag False - No Disable early stopping
--gpu int 0 0-7 No GPU device ID
--language string en en/zh No Visualization language
--skip-train flag False - No Skip training
--skip-eval flag False - No Skip evaluation
--skip-viz flag False - No Skip visualization

visualization.run_visualization

Parameter Type Default Range/Options Required Description
--result_dir string - directory Yes Results directory
--dataset string - - Yes Dataset name
--mode string zero_shot zero_shot/supervised/unsupervised No THETA mode
--model_size string 0.6B 0.6B/4B/8B No Model size
--baseline flag False - No Baseline flag
--model string None lda/etm/ctm/dtm No Baseline model
--num_topics int 20 5-100 No Number of topics
--language string en en/zh No Language
--dpi int 300 72-1200 No Image resolution

Directory Structure

/root/autodl-tmp/
├── ETM/
│   ├── main.py
│   ├── run_pipeline.py
│   ├── prepare_data.py
│   └── src/
├── data/
│   └── {dataset}/
│       └── {dataset}_cleaned.csv
├── result/
│   ├── 0.6B/
│   ├── 4B/
│   ├── 8B/
│   └── baseline/
└── embedding_models/

Hardware Requirements

Setup CPU RAM GPU CUDA Storage
Minimum 4 cores 8GB 4GB VRAM 11.8+ 20GB
Recommended 8 cores 16GB 12GB VRAM 12.1+ 50GB SSD
High-Performance 16+ cores 32GB+ A100 40GB 12.1+ 200GB NVMe

FAQ

Q: What makes THETA different?
A: THETA uses Qwen embeddings and neural variational inference for better semantic understanding than LDA or ETM.

Q: Which model size to use?
A: 0.6B for prototyping, 4B for production, 8B for maximum quality.

Q: Minimum dataset size?
A: 500+ documents with 50+ words average recommended.

Q: Training time?
A: 5K docs with 0.6B on V100: ~25 min. 4B: ~50 min.

Q: GPU required?
A: Yes. GPU required for preprocessing and training.


Citation

@article{theta2024,
  title={THETA: Advanced Topic Modeling with Qwen Embeddings},
  author={CodeSoul Team},
  year={2024},
  url={https://github.com/CodeSoul-co/THETA}
}

Contact


Document Version: 1.0.0
Last Updated: February 6, 2026