Shell Scripts Reference¶
All scripts are non-interactive (pure command-line parameters), suitable for DLC/batch environments. No stdin input required.
Script Overview¶
| Script | Description |
|---|---|
01_setup.sh |
Install dependencies and download data from HuggingFace |
02_clean_data.sh |
Clean raw text data (tokenization, stopword removal, lemmatization) |
02_generate_embeddings.sh |
Generate Qwen embeddings (sub-script of 03, for failure recovery) |
03_prepare_data.sh |
One-stop data preparation: BOW + embeddings for all 12 models |
04_train_theta.sh |
Train THETA model (train + evaluate + visualize) |
05_train_baseline.sh |
Train 11 baseline models for comparison with THETA |
06_visualize.sh |
Generate visualizations for trained models |
07_evaluate.sh |
Standalone evaluation with 7 unified metrics |
08_compare_models.sh |
Cross-model metric comparison table |
09_download_from_hf.sh |
Download pre-trained data from HuggingFace |
10_quick_start_english.sh |
Quick start for English datasets |
11_quick_start_chinese.sh |
Quick start for Chinese datasets |
12_train_multi_gpu.sh |
Multi-GPU training with DistributedDataParallel |
13_test_agent.sh |
Test LLM Agent connection and functionality |
14_start_agent_api.sh |
Start the Agent API server (FastAPI) |
A) Data Cleaning — 02_clean_data.sh¶
Row-by-row text cleaning with user-specified column selection. Two modes:
- CSV mode: User specifies --text_column (cleaned) and --label_columns (preserved as-is)
- Directory mode: Convert docx/txt files into a single cleaned CSV
Supported languages: english, chinese, german, spanish
# 1. Preview columns (recommended first step for CSV)
bash scripts/02_clean_data.sh \
--input data/FCPB/complaints_text_only.csv --preview
# 2. Clean text column only
bash scripts/02_clean_data.sh \
--input data/FCPB/complaints_text_only.csv \
--language english \
--text_column 'Consumer complaint narrative'
# 3. Clean text + keep label column
bash scripts/02_clean_data.sh \
--input data/hatespeech/hatespeech_text_only.csv \
--language english \
--text_column cleaned_content --label_columns Label
# 4. Keep ALL columns, only clean the text column
bash scripts/02_clean_data.sh \
--input raw.csv --language english \
--text_column text --keep_all
# 5. Directory mode (docx/txt → CSV)
bash scripts/02_clean_data.sh \
--input data/edu_data/ --language chinese
| Parameter | Required | Description | Default |
|---|---|---|---|
--input |
Yes | Input CSV file or directory (docx/txt) | - |
--language |
Yes (not for preview) | Data language: english, chinese, german, spanish | - |
--text_column |
Yes (CSV mode) | Name of the text column to clean | - |
--label_columns |
Comma-separated label/metadata columns to keep as-is | - | |
--keep_all |
Keep ALL original columns (only text column is cleaned) | false | |
--preview |
Show CSV columns and sample rows, then exit | false | |
--output |
Output CSV path | auto-generated | |
--min_words |
Min words per document after cleaning | 3 |
Output: data/{dataset}/{dataset}_cleaned.csv
B) Data Preparation — 03_prepare_data.sh¶
One-stop data preparation for all 12 models. Generates BOW matrix and model-specific embeddings.
Data requirements by model:
| Model | Type | Data Needed |
|---|---|---|
| lda, hdp, btm | Traditional | BOW only |
| stm | Traditional | BOW + covariates (document metadata) |
| nvdm, gsm, prodlda | Neural | BOW only |
| etm | Neural | BOW + Word2Vec |
| ctm | Neural | BOW + SBERT |
| dtm | Neural | BOW + SBERT + time slices |
| bertopic | Neural | SBERT + raw text |
| theta | THETA | BOW + Qwen embeddings |
Note: Models 1-7 (BOW-only) share the same data experiment. Prepare once, train all.
# ---- Baseline models ----
# BOW-only models (lda, hdp, btm, nvdm, gsm, prodlda share this)
bash scripts/03_prepare_data.sh \
--dataset edu_data --model lda --vocab_size 3500 --language chinese
# CTM (BOW + SBERT embeddings)
bash scripts/03_prepare_data.sh \
--dataset edu_data --model ctm --vocab_size 3500 --language chinese
# ETM (BOW + Word2Vec embeddings)
bash scripts/03_prepare_data.sh \
--dataset edu_data --model etm --vocab_size 3500 --language chinese
# DTM (BOW + SBERT + time slices, requires time column)
bash scripts/03_prepare_data.sh \
--dataset edu_data --model dtm --vocab_size 3500 --language chinese --time_column year
# BERTopic (SBERT + raw text)
bash scripts/03_prepare_data.sh \
--dataset edu_data --model bertopic --vocab_size 3500 --language chinese
# ---- THETA model ----
# Zero-shot (fastest, no training needed)
bash scripts/03_prepare_data.sh \
--dataset edu_data --model theta --model_size 0.6B --mode zero_shot \
--vocab_size 3500 --language chinese
# Unsupervised (LoRA fine-tuned Qwen embeddings)
bash scripts/03_prepare_data.sh \
--dataset edu_data --model theta --model_size 0.6B --mode unsupervised \
--vocab_size 3500 --language chinese
# Supervised (requires label column)
bash scripts/03_prepare_data.sh \
--dataset edu_data --model theta --model_size 0.6B --mode supervised \
--vocab_size 3500 --language chinese
# ---- Advanced options ----
# BOW only (skip embedding generation)
bash scripts/03_prepare_data.sh --dataset mydata --model theta --bow-only --vocab_size 5000
# Check if data files already exist
bash scripts/03_prepare_data.sh --dataset mydata --model theta --check-only
# Custom vocabulary size and max sequence length
bash scripts/03_prepare_data.sh --dataset mydata \
--model theta --model_size 0.6B --mode zero_shot \
--vocab_size 10000 --batch_size 64 --gpu 0
| Parameter | Required | Description | Default |
|---|---|---|---|
--dataset |
Yes | Dataset name | - |
--model |
Yes | Target model: lda, hdp, stm (requires covariates), btm, nvdm, gsm, prodlda, ctm, etm, dtm, bertopic, theta | - |
--model_size |
Qwen model size (theta only): 0.6B, 4B, 8B | 0.6B | |
--mode |
Embedding mode (theta only): zero_shot, unsupervised, supervised | zero_shot | |
--vocab_size |
Vocabulary size | 5000 | |
--batch_size |
Embedding generation batch size | 32 | |
--gpu |
GPU device ID | 0 | |
--language |
Data language: english, chinese (controls tokenization) | english | |
--bow-only |
Only generate BOW, skip embeddings | false | |
--check-only |
Only check if files exist | false | |
--time_column |
Time column name (DTM only) | year | |
--label_column |
Label column (theta supervised only) | - | |
--emb_epochs |
Embedding fine-tuning epochs (theta only) | 10 | |
--emb_batch_size |
Embedding fine-tuning batch size (theta only) | 8 | |
--exp_name |
Experiment name tag | auto-generated |
Embedding recovery — If embedding generation fails (e.g., OOM), re-run only the embedding step:
bash scripts/02_generate_embeddings.sh \
--dataset edu_data --mode zero_shot --model_size 0.6B \
--batch_size 4 --exp_dir result/0.6B/edu_data/data/exp_xxx
C) THETA Model Training — 04_train_theta.sh¶
Train THETA model with integrated training + evaluation + visualization.
# ---- Basic usage ----
# Zero-shot mode (simplest command)
bash scripts/04_train_theta.sh \
--dataset edu_data --model_size 0.6B --mode zero_shot --num_topics 20
# Unsupervised mode
bash scripts/04_train_theta.sh \
--dataset edu_data --model_size 0.6B --mode unsupervised --num_topics 20
# Supervised mode (requires label column)
bash scripts/04_train_theta.sh \
--dataset hatespeech --model_size 0.6B --mode supervised --num_topics 20
# Larger model for better quality
bash scripts/04_train_theta.sh \
--dataset hatespeech --model_size 4B --mode zero_shot --num_topics 20
# ---- Full parameters ----
bash scripts/04_train_theta.sh \
--dataset edu_data --model_size 0.6B --mode zero_shot \
--num_topics 20 --epochs 100 --batch_size 64 \
--hidden_dim 512 --learning_rate 0.002 \
--kl_start 0.0 --kl_end 1.0 --kl_warmup 50 \
--patience 10 --gpu 0 --language zh
# Custom KL annealing
bash scripts/04_train_theta.sh \
--dataset hatespeech --model_size 0.6B --mode zero_shot \
--num_topics 20 --epochs 200 \
--kl_start 0.1 --kl_end 0.8 --kl_warmup 40
# ---- Specify data experiment ----
bash scripts/04_train_theta.sh \
--dataset edu_data --model_size 0.6B --mode zero_shot \
--data_exp exp_20260208_151906_vocab3500_theta_0.6B_zero_shot \
--num_topics 20 --epochs 50 --language zh
# ---- Skip options ----
# Skip visualization (train + evaluate only, faster)
bash scripts/04_train_theta.sh \
--dataset edu_data --model_size 0.6B --mode zero_shot \
--num_topics 20 --skip-viz
# Skip training (evaluate + visualize existing model)
bash scripts/04_train_theta.sh \
--dataset edu_data --model_size 0.6B --mode zero_shot \
--skip-train --language zh
| Parameter | Required | Description | Default |
|---|---|---|---|
--dataset |
Yes | Dataset name | - |
--model_size |
Qwen model size: 0.6B, 4B, 8B | 0.6B | |
--mode |
Embedding mode: zero_shot, unsupervised, supervised | zero_shot | |
--num_topics |
Number of topics K | 20 | |
--epochs |
Training epochs | 100 | |
--batch_size |
Training batch size | 64 | |
--hidden_dim |
Encoder hidden dimension | 512 | |
--learning_rate |
Learning rate | 0.002 | |
--kl_start |
KL annealing start weight | 0.0 | |
--kl_end |
KL annealing end weight | 1.0 | |
--kl_warmup |
KL warmup epochs | 50 | |
--patience |
Early stopping patience | 10 | |
--gpu |
GPU device ID | 0 | |
--language |
Visualization language: en, zh | en | |
--skip-train |
Skip training, only evaluate | false | |
--skip-viz |
Skip visualization | false | |
--data_exp |
Data experiment ID | auto latest | |
--exp_name |
Experiment name tag | auto-generated |
D) Baseline Model Training — 05_train_baseline.sh¶
Train 11 baseline topic models for comparison with THETA.
Supported Models¶
| Model | Type | Description | Model-Specific Parameters |
|---|---|---|---|
| lda | Traditional | Latent Dirichlet Allocation | --max_iter |
| hdp | Traditional | Hierarchical Dirichlet Process (auto topic count) | --max_topics, --alpha |
| stm | Traditional | Structural Topic Model (requires covariates) | --max_iter |
| btm | Traditional | Biterm Topic Model (best for short texts) | --n_iter, --alpha, --beta |
| nvdm | Neural | Neural Variational Document Model | --epochs, --dropout |
| gsm | Neural | Gaussian Softmax Model | --epochs, --dropout |
| prodlda | Neural | Product of Experts LDA | --epochs, --dropout |
| ctm | Neural | Contextualized Topic Model (requires SBERT) | --epochs, --inference_type |
| etm | Neural | Embedded Topic Model (requires Word2Vec) | --epochs |
| dtm | Neural | Dynamic Topic Model (requires timestamps) | --epochs |
| bertopic | Neural | BERT-based Topic Model (auto topic count) | - |
Complete Per-Model Examples¶
# ============================================================
# 1. LDA — Latent Dirichlet Allocation
# Type: Traditional | Data: BOW only
# ============================================================
bash scripts/05_train_baseline.sh \
--dataset edu_data --models lda --num_topics 20
bash scripts/05_train_baseline.sh \
--dataset edu_data --models lda \
--num_topics 20 --max_iter 200 \
--gpu 0 --language zh --with-viz \
--data_exp exp_20260208_153424_vocab3500_lda \
--exp_name lda_full
# ============================================================
# 2. HDP — Hierarchical Dirichlet Process
# Note: Auto-determines topic count, --num_topics is IGNORED
# ============================================================
bash scripts/05_train_baseline.sh \
--dataset edu_data --models hdp
bash scripts/05_train_baseline.sh \
--dataset edu_data --models hdp \
--max_topics 150 --alpha 1.0 \
--gpu 0 --language zh --with-viz \
--data_exp exp_20260208_153424_vocab3500_lda \
--exp_name hdp_full
# ============================================================
# 3. STM — Structural Topic Model
# REQUIRES covariates — auto-skipped if dataset has no metadata
# ============================================================
#
# To use STM:
# 1. Ensure your cleaned CSV has metadata columns
# 2. Register covariates in ETM/config.py → DATASET_CONFIGS
# 3. Prepare data (same as other BOW models)
# 4. Train STM
bash scripts/05_train_baseline.sh \
--dataset my_dataset_with_covariates --models stm --num_topics 20
bash scripts/05_train_baseline.sh \
--dataset my_dataset_with_covariates --models stm \
--num_topics 20 --max_iter 200 \
--gpu 0 --language en --with-viz \
--data_exp exp_20260208_153424_vocab3500_lda \
--exp_name stm_full
# ============================================================
# 4. BTM — Biterm Topic Model
# Best suited for short texts (tweets, comments)
# ============================================================
bash scripts/05_train_baseline.sh \
--dataset edu_data --models btm --num_topics 20
bash scripts/05_train_baseline.sh \
--dataset edu_data --models btm \
--num_topics 20 --n_iter 100 --alpha 1.0 --beta 0.01 \
--gpu 0 --language zh --with-viz \
--data_exp exp_20260208_153424_vocab3500_lda \
--exp_name btm_full
# ============================================================
# 5. NVDM / 6. GSM / 7. ProdLDA — BOW-only neural models
# ============================================================
bash scripts/05_train_baseline.sh \
--dataset edu_data --models nvdm --num_topics 20
bash scripts/05_train_baseline.sh \
--dataset edu_data --models nvdm \
--num_topics 20 --epochs 200 --batch_size 128 \
--hidden_dim 512 --learning_rate 0.002 --dropout 0.2 \
--gpu 0 --language zh --with-viz \
--data_exp exp_20260208_153424_vocab3500_lda \
--exp_name nvdm_full
# (Replace nvdm with gsm or prodlda for those models)
# ============================================================
# 8. CTM — Contextualized Topic Model
# Requires SBERT data_exp (prepared with --model ctm)
# ============================================================
bash scripts/05_train_baseline.sh \
--dataset edu_data --models ctm --num_topics 20
bash scripts/05_train_baseline.sh \
--dataset edu_data --models ctm \
--num_topics 20 --epochs 100 --inference_type zeroshot \
--batch_size 64 --hidden_dim 512 --learning_rate 0.002 \
--gpu 0 --language zh --with-viz \
--data_exp exp_20260208_154645_vocab3500_ctm \
--exp_name ctm_zeroshot
# Combined inference (uses both BOW and SBERT)
bash scripts/05_train_baseline.sh \
--dataset edu_data --models ctm \
--num_topics 20 --epochs 100 --inference_type combined \
--gpu 0 --language zh --with-viz
# ============================================================
# 9. ETM — Embedded Topic Model (BOW + Word2Vec)
# ============================================================
bash scripts/05_train_baseline.sh \
--dataset edu_data --models etm --num_topics 20
bash scripts/05_train_baseline.sh \
--dataset edu_data --models etm \
--num_topics 20 --epochs 200 --batch_size 64 \
--hidden_dim 512 --learning_rate 0.002 \
--gpu 0 --language zh --with-viz \
--data_exp exp_20260208_153424_vocab3500_lda \
--exp_name etm_full
# ============================================================
# 10. DTM — Dynamic Topic Model (requires timestamps)
# ============================================================
bash scripts/05_train_baseline.sh \
--dataset edu_data --models dtm --num_topics 20
bash scripts/05_train_baseline.sh \
--dataset edu_data --models dtm \
--num_topics 20 --epochs 200 --batch_size 64 \
--hidden_dim 512 --learning_rate 0.002 \
--gpu 0 --language zh --with-viz \
--data_exp exp_20260208_171413_vocab3500_dtm \
--exp_name dtm_full
# ============================================================
# 11. BERTopic — Auto-determines topic count
# ============================================================
bash scripts/05_train_baseline.sh \
--dataset edu_data --models bertopic
bash scripts/05_train_baseline.sh \
--dataset edu_data --models bertopic \
--gpu 0 --language zh --with-viz \
--data_exp exp_20260208_154645_vocab3500_ctm \
--exp_name bertopic_full
# ============================================================
# Batch training (multiple models at once)
# ============================================================
# Train all BOW-only models (share the same data_exp)
bash scripts/05_train_baseline.sh \
--dataset edu_data \
--models lda,hdp,btm,nvdm,gsm,prodlda \
--num_topics 20 --epochs 100 \
--data_exp exp_20260208_153424_vocab3500_lda
# Train CTM + BERTopic (share SBERT data_exp)
bash scripts/05_train_baseline.sh \
--dataset edu_data --models ctm,bertopic \
--num_topics 20 --epochs 100 \
--data_exp exp_20260208_154645_vocab3500_ctm
# Skip training, only evaluate and visualize existing model
bash scripts/05_train_baseline.sh \
--dataset edu_data --models lda --num_topics 20 --skip-train
# Enable visualization (disabled by default)
bash scripts/05_train_baseline.sh \
--dataset edu_data --models lda --num_topics 20 \
--with-viz --language zh
Important notes: - BTM uses Gibbs sampling and is very slow on long documents (samples max 50 words/doc). Best for short texts. - HDP and BERTopic auto-determine topic count;
--num_topicsis ignored for these models. - STM requires document-level covariates. If your dataset has nocovariate_columnsinDATASET_CONFIGS, STM will be automatically skipped. - DTM requires a data experiment containingtime_slices.json(prepared with--model dtm). - CTM and BERTopic require a data experiment containing SBERT embeddings.
Parameter Reference¶
Common parameters:
| Parameter | Required | Description | Default |
|---|---|---|---|
--dataset |
Yes | Dataset name | - |
--models |
Yes | Model list (comma-separated) | - |
--num_topics |
Number of topics (ignored for hdp/bertopic) | 20 | |
--vocab_size |
Vocabulary size | 5000 | |
--epochs |
Training epochs (neural models) | 100 | |
--batch_size |
Batch size | 64 | |
--hidden_dim |
Hidden layer dimension | 512 | |
--learning_rate |
Learning rate | 0.002 | |
--gpu |
GPU device ID | 0 | |
--language |
Visualization language: en, zh | en | |
--skip-train |
Skip training | false | |
--with-viz |
Enable visualization | false | |
--data_exp |
Data experiment ID | auto latest | |
--exp_name |
Experiment name tag | auto-generated |
Model-specific parameters:
| Parameter | Applicable Models | Description | Default |
|---|---|---|---|
--max_iter |
lda, stm | Max iterations (EM algorithm) | 100 |
--max_topics |
hdp | Max topic count | 150 |
--n_iter |
btm | Gibbs sampling iterations | 100 |
--alpha |
hdp, btm | Alpha prior | 1.0 |
--beta |
btm | Beta prior | 0.01 |
--inference_type |
ctm | Inference type: zeroshot, combined | zeroshot |
--dropout |
Neural models (nvdm, gsm, prodlda, ctm, etm, dtm) | Dropout rate | 0.2 |
E) Visualization — 06_visualize.sh¶
Generate visualizations for trained models without re-training.
# THETA model visualization
bash scripts/06_visualize.sh \
--dataset edu_data --model_size 0.6B --mode zero_shot --language zh
# English charts + high DPI (for papers)
bash scripts/06_visualize.sh \
--dataset edu_data --model_size 0.6B --mode zero_shot --language en --dpi 600
# Baseline model visualization
bash scripts/06_visualize.sh \
--baseline --dataset edu_data --model lda --num_topics 20 --language zh
# HDP (auto topic count, use actual K from training)
bash scripts/06_visualize.sh \
--baseline --dataset edu_data --model hdp --num_topics 150 --language zh
# DTM (includes topic evolution charts)
bash scripts/06_visualize.sh \
--baseline --dataset edu_data --model dtm --num_topics 20 --language zh
# Specify a model experiment explicitly
bash scripts/06_visualize.sh \
--baseline --dataset edu_data --model ctm --model_exp exp_20260208_xxx --language zh
| Parameter | Description | Default |
|---|---|---|
--dataset |
Dataset name (required) | — |
--baseline |
Baseline model mode | false |
--model |
Baseline model name | — |
--model_exp |
Model experiment ID (auto-selects latest if not specified) | auto latest |
--model_size |
THETA model size | 0.6B |
--mode |
THETA mode | zero_shot |
--language |
Visualization language: en, zh | en |
--dpi |
Image DPI | 300 |
Generated charts (20+ types):
| Chart | Description | Filename |
|---|---|---|
| Topic Table | Top words per topic | topic_table.png |
| Topic Network | Inter-topic similarity network | topic_network.png |
| Document Clusters | UMAP document distribution | doc_topic_umap.png |
| Cluster Heatmap | Topic-document heatmap | cluster_heatmap.png |
| Topic Proportion | Document proportion per topic | topic_proportion.png |
| Training Loss | Loss curve | training_loss.png |
| Evaluation Metrics | 7-metric radar chart | metrics.png |
| Topic Coherence | Per-topic NPMI | topic_coherence.png |
| Topic Exclusivity | Per-topic exclusivity | topic_exclusivity.png |
| Word Clouds | All topic word clouds | topic_wordclouds.png |
| Topic Similarity | Inter-topic cosine similarity | topic_similarity.png |
| pyLDAvis | Interactive topic explorer | pyldavis_interactive.html |
| Per-topic Words | Per-topic word weights | topics/topic_N/word_importance.png |
F) Evaluation — 07_evaluate.sh¶
Standalone evaluation with 7 unified metrics.
# Evaluate baseline models
bash scripts/07_evaluate.sh --dataset edu_data --model lda --num_topics 20
bash scripts/07_evaluate.sh --dataset edu_data --model hdp --num_topics 150
bash scripts/07_evaluate.sh --dataset edu_data --model ctm --num_topics 20
bash scripts/07_evaluate.sh --dataset edu_data --model etm --num_topics 20
bash scripts/07_evaluate.sh --dataset edu_data --model dtm --num_topics 20
bash scripts/07_evaluate.sh --dataset edu_data --model bertopic --num_topics 20
# With custom vocab size
bash scripts/07_evaluate.sh --dataset edu_data --model lda --num_topics 20 --vocab_size 3500
# Evaluate THETA models
bash scripts/07_evaluate.sh --dataset edu_data --model theta --model_size 0.6B --mode zero_shot
bash scripts/07_evaluate.sh --dataset edu_data --model theta --model_size 0.6B --mode unsupervised
bash scripts/07_evaluate.sh --dataset edu_data --model theta --model_size 4B --mode supervised
| Parameter | Description | Default |
|---|---|---|
--dataset |
Dataset name (required) | — |
--model |
Model name (required): lda, hdp, stm, btm, nvdm, gsm, prodlda, ctm, etm, dtm, bertopic, theta | — |
--num_topics |
Number of topics | 20 |
--vocab_size |
Vocabulary size | 5000 |
--model_size |
THETA model size: 0.6B, 4B, 8B | 0.6B |
--mode |
THETA mode: zero_shot, unsupervised, supervised | zero_shot |
Evaluation Metrics (7 metrics):
| Metric | Full Name | Direction | Description |
|---|---|---|---|
| TD | Topic Diversity | ↑ Higher is better | Proportion of unique words across topics |
| iRBO | Inverse Rank-Biased Overlap | ↑ Higher is better | Rank-based topic diversity |
| NPMI | Normalized PMI | ↑ Higher is better | Normalized pointwise mutual information coherence |
| C_V | C_V Coherence | ↑ Higher is better | Sliding-window based coherence |
| UMass | UMass Coherence | → Closer to 0 is better | Document co-occurrence based coherence |
| Exclusivity | Topic Exclusivity | ↑ Higher is better | How exclusive words are to their topics |
| PPL | Perplexity | ↓ Lower is better | Model fit (lower = better generalization) |
G) Model Comparison — 08_compare_models.sh¶
Cross-model metric comparison table.
# Compare all baseline models
bash scripts/08_compare_models.sh \
--dataset edu_data \
--models lda,hdp,btm,nvdm,gsm,prodlda,ctm,etm,dtm,bertopic \
--num_topics 20
# Compare traditional models only
bash scripts/08_compare_models.sh \
--dataset edu_data --models lda,hdp,btm --num_topics 20
# Compare neural models only
bash scripts/08_compare_models.sh \
--dataset edu_data --models nvdm,gsm,prodlda,ctm,etm,dtm --num_topics 20
# Export to CSV
bash scripts/08_compare_models.sh \
--dataset edu_data --models lda,hdp,nvdm,gsm,prodlda,ctm,etm,dtm \
--num_topics 20 --output comparison.csv
Example output:
================================================================================
Model Comparison: edu_data (K=20)
================================================================================
Model TD iRBO NPMI C_V UMass Exclusivity PPL
--------------------------------------------------------------------------------
lda 0.8500 0.7200 0.0512 0.4231 -2.1234 0.6543 123.45
prodlda 0.9200 0.8100 0.0634 0.4567 -1.8765 0.7234 98.76
ctm 0.8800 0.7800 0.0589 0.4412 -1.9876 0.6987 105.32
--------------------------------------------------------------------------------
Best Models:
- Best TD (Topic Diversity): prodlda (0.9200)
- Best NPMI (Coherence): prodlda (0.0634)
- Best PPL (Perplexity): prodlda (98.76)
| Parameter | Description | Default |
|---|---|---|
--dataset |
Dataset name (required) | — |
--models |
Comma-separated model list (required) | — |
--num_topics |
Number of topics | 20 |
--output |
Output CSV file path | terminal only |
H) Multi-GPU Training — 12_train_multi_gpu.sh¶
THETA supports multi-GPU training using PyTorch DistributedDataParallel (DDP).
# Train with 2 GPUs
bash scripts/12_train_multi_gpu.sh --dataset hatespeech --num_gpus 2 --num_topics 20
# Full parameters
bash scripts/12_train_multi_gpu.sh --dataset hatespeech \
--num_gpus 4 --model_size 0.6B --mode zero_shot \
--num_topics 25 --epochs 150 --batch_size 64 \
--hidden_dim 768 --learning_rate 0.001
# Custom master port (for multiple concurrent jobs)
bash scripts/12_train_multi_gpu.sh --dataset socialTwitter \
--num_gpus 2 --master_port 29501
# Or use torchrun directly
torchrun --nproc_per_node=2 --master_port=29500 \
ETM/main.py train \
--dataset hatespeech --mode zero_shot --num_topics 20 --epochs 100
I) Agent API — 14_start_agent_api.sh¶
Start the AI Agent API server for interactive analysis.
# Start the Agent API (default port 8000)
bash scripts/14_start_agent_api.sh --port 8000
# Test agent connection
bash scripts/13_test_agent.sh
API endpoints: POST /chat, POST /api/chat/v2, POST /api/interpret/metrics, POST /api/interpret/topics, POST /api/vision/analyze. See agent/docs/API_REFERENCE.md for full details.
J) Batch Processing Examples¶
# Train THETA across multiple datasets
for dataset in hatespeech mental_health socialTwitter; do
bash scripts/04_train_theta.sh --dataset $dataset \
--model_size 0.6B --mode zero_shot --num_topics 20
done
# Compare different topic counts
for k in 10 15 20 25 30; do
bash scripts/04_train_theta.sh --dataset hatespeech \
--model_size 0.6B --mode zero_shot --num_topics $k
done
# Generate visualizations for all trained baseline models
for model in lda etm ctm prodlda; do
bash scripts/06_visualize.sh --baseline --dataset hatespeech \
--model $model --num_topics 20 --language en
done
K) End-to-End Example: edu_data¶
Complete workflow using edu_data (823 Chinese education policy documents).
# 1. Setup
bash scripts/01_setup.sh
# 2. Data cleaning
bash scripts/02_clean_data.sh --input ./data/edu_data/ --language chinese
# 3. Data preparation — baselines
bash scripts/03_prepare_data.sh \
--dataset edu_data --model lda --vocab_size 3500 --language chinese
bash scripts/03_prepare_data.sh \
--dataset edu_data --model ctm --vocab_size 3500 --language chinese
bash scripts/03_prepare_data.sh \
--dataset edu_data --model dtm --vocab_size 3500 --language chinese --time_column year
# 4. Data preparation — THETA
bash scripts/03_prepare_data.sh \
--dataset edu_data --model theta --model_size 0.6B --mode zero_shot \
--vocab_size 3500 --language chinese
# 5. Train baselines
bash scripts/05_train_baseline.sh \
--dataset edu_data --models lda,hdp,btm,nvdm,gsm,prodlda \
--num_topics 20 --epochs 100
# 6. Train THETA
bash scripts/04_train_theta.sh \
--dataset edu_data --model_size 0.6B --mode zero_shot \
--num_topics 20 --epochs 100 --language zh
# 7. Compare models
bash scripts/08_compare_models.sh \
--dataset edu_data \
--models lda,hdp,btm,nvdm,gsm,prodlda,ctm,etm \
--num_topics 20