Skip to content

Shell Scripts Reference

All scripts are non-interactive (pure command-line parameters), suitable for DLC/batch environments. No stdin input required.

Script Overview

Script Description
01_setup.sh Install dependencies and download data from HuggingFace
02_clean_data.sh Clean raw text data (tokenization, stopword removal, lemmatization)
02_generate_embeddings.sh Generate Qwen embeddings (sub-script of 03, for failure recovery)
03_prepare_data.sh One-stop data preparation: BOW + embeddings for all 12 models
04_train_theta.sh Train THETA model (train + evaluate + visualize)
05_train_baseline.sh Train 11 baseline models for comparison with THETA
06_visualize.sh Generate visualizations for trained models
07_evaluate.sh Standalone evaluation with 7 unified metrics
08_compare_models.sh Cross-model metric comparison table
09_download_from_hf.sh Download pre-trained data from HuggingFace
10_quick_start_english.sh Quick start for English datasets
11_quick_start_chinese.sh Quick start for Chinese datasets
12_train_multi_gpu.sh Multi-GPU training with DistributedDataParallel
13_test_agent.sh Test LLM Agent connection and functionality
14_start_agent_api.sh Start the Agent API server (FastAPI)

A) Data Cleaning — 02_clean_data.sh

Row-by-row text cleaning with user-specified column selection. Two modes: - CSV mode: User specifies --text_column (cleaned) and --label_columns (preserved as-is) - Directory mode: Convert docx/txt files into a single cleaned CSV

Supported languages: english, chinese, german, spanish

# 1. Preview columns (recommended first step for CSV)
bash scripts/02_clean_data.sh \
    --input data/FCPB/complaints_text_only.csv --preview

# 2. Clean text column only
bash scripts/02_clean_data.sh \
    --input data/FCPB/complaints_text_only.csv \
    --language english \
    --text_column 'Consumer complaint narrative'

# 3. Clean text + keep label column
bash scripts/02_clean_data.sh \
    --input data/hatespeech/hatespeech_text_only.csv \
    --language english \
    --text_column cleaned_content --label_columns Label

# 4. Keep ALL columns, only clean the text column
bash scripts/02_clean_data.sh \
    --input raw.csv --language english \
    --text_column text --keep_all

# 5. Directory mode (docx/txt → CSV)
bash scripts/02_clean_data.sh \
    --input data/edu_data/ --language chinese
Parameter Required Description Default
--input Yes Input CSV file or directory (docx/txt) -
--language Yes (not for preview) Data language: english, chinese, german, spanish -
--text_column Yes (CSV mode) Name of the text column to clean -
--label_columns Comma-separated label/metadata columns to keep as-is -
--keep_all Keep ALL original columns (only text column is cleaned) false
--preview Show CSV columns and sample rows, then exit false
--output Output CSV path auto-generated
--min_words Min words per document after cleaning 3

Output: data/{dataset}/{dataset}_cleaned.csv


B) Data Preparation — 03_prepare_data.sh

One-stop data preparation for all 12 models. Generates BOW matrix and model-specific embeddings.

Data requirements by model:

Model Type Data Needed
lda, hdp, btm Traditional BOW only
stm Traditional BOW + covariates (document metadata)
nvdm, gsm, prodlda Neural BOW only
etm Neural BOW + Word2Vec
ctm Neural BOW + SBERT
dtm Neural BOW + SBERT + time slices
bertopic Neural SBERT + raw text
theta THETA BOW + Qwen embeddings

Note: Models 1-7 (BOW-only) share the same data experiment. Prepare once, train all.

# ---- Baseline models ----

# BOW-only models (lda, hdp, btm, nvdm, gsm, prodlda share this)
bash scripts/03_prepare_data.sh \
    --dataset edu_data --model lda --vocab_size 3500 --language chinese

# CTM (BOW + SBERT embeddings)
bash scripts/03_prepare_data.sh \
    --dataset edu_data --model ctm --vocab_size 3500 --language chinese

# ETM (BOW + Word2Vec embeddings)
bash scripts/03_prepare_data.sh \
    --dataset edu_data --model etm --vocab_size 3500 --language chinese

# DTM (BOW + SBERT + time slices, requires time column)
bash scripts/03_prepare_data.sh \
    --dataset edu_data --model dtm --vocab_size 3500 --language chinese --time_column year

# BERTopic (SBERT + raw text)
bash scripts/03_prepare_data.sh \
    --dataset edu_data --model bertopic --vocab_size 3500 --language chinese

# ---- THETA model ----

# Zero-shot (fastest, no training needed)
bash scripts/03_prepare_data.sh \
    --dataset edu_data --model theta --model_size 0.6B --mode zero_shot \
    --vocab_size 3500 --language chinese

# Unsupervised (LoRA fine-tuned Qwen embeddings)
bash scripts/03_prepare_data.sh \
    --dataset edu_data --model theta --model_size 0.6B --mode unsupervised \
    --vocab_size 3500 --language chinese

# Supervised (requires label column)
bash scripts/03_prepare_data.sh \
    --dataset edu_data --model theta --model_size 0.6B --mode supervised \
    --vocab_size 3500 --language chinese

# ---- Advanced options ----

# BOW only (skip embedding generation)
bash scripts/03_prepare_data.sh --dataset mydata --model theta --bow-only --vocab_size 5000

# Check if data files already exist
bash scripts/03_prepare_data.sh --dataset mydata --model theta --check-only

# Custom vocabulary size and max sequence length
bash scripts/03_prepare_data.sh --dataset mydata \
    --model theta --model_size 0.6B --mode zero_shot \
    --vocab_size 10000 --batch_size 64 --gpu 0
Parameter Required Description Default
--dataset Yes Dataset name -
--model Yes Target model: lda, hdp, stm (requires covariates), btm, nvdm, gsm, prodlda, ctm, etm, dtm, bertopic, theta -
--model_size Qwen model size (theta only): 0.6B, 4B, 8B 0.6B
--mode Embedding mode (theta only): zero_shot, unsupervised, supervised zero_shot
--vocab_size Vocabulary size 5000
--batch_size Embedding generation batch size 32
--gpu GPU device ID 0
--language Data language: english, chinese (controls tokenization) english
--bow-only Only generate BOW, skip embeddings false
--check-only Only check if files exist false
--time_column Time column name (DTM only) year
--label_column Label column (theta supervised only) -
--emb_epochs Embedding fine-tuning epochs (theta only) 10
--emb_batch_size Embedding fine-tuning batch size (theta only) 8
--exp_name Experiment name tag auto-generated

Embedding recovery — If embedding generation fails (e.g., OOM), re-run only the embedding step:

bash scripts/02_generate_embeddings.sh \
    --dataset edu_data --mode zero_shot --model_size 0.6B \
    --batch_size 4 --exp_dir result/0.6B/edu_data/data/exp_xxx

C) THETA Model Training — 04_train_theta.sh

Train THETA model with integrated training + evaluation + visualization.

# ---- Basic usage ----

# Zero-shot mode (simplest command)
bash scripts/04_train_theta.sh \
    --dataset edu_data --model_size 0.6B --mode zero_shot --num_topics 20

# Unsupervised mode
bash scripts/04_train_theta.sh \
    --dataset edu_data --model_size 0.6B --mode unsupervised --num_topics 20

# Supervised mode (requires label column)
bash scripts/04_train_theta.sh \
    --dataset hatespeech --model_size 0.6B --mode supervised --num_topics 20

# Larger model for better quality
bash scripts/04_train_theta.sh \
    --dataset hatespeech --model_size 4B --mode zero_shot --num_topics 20

# ---- Full parameters ----

bash scripts/04_train_theta.sh \
    --dataset edu_data --model_size 0.6B --mode zero_shot \
    --num_topics 20 --epochs 100 --batch_size 64 \
    --hidden_dim 512 --learning_rate 0.002 \
    --kl_start 0.0 --kl_end 1.0 --kl_warmup 50 \
    --patience 10 --gpu 0 --language zh

# Custom KL annealing
bash scripts/04_train_theta.sh \
    --dataset hatespeech --model_size 0.6B --mode zero_shot \
    --num_topics 20 --epochs 200 \
    --kl_start 0.1 --kl_end 0.8 --kl_warmup 40

# ---- Specify data experiment ----

bash scripts/04_train_theta.sh \
    --dataset edu_data --model_size 0.6B --mode zero_shot \
    --data_exp exp_20260208_151906_vocab3500_theta_0.6B_zero_shot \
    --num_topics 20 --epochs 50 --language zh

# ---- Skip options ----

# Skip visualization (train + evaluate only, faster)
bash scripts/04_train_theta.sh \
    --dataset edu_data --model_size 0.6B --mode zero_shot \
    --num_topics 20 --skip-viz

# Skip training (evaluate + visualize existing model)
bash scripts/04_train_theta.sh \
    --dataset edu_data --model_size 0.6B --mode zero_shot \
    --skip-train --language zh
Parameter Required Description Default
--dataset Yes Dataset name -
--model_size Qwen model size: 0.6B, 4B, 8B 0.6B
--mode Embedding mode: zero_shot, unsupervised, supervised zero_shot
--num_topics Number of topics K 20
--epochs Training epochs 100
--batch_size Training batch size 64
--hidden_dim Encoder hidden dimension 512
--learning_rate Learning rate 0.002
--kl_start KL annealing start weight 0.0
--kl_end KL annealing end weight 1.0
--kl_warmup KL warmup epochs 50
--patience Early stopping patience 10
--gpu GPU device ID 0
--language Visualization language: en, zh en
--skip-train Skip training, only evaluate false
--skip-viz Skip visualization false
--data_exp Data experiment ID auto latest
--exp_name Experiment name tag auto-generated

D) Baseline Model Training — 05_train_baseline.sh

Train 11 baseline topic models for comparison with THETA.

Supported Models

Model Type Description Model-Specific Parameters
lda Traditional Latent Dirichlet Allocation --max_iter
hdp Traditional Hierarchical Dirichlet Process (auto topic count) --max_topics, --alpha
stm Traditional Structural Topic Model (requires covariates) --max_iter
btm Traditional Biterm Topic Model (best for short texts) --n_iter, --alpha, --beta
nvdm Neural Neural Variational Document Model --epochs, --dropout
gsm Neural Gaussian Softmax Model --epochs, --dropout
prodlda Neural Product of Experts LDA --epochs, --dropout
ctm Neural Contextualized Topic Model (requires SBERT) --epochs, --inference_type
etm Neural Embedded Topic Model (requires Word2Vec) --epochs
dtm Neural Dynamic Topic Model (requires timestamps) --epochs
bertopic Neural BERT-based Topic Model (auto topic count) -

Complete Per-Model Examples

# ============================================================
# 1. LDA — Latent Dirichlet Allocation
#    Type: Traditional | Data: BOW only
# ============================================================

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models lda --num_topics 20

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models lda \
    --num_topics 20 --max_iter 200 \
    --gpu 0 --language zh --with-viz \
    --data_exp exp_20260208_153424_vocab3500_lda \
    --exp_name lda_full

# ============================================================
# 2. HDP — Hierarchical Dirichlet Process
#    Note: Auto-determines topic count, --num_topics is IGNORED
# ============================================================

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models hdp

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models hdp \
    --max_topics 150 --alpha 1.0 \
    --gpu 0 --language zh --with-viz \
    --data_exp exp_20260208_153424_vocab3500_lda \
    --exp_name hdp_full

# ============================================================
# 3. STM — Structural Topic Model
#    REQUIRES covariates — auto-skipped if dataset has no metadata
# ============================================================
#
# To use STM:
#   1. Ensure your cleaned CSV has metadata columns
#   2. Register covariates in ETM/config.py → DATASET_CONFIGS
#   3. Prepare data (same as other BOW models)
#   4. Train STM

bash scripts/05_train_baseline.sh \
    --dataset my_dataset_with_covariates --models stm --num_topics 20

bash scripts/05_train_baseline.sh \
    --dataset my_dataset_with_covariates --models stm \
    --num_topics 20 --max_iter 200 \
    --gpu 0 --language en --with-viz \
    --data_exp exp_20260208_153424_vocab3500_lda \
    --exp_name stm_full

# ============================================================
# 4. BTM — Biterm Topic Model
#    Best suited for short texts (tweets, comments)
# ============================================================

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models btm --num_topics 20

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models btm \
    --num_topics 20 --n_iter 100 --alpha 1.0 --beta 0.01 \
    --gpu 0 --language zh --with-viz \
    --data_exp exp_20260208_153424_vocab3500_lda \
    --exp_name btm_full

# ============================================================
# 5. NVDM / 6. GSM / 7. ProdLDA — BOW-only neural models
# ============================================================

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models nvdm --num_topics 20

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models nvdm \
    --num_topics 20 --epochs 200 --batch_size 128 \
    --hidden_dim 512 --learning_rate 0.002 --dropout 0.2 \
    --gpu 0 --language zh --with-viz \
    --data_exp exp_20260208_153424_vocab3500_lda \
    --exp_name nvdm_full

# (Replace nvdm with gsm or prodlda for those models)

# ============================================================
# 8. CTM — Contextualized Topic Model
#    Requires SBERT data_exp (prepared with --model ctm)
# ============================================================

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models ctm --num_topics 20

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models ctm \
    --num_topics 20 --epochs 100 --inference_type zeroshot \
    --batch_size 64 --hidden_dim 512 --learning_rate 0.002 \
    --gpu 0 --language zh --with-viz \
    --data_exp exp_20260208_154645_vocab3500_ctm \
    --exp_name ctm_zeroshot

# Combined inference (uses both BOW and SBERT)
bash scripts/05_train_baseline.sh \
    --dataset edu_data --models ctm \
    --num_topics 20 --epochs 100 --inference_type combined \
    --gpu 0 --language zh --with-viz

# ============================================================
# 9. ETM — Embedded Topic Model (BOW + Word2Vec)
# ============================================================

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models etm --num_topics 20

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models etm \
    --num_topics 20 --epochs 200 --batch_size 64 \
    --hidden_dim 512 --learning_rate 0.002 \
    --gpu 0 --language zh --with-viz \
    --data_exp exp_20260208_153424_vocab3500_lda \
    --exp_name etm_full

# ============================================================
# 10. DTM — Dynamic Topic Model (requires timestamps)
# ============================================================

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models dtm --num_topics 20

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models dtm \
    --num_topics 20 --epochs 200 --batch_size 64 \
    --hidden_dim 512 --learning_rate 0.002 \
    --gpu 0 --language zh --with-viz \
    --data_exp exp_20260208_171413_vocab3500_dtm \
    --exp_name dtm_full

# ============================================================
# 11. BERTopic — Auto-determines topic count
# ============================================================

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models bertopic

bash scripts/05_train_baseline.sh \
    --dataset edu_data --models bertopic \
    --gpu 0 --language zh --with-viz \
    --data_exp exp_20260208_154645_vocab3500_ctm \
    --exp_name bertopic_full

# ============================================================
# Batch training (multiple models at once)
# ============================================================

# Train all BOW-only models (share the same data_exp)
bash scripts/05_train_baseline.sh \
    --dataset edu_data \
    --models lda,hdp,btm,nvdm,gsm,prodlda \
    --num_topics 20 --epochs 100 \
    --data_exp exp_20260208_153424_vocab3500_lda

# Train CTM + BERTopic (share SBERT data_exp)
bash scripts/05_train_baseline.sh \
    --dataset edu_data --models ctm,bertopic \
    --num_topics 20 --epochs 100 \
    --data_exp exp_20260208_154645_vocab3500_ctm

# Skip training, only evaluate and visualize existing model
bash scripts/05_train_baseline.sh \
    --dataset edu_data --models lda --num_topics 20 --skip-train

# Enable visualization (disabled by default)
bash scripts/05_train_baseline.sh \
    --dataset edu_data --models lda --num_topics 20 \
    --with-viz --language zh

Important notes: - BTM uses Gibbs sampling and is very slow on long documents (samples max 50 words/doc). Best for short texts. - HDP and BERTopic auto-determine topic count; --num_topics is ignored for these models. - STM requires document-level covariates. If your dataset has no covariate_columns in DATASET_CONFIGS, STM will be automatically skipped. - DTM requires a data experiment containing time_slices.json (prepared with --model dtm). - CTM and BERTopic require a data experiment containing SBERT embeddings.

Parameter Reference

Common parameters:

Parameter Required Description Default
--dataset Yes Dataset name -
--models Yes Model list (comma-separated) -
--num_topics Number of topics (ignored for hdp/bertopic) 20
--vocab_size Vocabulary size 5000
--epochs Training epochs (neural models) 100
--batch_size Batch size 64
--hidden_dim Hidden layer dimension 512
--learning_rate Learning rate 0.002
--gpu GPU device ID 0
--language Visualization language: en, zh en
--skip-train Skip training false
--with-viz Enable visualization false
--data_exp Data experiment ID auto latest
--exp_name Experiment name tag auto-generated

Model-specific parameters:

Parameter Applicable Models Description Default
--max_iter lda, stm Max iterations (EM algorithm) 100
--max_topics hdp Max topic count 150
--n_iter btm Gibbs sampling iterations 100
--alpha hdp, btm Alpha prior 1.0
--beta btm Beta prior 0.01
--inference_type ctm Inference type: zeroshot, combined zeroshot
--dropout Neural models (nvdm, gsm, prodlda, ctm, etm, dtm) Dropout rate 0.2

E) Visualization — 06_visualize.sh

Generate visualizations for trained models without re-training.

# THETA model visualization
bash scripts/06_visualize.sh \
    --dataset edu_data --model_size 0.6B --mode zero_shot --language zh

# English charts + high DPI (for papers)
bash scripts/06_visualize.sh \
    --dataset edu_data --model_size 0.6B --mode zero_shot --language en --dpi 600

# Baseline model visualization
bash scripts/06_visualize.sh \
    --baseline --dataset edu_data --model lda --num_topics 20 --language zh

# HDP (auto topic count, use actual K from training)
bash scripts/06_visualize.sh \
    --baseline --dataset edu_data --model hdp --num_topics 150 --language zh

# DTM (includes topic evolution charts)
bash scripts/06_visualize.sh \
    --baseline --dataset edu_data --model dtm --num_topics 20 --language zh

# Specify a model experiment explicitly
bash scripts/06_visualize.sh \
    --baseline --dataset edu_data --model ctm --model_exp exp_20260208_xxx --language zh
Parameter Description Default
--dataset Dataset name (required)
--baseline Baseline model mode false
--model Baseline model name
--model_exp Model experiment ID (auto-selects latest if not specified) auto latest
--model_size THETA model size 0.6B
--mode THETA mode zero_shot
--language Visualization language: en, zh en
--dpi Image DPI 300

Generated charts (20+ types):

Chart Description Filename
Topic Table Top words per topic topic_table.png
Topic Network Inter-topic similarity network topic_network.png
Document Clusters UMAP document distribution doc_topic_umap.png
Cluster Heatmap Topic-document heatmap cluster_heatmap.png
Topic Proportion Document proportion per topic topic_proportion.png
Training Loss Loss curve training_loss.png
Evaluation Metrics 7-metric radar chart metrics.png
Topic Coherence Per-topic NPMI topic_coherence.png
Topic Exclusivity Per-topic exclusivity topic_exclusivity.png
Word Clouds All topic word clouds topic_wordclouds.png
Topic Similarity Inter-topic cosine similarity topic_similarity.png
pyLDAvis Interactive topic explorer pyldavis_interactive.html
Per-topic Words Per-topic word weights topics/topic_N/word_importance.png

F) Evaluation — 07_evaluate.sh

Standalone evaluation with 7 unified metrics.

# Evaluate baseline models
bash scripts/07_evaluate.sh --dataset edu_data --model lda --num_topics 20
bash scripts/07_evaluate.sh --dataset edu_data --model hdp --num_topics 150
bash scripts/07_evaluate.sh --dataset edu_data --model ctm --num_topics 20
bash scripts/07_evaluate.sh --dataset edu_data --model etm --num_topics 20
bash scripts/07_evaluate.sh --dataset edu_data --model dtm --num_topics 20
bash scripts/07_evaluate.sh --dataset edu_data --model bertopic --num_topics 20

# With custom vocab size
bash scripts/07_evaluate.sh --dataset edu_data --model lda --num_topics 20 --vocab_size 3500

# Evaluate THETA models
bash scripts/07_evaluate.sh --dataset edu_data --model theta --model_size 0.6B --mode zero_shot
bash scripts/07_evaluate.sh --dataset edu_data --model theta --model_size 0.6B --mode unsupervised
bash scripts/07_evaluate.sh --dataset edu_data --model theta --model_size 4B --mode supervised
Parameter Description Default
--dataset Dataset name (required)
--model Model name (required): lda, hdp, stm, btm, nvdm, gsm, prodlda, ctm, etm, dtm, bertopic, theta
--num_topics Number of topics 20
--vocab_size Vocabulary size 5000
--model_size THETA model size: 0.6B, 4B, 8B 0.6B
--mode THETA mode: zero_shot, unsupervised, supervised zero_shot

Evaluation Metrics (7 metrics):

Metric Full Name Direction Description
TD Topic Diversity ↑ Higher is better Proportion of unique words across topics
iRBO Inverse Rank-Biased Overlap ↑ Higher is better Rank-based topic diversity
NPMI Normalized PMI ↑ Higher is better Normalized pointwise mutual information coherence
C_V C_V Coherence ↑ Higher is better Sliding-window based coherence
UMass UMass Coherence → Closer to 0 is better Document co-occurrence based coherence
Exclusivity Topic Exclusivity ↑ Higher is better How exclusive words are to their topics
PPL Perplexity ↓ Lower is better Model fit (lower = better generalization)

G) Model Comparison — 08_compare_models.sh

Cross-model metric comparison table.

# Compare all baseline models
bash scripts/08_compare_models.sh \
    --dataset edu_data \
    --models lda,hdp,btm,nvdm,gsm,prodlda,ctm,etm,dtm,bertopic \
    --num_topics 20

# Compare traditional models only
bash scripts/08_compare_models.sh \
    --dataset edu_data --models lda,hdp,btm --num_topics 20

# Compare neural models only
bash scripts/08_compare_models.sh \
    --dataset edu_data --models nvdm,gsm,prodlda,ctm,etm,dtm --num_topics 20

# Export to CSV
bash scripts/08_compare_models.sh \
    --dataset edu_data --models lda,hdp,nvdm,gsm,prodlda,ctm,etm,dtm \
    --num_topics 20 --output comparison.csv

Example output:

================================================================================
Model Comparison: edu_data (K=20)
================================================================================

Model              TD     iRBO     NPMI      C_V    UMass  Exclusivity        PPL
--------------------------------------------------------------------------------
lda            0.8500   0.7200   0.0512   0.4231  -2.1234       0.6543     123.45
prodlda        0.9200   0.8100   0.0634   0.4567  -1.8765       0.7234      98.76
ctm            0.8800   0.7800   0.0589   0.4412  -1.9876       0.6987     105.32
--------------------------------------------------------------------------------

Best Models:
  - Best TD (Topic Diversity): prodlda (0.9200)
  - Best NPMI (Coherence):     prodlda (0.0634)
  - Best PPL (Perplexity):     prodlda (98.76)

Parameter Description Default
--dataset Dataset name (required)
--models Comma-separated model list (required)
--num_topics Number of topics 20
--output Output CSV file path terminal only

H) Multi-GPU Training — 12_train_multi_gpu.sh

THETA supports multi-GPU training using PyTorch DistributedDataParallel (DDP).

# Train with 2 GPUs
bash scripts/12_train_multi_gpu.sh --dataset hatespeech --num_gpus 2 --num_topics 20

# Full parameters
bash scripts/12_train_multi_gpu.sh --dataset hatespeech \
    --num_gpus 4 --model_size 0.6B --mode zero_shot \
    --num_topics 25 --epochs 150 --batch_size 64 \
    --hidden_dim 768 --learning_rate 0.001

# Custom master port (for multiple concurrent jobs)
bash scripts/12_train_multi_gpu.sh --dataset socialTwitter \
    --num_gpus 2 --master_port 29501

# Or use torchrun directly
torchrun --nproc_per_node=2 --master_port=29500 \
    ETM/main.py train \
    --dataset hatespeech --mode zero_shot --num_topics 20 --epochs 100

I) Agent API — 14_start_agent_api.sh

Start the AI Agent API server for interactive analysis.

# Start the Agent API (default port 8000)
bash scripts/14_start_agent_api.sh --port 8000

# Test agent connection
bash scripts/13_test_agent.sh

API endpoints: POST /chat, POST /api/chat/v2, POST /api/interpret/metrics, POST /api/interpret/topics, POST /api/vision/analyze. See agent/docs/API_REFERENCE.md for full details.


J) Batch Processing Examples

# Train THETA across multiple datasets
for dataset in hatespeech mental_health socialTwitter; do
    bash scripts/04_train_theta.sh --dataset $dataset \
        --model_size 0.6B --mode zero_shot --num_topics 20
done

# Compare different topic counts
for k in 10 15 20 25 30; do
    bash scripts/04_train_theta.sh --dataset hatespeech \
        --model_size 0.6B --mode zero_shot --num_topics $k
done

# Generate visualizations for all trained baseline models
for model in lda etm ctm prodlda; do
    bash scripts/06_visualize.sh --baseline --dataset hatespeech \
        --model $model --num_topics 20 --language en
done

K) End-to-End Example: edu_data

Complete workflow using edu_data (823 Chinese education policy documents).

# 1. Setup
bash scripts/01_setup.sh

# 2. Data cleaning
bash scripts/02_clean_data.sh --input ./data/edu_data/ --language chinese

# 3. Data preparation — baselines
bash scripts/03_prepare_data.sh \
    --dataset edu_data --model lda --vocab_size 3500 --language chinese
bash scripts/03_prepare_data.sh \
    --dataset edu_data --model ctm --vocab_size 3500 --language chinese
bash scripts/03_prepare_data.sh \
    --dataset edu_data --model dtm --vocab_size 3500 --language chinese --time_column year

# 4. Data preparation — THETA
bash scripts/03_prepare_data.sh \
    --dataset edu_data --model theta --model_size 0.6B --mode zero_shot \
    --vocab_size 3500 --language chinese

# 5. Train baselines
bash scripts/05_train_baseline.sh \
    --dataset edu_data --models lda,hdp,btm,nvdm,gsm,prodlda \
    --num_topics 20 --epochs 100

# 6. Train THETA
bash scripts/04_train_theta.sh \
    --dataset edu_data --model_size 0.6B --mode zero_shot \
    --num_topics 20 --epochs 100 --language zh

# 7. Compare models
bash scripts/08_compare_models.sh \
    --dataset edu_data \
    --models lda,hdp,btm,nvdm,gsm,prodlda,ctm,etm \
    --num_topics 20