Supervised Learning Example¶
This example demonstrates supervised topic modeling and temporal analysis.
Example 1: Supervised Topic Classification¶
Dataset Description¶
- Domain: Customer reviews
- Size: 3000 documents
- Language: English
- Labels: 5 product categories
- Goal: Discover category-aligned topics
Step 1: Prepare Labeled Data¶
CSV format with labels:
text,label
"Great laptop with fast processor and long battery life",Electronics
"Comfortable running shoes with good arch support",Sports
"Delicious coffee beans with rich aroma",Food
Step 2: Preprocess in Supervised Mode¶
python prepare_data.py \
--dataset reviews \
--model theta \
--model_size 0.6B \
--mode supervised \
--vocab_size 5000 \
--batch_size 32 \
--gpu 0
Step 3: Train with Supervision¶
python run_pipeline.py \
--dataset reviews \
--models theta \
--model_size 0.6B \
--mode supervised \
--num_topics 15 \
--epochs 100 \
--batch_size 64 \
--gpu 0 \
--language en
Step 4: Compare Modes¶
Train both supervised and zero-shot for comparison:
# Zero-shot (ignores labels)
python run_pipeline.py \
--dataset reviews \
--models theta \
--model_size 0.6B \
--mode zero_shot \
--num_topics 15 \
--gpu 0
# Supervised (uses labels)
python run_pipeline.py \
--dataset reviews \
--models theta \
--model_size 0.6B \
--mode supervised \
--num_topics 15 \
--gpu 0
Results comparison:
| Mode | TD | NPMI | Label Alignment |
|---|---|---|---|
| Zero-shot | 0.85 | 0.41 | 0.62 |
| Supervised | 0.83 | 0.38 | 0.89 |
Supervised mode achieves better label alignment with slight reduction in diversity.
Step 5: Topic-Label Analysis¶
import json
import numpy as np
# Load results
with open('result/0.6B/reviews/supervised/metrics/evaluation_results.json') as f:
results = json.load(f)
# Analyze topic-label correspondence
# Topics 0-2: Electronics
# Topics 3-5: Sports
# Topics 6-8: Food
# Topics 9-11: Books
# Topics 12-14: Clothing
Example 2: Temporal Topic Evolution¶
Dataset Description¶
- Domain: Academic papers
- Size: 10000 documents
- Language: English
- Time range: 2015-2023
- Field: Machine learning
Step 1: Prepare Temporal Data¶
CSV with year column:
text,year
"Deep learning approaches for image recognition...",2015
"Transformer architectures for natural language...",2019
"Large language models and emergent capabilities...",2023
Step 2: Preprocess with Time Information¶
Step 3: Train DTM Model¶
python run_pipeline.py \
--dataset ml_papers \
--models dtm \
--num_topics 30 \
--epochs 150 \
--batch_size 64 \
--hidden_dim 512 \
--learning_rate 0.002 \
--gpu 0 \
--language en
Step 4: Analyze Topic Evolution¶
DTM tracks topic changes over time:
Topic 5: Deep Learning (2015-2018) - 2015: convolutional, neural, network, classification - 2016: deep, learning, layers, training - 2017: residual, connections, skip, depth - 2018: architectures, design, efficient, mobile
Topic 12: Attention Mechanisms (2017-2020) - 2017: attention, mechanism, sequence, encoder - 2018: self-attention, multi-head, transformer - 2019: bert, pre-training, fine-tuning, downstream - 2020: scaling, models, parameters, performance
Topic 18: Large Language Models (2020-2023) - 2020: gpt, generation, language, model - 2021: prompting, few-shot, in-context, learning - 2022: instruction, tuning, alignment, human - 2023: emergent, capabilities, scaling, laws
Step 5: Visualize Trends¶
python -m visualization.run_visualization \
--baseline \
--result_dir /root/autodl-tmp/result/baseline \
--dataset ml_papers \
--model dtm \
--num_topics 30 \
--language en \
--dpi 300
Visualizations show: - Topic birth and death - Word probability changes over time - Topic intensity trends