Data Preparation¶
This guide covers data format requirements and cleaning procedures.
Data Format Requirements¶
THETA accepts CSV files with specific column requirements. The preprocessing pipeline recognizes several standard column names for text content.
Accepted text column names:
- text
- content
- cleaned_content
- clean_text
Optional columns:
- label or category - Required for supervised mode
- year, timestamp, or date - Required for DTM (temporal analysis)
Example CSV structure:
text,label,year
"Document about renewable energy and solar panels.",Environment,2020
"Article discussing machine learning applications.",Technology,2021
"Policy paper on healthcare reform.",Healthcare,2022
Data Cleaning¶
Raw text often contains noise that degrades topic quality. The data cleaning module handles common issues in both English and Chinese text.
English Data Cleaning¶
cd /root/autodl-tmp/ETM
python -m dataclean.main \
--input /root/autodl-tmp/data/raw_data.csv \
--output /root/autodl-tmp/data/cleaned_data.csv \
--language english
The cleaning process removes: - HTML tags and markup - URLs and email addresses - Special characters and symbols - Extra whitespace - Non-printable characters
Chinese Data Cleaning¶
Chinese text requires specialized processing for proper segmentation and cleaning.
python -m dataclean.main \
--input /root/autodl-tmp/data/raw_data.csv \
--output /root/autodl-tmp/data/cleaned_data.csv \
--language chinese
Additional steps for Chinese: - Removes traditional punctuation marks - Handles full-width and half-width characters - Preserves Chinese word boundaries
Batch Cleaning¶
Process multiple files in a directory:
python -m dataclean.main \
--input /root/autodl-tmp/data/raw/ \
--output /root/autodl-tmp/data/cleaned/ \
--language english
All CSV files in the input directory will be processed and saved to the output directory with the same filenames.