Text Similarity Clustering

Upload a CSV/TXT or paste texts to find groups of similar items using SBERT embeddings.

CSV: uses the specified column, or text if present, otherwise the first column. Up to 500,000 rows.
Finds texts that are nearly identical — reposts, copypasta, minor rewording. Good for deduplication. You set a similarity threshold; only texts above it are grouped together.
CSV column name to cluster on.
Higher = stricter matches.
CSV column that distinguishes sources. Used to filter out clusters that are just one source repeating itself.
Clusters with fewer distinct values are hidden. Set to 2+ to require cross-source overlap.
Clustering runs independently within each value. Pair with NEAR_SIM_EVENTS / URL_AMP_EVENTS from the temporal analyzer to find textually similar posts inside each flagged window.