Machine Learning and NLP group at Trento.

Clustering Questions into Intents

Modern automated dialog systems require complex dialog managers able to deal with user intent triggered by high-level semantic questions. We developed a model for automatically clustering questions into user intents to help the design tasks. Since questions are short texts, uncovering their semantics to group them together can be very challenging. We approach the problem by using powerful semantic classifiers from question duplicate/matching research along with a novel idea of supervised clustering methods based on structured output.

Collaborators: Iryna Haponchyk, Antonio Uva, Seunghak Yu, Olga Uryupina, Alessandro Moschitti


We test our approach on two intent clustering corpora:

  • Question clusters from Quora

A corpus of question clusters that we derive from the data of the Quora competition for detecting semantically duplicate questions, complementing the available question pair annotation with the transitive closure of the semantic matching property.

  • FAQ: Hype intent corpus

A set of questions asked by users to a conversational agent, collected and manually processed for constructing a FAQ section for Hype — an online service offering a credit card, a bank account number, and an ibanking app. The questions are explicitly assigned to clusters by human annotators.

The corpora are available for research purposes and can be downloaded from here (67K).

Models and Software

We investigate a global structured output-based supervised clustering approach to uncover domain-specific intents implicitly encoded in the training data. To this end, we combine two models:

(a) A pairwise similarity model for short texts helps us identify pairs of similar/duplicate questions that should correspond to the same intent. This model follows the state of the art research on semantic similarity for short texts (e.g. SemEval).

(b) The superwised clustering approach builds on top of the pairwise similarity, using structured output modelling (Yu and Yoachims, 2009). We perform latent SVM and latent perceptron-based supervised clustering.

We are expecting to release the software in the nearest future.


Haponchyk, I., Uva, A., Yu, S., Uryupina, O., and Moschitti, A. (2018) Supervised Clustering of Questions into Intents for Dialog System Applications. In EMNLP 2018.