Tensor-based Graph Modularity for Text Data Clustering

Rafika Boutalbi, Mira Ait-Saada, Anastasiia Iurshina, Steffen Staab, Mohamed Nadif

Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 1–5, 2022.

Abstract

Graphs are used in several applications to represent similaritiesbetween instances. For text data, we can represent texts by differentfeatures such as bag-of-words, static embeddings (Word2vec, GloVe,etc.), and contextual embeddings (BERT, RoBERTa, etc.), leading tomultiple similarities (or graphs) based on each representation. Theproposal posits that incorporating the local invariance within everygraph and the consistency across different graphs leads to a consen-sus clustering that improves the document clustering. This problemis complex and challenged with the sparsity and the noisy data in-cluded in each graph. To this end, we rely on the modularity metric,which effectively evaluates graph clustering in such circumstances.Therefore, we present a novel approach for text clustering basedon both a sparse tensor representation and graph modularity. Thisleads to cluster texts (nodes) while capturing information arisingfrom the different graphs. We iteratively maximize a Tensor-basedGraph Modularity criterion. Extensive experiments on benchmarktext clustering datasets are performed, showing that the proposed al-gorithm referred to asTensor Graph Modularity–TGM– outperformsother baseline methods in terms of clustering task. The source codeis available at https://github.com/TGMclustering/TGMclustering.

BibTeX

@inproceedings{boutalbi22_sigir, title = {Tensor-based Graph Modularity for Text Data Clustering}, author = {Boutalbi, Rafika and Ait-Saada, Mira and Iurshina, Anastasiia and Staab, Steffen and Nadif, Mohamed}, year = {2022}, booktitle = {Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)}, pages = {1--5} }