A Turkish Dataset and BERTurk-Contrastive Model for Semantic Textual Similarity
Subject Areas : Natural Language Processing
Somaiyeh Dehghan
1
*
,
Mehmet Fatih Amasyali
2
1 - Yildiz Technical University, Istanbul, Turkey
2 - Yildiz Technical University, Istanbul, Turkey
Keywords: Semantic Textual Similarity, Contrastive Learning, Deep Learning, BERT, BERTurk, Turkish Language,
Abstract :
Semantic Textual Similarity (STS) is an important NLP task that measures the degree of semantic equivalence between two texts, even if the sentence pairs contain different words. STS plays a vital role in numerous NLP tasks, such as information retrieval, text summarization, text classification, sentiment analysis, question answering, machine translation, automatic essay scoring, named entity recognition, plagiarism detection, paraphrase detection, and more. Numerous studies have been conducted on this topic, particularly for the English language. However, for the Turkish language, STS has been less thoroughly addressed so far. In this study, we propose a new BERT-based model, BERTurk-contrastive, using contrastive learning for the STS task in Turkish language. Using contrastive learning, our proposed model aims to bring similar sentences closer together in the embedding space and push dissimilar sentences further apart. In addition, we create a dataset called SICK-tr for the STS task in Turkish by automatic translating from English to Turkish. We evaluate our model on STSb-tr and our newly released dataset, SICK-tr. Evaluation results demonstrate that our model significantly outperforms previous models, with an improvement of 5.92 points, achieving higher accuracy and robustness in identifying semantic similarities across various contexts, and setting a new benchmark for STS tasks in the Turkish language.
[1] T. Mikolov, K. Chen, G. S. Corrado, J. Dean, “Efficient estimation of word representations in vector space,” in In Proceedings of the 2013 International Conference on Learning Representations, 2013.
[2] J. Pennington, R. Socher, C. Manning, “Glove: Global vectors for word representation,” in In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014, Association for Computational Linguistics, pp. 1532–1543. 2014.
[3] J. Devlin, JM.-W. Chang, K. Lee, K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, pp. 4171—4186, 2019.
[4] H. Cheng, S. Yat, "A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method", Chinese Journal of Computers, 2011.
[5] S. Albitar, S. Fournier, B. Espinasse, "An Effective TF/IDF-based Text-to-Text Semantic Similarity Measure for Text Classification", Web Information Systems Engineering, pp 105-114, 2014.
[6] J. Chandra, A. Santhanam, A. Joseph, "Artificial Intelligence based Semantic Text Similarity for RAP Lyrics," 2020 International Conference on Emerging Trends in Information Technology and Engineering, pp. 1-5, 2020.
[7] E. Hindocha, V. Yazhiny, A. Arunkumar, P. Boobalan, "Short-text Semantic Similarity using GloVe word embedding", International Research Journal of Engineering and Technology (IRJET), Volume: 06, Issue: 04, Apr 2019.
[8] S. Chakraborty, “An Efficient Sentiment Analysis Model for Crime Articles’ Comments using a Fine-tuned BERT Deep Architecture and Pre-Processing Techniques”, Journal of Information Systems and Telecommunication (JIST), Vol. 45, pp. 1-11, 2024.
[9] J. Nagesh, “Hierarchical Weighted Framework for Emotional Distress Detection using Personalized Affective Cues”, Journal of Information Systems and Telecommunication (JIST), Vol. 38, pp. 89-101, 2022
[10] P. Kavehzadeh, “Deep Transformer-based Representation for Text Chunking”, Journal of Information Systems and Telecommunication (JIST), Vol. 43, pp. 176-184, 2023.
[11] F. B. Fikri, K. Oflazer, B. Yanıkoğlu, “Anlamsal Benzerlik için Türkçe Veri Kümesi (Turkish Dataset for Semantic Similarity)”, In Proceedings of the 29th IEEE Conference on Signal Processing and Communications Applications, Istanbul, Turkey, 2021.
[12] E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, “Semeval-2012 task 6: A pilot on semantic textual similarity,” In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics (SemEval 2012); Association for Computational Linguistics: pp. 385–393, 2012.
[13] E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, “*sem 2013 shared task: Semantic textual similarity,” in In Second Joint Conference on Lexical and Computational Semantics (*SEM), Vol. 1, Association for Computational Linguistics, pp. 32–43, 2013.
[14] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, R. Mihalcea, G. Rigau, J. Wiebe, “Semeval-2014 task 10: Multilingual semantic textual similarity,” Association for Computational Linguistics, pp. 81–91, 2014.
[15] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea, G. Rigau, L. Uria, J. Wiebe, “Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability,” Association for Computational Linguistics, pp. 252–263, 2015.
[16] E. Agirre, C. Banea, D. Cer, M. Diab,A. Gonzalez-Agirre, R. Mihalcea, G. Rigau, J. Wiebe, “Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation,” Association for Computational Linguistics, pp. 497–511, 2016.
[17] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, L. Specia, “Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation,” In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics, pp. 1–14, 2017.
[18] M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi, R. Zamparelli, “A sick cure for the evaluation of compositional distributional semantic models,” in In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland, 26–31 May 2014, European Language Resources Association (ELRA), pp. 216–223, 2014.
[19] E. Budur, R. Özçelik, T. Güngör, “Data and Representation for Turkish Natural Language Inference”, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Nov. 2020.
[20] E. Yıldıztepe, V. Uzun, "Olasılıksal Yöntemler ile Türkçe Metinlerin Anlamsal Benzerliğinin Belirlenmesi", Sinop Üniversitesi Fen Bilimleri Dergisi, Sinop Uni J Nat Sci 3 (2): 66-78, 2018.
[21] T. Gao, X. Yao, D. Chen, “SimCSE: Simple Contrastive Learning of Sentence Embeddings”, In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
[22] S. Dehghan, M.F. Amasyali, “SupMPN: Supervised Multiple Positives and Negatives Contrastive Learning Model for Semantic Textual Similarity”, Applied Sciences, 12:9659, 2022.
[23] S. Dehghan, M.F. Amasyali, "SelfCCL: Curriculum Contrastive Learning by Transferring Self-Taught Knowledge for Fine-Tuning BERT", Applied Sciences, Vol. 13(3):1913, 2023.
[24] A. Conneau, D. Kiela, “SentEval: An evaluation toolkit for universal sentence representations” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan, 7--12 May, 2018.
[25] B. Koçer Güldalı, K. U. İşisağ, “A comparative study on google translate: An error analysis of Turkish-to English translations in terms of the text typology of Katherina Reiss”, RumeliDE Dil ve Edebiyat Araştırmaları Dergisi, 2019.
[26] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, “A simple framework for contrastive learning of visual representations,” arXiv: 2002.05709, 2020.
[27] F. Schroff, D. Kalenichenko, J. Philbin, “FaceNet: A Unified Embedding for Face Recognition and Clustering”, arXiv:1503.03832, 2015
[28] S.R. Bowman, G. Angeli, C. Potts, C.D. Manning, “A large annotated corpus for learning natural language inference”, In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Portugal, 2015.
[29] A. Williams, N. Nangia, S. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference”, In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, Vol. 1, 2018.
[30] N. Reimers, I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert networks”, In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992, 2019.
[31] L.V.D. Maaten, G.E. Hinton, “Visualizing Data Using t-SNE”, Journal of Machine Learning Research, 9, pp. 2579–2605, 2008.