Search algorithms of verbal identity markers in modern scientific discourse

The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language proc...

Full description

Saved in:
Bibliographic Details
Published inAktualʹnye problemy filologii i pedagogičeskoj lingvistiki no. 2; pp. 18 - 29
Main Authors Goncharova, Oksana V., Zavrumov, Zaur A., Khaleeva, Svetlana
Format Journal Article
LanguageEnglish
German
Published Publishing and Printing Center NOSU 25.06.2024
Subjects
Online AccessGet full text
ISSN2079-6021
2619-029X
2619-029X
DOI10.29025/2079-6021-2024-2-18-29

Cover

More Information
Summary:The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language processing and machine learning tools was developed and tested as part of the research. The analysis was carried out using the Natural Language Toolkit library for tokenization and POS-tagging procedures for calculating the frequency of tokens from the «identity» environment. Word Embeddings, pre-trained Word2Vec model and K-means algorithm were used for the subsequent analysis and clustering of words based on their semantic proximity. Gensim library and Scikit-learn library were used to work with the Word2Vec model. As a result, it was proved that in English scientific discourse young person’s identity is verbalized within 9 semantic categories: behavior, communities, communication, education, identity, language, practice, complexity, science, the most common of which are education (33%), language (21%) and communities (18%). N-grams analysis made it possible to identify semantic fields, establish their attributes, and evaluate texts’ similarity, which provided the most accurate vector space search for semantically close n-grams. Optimization made it possible to establish a similarity measure to rank phrases according to the query, as well as assign each n-gram a certain ranking weight. Improvements can be achieved by adding statistical word weighting, such as TF-IDF. The proposed system is capable of searching in a large text array of related phrases with a similar meaning.
ISSN:2079-6021
2619-029X
2619-029X
DOI:10.29025/2079-6021-2024-2-18-29