Search algorithms of verbal identity markers in modern scientific discourse

The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language proc...

Full description

Saved in:

Bibliographic Details
Published in	Aktualʹnye problemy filologii i pedagogičeskoj lingvistiki no. 2; pp. 18 - 29
Main Authors	Goncharova, Oksana V., Zavrumov, Zaur A., Khaleeva, Svetlana
Format	Journal Article
Language	English German
Published	Publishing and Printing Center NOSU 25.06.2024
Subjects	data mining identity verbalization internet scientific repositories python scientific discourse semantic category youth identity
Online Access	Get full text
ISSN	2079-6021 2619-029X 2619-029X
DOI	10.29025/2079-6021-2024-2-18-29

Cover

More Information
Summary:	The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language processing and machine learning tools was developed and tested as part of the research. The analysis was carried out using the Natural Language Toolkit library for tokenization and POS-tagging procedures for calculating the frequency of tokens from the «identity» environment. Word Embeddings, pre-trained Word2Vec model and K-means algorithm were used for the subsequent analysis and clustering of words based on their semantic proximity. Gensim library and Scikit-learn library were used to work with the Word2Vec model. As a result, it was proved that in English scientific discourse young person’s identity is verbalized within 9 semantic categories: behavior, communities, communication, education, identity, language, practice, complexity, science, the most common of which are education (33%), language (21%) and communities (18%). N-grams analysis made it possible to identify semantic fields, establish their attributes, and evaluate texts’ similarity, which provided the most accurate vector space search for semantically close n-grams. Optimization made it possible to establish a similarity measure to rank phrases according to the query, as well as assign each n-gram a certain ranking weight. Improvements can be achieved by adding statistical word weighting, such as TF-IDF. The proposed system is capable of searching in a large text array of related phrases with a similar meaning.
ISSN:	2079-6021 2619-029X 2619-029X
DOI:	10.29025/2079-6021-2024-2-18-29