How to select samples for active learning? Document clustering with active learning methodology

In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach usin...

Full description

Saved in:
Bibliographic Details
Published inProceedings (International Conference on Engineering of Complex Computer Systems. Online) pp. 42 - 50
Main Authors Ropiak, Norbert, Gniewkowski, Mateusz, Swedrowski, Michal, Pogoda, Michal, Gawron, Karol, Bojanowski, Bartlomiej, Walkowiak, Tomasz
Format Conference Proceeding
LanguageEnglish
Published IEEE 14.06.2023
Subjects
Online AccessGet full text
ISSN2770-8535
DOI10.1109/ICECCS59891.2023.00015

Cover

Abstract In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach using automatic annotation from datasets with prepared labels. In a simulated study conducted on Polish and English datasets, we show how labeling a relatively small carefully selected number of examples can improve the quality of clustering relative to approaches based on a general notion of text similarity. We compare a number of techniques for selecting samples for labeling, dimensionality reduction and training approaches in order to compare and obtain the best quality of the resulting clusters with a minimum number of annotations. The obtained results show that with a relatively simple approach it is possible to obtain good quality clusters and thus develop classification ontologies in a data-centric approach.
AbstractList In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach using automatic annotation from datasets with prepared labels. In a simulated study conducted on Polish and English datasets, we show how labeling a relatively small carefully selected number of examples can improve the quality of clustering relative to approaches based on a general notion of text similarity. We compare a number of techniques for selecting samples for labeling, dimensionality reduction and training approaches in order to compare and obtain the best quality of the resulting clusters with a minimum number of annotations. The obtained results show that with a relatively simple approach it is possible to obtain good quality clusters and thus develop classification ontologies in a data-centric approach.
Author Pogoda, Michal
Gawron, Karol
Swedrowski, Michal
Walkowiak, Tomasz
Ropiak, Norbert
Gniewkowski, Mateusz
Bojanowski, Bartlomiej
Author_xml – sequence: 1
  givenname: Norbert
  surname: Ropiak
  fullname: Ropiak, Norbert
  email: norbert.ropiak@pwr.edu.pl
  organization: Wrocław University of Science and Technology,CLARIN-PL
– sequence: 2
  givenname: Mateusz
  surname: Gniewkowski
  fullname: Gniewkowski, Mateusz
  email: mateusz.gniewkowski@pwr.edu.pl
  organization: Wrocław University of Science and Technology,CLARIN-PL
– sequence: 3
  givenname: Michal
  surname: Swedrowski
  fullname: Swedrowski, Michal
  email: michal.swedrowski@pwr.edu.pl
  organization: Wrocław University of Science and Technology,CLARIN-PL
– sequence: 4
  givenname: Michal
  surname: Pogoda
  fullname: Pogoda, Michal
  email: michal.pogoda@pwr.edu.pl
  organization: Wrocław University of Science and Technology,CLARIN-PL
– sequence: 5
  givenname: Karol
  surname: Gawron
  fullname: Gawron, Karol
  email: karol.gawron@pwr.edu.pl
  organization: Wrocław University of Science and Technology,CLARIN-PL
– sequence: 6
  givenname: Bartlomiej
  surname: Bojanowski
  fullname: Bojanowski, Bartlomiej
  email: bartlomiej.bojanowski@pwr.edu.pl
  organization: Wrocław University of Science and Technology,CLARIN-PL
– sequence: 7
  givenname: Tomasz
  surname: Walkowiak
  fullname: Walkowiak, Tomasz
  email: tomasz.walkowiak@pwr.edu.pl
  organization: Wrocław University of Science and Technology,CLARIN-PL
BookMark eNpdjMtKAzEYRqMoqLVvIJIXmPrn1iQrkbHaQsGFui65_G1HZiZlklr69hZ05eo7HA7fDbnoU4-E3DOYMAb2YVHP6vpdWWPZhAMXEwBg6oyMrbZGKBASQMpzcs21hsoooa7IOOevUyYEBy34NVnN04GWRDO2GArNrtu1mOk6DdSF0nwjbdENfdNvHulzCvsO-0JDu88Fh5Okh6Zs_5e0w7JNMbVpc7wll2vXZhz_7Yh8vsw-6nm1fHtd1E_LqmHMlipqY6MTbiqQoZde6hOsg1cBTPQGuAouOOu1iFHyYLkWWntULkTtp9KKEbn7_W0QcbUbms4NxxUDwZmZgvgBJcRaHQ
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICECCS59891.2023.00015
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798350340044
EISSN 2770-8535
EndPage 50
ExternalDocumentID 10321860
Genre orig-research
GroupedDBID 6IE
6IL
6IN
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
OCL
RIE
RIL
ID FETCH-LOGICAL-i119t-d789da3a63e1eb4b473e1fcb5c08db8025caca9b73dd42c927377be5acd7b6493
IEDL.DBID RIE
IngestDate Wed Aug 27 02:37:21 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i119t-d789da3a63e1eb4b473e1fcb5c08db8025caca9b73dd42c927377be5acd7b6493
PageCount 9
ParticipantIDs ieee_primary_10321860
PublicationCentury 2000
PublicationDate 2023-June-14
PublicationDateYYYYMMDD 2023-06-14
PublicationDate_xml – month: 06
  year: 2023
  text: 2023-June-14
  day: 14
PublicationDecade 2020
PublicationTitle Proceedings (International Conference on Engineering of Complex Computer Systems. Online)
PublicationTitleAbbrev ICECCS
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003320732
Score 1.8378261
Snippet In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are...
SourceID ieee
SourceType Publisher
StartPage 42
SubjectTerms active learning
Annotations
Dimensionality reduction
document clustering
Labeling
natural language processing
Ontologies
Task analysis
Training
Title How to select samples for active learning? Document clustering with active learning methodology
URI https://ieeexplore.ieee.org/document/10321860
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS8MwGA26J5_mZeKdPPjarUnTXJ58qJMpOAQd7G3k8k1EWcW1CP56k7SbMhB8C6EhJWmSk6_fOQehS0rBQe5c4giHhPnllqiMk0TNPXp3ihsR7d7ux3w0YXfTfNqS1SMXBgBi8hn0QzH-y3elrUOobBDE34jk_oa-LSRvyFrrgEqWUf-50pYFTFI1uC2GRfGYK6nCRZAGKdM02N_-slGJp8hNF41X_TfJI6_9ujJ9-7UhzfjvF9xFvR_CHn5YH0V7aAsW-6i7cmzA7QI-QLNR-YmrEi-j-w1e6qANvMQeuGIdNz7cukg8X-Hrtjts3-qgpuArcYjabj6JGxPqGJ7vocnN8KkYJa3FQvJCiKoSJ6RyOtM8AwKGGSZ8YW5NblPpjPSAyGqrlRGZc4xa5cGOEAZybZ0wnKnsEHUW5QKOEFaSMu6bGW6BsXRunLRUeVQMuTVW62PUCwM2e29UNGarsTr5o_4U7YRJC2lZhJ2hTvVRw7kHAJW5iBP_DYwQsjs
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwHA0yD3qaHxO_zcFrtyZN0ubkoW5U3YbgBruVfE1EWcW1CP71Jmk3ZSB4CyElIU2al19_7z0ArjE22lCtA42YCYjdbgGPGAr43KJ3zZmMvd3baMyyKbmf0VlDVvdcGGOMTz4zXVf0__J1oSoXKus58TeUMHtD36aEEFrTtdYhlSjCdsHihgeMQt67S_tp-kR5wt1VEDsx09AZ4P4yUvHnyKANxqsR1Okjr92qlF31tSHO-O8h7oHOD2UPPq4Po32wZRYHoL3ybIDNFj4EeVZ8wrKAS-9_A5fCqQMvoYWuUPhPH2x8JJ5v4G3THVRvldNTsJXQxW03W8LahtoH6DtgOuhP0ixoTBaCF4R4Geg44VpEgkUGGUkkiW1hriRVYaJlYiGREkpwGUdaE6y4hTtxLA0VSseSER4dgdaiWJhjAHmCCbOPSaYMIeFc6kRhbnGxoUoqIU5Ax01Y_l7raOSruTr9o_4K7GST0TAf3o0fzsCue4EuSQuRc9AqPypzYeFAKS_9IvgGC861iA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28International+Conference+on+Engineering+of+Complex+Computer+Systems.+Online%29&rft.atitle=How+to+select+samples+for+active+learning%3F+Document+clustering+with+active+learning+methodology&rft.au=Ropiak%2C+Norbert&rft.au=Gniewkowski%2C+Mateusz&rft.au=Swedrowski%2C+Michal&rft.au=Pogoda%2C+Michal&rft.date=2023-06-14&rft.pub=IEEE&rft.eissn=2770-8535&rft.spage=42&rft.epage=50&rft_id=info:doi/10.1109%2FICECCS59891.2023.00015&rft.externalDocID=10321860