How to select samples for active learning? Document clustering with active learning methodology
In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach usin...
        Saved in:
      
    
          | Published in | Proceedings (International Conference on Engineering of Complex Computer Systems. Online) pp. 42 - 50 | 
|---|---|
| Main Authors | , , , , , , | 
| Format | Conference Proceeding | 
| Language | English | 
| Published | 
            IEEE
    
        14.06.2023
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 2770-8535 | 
| DOI | 10.1109/ICECCS59891.2023.00015 | 
Cover
| Abstract | In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach using automatic annotation from datasets with prepared labels. In a simulated study conducted on Polish and English datasets, we show how labeling a relatively small carefully selected number of examples can improve the quality of clustering relative to approaches based on a general notion of text similarity. We compare a number of techniques for selecting samples for labeling, dimensionality reduction and training approaches in order to compare and obtain the best quality of the resulting clusters with a minimum number of annotations. The obtained results show that with a relatively simple approach it is possible to obtain good quality clusters and thus develop classification ontologies in a data-centric approach. | 
    
|---|---|
| AbstractList | In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach using automatic annotation from datasets with prepared labels. In a simulated study conducted on Polish and English datasets, we show how labeling a relatively small carefully selected number of examples can improve the quality of clustering relative to approaches based on a general notion of text similarity. We compare a number of techniques for selecting samples for labeling, dimensionality reduction and training approaches in order to compare and obtain the best quality of the resulting clusters with a minimum number of annotations. The obtained results show that with a relatively simple approach it is possible to obtain good quality clusters and thus develop classification ontologies in a data-centric approach. | 
    
| Author | Pogoda, Michal Gawron, Karol Swedrowski, Michal Walkowiak, Tomasz Ropiak, Norbert Gniewkowski, Mateusz Bojanowski, Bartlomiej  | 
    
| Author_xml | – sequence: 1 givenname: Norbert surname: Ropiak fullname: Ropiak, Norbert email: norbert.ropiak@pwr.edu.pl organization: Wrocław University of Science and Technology,CLARIN-PL – sequence: 2 givenname: Mateusz surname: Gniewkowski fullname: Gniewkowski, Mateusz email: mateusz.gniewkowski@pwr.edu.pl organization: Wrocław University of Science and Technology,CLARIN-PL – sequence: 3 givenname: Michal surname: Swedrowski fullname: Swedrowski, Michal email: michal.swedrowski@pwr.edu.pl organization: Wrocław University of Science and Technology,CLARIN-PL – sequence: 4 givenname: Michal surname: Pogoda fullname: Pogoda, Michal email: michal.pogoda@pwr.edu.pl organization: Wrocław University of Science and Technology,CLARIN-PL – sequence: 5 givenname: Karol surname: Gawron fullname: Gawron, Karol email: karol.gawron@pwr.edu.pl organization: Wrocław University of Science and Technology,CLARIN-PL – sequence: 6 givenname: Bartlomiej surname: Bojanowski fullname: Bojanowski, Bartlomiej email: bartlomiej.bojanowski@pwr.edu.pl organization: Wrocław University of Science and Technology,CLARIN-PL – sequence: 7 givenname: Tomasz surname: Walkowiak fullname: Walkowiak, Tomasz email: tomasz.walkowiak@pwr.edu.pl organization: Wrocław University of Science and Technology,CLARIN-PL  | 
    
| BookMark | eNpdjMtKAzEYRqMoqLVvIJIXmPrn1iQrkbHaQsGFui65_G1HZiZlklr69hZ05eo7HA7fDbnoU4-E3DOYMAb2YVHP6vpdWWPZhAMXEwBg6oyMrbZGKBASQMpzcs21hsoooa7IOOevUyYEBy34NVnN04GWRDO2GArNrtu1mOk6DdSF0nwjbdENfdNvHulzCvsO-0JDu88Fh5Okh6Zs_5e0w7JNMbVpc7wll2vXZhz_7Yh8vsw-6nm1fHtd1E_LqmHMlipqY6MTbiqQoZde6hOsg1cBTPQGuAouOOu1iFHyYLkWWntULkTtp9KKEbn7_W0QcbUbms4NxxUDwZmZgvgBJcRaHQ | 
    
| CODEN | IEEPAD | 
    
| ContentType | Conference Proceeding | 
    
| DBID | 6IE 6IL CBEJK RIE RIL  | 
    
| DOI | 10.1109/ICECCS59891.2023.00015 | 
    
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present  | 
    
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher  | 
    
| DeliveryMethod | fulltext_linktorsrc | 
    
| Discipline | Computer Science | 
    
| EISBN | 9798350340044 | 
    
| EISSN | 2770-8535 | 
    
| EndPage | 50 | 
    
| ExternalDocumentID | 10321860 | 
    
| Genre | orig-research | 
    
| GroupedDBID | 6IE 6IL 6IN ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK OCL RIE RIL  | 
    
| ID | FETCH-LOGICAL-i119t-d789da3a63e1eb4b473e1fcb5c08db8025caca9b73dd42c927377be5acd7b6493 | 
    
| IEDL.DBID | RIE | 
    
| IngestDate | Wed Aug 27 02:37:21 EDT 2025 | 
    
| IsPeerReviewed | false | 
    
| IsScholarly | false | 
    
| Language | English | 
    
| LinkModel | DirectLink | 
    
| MergedId | FETCHMERGED-LOGICAL-i119t-d789da3a63e1eb4b473e1fcb5c08db8025caca9b73dd42c927377be5acd7b6493 | 
    
| PageCount | 9 | 
    
| ParticipantIDs | ieee_primary_10321860 | 
    
| PublicationCentury | 2000 | 
    
| PublicationDate | 2023-June-14 | 
    
| PublicationDateYYYYMMDD | 2023-06-14 | 
    
| PublicationDate_xml | – month: 06 year: 2023 text: 2023-June-14 day: 14  | 
    
| PublicationDecade | 2020 | 
    
| PublicationTitle | Proceedings (International Conference on Engineering of Complex Computer Systems. Online) | 
    
| PublicationTitleAbbrev | ICECCS | 
    
| PublicationYear | 2023 | 
    
| Publisher | IEEE | 
    
| Publisher_xml | – name: IEEE | 
    
| SSID | ssj0003320732 | 
    
| Score | 1.8378261 | 
    
| Snippet | In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are... | 
    
| SourceID | ieee | 
    
| SourceType | Publisher | 
    
| StartPage | 42 | 
    
| SubjectTerms | active learning Annotations Dimensionality reduction document clustering Labeling natural language processing Ontologies Task analysis Training  | 
    
| Title | How to select samples for active learning? Document clustering with active learning methodology | 
    
| URI | https://ieeexplore.ieee.org/document/10321860 | 
    
| hasFullText | 1 | 
    
| inHoldings | 1 | 
    
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS8MwGA26J5_mZeKdPPjarUnTXJ58qJMpOAQd7G3k8k1EWcW1CP56k7SbMhB8C6EhJWmSk6_fOQehS0rBQe5c4giHhPnllqiMk0TNPXp3ihsR7d7ux3w0YXfTfNqS1SMXBgBi8hn0QzH-y3elrUOobBDE34jk_oa-LSRvyFrrgEqWUf-50pYFTFI1uC2GRfGYK6nCRZAGKdM02N_-slGJp8hNF41X_TfJI6_9ujJ9-7UhzfjvF9xFvR_CHn5YH0V7aAsW-6i7cmzA7QI-QLNR-YmrEi-j-w1e6qANvMQeuGIdNz7cukg8X-Hrtjts3-qgpuArcYjabj6JGxPqGJ7vocnN8KkYJa3FQvJCiKoSJ6RyOtM8AwKGGSZ8YW5NblPpjPSAyGqrlRGZc4xa5cGOEAZybZ0wnKnsEHUW5QKOEFaSMu6bGW6BsXRunLRUeVQMuTVW62PUCwM2e29UNGarsTr5o_4U7YRJC2lZhJ2hTvVRw7kHAJW5iBP_DYwQsjs | 
    
| linkProvider | IEEE | 
    
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwHA0yD3qaHxO_zcFrtyZN0ubkoW5U3YbgBruVfE1EWcW1CP71Jmk3ZSB4CyElIU2al19_7z0ArjE22lCtA42YCYjdbgGPGAr43KJ3zZmMvd3baMyyKbmf0VlDVvdcGGOMTz4zXVf0__J1oSoXKus58TeUMHtD36aEEFrTtdYhlSjCdsHihgeMQt67S_tp-kR5wt1VEDsx09AZ4P4yUvHnyKANxqsR1Okjr92qlF31tSHO-O8h7oHOD2UPPq4Po32wZRYHoL3ybIDNFj4EeVZ8wrKAS-9_A5fCqQMvoYWuUPhPH2x8JJ5v4G3THVRvldNTsJXQxW03W8LahtoH6DtgOuhP0ixoTBaCF4R4Geg44VpEgkUGGUkkiW1hriRVYaJlYiGREkpwGUdaE6y4hTtxLA0VSseSER4dgdaiWJhjAHmCCbOPSaYMIeFc6kRhbnGxoUoqIU5Ax01Y_l7raOSruTr9o_4K7GST0TAf3o0fzsCue4EuSQuRc9AqPypzYeFAKS_9IvgGC861iA | 
    
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28International+Conference+on+Engineering+of+Complex+Computer+Systems.+Online%29&rft.atitle=How+to+select+samples+for+active+learning%3F+Document+clustering+with+active+learning+methodology&rft.au=Ropiak%2C+Norbert&rft.au=Gniewkowski%2C+Mateusz&rft.au=Swedrowski%2C+Michal&rft.au=Pogoda%2C+Michal&rft.date=2023-06-14&rft.pub=IEEE&rft.eissn=2770-8535&rft.spage=42&rft.epage=50&rft_id=info:doi/10.1109%2FICECCS59891.2023.00015&rft.externalDocID=10321860 |