딥러닝 기반의 BERT 모델을 활용한 학술 문헌 자동분류

In this study, we analyzed the performance of the BERT-based document classification model by automatically classifying documents in the field of library and information science based on the KoBERT. For this purpose, abstract data of 5,357 papers in 7 journals in the field of library and information...

Full description

Saved in:

Bibliographic Details
Published in	Chŏngbo Kwalli Hakhoechi Vol. 39; no. 3; pp. 293 - 310
Main Authors	김인후, 김성희, Kim, In hu, Kim, Seong hee
Format	Journal Article
Language	Korean
Published	한국정보관리학회 01.09.2022
Subjects	문헌정보학 BERT
Online Access	Get full text
ISSN	1013-0799 2586-2073
DOI	10.3743/KOSIM.2022.39.3.293

Cover

Abstract	In this study, we analyzed the performance of the BERT-based document classification model by automatically classifying documents in the field of library and information science based on the KoBERT. For this purpose, abstract data of 5,357 papers in 7 journals in the field of library and information science were analyzed and evaluated for any difference in the performance of automatic classification according to the size of the learned data. As performance evaluation scales, precision, recall, and F scale were used. As a result of the evaluation, subject areas with large amounts of data and high quality showed a high level of performance with an F scale of 90% or more. On the other hand, if the data quality was low, the similarity with other subject areas was high, and there were few features that were clearly distinguished thematically, a meaningful high-level performance evaluation could not be derived. This study is expected to be used as basic data to suggest the possibility of using a pre-trained learning model to automatically classify the academic documents. 본 연구에서는 한국어 데이터로 학습된 BERT 모델을 기반으로 문헌정보학 분야의 문서를 자동으로 분류하여 성능을 분석하였다. 이를 위해 문헌정보학 분야의 7개 학술지의 5,357개 논문의 초록 데이터를 학습된 데이터의 크기에 따라서 자동분류의 성능에 어떠한 차이가 있는지를 분석, 평가하였다. 성능 평가척도는 정확률(Precision), 재현율(Recall), F 척도를 사용하였다. 평가결과 데이터의 양이 많고 품질이 높은 주제 분야들은 F 척도가 90% 이상으로 높은 수준의 성능을 보였다. 반면에 데이터 품질이 낮고 내용적으로 다른 주제 분야들과 유사도가 높고 주제적으로 확실히 구별되는 자질이 적을 경우 유의미한 높은 수준의 성능 평가가 도출되지 못하였다. 이러한 연구는 미래 학술 문헌에서 지속적으로 활용할 수 있는 사전학습모델의 활용 가능성을 제시하기 위한 기초자료로 활용될 수 있을 것으로 기대한다.
AbstractList	본 연구에서는 한국어 데이터로 학습된 BERT 모델을 기반으로 문헌정보학 분야의 문서를 자동으로 분류하여 성능을 분석하였다. 이를 위해 문헌정보학 분야의 7개 학술지의 5,357개 논문의 초록 데이터를 학습된 데이터의 크기에 따라서 자동분류의 성능에 어떠한 차이가 있는지를 분석, 평가하였다. 성능 평가척도는 정확률(Precision), 재현율(Recall), F 척도를 사용하였다. 평가결과 데이터의 양이 많고 품질이 높은 주제 분야들은 F 척도가 90% 이상으로 높은 수준의 성능을 보였다. 반면에 데이터 품질이 낮고 내용적으로 다른 주제 분야들과 유사도가 높고 주제적으로 확실히 구별되는 자질이 적을 경우 유의미한 높은 수준의 성능 평가가 도출되지 못하였다. 이러한 연구는 미래 학술 문헌에서 지속적으로 활용할 수 있는 사전학습모델의 활용 가능성을 제시하기 위한 기초자료로 활용될 수 있을 것으로 기대한다. In this study, we analyzed the performance of the BERT-based document classification model by automatically classifying documents in the field of library and information science based on the KoBERT. For this purpose, abstract data of 5,357 papers in 7 journals in the field of library and information science were analyzed and evaluated for any difference in the performance of automatic classification according to the size of the learned data. As performance evaluation scales, precision, recall, and F scale were used. As a result of the evaluation, subject areas with large amounts of data and high quality showed a high level of performance with an F scale of 90% or more. On the other hand, if the data quality was low, the similarity with other subject areas was high, and there were few features that were clearly distinguished thematically, a meaningful high-level performance evaluation could not be derived. This study is expected to be used as basic data to suggest the possibility of using a pre-trained learning model to automatically classify the academic documents. KCI Citation Count: 0 In this study, we analyzed the performance of the BERT-based document classification model by automatically classifying documents in the field of library and information science based on the KoBERT. For this purpose, abstract data of 5,357 papers in 7 journals in the field of library and information science were analyzed and evaluated for any difference in the performance of automatic classification according to the size of the learned data. As performance evaluation scales, precision, recall, and F scale were used. As a result of the evaluation, subject areas with large amounts of data and high quality showed a high level of performance with an F scale of 90% or more. On the other hand, if the data quality was low, the similarity with other subject areas was high, and there were few features that were clearly distinguished thematically, a meaningful high-level performance evaluation could not be derived. This study is expected to be used as basic data to suggest the possibility of using a pre-trained learning model to automatically classify the academic documents. 본 연구에서는 한국어 데이터로 학습된 BERT 모델을 기반으로 문헌정보학 분야의 문서를 자동으로 분류하여 성능을 분석하였다. 이를 위해 문헌정보학 분야의 7개 학술지의 5,357개 논문의 초록 데이터를 학습된 데이터의 크기에 따라서 자동분류의 성능에 어떠한 차이가 있는지를 분석, 평가하였다. 성능 평가척도는 정확률(Precision), 재현율(Recall), F 척도를 사용하였다. 평가결과 데이터의 양이 많고 품질이 높은 주제 분야들은 F 척도가 90% 이상으로 높은 수준의 성능을 보였다. 반면에 데이터 품질이 낮고 내용적으로 다른 주제 분야들과 유사도가 높고 주제적으로 확실히 구별되는 자질이 적을 경우 유의미한 높은 수준의 성능 평가가 도출되지 못하였다. 이러한 연구는 미래 학술 문헌에서 지속적으로 활용할 수 있는 사전학습모델의 활용 가능성을 제시하기 위한 기초자료로 활용될 수 있을 것으로 기대한다.
Author	Kim, Seong hee 김인후 김성희 Kim, In hu
Author_xml	– sequence: 1 fullname: 김인후 – sequence: 2 fullname: 김성희 – sequence: 3 fullname: Kim, In hu – sequence: 4 fullname: Kim, Seong hee
BackLink	https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART002880492$$DAccess content in National Research Foundation of Korea (NRF)
BookMark	eNotkE9LAkEcQIcoyMxP0GUvXYLdZubn7MwcTcxMQ7C9L_tvYllRcPsIHoKKOiRpaHURKjx4Cj9Ts_sdwuz0Lo93eHtou9fvRQgdEGwBL8Nxs33ZuLAoptQCaYFFJWyhAmXCNinmsI0KBBMwMZdyF5XSNPYxJgTbUtACquunuX5b6NuZ8bNa6uU4m42Nk1rHMfTXh75fZbOhkU-m2ctnPpoa-WiS3bwberHKn--M7PVRP0z091DPx_toR3ndNCr9s4ic05pTPTNb7XqjWmmZic2ICWUAUo4kp6Fvq4Bx5gupQs4os4WPIw-UraKI8kABxwK8kEfcVgx84nlBRKCIjjbZ3kC5SRC7fS_-41XfTQZupeM0XIIxw1iu5cONnMTpdez2wrTrnlea7fUqKjgjggopAH4BMd1wug
ContentType	Journal Article
DBID	JDI ACYCR
DEWEY	003.54
DOI	10.3743/KOSIM.2022.39.3.293
DatabaseName	KoreaScience Korean Citation Index
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences Library & Information Science Mathematics
DocumentTitleAlternate	Automatic Classification of Academic Articles Using BERT Model Based on Deep Learning
EISSN	2586-2073
EndPage	310
ExternalDocumentID	oai_kci_go_kr_ARTI_10050091 JAKO202228751828983
GroupedDBID	-~X 5GY 9ZL ALMA_UNASSIGNED_HOLDINGS CS3 JDI ACYCR
ID	FETCH-LOGICAL-k651-343314e972db6fc575b89fd752568b0ea3f6fee27cf37083ad7e76f53b1aace13
ISSN	1013-0799
IngestDate	Tue Nov 21 21:37:44 EST 2023 Fri Dec 22 12:01:14 EST 2023
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Issue	3
Keywords	BERT
Language	Korean
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-k651-343314e972db6fc575b89fd752568b0ea3f6fee27cf37083ad7e76f53b1aace13
Notes	KISTI1.1003/JNL.JAKO202228751828983 http://dx.doi.org/10.3743/KOSIM.2022.39.3.293
OpenAccessLink	http://click.ndsl.kr/servlet/LinkingDetailView?cn=JAKO202228751828983&dbt=JAKO&org_code=O481&site_code=SS1481&service_code=01
PageCount	18
ParticipantIDs	nrf_kci_oai_kci_go_kr_ARTI_10050091 kisti_ndsl_JAKO202228751828983
PublicationCentury	2000
PublicationDate	2022-09
PublicationDateYYYYMMDD	2022-09-01
PublicationDate_xml	– month: 09 year: 2022 text: 2022-09
PublicationDecade	2020
PublicationTitle	Chŏngbo Kwalli Hakhoechi
PublicationTitleAlternate	Journal of the Korean society for information management
PublicationYear	2022
Publisher	한국정보관리학회
Publisher_xml	– name: 한국정보관리학회
SSID	ssib001106982 ssj0069332 ssib044742772 ssib036278865 ssib053377437
Score	1.8178705
Snippet	In this study, we analyzed the performance of the BERT-based document classification model by automatically classifying documents in the field of library and... 본 연구에서는 한국어 데이터로 학습된 BERT 모델을 기반으로 문헌정보학 분야의 문서를 자동으로 분류하여 성능을 분석하였다. 이를 위해 문헌정보학 분야의 7개...
SourceID	nrf kisti
SourceType	Open Website Open Access Repository
StartPage	293
SubjectTerms	문헌정보학
Title	딥러닝 기반의 BERT 모델을 활용한 학술 문헌 자동분류
URI	http://click.ndsl.kr/servlet/LinkingDetailView?cn=JAKO202228751828983&dbt=JAKO&org_code=O481&site_code=SS1481&service_code=01 https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART002880492
Volume	39
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
ispartofPNX	정보관리학회지, 2022, 39(3), 125, pp.293-310
journalDatabaseRights	– providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2586-2073 dateEnd: 99991231 omitProxy: true ssIdentifier: ssib044742772 issn: 1013-0799 databaseCode: M~E dateStart: 20170101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3Nb9MwFLfGuMCBjwFiwKZIYC5RS1Pnwz6mWcY-NCaxIu1WxYmzVp1aae2ExIHTDkiA4MDEhjbgMgnQDjuh_U2kPfAf8OykTTVVAnZxrednO_F7jd8v8XsPoQeBVbIZD4NC6IQAUADqFALYSAqUC6scxBHhQkX7fGIvPDOX1q31iYnfI6eWtru8GL4Y61dyHqkCDeQqvWT_Q7LDQYEAdZAvlCBhKP9JxtivYGZi11KVeex6skKhPqdj38UViislSYKSUex70AAVveI_reqS7rrYparPnGROGagJbcDHMPMUCZiYolhA0Qc1JtsoxW4pHcpTI0Cbg6nkgo4-Zmp6Oq_Y4TpsGF1xwwB01DD26tizgLG1wdv68vNgc7OhLwTNeluE9cZAK9Q9MXk4I73SbEKYwBzLAnNVDMUyh_MYk1kC6cWWXt8-Q1sTMvVSXYjRlyGAowenvQbPb5WswklzLg0e8Gm0pEyRyejTOk3OmG38JD1fe3ZPIWBjgSIsr64trhTlnEVQRFIc9h2N4H1mZx2ed1xyl1dlV4CnliExLiUX0MUybEQy28jKSz-3YAGiszzcEJgXDqX5h2jTdMyyk7s6g7EO9nv-jsFmhKQf-LN1SANtyXt4NOYOAJFJmNIAw6q1FY8YVtVr6EqGiDQ3Ve_raKLZnkJXM3SkZXtPZwrNZC422kMt86GTf6wBwxS6vDKMQNy5gR4nH46SL8fJ60Pt1-lJcrLXO9zTpN5ryY9vydvT3uGO1t8_6H363t890Pq7-71XX7Xk-LT_8Y3W-_w-ebef_NxJjvZuouq8X_UWClnSkELTtowCkR6ApmBOWfqXhgBGOGVx5Fhg2lNeEgGJ7ViIshPGxAH4EUSOcOzYItwIglAY5BaabLVb4jbSWMxLtuClwDAik4QsCKjgPDJEzCgIIZpGs2rxaq2os1kbI-JpdB9WtdYMGzUZxF3-brRrza0aQNVFGSzcAoBj3PnbMHfRpVzV76HJ7ta2mAFLuMtnle78ARvPmM8
linkProvider	ISSN International Centre
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=%EB%94%A5%EB%9F%AC%EB%8B%9D+%EA%B8%B0%EB%B0%98%EC%9D%98+BERT+%EB%AA%A8%EB%8D%B8%EC%9D%84+%ED%99%9C%EC%9A%A9%ED%95%9C+%ED%95%99%EC%88%A0+%EB%AC%B8%ED%97%8C+%EC%9E%90%EB%8F%99%EB%B6%84%EB%A5%98&rft.jtitle=Ch%C5%8Fngbo+Kwalli+Hakhoechi&rft.au=%EA%B9%80%EC%9D%B8%ED%9B%84&rft.au=%EA%B9%80%EC%84%B1%ED%9D%AC&rft.au=Kim%2C+In+hu&rft.au=Kim%2C+Seong+hee&rft.date=2022-09-01&rft.issn=1013-0799&rft.volume=39&rft.issue=3&rft.spage=293&rft.epage=310&rft_id=info:doi/10.3743%2FKOSIM.2022.39.3.293&rft.externalDBID=n%2Fa&rft.externalDocID=JAKO202228751828983
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1013-0799&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1013-0799&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1013-0799&client=summon