아파치 스파크 기반 검색엔진의 설계 및 구현

Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and colle...

Full description

Saved in:

Bibliographic Details
Published in	한국정보통신학회논문지 Vol. 21; no. 1; pp. 17 - 28
Main Authors	박기성(Ki-Sung Park), 최재현(Jae-Hyun Choi), 김종배(Jong-Bae Kim), 박제원(Jae-Won Park)
Format	Journal Article
Language	Korean
Published	한국정보통신학회 2017
Subjects	전자/정보통신공학 솔라 Nutch 크롤러 스파크 Solr 너치 Search Engine Spark Crawler 검색엔진
Online Access	Get full text
ISSN	2234-4772 2288-4165
DOI	10.6109/jkiice.2017.21.1.17

Cover

Abstract	Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection. 최근 데이터의 활용가치가 높아지면서 데이터에 관한 연구가 활발히 진행되고 있다. 데이터의 수집, 저장, 활용을 위한 대표적인 프로그램으로 웹 크롤러, 데이터베이스, 분산처리 등이 있으며, 최근에는 웹 크롤러가 다양한 분야에 활용할 수 있는 유용성으로 인해 크게 각광받고 있는 실정이다. 웹 크롤러란 자동화된 방법으로 웹서버를 순회하여 웹 페이지를 분석하고 URL을 수집하는 도구라고 정의할 수 있다. 인터넷 사용량의 증가로 매일 대량으로 생성되는 웹 페이지의 처리를 위해 하둡의 맵리듀스를 기반으로 하는 분산 웹 크롤러가 많이 사용되고 있다. 그러나 맵리듀스는 사용이 어렵고 성능에 제약이 있는 단점이 있다. 이러한 맵리듀스의 한계를 보완하여 제시된 인메모리 기반 연산 플랫폼인 아파치 스파크가 그 대안이 되고 있다. 웹 크롤러의 주요용도 중 하나인 검색엔진은 웹 크롤러로 수집한 정보 중 특정 검색어에 맞는 결과를 보여준다. 검색엔진을 기존 맵리듀스 기반의 웹 크롤러 대신 스파크 기반 웹 크롤러로 구현할 경우 더욱 빠른 데이터 수집이 가능할 것이다.
AbstractList	Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection. 최근 데이터의 활용가치가 높아지면서 데이터에 관한 연구가 활발히 진행되고 있다. 데이터의 수집, 저장, 활용을 위한 대표적인 프로그램으로 웹 크롤러, 데이터베이스, 분산처리 등이 있으며, 최근에는 웹 크롤러가 다양한 분야에 활용할 수 있는 유용성으로 인해 크게 각광받고 있는 실정이다. 웹 크롤러란 자동화된 방법으로 웹서버를 순회하여 웹 페이지를 분석하고 URL을 수집하는 도구라고 정의할 수 있다. 인터넷 사용량의 증가로 매일 대량으로 생성되는 웹 페이지의 처리를 위해 하둡의 맵리듀스를 기반으로 하는 분산 웹 크롤러가 많이 사용되고 있다. 그러나 맵리듀스는 사용이 어렵고 성능에 제약이 있는 단점이 있다. 이러한 맵리듀스의 한계를 보완하여 제시된 인메모리 기반 연산 플랫폼인 아파치 스파크가 그 대안이 되고 있다. 웹 크롤러의 주요용도 중 하나인 검색엔진은 웹 크롤러로 수집한 정보 중 특정 검색어에 맞는 결과를 보여준다. 검색엔진을 기존 맵리듀스 기반의 웹 크롤러 대신 스파크 기반 웹 크롤러로 구현할 경우 더욱 빠른 데이터 수집이 가능할 것이다. 최근 데이터의 활용가치가 높아지면서 데이터에 관한 연구가 활발히 진행되고 있다. 데이터의 수집, 저장, 활용을 위한 대표적인 프로그램으로 웹 크롤러, 데이터베이스, 분산처리 등이 있으며, 최근에는 웹 크롤러가 다양한 분야에 활용할 수 있는 유용성으로 인해 크게 각광받고 있는 실정이다. 웹 크롤러란 자동화된 방법으로 웹서버를 순회하여 웹 페이지를 분석하고 URL을 수집하는 도구라고 정의할 수 있다. 인터넷 사용량의 증가로 매일 대량으로 생성되는 웹 페이지의 처리를 위해 하둡의 맵리듀스를 기반으로 하는 분산 웹 크롤러가 많이 사용되고 있다. 그러나 맵리듀스는 사용이 어렵고 성능에 제약이 있는 단점이 있다. 이러한 맵리듀스의 한계를 보완하여 제시된 인메모리 기반 연산 플랫폼인 아파치 스파크가 그 대안이 되고 있다. 웹 크롤러의 주요용도 중 하나인 검색엔진은 웹 크롤러로 수집한 정보 중 특정 검색어에 맞는 결과를 보여준다. 검색엔진을 기존 맵리듀스 기반의 웹 크롤러 대신 스파크 기반 웹 크롤러로 구현할 경우 더욱 빠른 데이터 수집이 가능할 것이다. Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection. KCI Citation Count: 3
Author	김종배(Jong-Bae Kim) 박기성(Ki-Sung Park) 박제원(Jae-Won Park) 최재현(Jae-Hyun Choi)
Author_xml	– sequence: 1 fullname: 박기성(Ki-Sung Park) – sequence: 2 fullname: 최재현(Jae-Hyun Choi) – sequence: 3 fullname: 김종배(Jong-Bae Kim) – sequence: 4 fullname: 박제원(Jae-Won Park)
BackLink	https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART002194191$$DAccess content in National Research Foundation of Korea (NRF)
BookMark	eNpFkDFLw0AYhg9RsNb-ApcsDg6J991d7xpwKbVqtViQ7keSS-RMTSXRwU0xi1JwUeziVnDpUBRB_1Ka_gdTK8g3vM_w8PLyraHlqB_5CG0Atjhge_ss1NrzLYJBWASs4sQSKhFSq5kMeHV5zpSZTAiyiipJol1MORE2UF5CO_lzOhsM8u-hkT-MCprdjo3sazKdDI3s_Sa_u89fnvK3NH8thHSUfaTGdPJoZJ_j2TBdRyuB00v8yl-WUXev2W0cmO3OfqtRb5uhzcD0hOcy5rsKhC0UZ6B8igPlKQC7SilXtgiAUDfAJPAYLXZ74DhOwDEXHuGYltHWojaKAxl6WvYd_ZunfRnGsn7SbUmgBDCdu5sLN9TJpZaRSnrysH7UmX8HsKjWgBFbsH8vuor1ua-0Iy8KcOJredzZbeJCBi6A_gBz8Xbi
ContentType	Journal Article
DBID	DBRKI TDB JDI ACYCR
DEWEY	003.5
DOI	10.6109/jkiice.2017.21.1.17
DatabaseName	DBPIA - 디비피아 Nurimedia DBPIA Journals KoreaScience Korean Citation Index
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences Mathematics
DocumentTitleAlternate	Design and Implementation of a Search Engine based on Apache Spark
DocumentTitle_FL	Design and Implementation of a Search Engine based on Apache Spark
EISSN	2288-4165
EndPage	28
ExternalDocumentID	oai_kci_go_kr_ARTI_1321030 JAKO201710758142974 NODE07101671
GroupedDBID	.UV ALMA_UNASSIGNED_HOLDINGS DBRKI TDB JDI ACYCR M~E
ID	FETCH-LOGICAL-k941-c7cb44ebd1797d641de30fdcd1195336d97f123bf02fc43288c1aaaf6067c2603
ISSN	2234-4772
IngestDate	Tue Nov 21 21:29:19 EST 2023 Fri Dec 22 12:01:15 EST 2023 Thu Feb 06 13:24:19 EST 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	1
Keywords	솔라 Nutch 크롤러 스파크 Solr 너치 Search Engine Spark Crawler 검색엔진
Language	Korean
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-k941-c7cb44ebd1797d641de30fdcd1195336d97f123bf02fc43288c1aaaf6067c2603
Notes	KISTI1.1003/JNL.JAKO201710758142974 http://jkiice.org G704-SER000003195.2017.21.1.016
OpenAccessLink	http://click.ndsl.kr/servlet/LinkingDetailView?cn=JAKO201710758142974&dbt=JAKO&org_code=O481&site_code=SS1481&service_code=01
PageCount	12
ParticipantIDs	nrf_kci_oai_kci_go_kr_ARTI_1321030 kisti_ndsl_JAKO201710758142974 nurimedia_primary_NODE07101671
PublicationCentury	2000
PublicationDate	2017 2017-01
PublicationDateYYYYMMDD	2017-01-01
PublicationDate_xml	– year: 2017 text: 2017
PublicationDecade	2010
PublicationTitle	한국정보통신학회논문지
PublicationTitleAlternate	Journal of the Korea Institute of Information and Communication Engineering
PublicationYear	2017
Publisher	한국정보통신학회
Publisher_xml	– name: 한국정보통신학회
SSID	ssib036279136 ssib053377456 ssib044738262 ssib015937029 ssib023393675 ssib012146319
Score	1.9883536
Snippet	Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection... 최근 데이터의 활용가치가 높아지면서 데이터에 관한 연구가 활발히 진행되고 있다. 데이터의 수집, 저장, 활용을 위한 대표적인 프로그램으로 웹 크롤러, 데이터베이스,...
SourceID	nrf kisti nurimedia
SourceType	Open Website Open Access Repository Publisher
StartPage	17
SubjectTerms	전자/정보통신공학
Title	아파치 스파크 기반 검색엔진의 설계 및 구현
URI	https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE07101671 http://click.ndsl.kr/servlet/LinkingDetailView?cn=JAKO201710758142974&dbt=JAKO&org_code=O481&site_code=SS1481&service_code=01 https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART002194191
Volume	21
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
ispartofPNX	한국정보통신학회논문지, 2017, 21(1), , pp.17-28
journalDatabaseRights	– providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2288-4165 dateEnd: 99991231 omitProxy: true ssIdentifier: ssib044738262 issn: 2234-4772 databaseCode: M~E dateStart: 20130101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV3Na9RAFA-1HvQiioqfZRHntGRNMpNkBrwku1tqi61gxd5CPuu6kpXaPehBFPeiFLwo9uKt4KWHogh69c_Zpv-D781ks6kU_EA2ZIc3M--9eW8n85vNzBtNu26lls2pEeqxHbo6C-1IjwC26haCAchIWIxvdG8vOwv32OKavTZz7Edt1dJwM2rFz47cV_IvXgUa-BV3yf6FZyumQIA0-Bfu4GG4_5GPSbdNhE04I90O4W28gOILIngTU9wjXj0PEibx2pDnEZ8T3yBdH--yOJDkugesRwkXkrlLBMOE50opQOlUzJlkDtWgNGuWrPi8YmWDHBQoOFGHs00QsCTaRLQnxTqSv4FEZEGJr1R2MBfl-Jhb1hIy4cnW-ITbxJcJkAXtKfWsXsBMWmfX2iv19uHxzZd6-l140jXvyNXiYlqpjYKFNKXo1psBlRbDVF94Osyb7QeD3qFaHtpdmc-zUDMlHC6oNcjXdT9Mm_LsanGEfsoCSqZPhFGKug99s65f-feM2ocqu9L_Mud0QAAkx3TmqqOOWmlJgz4BoNquj2hqz_mhnquGp1I9BXTUpvxfh1BHRaB92O9hTCtsUcsyW_Bxp4ihWse5vNLpIkY1HQzjcNxyAe3h8tnn3ckz3cSD4-k0ZCCgZ-rW3gRblApaiycEeMoVJq0gMmMu5bWQljA7gQmLPJS5MocKJoaK3zhCbZh14lSsB-Ax3wDMeSIf4sEZ8PStAcnV09qpcgbY8FR3PqPN9AdntZvF-9HB1lbxfbtRvNmB1MHL3cb4297-3nZj_PlF8ep18eFd8WlUfIQCo53xl1Fjf-9tY_x192B7dE5bne-uthf08mATvS-YqcduHDGWRgkMhm7iMDNJqZElcYLhFyl1EuFmACijzLCymFHwcGyGYZg5gCxjyzHoeW02H-TpBa2RcGZkVuiGRhQyM4t4aFsZIHQrMZlIOL-ozcnGB3ny5FGw6C2toGlMmCdwE5Coyy5q18AqQT_uBRhoHr_XB0F_I4Dp9K3AxB1-1AAuldGCxyoKTlD3_aXfFbisnUTB6r_LK9rs5sYwvQpofjOakz-Xn9lxuns
linkProvider	ISSN International Centre
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=%EC%95%84%ED%8C%8C%EC%B9%98+%EC%8A%A4%ED%8C%8C%ED%81%AC+%EA%B8%B0%EB%B0%98+%EA%B2%80%EC%83%89%EC%97%94%EC%A7%84%EC%9D%98+%EC%84%A4%EA%B3%84+%EB%B0%8F+%EA%B5%AC%ED%98%84&rft.jtitle=%ED%95%9C%EA%B5%AD%EC%A0%95%EB%B3%B4%ED%86%B5%EC%8B%A0%ED%95%99%ED%9A%8C%EB%85%BC%EB%AC%B8%EC%A7%80&rft.au=%EB%B0%95%EA%B8%B0%EC%84%B1%28Ki-Sung+Park%29&rft.au=%EC%B5%9C%EC%9E%AC%ED%98%84%28Jae-Hyun+Choi%29&rft.au=%EA%B9%80%EC%A2%85%EB%B0%B0%28Jong-Bae+Kim%29&rft.au=%EB%B0%95%EC%A0%9C%EC%9B%90%28Jae-Won+Park%29&rft.date=2017&rft.pub=%ED%95%9C%EA%B5%AD%EC%A0%95%EB%B3%B4%ED%86%B5%EC%8B%A0%ED%95%99%ED%9A%8C&rft.issn=2234-4772&rft.eissn=2288-4165&rft.volume=21&rft.issue=1&rft.spage=17&rft.epage=28&rft_id=info:doi/10.6109%2Fjkiice.2017.21.1.17&rft.externalDocID=NODE07101671
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2234-4772&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2234-4772&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2234-4772&client=summon