Detecting CSV file dialects by table uniformity measurement and data type inference

The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the proliferation of several variants in the dialects with which these files are written. The latter has meant that the exchange of information...

Full description

Saved in:

Bibliographic Details
Published in	Data Science Vol. 7; no. 2; pp. 55 - 72
Main Author	García, Wilfredo
Format	Journal Article
Language	English
Published	IOS Press 25.11.2024
Subjects	Computer Science Data Structures and Algorithms Comma Separated Values Comma Separated Values CSV dialect detection Data Mining Data Wrangling Data Mining CSV dialect detection Data Wrangling
Online Access	Get full text
ISSN	2451-8484 2451-8492 2451-8492
DOI	10.3233/DS-240062

Cover

Abstract	The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the proliferation of several variants in the dialects with which these files are written. The latter has meant that the exchange of information between data management systems, or between countries and regions, requires human intervention during the data mining and cleansing process. This has led to the development of various computational tools that aim to accurately determine the dialects of CSV files, in order to avoid data loss at data loading stage in a given system. However, the dialect detection is a complex problem and current systems have limitations or make assumptions that need to be improved and/or extended. This paper proposes a method for determining CSV file dialects through table uniformity, a statistical approach based on table consistency and records dispersion measurement along with the detection of data type over each field. The new method has a 93.38% average accuracy on a dataset with 548 CSV files composed of samples coming from a data load testing framework, the test suite provided by the CSV on the Web Working Group (CSVW), curated experimental data set from similar tool development and some others CSV files added as verification of the parsing routines. In tests, the proposed solution outperforms the state-of-the-art tool by achieving an average improvement of 16.45%, resulting in an net increment of about 10% in the accuracy with which dialects are detected on truly messy data for this research dataset. Furthermore, the proposed method is accurate enough to determine dialects by reading only ten records, requiring more data to disambiguate those cases where the first records do not contain the necessary information to conclude with a dialect determination.
AbstractList	The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the proliferation of several variants in the dialects with which these files are written. The latter has meant that the exchange of information between data management systems, or between countries and regions, requires human intervention during the data mining and cleansing process. This has led to the development of various computational tools that aim to accurately determine the dialects of CSV files, in order to avoid data loss at data loading stage in a given system. However, the dialect detection is a complex problem and current systems have limitations or make assumptions that need to be improved and/or extended. This paper proposes a method for determining CSV file dialects through table uniformity, a statistical approach based on table consistency and records dispersion measurement along with the detection of data type over each field. The new method has a 93.38% average accuracy on a dataset with 548 CSV files composed of samples coming from a data load testing framework, the test suite provided by the CSV on the Web Working Group (CSVW), curated experimental data set from similar tool development and some others CSV files added as verification of the parsing routines. In tests, the proposed solution outperforms the state-of-the-art tool by achieving an average improvement of 16.45%, resulting in an net increment of about 10% in the accuracy with which dialects are detected on truly messy data for this research dataset. Furthermore, the proposed method is accurate enough to determine dialects by reading only ten records, requiring more data to disambiguate those cases where the first records do not contain the necessary information to conclude with a dialect determination.
Author	García, Wilfredo
Author_xml	– sequence: 1 givenname: Wilfredo orcidid: 0000-0002-9620-1119 surname: García fullname: García, Wilfredo organization: CEO office, ECP Solutions, Santiago, República Dominicana
BackLink	https://hal.science/hal-04663419$$DView record in HAL
BookMark	eNplkM1KAzEUhYNUsNYufINsFUbzOz_L0lorFFy0ug2Z5EYjM5kymSrzNj6LT2ZL1S6EC-fy8XEW5xwNQhMAoUtKbjjj_Ha2SpggJGUnaMiEpEkuCjb4-3NxhsYxvhFCWF5QLskQrWfQgel8eMHT1TN2vgJsva52LOKyx50ud2QbvGva2nc9rkHHbQs1hA7rYLHVncZdv4GvTx8ctBAMXKBTp6sI458coaf53Xq6SJaP9w_TyTIxNOMsoQW11uqsyIzISF4I4VIpOLfElpJKxw3LbeYklyUxBqwoUw0Fc6nhju5vhK4Pvduw0f2Hriq1aX2t215RovaTKBvVYZKdfHWQX_VRa7RXi8lS7RkRacoFLd7p0TVtE2ML7l_vbPXb-w1U1nFO
Cites_doi	10.1007/s10618-019-00646-y 10.1109/TKDE.2022.3222538.url 10.1145/2830508 10.1145/3085504.3085520 10.1109/OBD.2016.18 10.14778/3407790.3407810 10.1145/3219819.3220057 10.14778/2732977.2732986 10.18420/BTW2023-20 10.14778/3594512.3594518 10.17713/ajs.v38i3.272
ContentType	Journal Article
Copyright	Attribution
Copyright_xml	– notice: Attribution
DBID	AAYXX CITATION 1XC VOOES ADTOC UNPAY
DOI	10.3233/DS-240062
DatabaseName	CrossRef Hyper Article en Ligne (HAL) Hyper Article en Ligne (HAL) (Open Access) Unpaywall for CDI: Periodical Content Unpaywall
DatabaseTitle	CrossRef
DatabaseTitleList	CrossRef
Database_xml	– sequence: 1 dbid: UNPAY name: Unpaywall url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/ sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	2451-8492
EndPage	72
ExternalDocumentID	10.3233/ds-240062 oai:HAL:hal-04663419v1 10_3233_DS_240062
GroupedDBID	AAYXX ACGFS ACHEB ACPQW ADZMO ALMA_UNASSIGNED_HOLDINGS CITATION EBS GROUPED_DOAJ H13 J8X SAUOL SCNPE SFC 1XC VOOES 0R~ ADTOC AFYTF ARCSS EJD UNPAY
ID	FETCH-LOGICAL-c1732-191ddda797c4708944f65433d0db515f3c28d7f535b0cced4b6ae92f6c3f13f13
IEDL.DBID	UNPAY
ISSN	2451-8484 2451-8492
IngestDate	Sun Oct 26 04:10:50 EDT 2025 Tue Oct 14 20:46:59 EDT 2025 Tue Jul 01 05:21:28 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	2
Keywords	Comma Separated Values Comma Separated Values CSV dialect detection Data Mining Data Wrangling Data Mining CSV dialect detection Data Wrangling
Language	English
License	Attribution: http://creativecommons.org/licenses/by unspecified-oa
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c1732-191ddda797c4708944f65433d0db515f3c28d7f535b0cced4b6ae92f6c3f13f13
ORCID	0000-0002-9620-1119
OpenAccessLink	https://proxy.k.utb.cz/login?url=https://doi.org/10.3233/ds-240062
PageCount	18
ParticipantIDs	unpaywall_primary_10_3233_ds_240062 hal_primary_oai_HAL_hal_04663419v1 crossref_primary_10_3233_DS_240062
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2024-11-25
PublicationDateYYYYMMDD	2024-11-25
PublicationDate_xml	– month: 11 year: 2024 text: 2024-11-25 day: 25
PublicationDecade	2020
PublicationTitle	Data Science
PublicationYear	2024
Publisher	IOS Press
Publisher_xml	– name: IOS Press
References	ref009 ref007 ref008 ref005 ref006 ref003 ref004 ref001 ref012 ref002 ref010 ref011
References_xml	– ident: ref006 – ident: ref011 doi: 10.1007/s10618-019-00646-y – ident: ref009 doi: 10.1109/TKDE.2022.3222538.url – ident: ref002 doi: 10.1145/2830508 – ident: ref004 doi: 10.1145/3085504.3085520 – ident: ref008 doi: 10.1109/OBD.2016.18 – ident: ref003 doi: 10.14778/3407790.3407810 – ident: ref010 doi: 10.1145/3219819.3220057 – ident: ref007 doi: 10.14778/2732977.2732986 – ident: ref005 doi: 10.18420/BTW2023-20 – ident: ref012 doi: 10.14778/3594512.3594518 – ident: ref001 doi: 10.17713/ajs.v38i3.272
SSID	ssj0002891350
Score	2.2776527
Snippet	The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the...
SourceID	unpaywall hal crossref
SourceType	Open Access Repository Index Database
StartPage	55
SubjectTerms	Computer Science Data Structures and Algorithms
Title	Detecting CSV file dialects by table uniformity measurement and data type inference
URI	https://hal.science/hal-04663419 https://doi.org/10.3233/ds-240062
UnpaywallVersion	publishedVersion
Volume	7
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2451-8492 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0002891350 issn: 2451-8484 databaseCode: DOA dateStart: 20170101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3bSsNAEF20PuiLF1S8lvXyGm32lu5jaZUiWgSt6FPYK4o2Ftsq9Wv8Fr_MnSbVqiBCICHZLMtsyJ6ZnXMGoX0Wa-aNkpEP54hVLYmUNCLiiZdKUGethzjkWUs02-zkml9PoZ0xF2Zi_54SSg9tD-L_FfjLzgge4HYJzbRb57UbKBrHeBxV2aiqcHEtSa4e9P3db2vO9C1kPM4Osq4avqiHh4nl5Hjhi5STZ5HcHwz6-sC8_tBo_HOki2i-AJO4ls_-Eppy2TK6bDjYGAhLEq5fXGEQXsLADoG0DayHuA9kKTzIgJLVCRgcd77ChFhlFkPSKIbQ7Pvb3ZgOuILax0eX9WZU1E6ITJxQEgU3zFqrEpkYllSqkjEPLFJqK1YHCOOpIVWbeE65rhjjLNNCOUm8MNTHcKyiUvaYuTWENTVQ-1ZbTRzTsZDOc6JdbIXQXMlkHe2OrZt2c4mMNLgWYJS0cZHmRgmNgt0_n4OodbN2msK94KELUJV7jtfR3ue0_OrK9oquNv7VahPNkYA_gDZI-BYq9Z8Gbjvgh74uj_zucvEdfQA6wMIS
linkProvider	Unpaywall
linkToUnpaywall	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3ZSsNAFB20PuiLCyrWjXF5jTazpfNYWqWIitBW9CnMimKNYhulfo3f4pc516SuIEIgIZkMw52QOffOPecitMtizbxRMvLhHLG6JZGSRkQ88VIJ6qz1EIc8ORXtHju64BcTaGvMhfmyf08Jpft2APH_GvxlpwQPcLuCpnqnZ41LKBrHeBzV2XtV4fJakkI96Pu739acySvIeJzOs3s1elL9_pfl5HDuk5RTZJHc7OVDvWeef2g0_jnSeTRbgkncKGZ_AU24bBF1Ww42BsKShJudcwzCSxjYIZC2gfUID4EshfMMKFm3AYPj288wIVaZxZA0iiE0-_pyPaYDLqHe4UG32Y7K2gmRiRNKouCGWWtVIhPDklpdMuaBRUptzeoAYTw1pG4TzynXNWOcZVooJ4kXhvoYjmVUye4yt4KwpgZq32qriWM6FtJ5TrSLrRCaK5lU0fbYuul9IZGRBtcCjJK2OmlhlNAo2P3jOYhatxvHKdwLHroAVbnHuIp2PqblV1d2UHa1-q9Wa2iGBPwBtEHC11Fl-JC7jYAfhnqz_ILeALPUwR0
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Detecting+CSV+file+dialects+by+table+uniformity+measurement+and+data+type%C2%A0inference&rft.jtitle=Data+Science&rft.au=Garc%C3%ADa%2C+Wilfredo&rft.date=2024-11-25&rft.issn=2451-8484&rft.eissn=2451-8492&rft.volume=7&rft.issue=2&rft.spage=55&rft.epage=72&rft_id=info:doi/10.3233%2FDS-240062&rft.externalDBID=n%2Fa&rft.externalDocID=10_3233_DS_240062
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2451-8484&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2451-8484&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2451-8484&client=summon