Detecting CSV file dialects by table uniformity measurement and data type inference

The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the proliferation of several variants in the dialects with which these files are written. The latter has meant that the exchange of information...

Full description

Saved in:
Bibliographic Details
Published inData Science Vol. 7; no. 2; pp. 55 - 72
Main Author García, Wilfredo
Format Journal Article
LanguageEnglish
Published IOS Press 25.11.2024
Subjects
Online AccessGet full text
ISSN2451-8484
2451-8492
2451-8492
DOI10.3233/DS-240062

Cover

Abstract The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the proliferation of several variants in the dialects with which these files are written. The latter has meant that the exchange of information between data management systems, or between countries and regions, requires human intervention during the data mining and cleansing process. This has led to the development of various computational tools that aim to accurately determine the dialects of CSV files, in order to avoid data loss at data loading stage in a given system. However, the dialect detection is a complex problem and current systems have limitations or make assumptions that need to be improved and/or extended. This paper proposes a method for determining CSV file dialects through table uniformity, a statistical approach based on table consistency and records dispersion measurement along with the detection of data type over each field. The new method has a 93.38% average accuracy on a dataset with 548 CSV files composed of samples coming from a data load testing framework, the test suite provided by the CSV on the Web Working Group (CSVW), curated experimental data set from similar tool development and some others CSV files added as verification of the parsing routines. In tests, the proposed solution outperforms the state-of-the-art tool by achieving an average improvement of 16.45%, resulting in an net increment of about 10% in the accuracy with which dialects are detected on truly messy data for this research dataset. Furthermore, the proposed method is accurate enough to determine dialects by reading only ten records, requiring more data to disambiguate those cases where the first records do not contain the necessary information to conclude with a dialect determination.
AbstractList The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the proliferation of several variants in the dialects with which these files are written. The latter has meant that the exchange of information between data management systems, or between countries and regions, requires human intervention during the data mining and cleansing process. This has led to the development of various computational tools that aim to accurately determine the dialects of CSV files, in order to avoid data loss at data loading stage in a given system. However, the dialect detection is a complex problem and current systems have limitations or make assumptions that need to be improved and/or extended. This paper proposes a method for determining CSV file dialects through table uniformity, a statistical approach based on table consistency and records dispersion measurement along with the detection of data type over each field. The new method has a 93.38% average accuracy on a dataset with 548 CSV files composed of samples coming from a data load testing framework, the test suite provided by the CSV on the Web Working Group (CSVW), curated experimental data set from similar tool development and some others CSV files added as verification of the parsing routines. In tests, the proposed solution outperforms the state-of-the-art tool by achieving an average improvement of 16.45%, resulting in an net increment of about 10% in the accuracy with which dialects are detected on truly messy data for this research dataset. Furthermore, the proposed method is accurate enough to determine dialects by reading only ten records, requiring more data to disambiguate those cases where the first records do not contain the necessary information to conclude with a dialect determination.
Author García, Wilfredo
Author_xml – sequence: 1
  givenname: Wilfredo
  orcidid: 0000-0002-9620-1119
  surname: García
  fullname: García, Wilfredo
  organization: CEO office, ECP Solutions, Santiago, República Dominicana
BackLink https://hal.science/hal-04663419$$DView record in HAL
BookMark eNplkM1KAzEUhYNUsNYufINsFUbzOz_L0lorFFy0ug2Z5EYjM5kymSrzNj6LT2ZL1S6EC-fy8XEW5xwNQhMAoUtKbjjj_Ha2SpggJGUnaMiEpEkuCjb4-3NxhsYxvhFCWF5QLskQrWfQgel8eMHT1TN2vgJsva52LOKyx50ud2QbvGva2nc9rkHHbQs1hA7rYLHVncZdv4GvTx8ctBAMXKBTp6sI458coaf53Xq6SJaP9w_TyTIxNOMsoQW11uqsyIzISF4I4VIpOLfElpJKxw3LbeYklyUxBqwoUw0Fc6nhju5vhK4Pvduw0f2Hriq1aX2t215RovaTKBvVYZKdfHWQX_VRa7RXi8lS7RkRacoFLd7p0TVtE2ML7l_vbPXb-w1U1nFO
Cites_doi 10.1007/s10618-019-00646-y
10.1109/TKDE.2022.3222538.url
10.1145/2830508
10.1145/3085504.3085520
10.1109/OBD.2016.18
10.14778/3407790.3407810
10.1145/3219819.3220057
10.14778/2732977.2732986
10.18420/BTW2023-20
10.14778/3594512.3594518
10.17713/ajs.v38i3.272
ContentType Journal Article
Copyright Attribution
Copyright_xml – notice: Attribution
DBID AAYXX
CITATION
1XC
VOOES
ADTOC
UNPAY
DOI 10.3233/DS-240062
DatabaseName CrossRef
Hyper Article en Ligne (HAL)
Hyper Article en Ligne (HAL) (Open Access)
Unpaywall for CDI: Periodical Content
Unpaywall
DatabaseTitle CrossRef
DatabaseTitleList
CrossRef
Database_xml – sequence: 1
  dbid: UNPAY
  name: Unpaywall
  url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 2451-8492
EndPage 72
ExternalDocumentID 10.3233/ds-240062
oai:HAL:hal-04663419v1
10_3233_DS_240062
GroupedDBID AAYXX
ACGFS
ACHEB
ACPQW
ADZMO
ALMA_UNASSIGNED_HOLDINGS
CITATION
EBS
GROUPED_DOAJ
H13
J8X
SAUOL
SCNPE
SFC
1XC
VOOES
0R~
ADTOC
AFYTF
ARCSS
EJD
UNPAY
ID FETCH-LOGICAL-c1732-191ddda797c4708944f65433d0db515f3c28d7f535b0cced4b6ae92f6c3f13f13
IEDL.DBID UNPAY
ISSN 2451-8484
2451-8492
IngestDate Sun Oct 26 04:10:50 EDT 2025
Tue Oct 14 20:46:59 EDT 2025
Tue Jul 01 05:21:28 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 2
Keywords Comma Separated Values
Comma Separated Values CSV dialect detection Data Mining Data Wrangling
Data Mining
CSV dialect detection
Data Wrangling
Language English
License Attribution: http://creativecommons.org/licenses/by
unspecified-oa
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c1732-191ddda797c4708944f65433d0db515f3c28d7f535b0cced4b6ae92f6c3f13f13
ORCID 0000-0002-9620-1119
OpenAccessLink https://proxy.k.utb.cz/login?url=https://doi.org/10.3233/ds-240062
PageCount 18
ParticipantIDs unpaywall_primary_10_3233_ds_240062
hal_primary_oai_HAL_hal_04663419v1
crossref_primary_10_3233_DS_240062
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2024-11-25
PublicationDateYYYYMMDD 2024-11-25
PublicationDate_xml – month: 11
  year: 2024
  text: 2024-11-25
  day: 25
PublicationDecade 2020
PublicationTitle Data Science
PublicationYear 2024
Publisher IOS Press
Publisher_xml – name: IOS Press
References ref009
ref007
ref008
ref005
ref006
ref003
ref004
ref001
ref012
ref002
ref010
ref011
References_xml – ident: ref006
– ident: ref011
  doi: 10.1007/s10618-019-00646-y
– ident: ref009
  doi: 10.1109/TKDE.2022.3222538.url
– ident: ref002
  doi: 10.1145/2830508
– ident: ref004
  doi: 10.1145/3085504.3085520
– ident: ref008
  doi: 10.1109/OBD.2016.18
– ident: ref003
  doi: 10.14778/3407790.3407810
– ident: ref010
  doi: 10.1145/3219819.3220057
– ident: ref007
  doi: 10.14778/2732977.2732986
– ident: ref005
  doi: 10.18420/BTW2023-20
– ident: ref012
  doi: 10.14778/3594512.3594518
– ident: ref001
  doi: 10.17713/ajs.v38i3.272
SSID ssj0002891350
Score 2.2776527
Snippet The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the...
SourceID unpaywall
hal
crossref
SourceType Open Access Repository
Index Database
StartPage 55
SubjectTerms Computer Science
Data Structures and Algorithms
Title Detecting CSV file dialects by table uniformity measurement and data type inference
URI https://hal.science/hal-04663419
https://doi.org/10.3233/ds-240062
UnpaywallVersion publishedVersion
Volume 7
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2451-8492
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0002891350
  issn: 2451-8484
  databaseCode: DOA
  dateStart: 20170101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3bSsNAEF20PuiLF1S8lvXyGm32lu5jaZUiWgSt6FPYK4o2Ftsq9Wv8Fr_MnSbVqiBCICHZLMtsyJ6ZnXMGoX0Wa-aNkpEP54hVLYmUNCLiiZdKUGethzjkWUs02-zkml9PoZ0xF2Zi_54SSg9tD-L_FfjLzgge4HYJzbRb57UbKBrHeBxV2aiqcHEtSa4e9P3db2vO9C1kPM4Osq4avqiHh4nl5Hjhi5STZ5HcHwz6-sC8_tBo_HOki2i-AJO4ls_-Eppy2TK6bDjYGAhLEq5fXGEQXsLADoG0DayHuA9kKTzIgJLVCRgcd77ChFhlFkPSKIbQ7Pvb3ZgOuILax0eX9WZU1E6ITJxQEgU3zFqrEpkYllSqkjEPLFJqK1YHCOOpIVWbeE65rhjjLNNCOUm8MNTHcKyiUvaYuTWENTVQ-1ZbTRzTsZDOc6JdbIXQXMlkHe2OrZt2c4mMNLgWYJS0cZHmRgmNgt0_n4OodbN2msK94KELUJV7jtfR3ue0_OrK9oquNv7VahPNkYA_gDZI-BYq9Z8Gbjvgh74uj_zucvEdfQA6wMIS
linkProvider Unpaywall
linkToUnpaywall http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3ZSsNAFB20PuiLCyrWjXF5jTazpfNYWqWIitBW9CnMimKNYhulfo3f4pc516SuIEIgIZkMw52QOffOPecitMtizbxRMvLhHLG6JZGSRkQ88VIJ6qz1EIc8ORXtHju64BcTaGvMhfmyf08Jpft2APH_GvxlpwQPcLuCpnqnZ41LKBrHeBzV2XtV4fJakkI96Pu739acySvIeJzOs3s1elL9_pfl5HDuk5RTZJHc7OVDvWeef2g0_jnSeTRbgkncKGZ_AU24bBF1Ww42BsKShJudcwzCSxjYIZC2gfUID4EshfMMKFm3AYPj288wIVaZxZA0iiE0-_pyPaYDLqHe4UG32Y7K2gmRiRNKouCGWWtVIhPDklpdMuaBRUptzeoAYTw1pG4TzynXNWOcZVooJ4kXhvoYjmVUye4yt4KwpgZq32qriWM6FtJ5TrSLrRCaK5lU0fbYuul9IZGRBtcCjJK2OmlhlNAo2P3jOYhatxvHKdwLHroAVbnHuIp2PqblV1d2UHa1-q9Wa2iGBPwBtEHC11Fl-JC7jYAfhnqz_ILeALPUwR0
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Detecting+CSV+file+dialects+by+table+uniformity+measurement+and+data+type%C2%A0inference&rft.jtitle=Data+Science&rft.au=Garc%C3%ADa%2C+Wilfredo&rft.date=2024-11-25&rft.issn=2451-8484&rft.eissn=2451-8492&rft.volume=7&rft.issue=2&rft.spage=55&rft.epage=72&rft_id=info:doi/10.3233%2FDS-240062&rft.externalDBID=n%2Fa&rft.externalDocID=10_3233_DS_240062
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2451-8484&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2451-8484&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2451-8484&client=summon