Detecting CSV file dialects by table uniformity measurement and data type inference
The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the proliferation of several variants in the dialects with which these files are written. The latter has meant that the exchange of information...
Saved in:
| Published in | Data Science Vol. 7; no. 2; pp. 55 - 72 |
|---|---|
| Main Author | |
| Format | Journal Article |
| Language | English |
| Published |
IOS Press
25.11.2024
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 2451-8484 2451-8492 2451-8492 |
| DOI | 10.3233/DS-240062 |
Cover
| Abstract | The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the proliferation of several variants in the dialects with which these files are written. The latter has meant that the exchange of information between data management systems, or between countries and regions, requires human intervention during the data mining and cleansing process. This has led to the development of various computational tools that aim to accurately determine the dialects of CSV files, in order to avoid data loss at data loading stage in a given system. However, the dialect detection is a complex problem and current systems have limitations or make assumptions that need to be improved and/or extended. This paper proposes a method for determining CSV file dialects through table uniformity, a statistical approach based on table consistency and records dispersion measurement along with the detection of data type over each field. The new method has a 93.38% average accuracy on a dataset with 548 CSV files composed of samples coming from a data load testing framework, the test suite provided by the CSV on the Web Working Group (CSVW), curated experimental data set from similar tool development and some others CSV files added as verification of the parsing routines. In tests, the proposed solution outperforms the state-of-the-art tool by achieving an average improvement of 16.45%, resulting in an net increment of about 10% in the accuracy with which dialects are detected on truly messy data for this research dataset. Furthermore, the proposed method is accurate enough to determine dialects by reading only ten records, requiring more data to disambiguate those cases where the first records do not contain the necessary information to conclude with a dialect determination. |
|---|---|
| AbstractList | The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the proliferation of several variants in the dialects with which these files are written. The latter has meant that the exchange of information between data management systems, or between countries and regions, requires human intervention during the data mining and cleansing process. This has led to the development of various computational tools that aim to accurately determine the dialects of CSV files, in order to avoid data loss at data loading stage in a given system. However, the dialect detection is a complex problem and current systems have limitations or make assumptions that need to be improved and/or extended. This paper proposes a method for determining CSV file dialects through table uniformity, a statistical approach based on table consistency and records dispersion measurement along with the detection of data type over each field. The new method has a 93.38% average accuracy on a dataset with 548 CSV files composed of samples coming from a data load testing framework, the test suite provided by the CSV on the Web Working Group (CSVW), curated experimental data set from similar tool development and some others CSV files added as verification of the parsing routines. In tests, the proposed solution outperforms the state-of-the-art tool by achieving an average improvement of 16.45%, resulting in an net increment of about 10% in the accuracy with which dialects are detected on truly messy data for this research dataset. Furthermore, the proposed method is accurate enough to determine dialects by reading only ten records, requiring more data to disambiguate those cases where the first records do not contain the necessary information to conclude with a dialect determination. |
| Author | García, Wilfredo |
| Author_xml | – sequence: 1 givenname: Wilfredo orcidid: 0000-0002-9620-1119 surname: García fullname: García, Wilfredo organization: CEO office, ECP Solutions, Santiago, República Dominicana |
| BackLink | https://hal.science/hal-04663419$$DView record in HAL |
| BookMark | eNplkM1KAzEUhYNUsNYufINsFUbzOz_L0lorFFy0ug2Z5EYjM5kymSrzNj6LT2ZL1S6EC-fy8XEW5xwNQhMAoUtKbjjj_Ha2SpggJGUnaMiEpEkuCjb4-3NxhsYxvhFCWF5QLskQrWfQgel8eMHT1TN2vgJsva52LOKyx50ud2QbvGva2nc9rkHHbQs1hA7rYLHVncZdv4GvTx8ctBAMXKBTp6sI458coaf53Xq6SJaP9w_TyTIxNOMsoQW11uqsyIzISF4I4VIpOLfElpJKxw3LbeYklyUxBqwoUw0Fc6nhju5vhK4Pvduw0f2Hriq1aX2t215RovaTKBvVYZKdfHWQX_VRa7RXi8lS7RkRacoFLd7p0TVtE2ML7l_vbPXb-w1U1nFO |
| Cites_doi | 10.1007/s10618-019-00646-y 10.1109/TKDE.2022.3222538.url 10.1145/2830508 10.1145/3085504.3085520 10.1109/OBD.2016.18 10.14778/3407790.3407810 10.1145/3219819.3220057 10.14778/2732977.2732986 10.18420/BTW2023-20 10.14778/3594512.3594518 10.17713/ajs.v38i3.272 |
| ContentType | Journal Article |
| Copyright | Attribution |
| Copyright_xml | – notice: Attribution |
| DBID | AAYXX CITATION 1XC VOOES ADTOC UNPAY |
| DOI | 10.3233/DS-240062 |
| DatabaseName | CrossRef Hyper Article en Ligne (HAL) Hyper Article en Ligne (HAL) (Open Access) Unpaywall for CDI: Periodical Content Unpaywall |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | CrossRef |
| Database_xml | – sequence: 1 dbid: UNPAY name: Unpaywall url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/ sourceTypes: Open Access Repository |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 2451-8492 |
| EndPage | 72 |
| ExternalDocumentID | 10.3233/ds-240062 oai:HAL:hal-04663419v1 10_3233_DS_240062 |
| GroupedDBID | AAYXX ACGFS ACHEB ACPQW ADZMO ALMA_UNASSIGNED_HOLDINGS CITATION EBS GROUPED_DOAJ H13 J8X SAUOL SCNPE SFC 1XC VOOES 0R~ ADTOC AFYTF ARCSS EJD UNPAY |
| ID | FETCH-LOGICAL-c1732-191ddda797c4708944f65433d0db515f3c28d7f535b0cced4b6ae92f6c3f13f13 |
| IEDL.DBID | UNPAY |
| ISSN | 2451-8484 2451-8492 |
| IngestDate | Sun Oct 26 04:10:50 EDT 2025 Tue Oct 14 20:46:59 EDT 2025 Tue Jul 01 05:21:28 EDT 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 2 |
| Keywords | Comma Separated Values Comma Separated Values CSV dialect detection Data Mining Data Wrangling Data Mining CSV dialect detection Data Wrangling |
| Language | English |
| License | Attribution: http://creativecommons.org/licenses/by unspecified-oa |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c1732-191ddda797c4708944f65433d0db515f3c28d7f535b0cced4b6ae92f6c3f13f13 |
| ORCID | 0000-0002-9620-1119 |
| OpenAccessLink | https://proxy.k.utb.cz/login?url=https://doi.org/10.3233/ds-240062 |
| PageCount | 18 |
| ParticipantIDs | unpaywall_primary_10_3233_ds_240062 hal_primary_oai_HAL_hal_04663419v1 crossref_primary_10_3233_DS_240062 |
| ProviderPackageCode | CITATION AAYXX |
| PublicationCentury | 2000 |
| PublicationDate | 2024-11-25 |
| PublicationDateYYYYMMDD | 2024-11-25 |
| PublicationDate_xml | – month: 11 year: 2024 text: 2024-11-25 day: 25 |
| PublicationDecade | 2020 |
| PublicationTitle | Data Science |
| PublicationYear | 2024 |
| Publisher | IOS Press |
| Publisher_xml | – name: IOS Press |
| References | ref009 ref007 ref008 ref005 ref006 ref003 ref004 ref001 ref012 ref002 ref010 ref011 |
| References_xml | – ident: ref006 – ident: ref011 doi: 10.1007/s10618-019-00646-y – ident: ref009 doi: 10.1109/TKDE.2022.3222538.url – ident: ref002 doi: 10.1145/2830508 – ident: ref004 doi: 10.1145/3085504.3085520 – ident: ref008 doi: 10.1109/OBD.2016.18 – ident: ref003 doi: 10.14778/3407790.3407810 – ident: ref010 doi: 10.1145/3219819.3220057 – ident: ref007 doi: 10.14778/2732977.2732986 – ident: ref005 doi: 10.18420/BTW2023-20 – ident: ref012 doi: 10.14778/3594512.3594518 – ident: ref001 doi: 10.17713/ajs.v38i3.272 |
| SSID | ssj0002891350 |
| Score | 2.2776527 |
| Snippet | The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the... |
| SourceID | unpaywall hal crossref |
| SourceType | Open Access Repository Index Database |
| StartPage | 55 |
| SubjectTerms | Computer Science Data Structures and Algorithms |
| Title | Detecting CSV file dialects by table uniformity measurement and data type inference |
| URI | https://hal.science/hal-04663419 https://doi.org/10.3233/ds-240062 |
| UnpaywallVersion | publishedVersion |
| Volume | 7 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2451-8492 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0002891350 issn: 2451-8484 databaseCode: DOA dateStart: 20170101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3bSsNAEF20PuiLF1S8lvXyGm32lu5jaZUiWgSt6FPYK4o2Ftsq9Wv8Fr_MnSbVqiBCICHZLMtsyJ6ZnXMGoX0Wa-aNkpEP54hVLYmUNCLiiZdKUGethzjkWUs02-zkml9PoZ0xF2Zi_54SSg9tD-L_FfjLzgge4HYJzbRb57UbKBrHeBxV2aiqcHEtSa4e9P3db2vO9C1kPM4Osq4avqiHh4nl5Hjhi5STZ5HcHwz6-sC8_tBo_HOki2i-AJO4ls_-Eppy2TK6bDjYGAhLEq5fXGEQXsLADoG0DayHuA9kKTzIgJLVCRgcd77ChFhlFkPSKIbQ7Pvb3ZgOuILax0eX9WZU1E6ITJxQEgU3zFqrEpkYllSqkjEPLFJqK1YHCOOpIVWbeE65rhjjLNNCOUm8MNTHcKyiUvaYuTWENTVQ-1ZbTRzTsZDOc6JdbIXQXMlkHe2OrZt2c4mMNLgWYJS0cZHmRgmNgt0_n4OodbN2msK94KELUJV7jtfR3ue0_OrK9oquNv7VahPNkYA_gDZI-BYq9Z8Gbjvgh74uj_zucvEdfQA6wMIS |
| linkProvider | Unpaywall |
| linkToUnpaywall | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3ZSsNAFB20PuiLCyrWjXF5jTazpfNYWqWIitBW9CnMimKNYhulfo3f4pc516SuIEIgIZkMw52QOffOPecitMtizbxRMvLhHLG6JZGSRkQ88VIJ6qz1EIc8ORXtHju64BcTaGvMhfmyf08Jpft2APH_GvxlpwQPcLuCpnqnZ41LKBrHeBzV2XtV4fJakkI96Pu739acySvIeJzOs3s1elL9_pfl5HDuk5RTZJHc7OVDvWeef2g0_jnSeTRbgkncKGZ_AU24bBF1Ww42BsKShJudcwzCSxjYIZC2gfUID4EshfMMKFm3AYPj288wIVaZxZA0iiE0-_pyPaYDLqHe4UG32Y7K2gmRiRNKouCGWWtVIhPDklpdMuaBRUptzeoAYTw1pG4TzynXNWOcZVooJ4kXhvoYjmVUye4yt4KwpgZq32qriWM6FtJ5TrSLrRCaK5lU0fbYuul9IZGRBtcCjJK2OmlhlNAo2P3jOYhatxvHKdwLHroAVbnHuIp2PqblV1d2UHa1-q9Wa2iGBPwBtEHC11Fl-JC7jYAfhnqz_ILeALPUwR0 |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Detecting+CSV+file+dialects+by+table+uniformity+measurement+and+data+type%C2%A0inference&rft.jtitle=Data+Science&rft.au=Garc%C3%ADa%2C+Wilfredo&rft.date=2024-11-25&rft.issn=2451-8484&rft.eissn=2451-8492&rft.volume=7&rft.issue=2&rft.spage=55&rft.epage=72&rft_id=info:doi/10.3233%2FDS-240062&rft.externalDBID=n%2Fa&rft.externalDocID=10_3233_DS_240062 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2451-8484&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2451-8484&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2451-8484&client=summon |