Validating UTF‐8 in less than one instruction per byte
The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookupalgorithm, which outperforms UTF‐8 validation routines used in many libraries and languages by more than 10 times using commonly available single‐instruction‐multiple‐data instructions. To ensure repr...
Saved in:
| Published in | Software, practice & experience Vol. 51; no. 5; pp. 950 - 964 |
|---|---|
| Main Authors | , |
| Format | Journal Article |
| Language | English |
| Published |
Bognor Regis
Wiley Subscription Services, Inc
01.05.2021
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 0038-0644 1097-024X |
| DOI | 10.1002/spe.2920 |
Cover
| Abstract | The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookupalgorithm, which outperforms UTF‐8 validation routines used in many libraries and languages by more than 10 times using commonly available single‐instruction‐multiple‐data instructions. To ensure reproducibility, our work is freely available as open source software. |
|---|---|
| AbstractList | The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookupalgorithm, which outperforms UTF‐8 validation routines used in many libraries and languages by more than 10 times using commonly available single‐instruction‐multiple‐data instructions. To ensure reproducibility, our work is freely available as open source software. The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookupalgorithm, which outperforms UTF‐8 validation routines used in many libraries and languages by more than 10 times using commonly available single‐instruction‐multiple‐data instructions. To ensure reproducibility, our work is freely available as open source software. The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF‐8 validation routines used in many libraries and languages by more than 10 times using commonly available single‐instruction‐multiple‐data instructions. To ensure reproducibility, our work is freely available as open source software. |
| Author | Lemire, Daniel Keiser, John |
| Author_xml | – sequence: 1 givenname: John surname: Keiser fullname: Keiser, John organization: Microsoft – sequence: 2 givenname: Daniel orcidid: 0000-0003-3306-6922 surname: Lemire fullname: Lemire, Daniel email: lemire@gmail.com organization: Université du Québec (TELUQ) |
| BookMark | eNp1kMtKAzEUhoNUsK2CjxBw42bqyW06WUqpFygo2Iq7kEkzmjJmxiRFuvMRfEafxGnrSnR14PD9_zl8A9TzjbcInRIYEQB6EVs7opLCAeoTkOMMKH_qoT4AKzLIOT9CgxhXAIQImvdR8ahrt9TJ-We8mF99fXwW2Hlc2xhxetEed-3dIqawNsk1Hrc24HKT7DE6rHQd7cnPHKLF1XQ-uclmd9e3k8tZZqhkkIkSuGCScUn02IrClBUV5ZiLnBFj80pKpgllZEmkFhKWLDelFiZntJDcSMqG6Gzf24bmbW1jUqtmHXx3UlEBkhJOd9RoT5nQxBhspYxLevtwCtrVioDa2lGdHbW10wXOfwXa4F512PyFZnv03dV28y-nHu6nO_4bR9VzrA |
| CitedBy_id | crossref_primary_10_1002_spe_3036 crossref_primary_10_1002_spe_3261 crossref_primary_10_1002_spe_3296 crossref_primary_10_1002_spe_3313 |
| Cites_doi | 10.1109/SC.Companion.2012.93 10.1007/s00778-019-00578-5 10.1145/1463788.1463811 10.1016/j.peva.2020.102106 10.1002/spe.2777 10.1145/1345206.1345222 10.1145/1133255.1133997 10.1109/MM.2017.35 10.1007/978-1-4684-2001-2_9 10.14778/2556549.2556555 10.1109/ICECCS.2013.40 10.1145/2807591.2807644 10.1109/MS.2019.2909854 10.1145/3132709 10.1145/2541940.2541988 10.1145/3018743.3018760 |
| ContentType | Journal Article |
| Copyright | 2020 John Wiley & Sons Ltd. 2021 John Wiley & Sons, Ltd. |
| Copyright_xml | – notice: 2020 John Wiley & Sons Ltd. – notice: 2021 John Wiley & Sons, Ltd. |
| DBID | AAYXX CITATION 7SC 8FD F28 FR3 JQ2 L7M L~C L~D |
| DOI | 10.1002/spe.2920 |
| DatabaseName | CrossRef Computer and Information Systems Abstracts Technology Research Database ANTE: Abstracts in New Technology & Engineering Engineering Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest Computer Science Collection Computer and Information Systems Abstracts Engineering Research Database Advanced Technologies Database with Aerospace ANTE: Abstracts in New Technology & Engineering Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Technology Research Database CrossRef |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1097-024X |
| EndPage | 964 |
| ExternalDocumentID | 10_1002_spe_2920 SPE2920 |
| Genre | article |
| GrantInformation_xml | – fundername: National Research Council Canada funderid: RGPIN‐2017‐03910 |
| GroupedDBID | -~X .3N .4S .DC .GA .Y3 05W 0R~ 10A 123 1L6 1OB 1OC 31~ 33P 3EH 3R3 3SF 3WU 4.4 4ZD 50Y 50Z 51W 51X 52M 52N 52O 52P 52S 52T 52U 52W 52X 5VS 66C 702 7PT 8-0 8-1 8-3 8-4 8-5 85S 8UM 8WZ 930 9M8 A03 A6W AAESR AAEVG AAHHS AAHQN AAMNL AANHP AANLZ AAONW AASGY AAXRX AAYCA AAZKR ABCQN ABCUV ABDPE ABEFU ABEML ABIJN ABLJU ABTAH ACAHQ ACBWZ ACCFJ ACCZN ACFBH ACGFS ACIWK ACNCT ACPOU ACRPL ACSCC ACXBN ACXQS ACYXJ ADBBV ADEOM ADIZJ ADKYN ADMGS ADMXK ADNMO ADOZA ADXAS ADZMN AEEZP AEIGN AEIMD AENEX AEQDE AEUQT AEUYR AFBPY AFFPM AFGKR AFPWT AFWVQ AFZJQ AHBTC AITYG AIURR AIWBW AJBDE AJXKR ALAGY ALMA_UNASSIGNED_HOLDINGS ALUQN ALVPJ AMBMR AMYDB ARCSS ASPBG ATUGU AUFTA AVWKF AZBYB AZFZN AZVAB BAFTC BDRZF BFHJK BHBCM BMNLL BNHUX BROTX BRXPI BY8 CS3 CWDTD D-E D-F D0L DCZOG DPXWK DR2 DRFUL DRSTM DU5 EBS EJD F00 F01 F04 FEDTE G-S G.N GNP GODZA H.T H.X HBH HF~ HGLYW HHY HVGLF HZ~ IX1 J0M JPC KQQ LATKE LAW LC2 LC3 LEEKS LH4 LITHE LOXES LP6 LP7 LUTES LW6 LYRES M61 MEWTI MK4 MRFUL MRSTM MSFUL MSSTM MXFUL MXSTM N04 N05 N9A NF~ NNB O66 O9- OIG P2P P2W P2X P4D PALCI PQQKQ PZZ Q.N Q11 QB0 QRW R.K RIWAO RJQFR ROL RWI RX1 RXW RYL S10 SAMSI SUPJJ TAE TUS TWZ UB1 V2E W8V W99 WBKPD WH7 WIB WIH WIK WOHZO WQJ WRC WWW WXSBR WYISQ WZISG XG1 XPP XV2 YYP ZCA ZY4 ZZTAW ~02 ~IA ~WT AAMMB AAYXX ADMLS AEFGJ AEYWJ AGHNM AGQPQ AGXDD AGYGG AIDQK AIDYY AIQQE CITATION 7SC 8FD F28 FR3 JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-c2930-5b045393491a7e58cbf25b745631ce6f993a1231d19a590d36cba5c632894c923 |
| IEDL.DBID | DR2 |
| ISSN | 0038-0644 |
| IngestDate | Fri Jul 25 12:22:35 EDT 2025 Wed Oct 01 03:27:32 EDT 2025 Thu Apr 24 23:13:08 EDT 2025 Wed Jan 22 16:29:26 EST 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 5 |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c2930-5b045393491a7e58cbf25b745631ce6f993a1231d19a590d36cba5c632894c923 |
| Notes | Funding information National Research Council Canada, RGPIN‐2017‐03910 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0003-3306-6922 |
| PQID | 2509214292 |
| PQPubID | 1046349 |
| PageCount | 15 |
| ParticipantIDs | proquest_journals_2509214292 crossref_citationtrail_10_1002_spe_2920 crossref_primary_10_1002_spe_2920 wiley_primary_10_1002_spe_2920_SPE2920 |
| ProviderPackageCode | CITATION AAYXX |
| PublicationCentury | 2000 |
| PublicationDate | May 2021 2021-05-00 20210501 |
| PublicationDateYYYYMMDD | 2021-05-01 |
| PublicationDate_xml | – month: 05 year: 2021 text: May 2021 |
| PublicationDecade | 2020 |
| PublicationPlace | Bognor Regis |
| PublicationPlace_xml | – name: Bognor Regis |
| PublicationSubtitle | Practice & Experience |
| PublicationTitle | Software, practice & experience |
| PublicationYear | 2021 |
| Publisher | Wiley Subscription Services, Inc |
| Publisher_xml | – name: Wiley Subscription Services, Inc |
| References | 2006; 41 2012 2017; 37 2010 2020 2019; 10 2020; 50 2019; 36 2020; 140–141 2019; 28 2019 2008 2017 1972 2015 2014 2013 2018; 12 2013; 6 e_1_2_11_21_1 e_1_2_11_20_1 e_1_2_11_14_1 e_1_2_11_13_1 e_1_2_11_9_1 e_1_2_11_12_1 e_1_2_11_23_1 e_1_2_11_8_1 e_1_2_11_11_1 e_1_2_11_22_1 e_1_2_11_7_1 e_1_2_11_18_1 e_1_2_11_6_1 e_1_2_11_17_1 e_1_2_11_5_1 e_1_2_11_16_1 e_1_2_11_4_1 e_1_2_11_15_1 e_1_2_11_3_1 e_1_2_11_2_1 Singh T (e_1_2_11_10_1) 2019; 10 e_1_2_11_19_1 |
| References_xml | – start-page: 675 year: 2012 end-page: 684 – start-page: 85 year: 1972 end-page: 103 – volume: 12 start-page: 1 issue: 3 year: 2018 end-page: 26 article-title: Faster Base64 encoding and decoding using AVX2 instructions publication-title: ACM Trans Web – volume: 36 start-page: 96 issue: 4 year: 2019 end-page: 100 article-title: Scylladb optimizes database architecture to maximize hardware performance publication-title: IEEE Softw – volume: 50 start-page: 89 issue: 2 year: 2020 end-page: 97 article-title: Base64 encoding and decoding at almost the speed of a memory copy publication-title: Softw Pract Exper – volume: 140–141 year: 2020 article-title: Vectorization cost modeling for NEON, AVX and SVE publication-title: Perform Eval – year: 2008 – year: 2020 – volume: 37 start-page: 26 issue: 2 year: 2017 end-page: 39 article-title: The ARM scalable vector extension publication-title: IEEE Micro – volume: 6 start-page: 1702 issue: 14 year: 2013 end-page: 1713 article-title: Instant loading for main memory databases publication-title: Proc VLDB Endow – year: 2017 – volume: 10 start-page: 65 issue: 1 year: 2019 end-page: 67 article-title: Fuchsia OS‐a threat to android publication-title: IITM J Manag IT – volume: 28 start-page: 941 issue: 6 year: 2019 end-page: 960 article-title: Parsing gigabytes of JSON per second publication-title: VLDB J – year: 2019 – year: 2014 – year: 2015 – volume: 41 start-page: 132 issue: 6 year: 2006 end-page: 143 article-title: Auto‐vectorization of interleaved data for SIMD publication-title: ACM SIGPLAN Not – year: 2010 – year: 2013 – ident: e_1_2_11_7_1 doi: 10.1109/SC.Companion.2012.93 – ident: e_1_2_11_12_1 doi: 10.1007/s00778-019-00578-5 – ident: e_1_2_11_17_1 doi: 10.1145/1463788.1463811 – ident: e_1_2_11_23_1 doi: 10.1016/j.peva.2020.102106 – ident: e_1_2_11_2_1 – ident: e_1_2_11_6_1 – ident: e_1_2_11_16_1 doi: 10.1002/spe.2777 – ident: e_1_2_11_20_1 doi: 10.1145/1345206.1345222 – ident: e_1_2_11_8_1 doi: 10.1145/1133255.1133997 – ident: e_1_2_11_22_1 doi: 10.1109/MM.2017.35 – ident: e_1_2_11_4_1 – ident: e_1_2_11_13_1 doi: 10.1007/978-1-4684-2001-2_9 – ident: e_1_2_11_19_1 doi: 10.14778/2556549.2556555 – ident: e_1_2_11_11_1 – ident: e_1_2_11_9_1 doi: 10.1109/ICECCS.2013.40 – volume: 10 start-page: 65 issue: 1 year: 2019 ident: e_1_2_11_10_1 article-title: Fuchsia OS‐a threat to android publication-title: IITM J Manag IT – ident: e_1_2_11_14_1 doi: 10.1145/2807591.2807644 – ident: e_1_2_11_5_1 doi: 10.1109/MS.2019.2909854 – ident: e_1_2_11_15_1 doi: 10.1145/3132709 – ident: e_1_2_11_3_1 – ident: e_1_2_11_18_1 doi: 10.1145/2541940.2541988 – ident: e_1_2_11_21_1 doi: 10.1145/3018743.3018760 |
| SSID | ssj0011526 |
| Score | 2.3280184 |
| Snippet | The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookupalgorithm, which outperforms UTF‐8 validation routines used... The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF‐8 validation routines... |
| SourceID | proquest crossref wiley |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 950 |
| SubjectTerms | character encoding Ingestion text processing Unicode vectorization |
| Title | Validating UTF‐8 in less than one instruction per byte |
| URI | https://onlinelibrary.wiley.com/doi/abs/10.1002%2Fspe.2920 https://www.proquest.com/docview/2509214292 |
| Volume | 51 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVEBS databaseName: Inspec with Full Text customDbUrl: eissn: 1097-024X dateEnd: 20241102 omitProxy: false ssIdentifier: ssj0011526 issn: 0038-0644 databaseCode: ADMLS dateStart: 20120701 isFulltext: true titleUrlDefault: https://www.ebsco.com/products/research-databases/inspec-full-text providerName: EBSCOhost – providerCode: PRVWIB databaseName: Wiley Online Library - Core collection (SURFmarket) issn: 0038-0644 databaseCode: DR2 dateStart: 19960101 customDbUrl: isFulltext: true eissn: 1097-024X dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0011526 providerName: Wiley-Blackwell |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3NS8MwFA-ykxfnJ06nRBA9ZVvTpG2OIhvDg4g6GXgoyWsq4qiDzYOe_BP8G_1LfOnXVBTEU6G80Pb1ffySvPweIYcGkpAnGligDDAhU4-5858s1IHlmJ81BmZXbXEeDEfibCzHZVWlOwtT8EPUC27OM_J47Rxcm1l3QRo6m9qOa7WE4dfzg3w2dVkzRyHOyTutuY0vhllXVLyzPd6tBn7NRAt4-Rmk5llm0CS31fsVxSUPnae56cDLN-rG_33AKlkpwSc9KaxljSzZbJ00q8YOtPTzDRLdIDp35x6yO4r2-_76FtH7jE4wKlK31E4fM4s3au5ZOsXR5nluN8lo0L8-HbKyxQIDzPM9Jg1COl_5Qnk6tDICk3JpQkRVvgc2SBG9aMxtXuIpLVUv8QMwWkLg4zxNAILDLdLI8JnbhBrRS13ncmMkiEgKLRPNIVWejCJIwLTIcaXuGEr-cdcGYxIXzMk8RoXETiEtclBLTgvOjR9k2tUfi0uvm8UI55SjkFO8RY5y1f86Pr666Lvrzl8Fd8kyd-Usea1jmzRQw3YP8cjc7OeW9wEkldwG |
| linkProvider | Wiley-Blackwell |
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8NAEB5qPejF-sRq1RVET6l57OaBJ5GWqrWIttKDEHY3GxFLLLQe9ORP8Df6S5zNo1VREE-BMEuSyc7Mt7sz3wDsCRl5dsSl4QZCGpTFlqHrPw2Pu8rG-MzRMetsi47b6tGzPuuX4Kiohcn4ISYbbtoyUn-tDVxvSB9OWUNHQ1XXvZZmYJa6uEzRiOhqwh2FSCfttaaPvgyMu7RgnjXtw2Lk11g0BZifYWoaZ5oVuC3eMEsveag_jUVdvnwjb_znJyzCQo4_yXE2YZagpJJlqBS9HUhu6ivg3yBA16UPyR3BKfz--uaT-4QM0DESvdtOHhOFNyb0s2SIo8XzWK1Cr9nonrSMvMuCITHUmwYTiOqcwKGBxT3FfClimwkPgZVjSeXGCGA4hjcrsgLOAjNyXCk4k66DSzUqER-uQTnBZ64DEdSMdfNyIZikPqOcRdyWcWAx35eRFFU4KPQdypyCXHfCGIQZebIdokJCrZAq7E4khxntxg8yteKXhbnhjUJEdIFmkQvsKuynuv91fHh92dDXjb8K7sBcq3vRDtunnfNNmLd1dkua-liDMmpbbSE8GYvtdBp-APs_4Cc |
| linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8NAEB5qBfFifWK16gqip9Qm2U2zeBLbUh-UolZ6EMLuZiNiiYXWg578Cf5Gf4mzebQqCuIpEHZIMpnHl83MNwB7UoV1JxTK8rhUFmWRbZn-T6suPO1gfhYYmE21Rcdr9-hZn_ULcJT3wqT8EJMNN-MZSbw2Dq6HYXQ4ZQ0dDXXVzFqagVnKuG_q-RqXE-4oRDrJrDXz68vCvEtz5tmac5hLfs1FU4D5GaYmeaZVgtv8DtPykofq01hW1cs38sZ_PsIiLGT4kxynBrMEBR0vQymf7UAyV18B_wYBuml9iO8ImvD765tP7mMywMBIzG47eYw1npjQz5IhSsvnsV6FXqt5fdK2sikLlsJUX7OYRFTncpdyW9Q185WMHCbrCKxcW2kvQgAjML3Zoc0F47XQ9ZQUTHkufqpRhfhwDYoxXnMdiKS1yAwvl5Ip6jMqWCgcFXGb-b4KlSzDQa7vQGUU5GYSxiBIyZOdABUSGIWUYXeycpjSbvywppK_siBzvFGAiI4bFjnulGE_0f2v8sFVt2mOG39duANz3UYruDjtnG_CvGOKW5LKxwoUUdl6C9HJWG4nVvgBzr7fqw |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Validating+UTF%E2%80%908+in+less+than+one+instruction+per+byte&rft.jtitle=Software%2C+practice+%26+experience&rft.au=Keiser%2C+John&rft.au=Lemire%2C+Daniel&rft.date=2021-05-01&rft.pub=Wiley+Subscription+Services%2C+Inc&rft.issn=0038-0644&rft.eissn=1097-024X&rft.volume=51&rft.issue=5&rft.spage=950&rft.epage=964&rft_id=info:doi/10.1002%2Fspe.2920&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0038-0644&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0038-0644&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0038-0644&client=summon |