Validating UTF‐8 in less than one instruction per byte

The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookupalgorithm, which outperforms UTF‐8 validation routines used in many libraries and languages by more than 10 times using commonly available single‐instruction‐multiple‐data instructions. To ensure repr...

Full description

Saved in:
Bibliographic Details
Published inSoftware, practice & experience Vol. 51; no. 5; pp. 950 - 964
Main Authors Keiser, John, Lemire, Daniel
Format Journal Article
LanguageEnglish
Published Bognor Regis Wiley Subscription Services, Inc 01.05.2021
Subjects
Online AccessGet full text
ISSN0038-0644
1097-024X
DOI10.1002/spe.2920

Cover

Abstract The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookupalgorithm, which outperforms UTF‐8 validation routines used in many libraries and languages by more than 10 times using commonly available single‐instruction‐multiple‐data instructions. To ensure reproducibility, our work is freely available as open source software.
AbstractList The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookupalgorithm, which outperforms UTF‐8 validation routines used in many libraries and languages by more than 10 times using commonly available single‐instruction‐multiple‐data instructions. To ensure reproducibility, our work is freely available as open source software.
The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookupalgorithm, which outperforms UTF‐8 validation routines used in many libraries and languages by more than 10 times using commonly available single‐instruction‐multiple‐data instructions. To ensure reproducibility, our work is freely available as open source software.
The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF‐8 validation routines used in many libraries and languages by more than 10 times using commonly available single‐instruction‐multiple‐data instructions. To ensure reproducibility, our work is freely available as open source software.
Author Lemire, Daniel
Keiser, John
Author_xml – sequence: 1
  givenname: John
  surname: Keiser
  fullname: Keiser, John
  organization: Microsoft
– sequence: 2
  givenname: Daniel
  orcidid: 0000-0003-3306-6922
  surname: Lemire
  fullname: Lemire, Daniel
  email: lemire@gmail.com
  organization: Université du Québec (TELUQ)
BookMark eNp1kMtKAzEUhoNUsK2CjxBw42bqyW06WUqpFygo2Iq7kEkzmjJmxiRFuvMRfEafxGnrSnR14PD9_zl8A9TzjbcInRIYEQB6EVs7opLCAeoTkOMMKH_qoT4AKzLIOT9CgxhXAIQImvdR8ahrt9TJ-We8mF99fXwW2Hlc2xhxetEed-3dIqawNsk1Hrc24HKT7DE6rHQd7cnPHKLF1XQ-uclmd9e3k8tZZqhkkIkSuGCScUn02IrClBUV5ZiLnBFj80pKpgllZEmkFhKWLDelFiZntJDcSMqG6Gzf24bmbW1jUqtmHXx3UlEBkhJOd9RoT5nQxBhspYxLevtwCtrVioDa2lGdHbW10wXOfwXa4F512PyFZnv03dV28y-nHu6nO_4bR9VzrA
CitedBy_id crossref_primary_10_1002_spe_3036
crossref_primary_10_1002_spe_3261
crossref_primary_10_1002_spe_3296
crossref_primary_10_1002_spe_3313
Cites_doi 10.1109/SC.Companion.2012.93
10.1007/s00778-019-00578-5
10.1145/1463788.1463811
10.1016/j.peva.2020.102106
10.1002/spe.2777
10.1145/1345206.1345222
10.1145/1133255.1133997
10.1109/MM.2017.35
10.1007/978-1-4684-2001-2_9
10.14778/2556549.2556555
10.1109/ICECCS.2013.40
10.1145/2807591.2807644
10.1109/MS.2019.2909854
10.1145/3132709
10.1145/2541940.2541988
10.1145/3018743.3018760
ContentType Journal Article
Copyright 2020 John Wiley & Sons Ltd.
2021 John Wiley & Sons, Ltd.
Copyright_xml – notice: 2020 John Wiley & Sons Ltd.
– notice: 2021 John Wiley & Sons, Ltd.
DBID AAYXX
CITATION
7SC
8FD
F28
FR3
JQ2
L7M
L~C
L~D
DOI 10.1002/spe.2920
DatabaseName CrossRef
Computer and Information Systems Abstracts
Technology Research Database
ANTE: Abstracts in New Technology & Engineering
Engineering Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Engineering Research Database
Advanced Technologies Database with Aerospace
ANTE: Abstracts in New Technology & Engineering
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Technology Research Database
CrossRef
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1097-024X
EndPage 964
ExternalDocumentID 10_1002_spe_2920
SPE2920
Genre article
GrantInformation_xml – fundername: National Research Council Canada
  funderid: RGPIN‐2017‐03910
GroupedDBID -~X
.3N
.4S
.DC
.GA
.Y3
05W
0R~
10A
123
1L6
1OB
1OC
31~
33P
3EH
3R3
3SF
3WU
4.4
4ZD
50Y
50Z
51W
51X
52M
52N
52O
52P
52S
52T
52U
52W
52X
5VS
66C
702
7PT
8-0
8-1
8-3
8-4
8-5
85S
8UM
8WZ
930
9M8
A03
A6W
AAESR
AAEVG
AAHHS
AAHQN
AAMNL
AANHP
AANLZ
AAONW
AASGY
AAXRX
AAYCA
AAZKR
ABCQN
ABCUV
ABDPE
ABEFU
ABEML
ABIJN
ABLJU
ABTAH
ACAHQ
ACBWZ
ACCFJ
ACCZN
ACFBH
ACGFS
ACIWK
ACNCT
ACPOU
ACRPL
ACSCC
ACXBN
ACXQS
ACYXJ
ADBBV
ADEOM
ADIZJ
ADKYN
ADMGS
ADMXK
ADNMO
ADOZA
ADXAS
ADZMN
AEEZP
AEIGN
AEIMD
AENEX
AEQDE
AEUQT
AEUYR
AFBPY
AFFPM
AFGKR
AFPWT
AFWVQ
AFZJQ
AHBTC
AITYG
AIURR
AIWBW
AJBDE
AJXKR
ALAGY
ALMA_UNASSIGNED_HOLDINGS
ALUQN
ALVPJ
AMBMR
AMYDB
ARCSS
ASPBG
ATUGU
AUFTA
AVWKF
AZBYB
AZFZN
AZVAB
BAFTC
BDRZF
BFHJK
BHBCM
BMNLL
BNHUX
BROTX
BRXPI
BY8
CS3
CWDTD
D-E
D-F
D0L
DCZOG
DPXWK
DR2
DRFUL
DRSTM
DU5
EBS
EJD
F00
F01
F04
FEDTE
G-S
G.N
GNP
GODZA
H.T
H.X
HBH
HF~
HGLYW
HHY
HVGLF
HZ~
IX1
J0M
JPC
KQQ
LATKE
LAW
LC2
LC3
LEEKS
LH4
LITHE
LOXES
LP6
LP7
LUTES
LW6
LYRES
M61
MEWTI
MK4
MRFUL
MRSTM
MSFUL
MSSTM
MXFUL
MXSTM
N04
N05
N9A
NF~
NNB
O66
O9-
OIG
P2P
P2W
P2X
P4D
PALCI
PQQKQ
PZZ
Q.N
Q11
QB0
QRW
R.K
RIWAO
RJQFR
ROL
RWI
RX1
RXW
RYL
S10
SAMSI
SUPJJ
TAE
TUS
TWZ
UB1
V2E
W8V
W99
WBKPD
WH7
WIB
WIH
WIK
WOHZO
WQJ
WRC
WWW
WXSBR
WYISQ
WZISG
XG1
XPP
XV2
YYP
ZCA
ZY4
ZZTAW
~02
~IA
~WT
AAMMB
AAYXX
ADMLS
AEFGJ
AEYWJ
AGHNM
AGQPQ
AGXDD
AGYGG
AIDQK
AIDYY
AIQQE
CITATION
7SC
8FD
F28
FR3
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c2930-5b045393491a7e58cbf25b745631ce6f993a1231d19a590d36cba5c632894c923
IEDL.DBID DR2
ISSN 0038-0644
IngestDate Fri Jul 25 12:22:35 EDT 2025
Wed Oct 01 03:27:32 EDT 2025
Thu Apr 24 23:13:08 EDT 2025
Wed Jan 22 16:29:26 EST 2025
IsPeerReviewed true
IsScholarly true
Issue 5
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c2930-5b045393491a7e58cbf25b745631ce6f993a1231d19a590d36cba5c632894c923
Notes Funding information
National Research Council Canada, RGPIN‐2017‐03910
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-3306-6922
PQID 2509214292
PQPubID 1046349
PageCount 15
ParticipantIDs proquest_journals_2509214292
crossref_citationtrail_10_1002_spe_2920
crossref_primary_10_1002_spe_2920
wiley_primary_10_1002_spe_2920_SPE2920
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate May 2021
2021-05-00
20210501
PublicationDateYYYYMMDD 2021-05-01
PublicationDate_xml – month: 05
  year: 2021
  text: May 2021
PublicationDecade 2020
PublicationPlace Bognor Regis
PublicationPlace_xml – name: Bognor Regis
PublicationSubtitle Practice & Experience
PublicationTitle Software, practice & experience
PublicationYear 2021
Publisher Wiley Subscription Services, Inc
Publisher_xml – name: Wiley Subscription Services, Inc
References 2006; 41
2012
2017; 37
2010
2020
2019; 10
2020; 50
2019; 36
2020; 140–141
2019; 28
2019
2008
2017
1972
2015
2014
2013
2018; 12
2013; 6
e_1_2_11_21_1
e_1_2_11_20_1
e_1_2_11_14_1
e_1_2_11_13_1
e_1_2_11_9_1
e_1_2_11_12_1
e_1_2_11_23_1
e_1_2_11_8_1
e_1_2_11_11_1
e_1_2_11_22_1
e_1_2_11_7_1
e_1_2_11_18_1
e_1_2_11_6_1
e_1_2_11_17_1
e_1_2_11_5_1
e_1_2_11_16_1
e_1_2_11_4_1
e_1_2_11_15_1
e_1_2_11_3_1
e_1_2_11_2_1
Singh T (e_1_2_11_10_1) 2019; 10
e_1_2_11_19_1
References_xml – start-page: 675
  year: 2012
  end-page: 684
– start-page: 85
  year: 1972
  end-page: 103
– volume: 12
  start-page: 1
  issue: 3
  year: 2018
  end-page: 26
  article-title: Faster Base64 encoding and decoding using AVX2 instructions
  publication-title: ACM Trans Web
– volume: 36
  start-page: 96
  issue: 4
  year: 2019
  end-page: 100
  article-title: Scylladb optimizes database architecture to maximize hardware performance
  publication-title: IEEE Softw
– volume: 50
  start-page: 89
  issue: 2
  year: 2020
  end-page: 97
  article-title: Base64 encoding and decoding at almost the speed of a memory copy
  publication-title: Softw Pract Exper
– volume: 140–141
  year: 2020
  article-title: Vectorization cost modeling for NEON, AVX and SVE
  publication-title: Perform Eval
– year: 2008
– year: 2020
– volume: 37
  start-page: 26
  issue: 2
  year: 2017
  end-page: 39
  article-title: The ARM scalable vector extension
  publication-title: IEEE Micro
– volume: 6
  start-page: 1702
  issue: 14
  year: 2013
  end-page: 1713
  article-title: Instant loading for main memory databases
  publication-title: Proc VLDB Endow
– year: 2017
– volume: 10
  start-page: 65
  issue: 1
  year: 2019
  end-page: 67
  article-title: Fuchsia OS‐a threat to android
  publication-title: IITM J Manag IT
– volume: 28
  start-page: 941
  issue: 6
  year: 2019
  end-page: 960
  article-title: Parsing gigabytes of JSON per second
  publication-title: VLDB J
– year: 2019
– year: 2014
– year: 2015
– volume: 41
  start-page: 132
  issue: 6
  year: 2006
  end-page: 143
  article-title: Auto‐vectorization of interleaved data for SIMD
  publication-title: ACM SIGPLAN Not
– year: 2010
– year: 2013
– ident: e_1_2_11_7_1
  doi: 10.1109/SC.Companion.2012.93
– ident: e_1_2_11_12_1
  doi: 10.1007/s00778-019-00578-5
– ident: e_1_2_11_17_1
  doi: 10.1145/1463788.1463811
– ident: e_1_2_11_23_1
  doi: 10.1016/j.peva.2020.102106
– ident: e_1_2_11_2_1
– ident: e_1_2_11_6_1
– ident: e_1_2_11_16_1
  doi: 10.1002/spe.2777
– ident: e_1_2_11_20_1
  doi: 10.1145/1345206.1345222
– ident: e_1_2_11_8_1
  doi: 10.1145/1133255.1133997
– ident: e_1_2_11_22_1
  doi: 10.1109/MM.2017.35
– ident: e_1_2_11_4_1
– ident: e_1_2_11_13_1
  doi: 10.1007/978-1-4684-2001-2_9
– ident: e_1_2_11_19_1
  doi: 10.14778/2556549.2556555
– ident: e_1_2_11_11_1
– ident: e_1_2_11_9_1
  doi: 10.1109/ICECCS.2013.40
– volume: 10
  start-page: 65
  issue: 1
  year: 2019
  ident: e_1_2_11_10_1
  article-title: Fuchsia OS‐a threat to android
  publication-title: IITM J Manag IT
– ident: e_1_2_11_14_1
  doi: 10.1145/2807591.2807644
– ident: e_1_2_11_5_1
  doi: 10.1109/MS.2019.2909854
– ident: e_1_2_11_15_1
  doi: 10.1145/3132709
– ident: e_1_2_11_3_1
– ident: e_1_2_11_18_1
  doi: 10.1145/2541940.2541988
– ident: e_1_2_11_21_1
  doi: 10.1145/3018743.3018760
SSID ssj0011526
Score 2.3280184
Snippet The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookupalgorithm, which outperforms UTF‐8 validation routines used...
The majority of text is stored in UTF‐8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF‐8 validation routines...
SourceID proquest
crossref
wiley
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 950
SubjectTerms character encoding
Ingestion
text processing
Unicode
vectorization
Title Validating UTF‐8 in less than one instruction per byte
URI https://onlinelibrary.wiley.com/doi/abs/10.1002%2Fspe.2920
https://www.proquest.com/docview/2509214292
Volume 51
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVEBS
  databaseName: Inspec with Full Text
  customDbUrl:
  eissn: 1097-024X
  dateEnd: 20241102
  omitProxy: false
  ssIdentifier: ssj0011526
  issn: 0038-0644
  databaseCode: ADMLS
  dateStart: 20120701
  isFulltext: true
  titleUrlDefault: https://www.ebsco.com/products/research-databases/inspec-full-text
  providerName: EBSCOhost
– providerCode: PRVWIB
  databaseName: Wiley Online Library - Core collection (SURFmarket)
  issn: 0038-0644
  databaseCode: DR2
  dateStart: 19960101
  customDbUrl:
  isFulltext: true
  eissn: 1097-024X
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0011526
  providerName: Wiley-Blackwell
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3NS8MwFA-ykxfnJ06nRBA9ZVvTpG2OIhvDg4g6GXgoyWsq4qiDzYOe_BP8G_1LfOnXVBTEU6G80Pb1ffySvPweIYcGkpAnGligDDAhU4-5858s1IHlmJ81BmZXbXEeDEfibCzHZVWlOwtT8EPUC27OM_J47Rxcm1l3QRo6m9qOa7WE4dfzg3w2dVkzRyHOyTutuY0vhllXVLyzPd6tBn7NRAt4-Rmk5llm0CS31fsVxSUPnae56cDLN-rG_33AKlkpwSc9KaxljSzZbJ00q8YOtPTzDRLdIDp35x6yO4r2-_76FtH7jE4wKlK31E4fM4s3au5ZOsXR5nluN8lo0L8-HbKyxQIDzPM9Jg1COl_5Qnk6tDICk3JpQkRVvgc2SBG9aMxtXuIpLVUv8QMwWkLg4zxNAILDLdLI8JnbhBrRS13ncmMkiEgKLRPNIVWejCJIwLTIcaXuGEr-cdcGYxIXzMk8RoXETiEtclBLTgvOjR9k2tUfi0uvm8UI55SjkFO8RY5y1f86Pr666Lvrzl8Fd8kyd-Usea1jmzRQw3YP8cjc7OeW9wEkldwG
linkProvider Wiley-Blackwell
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8NAEB5qPejF-sRq1RVET6l57OaBJ5GWqrWIttKDEHY3GxFLLLQe9ORP8Df6S5zNo1VREE-BMEuSyc7Mt7sz3wDsCRl5dsSl4QZCGpTFlqHrPw2Pu8rG-MzRMetsi47b6tGzPuuX4Kiohcn4ISYbbtoyUn-tDVxvSB9OWUNHQ1XXvZZmYJa6uEzRiOhqwh2FSCfttaaPvgyMu7RgnjXtw2Lk11g0BZifYWoaZ5oVuC3eMEsveag_jUVdvnwjb_znJyzCQo4_yXE2YZagpJJlqBS9HUhu6ivg3yBA16UPyR3BKfz--uaT-4QM0DESvdtOHhOFNyb0s2SIo8XzWK1Cr9nonrSMvMuCITHUmwYTiOqcwKGBxT3FfClimwkPgZVjSeXGCGA4hjcrsgLOAjNyXCk4k66DSzUqER-uQTnBZ64DEdSMdfNyIZikPqOcRdyWcWAx35eRFFU4KPQdypyCXHfCGIQZebIdokJCrZAq7E4khxntxg8yteKXhbnhjUJEdIFmkQvsKuynuv91fHh92dDXjb8K7sBcq3vRDtunnfNNmLd1dkua-liDMmpbbSE8GYvtdBp-APs_4Cc
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8NAEB5qBfFifWK16gqip9Qm2U2zeBLbUh-UolZ6EMLuZiNiiYXWg578Cf5Gf4mzebQqCuIpEHZIMpnHl83MNwB7UoV1JxTK8rhUFmWRbZn-T6suPO1gfhYYmE21Rcdr9-hZn_ULcJT3wqT8EJMNN-MZSbw2Dq6HYXQ4ZQ0dDXXVzFqagVnKuG_q-RqXE-4oRDrJrDXz68vCvEtz5tmac5hLfs1FU4D5GaYmeaZVgtv8DtPykofq01hW1cs38sZ_PsIiLGT4kxynBrMEBR0vQymf7UAyV18B_wYBuml9iO8ImvD765tP7mMywMBIzG47eYw1npjQz5IhSsvnsV6FXqt5fdK2sikLlsJUX7OYRFTncpdyW9Q185WMHCbrCKxcW2kvQgAjML3Zoc0F47XQ9ZQUTHkufqpRhfhwDYoxXnMdiKS1yAwvl5Ip6jMqWCgcFXGb-b4KlSzDQa7vQGUU5GYSxiBIyZOdABUSGIWUYXeycpjSbvywppK_siBzvFGAiI4bFjnulGE_0f2v8sFVt2mOG39duANz3UYruDjtnG_CvGOKW5LKxwoUUdl6C9HJWG4nVvgBzr7fqw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Validating+UTF%E2%80%908+in+less+than+one+instruction+per+byte&rft.jtitle=Software%2C+practice+%26+experience&rft.au=Keiser%2C+John&rft.au=Lemire%2C+Daniel&rft.date=2021-05-01&rft.pub=Wiley+Subscription+Services%2C+Inc&rft.issn=0038-0644&rft.eissn=1097-024X&rft.volume=51&rft.issue=5&rft.spage=950&rft.epage=964&rft_id=info:doi/10.1002%2Fspe.2920&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0038-0644&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0038-0644&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0038-0644&client=summon