The languages of health in general practice electronic patient records: a Zipf’s law analysis

Background Natural human languages show a power law behaviour in which word frequency (in any large enough corpus) is inversely proportional to word rank - Zipf’s law. We have therefore asked whether similar power law behaviours could be seen in data from electronic patient records. Results In order...

Full description

Saved in:
Bibliographic Details
Published inJournal of biomedical semantics Vol. 5; no. 1; p. 2
Main Authors Kalankesh, Leila R, New, John P, Baker, Patricia G, Brass, Andy
Format Journal Article
LanguageEnglish
Published London BioMed Central 10.01.2014
BioMed Central Ltd
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN2041-1480
2041-1480
DOI10.1186/2041-1480-5-2

Cover

More Information
Summary:Background Natural human languages show a power law behaviour in which word frequency (in any large enough corpus) is inversely proportional to word rank - Zipf’s law. We have therefore asked whether similar power law behaviours could be seen in data from electronic patient records. Results In order to examine this question, anonymised data were obtained from all general practices in Salford covering a seven year period and captured in the form of Read codes. It was found that data for patient diagnoses and procedures followed Zipf’s law. However, the medication data behaved very differently, looking much more like a referential index. We also observed differences in the statistical behaviour of the language used to describe patient diagnosis as a function of an anonymised GP practice identifier. Conclusions This works demonstrate that data from electronic patient records does follow Zipf’s law. We also found significant differences in Zipf’s law behaviour in data from different GP practices. This suggests that computational linguistic techniques could become a useful additional tool to help understand and monitor the data quality of health records.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:2041-1480
2041-1480
DOI:10.1186/2041-1480-5-2