FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data [version 2; peer review: 2 approved]

Functional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys f...

Full description

Saved in:
Bibliographic Details
Published inF1000 research Vol. 7; p. 1667
Main Authors Buranosky, Matt, Stellnberger, Elmar, Pfaff, Emily, Diaz-Sanchez, David, Ward-Caviness, Cavin
Format Journal Article
LanguageEnglish
Published England F1000 Research Limited 2018
F1000 Research Ltd
Subjects
Online AccessGet full text
ISSN2046-1402
2046-1402
DOI10.12688/f1000research.16483.2

Cover

More Information
Summary:Functional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys from them. The runtime and memory costs associated with seven published FD discovery algorithms are given with an overview of their theoretical foundations. Previous research establishes that FD_Mine is the most efficient FD discovery algorithm when applied to datasets with many rows (> 100,000 rows) and few columns (< 14 columns). This puts it in a special position to rule mine clinical and demographic datasets, which often consist of long and narrow sets of participant records. The structure of FD_Mine is described and supplemented with a formal proof of the equivalence pruning method used. FDTool is a re-implementation of FD_Mine with additional features added to improve performance and automate typical processes in database architecture. The experimental results of applying FDTool to 13 datasets of different dimensions are summarized in terms of the number of FDs checked, the number of FDs found, and the time it takes for the code to terminate. We find that the number of attributes in a dataset has a much greater effect on the runtime and memory costs of FDTool than does row count. The last section explains in detail how the FDTool application can be accessed, executed, and further developed.
Bibliography:new_version
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
MB and ES designed and implemented the software. MB wrote the manuscript. CWC supervised MB, and reviewed the manuscript. EP maintained the research data. DDS coordinated the funding for the project. All authors agreed to the final content of the manuscript.
No competing interests were disclosed.
ISSN:2046-1402
2046-1402
DOI:10.12688/f1000research.16483.2