Using MDL for Grammar Induction

In this paper we study the application of the Minimum Description Length principle (or two-part-code optimization) to grammar induction in the light of recent developments in Kolmogorov complexity theory. We focus on issues that are important for construction of effective compression algorithms. We...

Full description

Saved in:

Bibliographic Details
Published in	Grammatical Inference: Algorithms and Applications pp. 293 - 306
Main Authors	Adriaans, Pieter, Jacobs, Ceriel
Format	Book Chapter
Language	English
Published	Berlin, Heidelberg Springer Berlin Heidelberg 2006
Series	Lecture Notes in Computer Science
Subjects	Binary String Kolmogorov Complexity Minimum Description Length Principle Regular Language Short Code
Online Access	Get full text
ISBN	3540452648 9783540452645
ISSN	0302-9743 1611-3349
DOI	10.1007/11872436_24

Cover

More Information
Summary:	In this paper we study the application of the Minimum Description Length principle (or two-part-code optimization) to grammar induction in the light of recent developments in Kolmogorov complexity theory. We focus on issues that are important for construction of effective compression algorithms. We define an independent measure for the quality of a theory given a data set: the randomness deficiency. This is a measure of how typical the data set is for the theory. It can not be computed, but it can in many relevant cases be approximated. An optimal theory has minimal randomness deficiency. Using results from [4] and [2] we show that: – Shorter code not necessarily leads to better theories. We prove that, in DFA induction, already as a result of a single deterministic merge of two nodes, divergence of randomness deficiency and MDL code can occur. – Contrary to what is suggested by the results of [6] there is no fundamental difference between positive and negative data from an MDL perspective. – MDL is extremely sensitive to the correct calculation of code length: model code and data-to-model code. These results show why the applications of MDL to grammar induction so far have been disappointing. We show how the theoretical results can be deployed to create an effective algorithm for DFA induction. However, we believe that, since MDL is a global optimization criterion, MDL based solutions will in many cases be less effective in problem domains where local optimization criteria can be easily calculated. The algorithms were tested on the Abbadingo problems ([10]). The code was in Java, using the Satin ([17]) divide-and-conquer system that runs on top of the Ibis ([18]) Grid programming environment.
ISBN:	3540452648 9783540452645
ISSN:	0302-9743 1611-3349
DOI:	10.1007/11872436_24