Using MDL for Grammar Induction

In this paper we study the application of the Minimum Description Length principle (or two-part-code optimization) to grammar induction in the light of recent developments in Kolmogorov complexity theory. We focus on issues that are important for construction of effective compression algorithms. We...

Full description

Saved in:
Bibliographic Details
Published inGrammatical Inference: Algorithms and Applications pp. 293 - 306
Main Authors Adriaans, Pieter, Jacobs, Ceriel
Format Book Chapter
LanguageEnglish
Published Berlin, Heidelberg Springer Berlin Heidelberg 2006
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text
ISBN3540452648
9783540452645
ISSN0302-9743
1611-3349
DOI10.1007/11872436_24

Cover

More Information
Summary:In this paper we study the application of the Minimum Description Length principle (or two-part-code optimization) to grammar induction in the light of recent developments in Kolmogorov complexity theory. We focus on issues that are important for construction of effective compression algorithms. We define an independent measure for the quality of a theory given a data set: the randomness deficiency. This is a measure of how typical the data set is for the theory. It can not be computed, but it can in many relevant cases be approximated. An optimal theory has minimal randomness deficiency. Using results from [4] and [2] we show that: – Shorter code not necessarily leads to better theories. We prove that, in DFA induction, already as a result of a single deterministic merge of two nodes, divergence of randomness deficiency and MDL code can occur. – Contrary to what is suggested by the results of [6] there is no fundamental difference between positive and negative data from an MDL perspective. – MDL is extremely sensitive to the correct calculation of code length: model code and data-to-model code. These results show why the applications of MDL to grammar induction so far have been disappointing. We show how the theoretical results can be deployed to create an effective algorithm for DFA induction. However, we believe that, since MDL is a global optimization criterion, MDL based solutions will in many cases be less effective in problem domains where local optimization criteria can be easily calculated. The algorithms were tested on the Abbadingo problems ([10]). The code was in Java, using the Satin ([17]) divide-and-conquer system that runs on top of the Ibis ([18]) Grid programming environment.
ISBN:3540452648
9783540452645
ISSN:0302-9743
1611-3349
DOI:10.1007/11872436_24