Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data

Hiroshi Sakamoto
The 12th International Conference on Grammatical Inference, PMLR 34:3-20, 2014.

Abstract

A grammatical inference algorithm tries to find as a small grammar as possible representing a potentially infinite sequence of strings. Here, let us consider a simple restriction: the input is a finite sequence or it might be a singleton set. Then the restricted problem is called the \em grammar compression to find the smallest CFG generating just the input. In the last decade many researchers have tackled this problem because of its scalable applications, e.g., expansion of data storage capacity, speeding-up information retrieval, DNA sequencing, frequent pattern mining, and similarity search. We would review the history of grammar compression and its wide applications together with an important future work. The study of grammar compression has begun with the bad news: the smallest CFG problem is NP-hard. Hence, the first question is: Can we get a near-optimal solution in a polynomial time? (Is there a reasonable approximation algorithm?) And the next question is: Can we minimize the costs of time and space? (Does a linear time algorithm exist within an optimal working space?) The recent results produced by the research community answer affirmatively the questions. We introduce several important results and typical applications to a huge text collection. On the other hand, the shrinkage of the advantage of grammar compression is caused by the data explosion, since there is no working space for storing the whole data supplied from data stream. The last question is: How can we handle the stream data? For this question, we propose the framework of \em stream grammar compression for the next generation and its attractive application to fast data transmission.

Cite this Paper


BibTeX
@InProceedings{pmlr-v34-sakamoto14a, title = {Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data }, author = {Sakamoto, Hiroshi}, booktitle = {The 12th International Conference on Grammatical Inference}, pages = {3--20}, year = {2014}, editor = {Clark, Alexander and Kanazawa, Makoto and Yoshinaka, Ryo}, volume = {34}, series = {Proceedings of Machine Learning Research}, address = {Kyoto, Japan}, month = {17--19 Sep}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v34/sakamoto14a.pdf}, url = {https://proceedings.mlr.press/v34/sakamoto14a.html}, abstract = {A grammatical inference algorithm tries to find as a small grammar as possible representing a potentially infinite sequence of strings. Here, let us consider a simple restriction: the input is a finite sequence or it might be a singleton set. Then the restricted problem is called the \em grammar compression to find the smallest CFG generating just the input. In the last decade many researchers have tackled this problem because of its scalable applications, e.g., expansion of data storage capacity, speeding-up information retrieval, DNA sequencing, frequent pattern mining, and similarity search. We would review the history of grammar compression and its wide applications together with an important future work. The study of grammar compression has begun with the bad news: the smallest CFG problem is NP-hard. Hence, the first question is: Can we get a near-optimal solution in a polynomial time? (Is there a reasonable approximation algorithm?) And the next question is: Can we minimize the costs of time and space? (Does a linear time algorithm exist within an optimal working space?) The recent results produced by the research community answer affirmatively the questions. We introduce several important results and typical applications to a huge text collection. On the other hand, the shrinkage of the advantage of grammar compression is caused by the data explosion, since there is no working space for storing the whole data supplied from data stream. The last question is: How can we handle the stream data? For this question, we propose the framework of \em stream grammar compression for the next generation and its attractive application to fast data transmission.} }
Endnote
%0 Conference Paper %T Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data %A Hiroshi Sakamoto %B The 12th International Conference on Grammatical Inference %C Proceedings of Machine Learning Research %D 2014 %E Alexander Clark %E Makoto Kanazawa %E Ryo Yoshinaka %F pmlr-v34-sakamoto14a %I PMLR %P 3--20 %U https://proceedings.mlr.press/v34/sakamoto14a.html %V 34 %X A grammatical inference algorithm tries to find as a small grammar as possible representing a potentially infinite sequence of strings. Here, let us consider a simple restriction: the input is a finite sequence or it might be a singleton set. Then the restricted problem is called the \em grammar compression to find the smallest CFG generating just the input. In the last decade many researchers have tackled this problem because of its scalable applications, e.g., expansion of data storage capacity, speeding-up information retrieval, DNA sequencing, frequent pattern mining, and similarity search. We would review the history of grammar compression and its wide applications together with an important future work. The study of grammar compression has begun with the bad news: the smallest CFG problem is NP-hard. Hence, the first question is: Can we get a near-optimal solution in a polynomial time? (Is there a reasonable approximation algorithm?) And the next question is: Can we minimize the costs of time and space? (Does a linear time algorithm exist within an optimal working space?) The recent results produced by the research community answer affirmatively the questions. We introduce several important results and typical applications to a huge text collection. On the other hand, the shrinkage of the advantage of grammar compression is caused by the data explosion, since there is no working space for storing the whole data supplied from data stream. The last question is: How can we handle the stream data? For this question, we propose the framework of \em stream grammar compression for the next generation and its attractive application to fast data transmission.
RIS
TY - CPAPER TI - Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data AU - Hiroshi Sakamoto BT - The 12th International Conference on Grammatical Inference DA - 2014/08/30 ED - Alexander Clark ED - Makoto Kanazawa ED - Ryo Yoshinaka ID - pmlr-v34-sakamoto14a PB - PMLR DP - Proceedings of Machine Learning Research VL - 34 SP - 3 EP - 20 L1 - http://proceedings.mlr.press/v34/sakamoto14a.pdf UR - https://proceedings.mlr.press/v34/sakamoto14a.html AB - A grammatical inference algorithm tries to find as a small grammar as possible representing a potentially infinite sequence of strings. Here, let us consider a simple restriction: the input is a finite sequence or it might be a singleton set. Then the restricted problem is called the \em grammar compression to find the smallest CFG generating just the input. In the last decade many researchers have tackled this problem because of its scalable applications, e.g., expansion of data storage capacity, speeding-up information retrieval, DNA sequencing, frequent pattern mining, and similarity search. We would review the history of grammar compression and its wide applications together with an important future work. The study of grammar compression has begun with the bad news: the smallest CFG problem is NP-hard. Hence, the first question is: Can we get a near-optimal solution in a polynomial time? (Is there a reasonable approximation algorithm?) And the next question is: Can we minimize the costs of time and space? (Does a linear time algorithm exist within an optimal working space?) The recent results produced by the research community answer affirmatively the questions. We introduce several important results and typical applications to a huge text collection. On the other hand, the shrinkage of the advantage of grammar compression is caused by the data explosion, since there is no working space for storing the whole data supplied from data stream. The last question is: How can we handle the stream data? For this question, we propose the framework of \em stream grammar compression for the next generation and its attractive application to fast data transmission. ER -
APA
Sakamoto, H.. (2014). Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data . The 12th International Conference on Grammatical Inference, in Proceedings of Machine Learning Research 34:3-20 Available from https://proceedings.mlr.press/v34/sakamoto14a.html.

Related Material