Space-Time Optimization in Natural Language Text Compression

  • Anatoly Anisimov Doctor of Physical and Mathematical Sciences, professor, Taras Shevchenko National University of Kyiv, 4d Glushkov Ave., 03022, Kyiv, Ukraine
  • Igor Zavadskyi Doctor of Physical and Mathematical Sciences, associate professor, Taras Shevchenko National University of Kyiv, 4d Glushkov Ave., 03022, Kyiv, Ukraine

Abstract

In this paper, we discuss various problems arising in space and time optimization of natural language text compression methods. We define a new class of variable-length universal data compression codes with multiple delimiters — the Reverse Multi-Delimiter (RMD) codes. They are synchronizable, allow us to perform fast Boyer-Moore-style search in a compressed file, and at the same time provide the best compression ratio among all codes of a discussed class. In combination with a special technique of preprocessing a natural language text and its dictionary, they improve the performance of modern powerful achievers. Also, we construct a very fast decoding algorithm for RMD-codes operating almost at the same speed as (s,c)-dense codes and times faster than Fi-bonacci codes decoding. The provided experiments show that RMD-codes occupy a very attractive position by the means of space/decoding time tradeoffs in natural language text compression.

References

A. Apostolico and A. S. Fraenkel. Robust transmission of unbounded strings using Fibonacci representations, IEEE Trans. Inf. Theory, vol. 33, 1987, pp. 238–245.

N. Brisaboa, A. Farina, G. Navarro, and M. Esteller. (s,c)-dense coding: an optimized compression code for natural language text databases, in: Proc. Symposium on String Processing and Information Retrieval, ser. LNCS, no. 2857. SVB, 2003, pp. 122–136.

S. T. Klein and M. Ben-Nissan. On the usefulness of fibonacci compression codes, Computer Journal, vol. 53, no. 6, pp. 701–716, 2010.

A. Anisimov and I. Zavadskyi. Variable-length prefix codes with multiple delimiters, IEEE Transactions Information Theory, vol. 63, no. 5, 2017, pp. 2885–2895.

I. Zavadskyi and A. Anisimov. Reverse multi-delimiter compression codes, in: 2020 Data Compression Conference, 2020, pp. 173–182.

Published
2023-06-13
How to Cite
Anisimov, A., & Zavadskyi, I. (2023). Space-Time Optimization in Natural Language Text Compression. PHYSICO-MATHEMATICAL MODELLING AND INFORMATIONAL TECHNOLOGIES, (36), 7-11. Retrieved from http://fmmit.lviv.ua/index.php/fmmit/article/view/294