Analysis of the possibilities application of the Vision Grid Transformer model for document structure analysis of ukrainian accounting documents
Abstract
In the context of ongoing digital transformation, the need for automation of accounting document processing is steadily increasing, particularly in Ukraine, where a significant portion of primary documentation still exists in paper form or as scanned images. Effective information extraction from such documents requires the application of advanced artificial intelligence techniques, especially deep learning and multimodal data analysis. This paper explores the applicability of the Vision Grid Transformer (VGT) architecture for analyzing the layout structure of Ukrainian accounting documents.
The VGT model combines two information streams – a visual stream (based on the Vision Transformer, ViT) and a text-spatial stream (based on the Grid Transformer, GiT), enabling comprehensive document representation both in terms of visual layout and semantic content. The model's flexibility is further enhanced by pretraining methods such as Masked Grid Language Modeling (MGLM) and Segment Language Modeling (SLM), which help capture both local and global contextual dependencies among textual elements. The study emphasizes the specific challenges of adapting VGT to the Ukrainian language and accounting domain. These include the lack of high-quality publicly annotated Ukrainian datasets, the necessity of accurate optical character recognition (OCR) for Cyrillic scripts, issues related to handwritten text recognition (HTR), and the complexity of domain-specific terminology and abbreviations used in accounting.
References
Shehzadi, T., Stricker, D., & Afzal, M. Z. (2024). A Hybrid Approach for Document Layout Analysis in Document images. [ArXiv preprint] arXiv:2404.17888. https://doi.org/10.48550/arXiv.2404.17888 https://doi.org/10.1007/978-3-031-70546-5_2
Da, C., Luo, C., Zheng, Q., & Yao, C. (2023). Vision Grid Transformer for Document Layout Analysis [ArXiv preprint]. arXiv. https://doi.org/10.48550/arXiv.2308.14978 https://doi.org/10.1109/ICCV51070.2023.01783
Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., & Wei, F. (2022). DiT: Self-supervised pre-training for Document Image Transformer [ArXiv preprint]. arXiv. https://doi.org/10.48550/arXiv.2203.02378 https://doi.org/10.1145/3503161.3547911
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2019). LayoutLM: Pre-training of text and layout for document image understanding [ArXiv preprint]. arXiv. https://doi.org/10.48550/arXiv.1912.13318 https://doi.org/10.1145/3394486.3403172
Shen, Z., Zhang, R., Dell, M., Lee, B. C. G., Carlson, J., & Li, W. (2021). LayoutParser: A unified toolkit for deep learning-based document image analysis [ArXiv preprint]. arXiv. https://doi.org/10.48550/arXiv.2103.15348 https://doi.org/10.1007/978-3-030-86549-8_9
Zhong, X., Tang, J., & Yepes, A. J. (2019). PubLayNet: Largest dataset ever for document layout analysis [ArXiv preprint]. arXiv. https://doi.org/10.48550/arXiv.1908.07836 https://doi.org/10.1109/ICDAR.2019.00166
Li, M., Xu, Y., Cui, L., Huang, S., Wei, F., Li, Z., & Zhou, M. (2020, December). DocBank: A Benchmark Dataset for Document Layout Analysis. In Proceedings of the 28th International Conference on Computational Linguistics (COLING) (pp. 949-960). Barcelona, Spain (Online): International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.82
Tikhonov, A., & Rabus, A. (2024). Handwritten Text Recognition of Ukrainian Manuscripts in the 21st Century: Possibilities, Challenges, and the Future of the First Generic AI-based Model. Kyiv-Mohyla Humanities Journal, 11, 226-247. https://doi.org/10.18523/2313-4895.11.2024.226-247
Gruber, I., Picek, L., Hlaváč, M., Neduchal, P., Hrúz, M. (2024). Improving Handwritten Cyrillic OCR by Font-Based Synthetic Text Generator. In: Moosaei, H., Hladík, M., Pardalos, P.M. (eds) Dynamics of Information Systems. DIS 2023. Lecture Notes in Computer Science, vol 14321. Springer, Cham. https://doi.org/10.1007/978-3-031-50320-7_8
Weber, M., Siebenschuh, C., Butler, R. M., Alexandrov, A., Thanner, V. R., Tsolakis, G., Jabbar, H., Foster, I., Li, B., Stevens, R., & Zhang, C. (2023). WordScape: A pipeline to extract multilingual, visually rich documents with layout annotations from web crawl data [Paper presentation]. In Advances in Neural Information Processing Systems, 36 (Datasets and Benchmarks Track). NeurIPS. https://doi.org/10.48550/arXiv.2312.10188
Copyright (c) 2025 Максим Коростіль, Ілона Лагун (Автор)

This work is licensed under a Creative Commons Attribution 4.0 International License.