Automated extraction and structuring of menus from PDF files using  machine learning and NLP techniques

A.S.  Mashkanov; Zh. Akhayeva; A. Zakirova

doi:10.32523/2616-7263-2025-153-4-257-267

Authors

A.S. Mashkanov L.N. Gumilyov Eurasian National University https://orcid.org/0009-0001-9992-5542
Zh. Akhayeva L.N. Gumilyov Eurasian National University https://orcid.org/0000-0003-4905-2111
A. Zakirova L.N. Gumilyov Eurasian National University https://orcid.org/0000-0001-8772-1414

DOI:

https://doi.org/10.32523/2616-7263-2025-153-4-257-267

Keywords:

PDF document processing, text analysis automation, weakly structured data, restaurant menus, Natural Language Processing (NLP), machine learning, data extraction, semantic analysis, food service digitalization

Abstract

This study explores state-of-the-art approaches for processing PDF
documents, with a focus on analyzing poorly structured restaurant menus. The focus will be on analyzing poorly structured restaurant menus. Successful automated processing typically requires well-structured documents, meaning that aesthetic design must often be sacrificed for machine readability. However, in case of restaurants, the design of the menu is more valuable than its structure, that is why the menus are harder to process, due to its poor structure. With the ability to successfully process the poorly structured PDF documents, further processing of the documents from other spheres of businesses should become much easier. A comparative analysis is conducted of structural features in different types of PDF documents, including legislative acts and academic publications.

The research is aimed to use machine learning methods in order to overcome the challenges in automation of data extraction, analysis and structuring.
Solution that has been described in the study is developed to overcome the
problems with poorly structured PDF documents.

Automated extraction and structuring of menus from PDF files using machine learning and NLP techniques

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Categories

Most read articles by the same author(s)

home

Make a Submission

Language

Certificate

Templates

Information

Indexing

Citation analysis

Browse

logo

Announcements

Temporary Suspension of Manuscript Submissions

Journal Rebranding