Automated extraction and structuring of menus from PDF files using machine learning and NLP techniques
Views: 12 / PDF downloads: 7
DOI:
https://doi.org/10.32523/2616-7263-2025-153-4-257-267Keywords:
PDF document processing, text analysis automation, weakly structured data, restaurant menus, Natural Language Processing (NLP), machine learning, data extraction, semantic analysis, food service digitalizationAbstract
This study explores state-of-the-art approaches for processing PDF
documents, with a focus on analyzing poorly structured restaurant menus. The focus will be on analyzing poorly structured restaurant menus. Successful automated processing typically requires well-structured documents, meaning that aesthetic design must often be sacrificed for machine readability. However, in case of restaurants, the design of the menu is more valuable than its structure, that is why the menus are harder to process, due to its poor structure. With the ability to successfully process the poorly structured PDF documents, further processing of the documents from other spheres of businesses should become much easier. A comparative analysis is conducted of structural features in different types of PDF documents, including legislative acts and academic publications.
The research is aimed to use machine learning methods in order to overcome the challenges in automation of data extraction, analysis and structuring.
Solution that has been described in the study is developed to overcome the
problems with poorly structured PDF documents.






