ABSTRACT
This research addresses the challenge of extracting financial data from unstructured sources, a persistent issue for accounting researchers, investors, and regulators. Leveraging large language models (LLMs), this study introduces a novel framework for automated financial data extraction from Portable Document Format (PDF)-formatted files. Following a design science methodology, this research develops the framework through a combination of text mining and prompt engineering techniques. The framework is subsequently applied to analyze governmental annual reports and corporate environmental, social, and governance reports, which are presented in PDF format. Test results indicate that the framework achieves an average 99.5 percent accuracy rate in a notably short time span when extracting key financial indicators. A subsequent large out-of-sample test reveals an overall accuracy rate converging around 96 percent. This study contributes to the evolving literature on applying LLMs in accounting and offers a valuable tool for both academic and industrial applications.
Data Availability: Data are available upon request.
JEL Classifications: M41; O31; C81.