This paper investigates the use of machine learning as a means of generating “bag of words” using text corpora from accounting applications. We use Word2Vec to generate words/dictionaries, that are “similar” to a seed word that captures a concept. As part of our analysis, we perform several experiments using text from Form 10Ks and earnings calls. We investigate several activities including choice of the seed word(s), choosing word sources (corpora), analysis of resulting word lists, and other concerns. We also examine the notion of “human-in-the-loop” and the roles that a person would need to perform while generating a dictionary. Further, we investigate the impact of using accounting and financial corpuses on the different semantic and syntactic relationships, in contrast to Wikipedia. We then extend the analysis to compare those findings to ChatGPT another source of words and investigate some of the advantages and disadvantages of that approach.

Data Availability: Data are available from the authors.

JEL Classifications: M4.

This content is only available via PDF.