The error correction method in the problem of automatic authorship identification of literary text

Yurii Orlov, Voronina Maria
15m
The paper describes an algorithm for correcting of possible identification errors of authors of literary texts. The identification method is the nearest neighbor pattern, corresponding to a given class of texts. The pattern in this case is the empirical frequency distribution of letter combinations based on the analysis of reliably known works of the author. The proximity between texts is understood in the sense of the proximity of the frequencies of bigrams in the L1 norm. The author of an unknown text is assigned the one to whose pattern the text under test is closest. In the analyzed corpus of texts, 1783 texts of 100 authors were collected, the recognition error was equal to 0.12. It is important that after the exclusion of incorrectly recognized texts, a library of 88 authors and 1450 texts remained, each of which was identified correctly. The problem under study is the assessment of the probability that there is no standard of the author of the tested text among the library patterns. If the correct pattern is excluded from consideration, the second closest pattern is assigned as such, but it turns out to be unstable: the ambiguity of such identification of the author of fragments occurs already when the text is cut into 4 fragments. Thus, the stability of the identification of the author of text fragments can be proposed as an independent criterion for the correctness of the method that allows you to select texts written in atypical styles.