Program/Track C/C.1.2/The error correction method in the problem of automatic authorship identification of literary text
The error correction method in the problem of automatic authorship identification of literary text
Yurii Orlov, Voronina Maria
15m
The paper describes an algorithm for correcting of possible identification
errors of authors of literary texts. The identification method is the nearest
neighbor pattern, corresponding to a given class of texts. The pattern in this
case is the empirical frequency distribution of letter combinations based on the
analysis of reliably known works of the author. The proximity between texts is
understood in the sense of the proximity of the frequencies of bigrams in the
L1 norm. The author of an unknown text is assigned the one to whose pattern
the text under test is closest. In the analyzed corpus of texts, 1783 texts of 100
authors were collected, the recognition error was equal to 0.12. It is important
that after the exclusion of incorrectly recognized texts, a library of 88 authors
and 1450 texts remained, each of which was identified correctly. The problem
under study is the assessment of the probability that there is no standard of the
author of the tested text among the library patterns. If the correct pattern is
excluded from consideration, the second closest pattern is assigned as such, but
it turns out to be unstable: the ambiguity of such identification of the author
of fragments occurs already when the text is cut into 4 fragments. Thus, the
stability of the identification of the author of text fragments can be proposed as
an independent criterion for the correctness of the method that allows you to
select texts written in atypical styles.