Annotation methodology in a supracorpora database of connectives

The project is aimed to substantially develop a supracorpora database (DB) of connectives. The DB was built as a result of the joint Swiss-Russian project IZLRZ1_164059/1 («Corpus-based contrastive study of connectors in Russian») co-financed by the FNS and the RFBR. This DB stores parallel texts from French and Italian subcorpora of the Russian National Corpus, which totals about 9.2mln of tokens making it a unique information resource for science, education and stability analysis of deep learning technologies. Given its volume and capacities, the DB has no world analogues. It helped unveil new fundamental knowledge about logical-semantic relations and means of their expression in French, Italian and Russian in the framework of several research projects. At the University of Geneva and Moscow State University, the DB is used in the educational process and in the preparation of theses. At the RAS, it is used to determine the degree of stability/instability of neural machine translation systems. A representative fragment of the DB is available at http://a179.frccsc.ru/RFH41002/main.aspx.

To date, it contains about 15,000 bilingual annotations (=30,000 monolingual ones) of contexts where connectives occur. The contexts are labeled with tags relevant for the study of connectives. If compared with the largest annotated corpus of discursive relations (PDTB), its volume only amounts to slightly more than 1mln of tokens (almost 10 times less than in the DB). What is more, the PDTB with its texts of the Wall Street Journal is monolingual, and the number of monolingual annotations containing an explicit indicator of the logical-semantic relation equals 24,240.

To expand the scope of the DB in science, education and new technologies implementation, it is planned to develop a methodology for marking up and annotating the boundaries of text fragments connected by a logical-semantic relation that can be expressed or not expressed by a connective. Now the DB only marks up the syntactic structure of these fragments and the connective. To implement such markup scheme requires, on the one hand, devising linguistic theoretical criteria and annotation methods, and on the other, developing computer programs that allow for applying these methods and at the same time provide simultaneous access to the DB via the Internet for its geographically distributed users.

As for linguistic criteria, it is well known that connectives can bind both clauses and independent sentences or a sequence of independent sentences. Since parallel texts tend to be sentence-aligned, which means that a part of the connective relevant for analysis may find itself in another pair of sentences, a pair-merging function has already been developed in the DB to solve this problem. There is a converse problem: not the entire context preceding or following the connective may be relevant for its analysis, and therefore, should be excluded. To tackle this problem, the proposed methodology for marking up and annotating the boundaries is necessary, and its creation needs linguistic substantiation.

This methodology is even more necessary for studying hierarchical text structures formed on the basis of enumeration that either opens or ends with a textual fragment organizing the enumeration into a single semantic block. Yet, while in some cases the enumeration is marked by relevant indicators, in other cases such an indicator is absent.

Developing the methodology and new DB software will thus become a preparatory stage for a project that will focus on marking up, annotating and researching hierarchical text structures – a problem so far little-studied from a contrastive perspective. A joint Swiss-Russian team will be able to carry it out.

Participants:

University of Geneva:

Institute of Informatics Problems of the Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences: