Objectifs : Développer une ressource lexicale et suntaxique « riche » pour les linguistes, utilisable en TAL.
The French treebank is distributed for research purposes, provided you fill and return the following licence (tex file, doc file); it can also be purchased for commercial purposes, please contact Anne Abeillé.
For the DTD files of the corpus : DTD file in RelaxNG format
Utilisateurs du corpus arboré
Hale, John T. – Surprisal and Chunking. In Automaton Theories of Human Sentence Comprehension. – Stanford : Center for the Study of Language and Information, 2014. – p. 91-99
Data are divided into two main directories:
For annotation choices, please read the documentation found in the annotation guides:
We define a complete morphosyntactic tag as follows:
For part of speech, we made traditional choices, except for weak pronouns that were given a POS of their own (clitic) according to the generative tradition, and foreign words (in quotations) which receive a special ET tag. Punctuations are divided between strong (clause markers) and weak (all the others). Most typographical signs (including %, numbers and abbreviations) are assigned a traditional POS (usually common noun).
We distinguish 15 lexical categories, used for simple words as well as for compounds:
We have chosen surface and shallow annotations, compatible with various syntactic frameworks.
Our phrasal tagset is as follows:
We chose to only annotate major phrases with little internal structure. For the sake of simplicity, we make parsimonious use of unary phrases. For rigid sequences of categories, such as dates or addresses, it is difficult to determine the head, and we have one global NP with no internal constituents.
We annotate certain phrases with a subcategory, which is important for functional annotation, for example relative or subordinate for embedded clauses.
We do not have discontinuous constituents.
In order to be as therory neutral as possible, we neither use empty categories, nor functional phrases (non DP or CP). We allow for headless phrases (elliptical NP lacking a head Noun or sentential clauses lacking a verbal nucleus).
Unexpressed subjects (in infinitive or participials) will be marked at the functional level.
For verbal phrases, we only annotate the minimal verbal nucleus (clitics, auciliaries, negation and verb), because the traditional VP (with complements) is subject to much linguistic debate and is often discontinuous in French.
For coordination, we only mark a coordinating phrase after a coordinating conjunction. We do not necessarily embed conjuncts inside a phrase since there are cases where the is none, and there are cases where the category of the phrase would be unspecified.
We have chosen to annotate grammatical functions associated with major constituents which are dependent of a Verb (or VN).
Our functional tagset is as follows:
No more than one fucntion can be tagged on a constituent, except for verbal nucleus which bear all the functions of their pronominal clitics.
Only surface functions are encoded : we code the subject of the passive as a subject, and the postverbal NP in an impersonal construction (Il est venu 3 hommes) as an object.
We do not code the fact that a subject or an object of a given verb can also be the subject of an ambedded Vinf for example.
Parentheticals usually have the function MOD.
Emebdded phrases such as Srel dont have a function, except if they are extraposed or clefted (with the function MOD). COORD dont have a function except in the case of multiple coordinations, where each COORD has the same function (Ni Paul ni Marie ne viendra).
In the same clause several constituants can have the same function.
We do not code the link between the dependent and the head, so long distance dependencies are not taken into account.
Tagged corpus (simplified)
Parsed corpus (simplified, without morphosyntactic annotations)
For more on the French Treebank, see Abeillé, A., L. Clément, and F. Toussenel. 2003. "Building a treebank for French", in A. Abeillé (ed) Treebanks, Kluwer, Dordrecht.