Corpus arboré pour le français / French Treebank

Objectifs : Développer une ressource lexicale et suntaxique « riche » pour les linguistes, utilisable en TAL.

Caractéristiques :

  • Projet initié en 1997, avec le soutien de l'IUF, du CNRS et du CNRTL
  • 1 million de mots du journal Le Monde (1989-1995)
  • Développement d'une interface d'interrogation en ligne

The French treebank is distributed for research purposes, provided you fill and return the following licence (tex filedoc file); it can also be purchased for commercial purposes, please contact Anne Abeillé.

For the DTD files of the corpus : DTD file in RelaxNG format
Utilisateurs du corpus arboré

Works based on the Treebank

Hale, John T. – Surprisal and Chunking. In Automaton Theories of Human Sentence Comprehension. – Stanford : Center for the Study of Language and Information, 2014. – p. 91-99

Overall organization of the Treebank

Data are divided into two main directories:

  • functions directory : grammatical functions + constituent + morphosyntactic annotations
  • constit directory : constituent + morphosyntactic annotations

For annotation choices, please read the documentation found in the annotation guides:

  • Guide des mots simples et composés PDF
  • Guide des annotations en constituants PDF
  • Guide des fonctions PDF

Morphosyntactic annotation

We define a complete morphosyntactic tag as follows:

  • Part of speech (POS)
  • Subcategorization
  • Inflection
  • Lemma (canonical form)
  • Parts (with similar morphosyntactic tags) for compounds.

For part of speech, we made traditional choices, except for weak pronouns that were given a POS of their own (clitic) according to the generative tradition, and foreign words (in quotations) which receive a special ET tag. Punctuations are divided between strong (clause markers) and weak (all the others). Most typographical signs (including %, numbers and abbreviations) are assigned a traditional POS (usually common noun).

We distinguish 15 lexical categories, used for simple words as well as for compounds:

  • A (adjective)
  • Adv (adverb)
  • CC (coordinating conjunction)
  • Cl (weak clitic pronoun)
  • CS (subordinating conjunction)
  • D (determiner)
  • ET (foreign word)
  • I (interjection)
  • NC (common noun)
  • NP (proper noun)
  • P (preposition)
  • PREF (prefix)
  • PRO (strong pronoun)
  • V (verb)
  • PONCT (punctuation mark)

Constituent annotation

We have chosen surface and shallow annotations, compatible with various syntactic frameworks.

Our phrasal tagset is as follows:

  • AP (adjectival phrases)
  • AdP (adverbial phrases)
  • COORD (coordinated phrases)
  • NP (noun phrases)
  • PP (prepositional phrases)
  • VN (verbal nucleus)
  • VPinf (infinitive clauses)
  • VPpart (nonfinite clauses)
  • SENT (sentences)
  • Sint, Srel, Ssub (finite clauses)

We chose to only annotate major phrases with little internal structure. For the sake of simplicity, we make parsimonious use of unary phrases. For rigid sequences of categories, such as dates or addresses, it is difficult to determine the head, and we have one global NP with no internal constituents.

We annotate certain phrases with a subcategory, which is important for functional annotation, for example relative or subordinate for embedded clauses.

We do not have discontinuous constituents.

In order to be as therory neutral as possible, we neither use empty categories, nor functional phrases (non DP or CP). We allow for headless phrases (elliptical NP lacking a head Noun or sentential clauses lacking a verbal nucleus).

Unexpressed subjects (in infinitive or participials) will be marked at the functional level.

For verbal phrases, we only annotate the minimal verbal nucleus (clitics, auciliaries, negation and verb), because the traditional VP (with complements) is subject to much linguistic debate and is often discontinuous in French.

For coordination, we only mark a coordinating phrase after a coordinating conjunction. We do not necessarily embed conjuncts inside a phrase since there are cases where the is none, and there are cases where the category of the phrase would be unspecified.

Function annotation

We have chosen to annotate grammatical functions associated with major constituents which are dependent of a Verb (or VN).

Our functional tagset is as follows:

  • SUJ (subject)
  • OBJ (direct object)
  • ATS (predicative complement of a subject)
  • ATO (predicative complement of a direct object)
  • MOD (modifier or adjunct)
  • A-OBJ (indirect complement introduced by à)
  • DE-OBJ (indirect complement introduced by de)
  • P-OBJ (indirect complement introduced by another preposition)

No more than one fucntion can be tagged on a constituent, except for verbal nucleus which bear all the functions of their pronominal clitics.

Only surface functions are encoded : we code the subject of the passive as a subject, and the postverbal NP in an impersonal construction (Il est venu 3 hommes) as an object.

We do not code the fact that a subject or an object of a given verb can also be the subject of an ambedded Vinf for example.

Parentheticals usually have the function MOD.

Emebdded phrases such as Srel dont have a function, except if they are extraposed or clefted (with the function MOD). COORD dont have a function except in the case of multiple coordinations, where each COORD has the same function (Ni Paul ni Marie ne viendra).

In the same clause several constituants can have the same function.

We do not code the link between the dependent and the head, so long distance dependencies are not taken into account.

Exemples

Tagged corpus (simplified)

Parsed corpus (simplified, without morphosyntactic annotations)

For more on the French Treebank, see Abeillé, A., L. Clément, and F. Toussenel. 2003. "Building a treebank for French", in A. Abeillé (ed) Treebanks, Kluwer, Dordrecht.