Marion Weller, Anita Gojun, Ulrich Heid, Béatrice Daille, Emmanuel Morin
Universität Stuttgart / Université de Nantes
Compiling terminological data using comparable corpora: from term extraction to dictionaries
For scientific domains, terminological resources like dictionaries are often not available or not up-to-date. Additionally, term variation (Daille 2005) is often not documented. As a result, translators working in technical domains usually spend much time building terminological resources.
The project TTC1 aims at compiling domain-specific lexical resources which are to be integrated into CAT tools and SMT systems. Since parallel data is often not available, comparable corpora are used: they are available for a large range of domains in many languages.
The TTC tool suite consists of the following steps:
corpus collection using a focused crawler (de Groc 2011)
pattern-based term extraction of terminologically relevant noun phrases from tagged and lemmatized text (Schmid 1994),
identification of term variants: (DE) Korrosionsschutz ↔ Schutz gegen Korrosion (corrosion protection ↔ protection against corrosion)
term alignment: for a given term of the source language, equivalents in the target language are searched and aligned. Term lists of both the source and target language, as well as a general language dictionary are taken as an input to this step.
In our poster presentation, we focus on term alignment, presenting two approaches: (1) lexical strategies and (2) the use of context vectors.
Terms do not necessarily have an equivalent of the same syntactic structure in other languages, particularly German compounds. By applying term variation patterns, compounds can be reformulated, resulting in term variants of different syntactic structures (Morin & Daille 2009). This allows to individually look up the components of a compound in the dictionary and identify matching target language terms: Stromspeicherung → Speicherung von Strom → storage of power / storage of electricity.
Terms and their translations tend to appear in comparable lexical contexts. For each source language term, context vectors are computed and translated into the target language. The translated vectors are then compared with target language context vectors (using cosine measure): terms with similar context vectors are likely to be translations. Since both approaches depend on the coverage of the dictionary, we consider the lexical strategies as an input for the context vector method.
The TTC project is presented by Marion Weller (IMS) at the Computational Linguistics section held at the annual meeting DGfS's Poster Session on March 8, 2012. The primary purpose of the section is to maintain the scientific exchange between theoretical linguistics and computational linguistics.
TTC Poster: Marion Weller, Anita Gojun, Ulrich Heid (IMS), Béatrice Daille, and Emmanuel Morin (UN). Compiling terminological data using comparable corpora: from term extraction to dictionaries (
PDF).