Apertium 2 cent tip: how to add analysis and generation of unknown words, and *why you shouldn't*
Jimmy O'Regan [joregan at gmail.com]
Thu, 1 Jan 2009 15:27:30 +0000
In my article about Apertium, I promised to follow it up with another article of a more 'HOWTO' nature. And I've been writing it. And constantly rewriting it, every time somebody asks how to do something that I think is moronic, to explain why they shouldn't do that... and I need to accept that people will always want to do stupid things, and I should just write a HOWTO.
Anyway... recently, someone asked how to implement generation of unknown words. There are only two reasons I can think of, why someone would want this: either they have words in the bilingual dictionary that they don't have in the monolingual dictionary, or they want to use it in conjunction with morphological guessing.
In general, the usual method used in Apertium's translators is, if we don't know the word, we don't try to translate it -- we're honest about it, essentially. Apertium has an option to mark unknown words, which we generally recommend that people use. It doesn't cover 'hidden' unknown words, where the same word an be two different parts of speech--we're looking into how to attempt that. One result of this, is that before a release, we specifically remove some words from the monolingual dictionary, if we can't add a translation.
Anyway, in the first case, we generally write scripts to automate adding those words to the bidix. One plus of this is that it can be manually checked afterwards, and fixed. Another is that, by adding the word to the monolingual dictionary, we can also analyse it: we generally try to make bilingual translators, but sometimes we can only make a single direction translator--but we still have the option of adding the other direction later. And, as our translators are open source, it increases the amount of freely available linguistic data to do so, so it's a win all round.
The latter case, of also using a mophological guesser, is one source of some of the worst translations out there. For example, at the moment, I'm translating a short story by Adam Mickiewicz, which contains the phrase 'tu i owdzie', which is either a misspelling of 'tu i ówdzie' ('here and there') or an old form, or typesetting error, but in any case, the word 'owdzie' does not exist in the modern Polish language.
Translatica, the leading Polish-English translator, gave: "here and he is owdzying"
Now, if I knew nothing of Polish, that would send me scrambling to the English dictionary, to search for the non-existant verb 'to owdzy'.
(Google gave: "here said". SMT is a great idea, in theory, but in practice has the potential to give translations that bear no resemblance to the original meaning of the source text. Google's own method of 'augmenting' SMT by extracting correlating phrase pairs based on a pivot language also leads to extra ambiguities)
Anyway. The tip, for anyone who still wants to try it
Apetium's dictionaries can have a limited subset of regular expressions; these can be used by someone who wishes to have both analysis and generation of unknown words. The <re> tag can be placed before the <par> tag, so the entry:
[ ... ]
2-cent Tip - Stringizing a C statement
Oscar Laycock [oscar_laycock at yahoo.co.uk]
Mon, 5 Jan 2009 14:06:12 +0000 (GMT)
I recently discovered you could "stringize" a whole C++ or C statement with the pre-processor. For example:
#define TRACE(s) cerr << #s << endl; s or: #define TRACE(s) printf("%s\n", #s); s .... TRACE(*p = '\0'); p--;
(I found this in "Thinking in C++, 2nd ed. Volume 1" by Bruce Eckel, available for free at http://www.mindview.net. By the way, it seems a good introduction to C++ for C programmers with lots of useful exercises. There is also a free, but slightly old, version of the official Qt book (the C++ framework used in KDE), at http://www.qtrac.eu/C++-GUI-Programming-with-Qt-4-1st-ed.zip. It is a bit difficult for a C++ beginner, and somewhat incomplete without the accompanying CD, but rewarding none the less.)
Bruce Eckel adds: "of course this kind of thing can cause problems, especially in one-line for loops:
for(int i = 0; i < 100; i++) TRACE(f(i));
Because there are actually two statements in the TRACE( ) macro, the one-line for loop executes only the first one. The solution is to replace the semicolon with a comma in the macro."
However, when I try this with a declaration. I get a compiler error:
TRACE(char c = *p); s.cpp:17: error: expected primary-expression before 'char' s.cpp:17: error: expected `;' before 'char'
I'm not sure exactly why!?