Locale::Maketext::TPJ13

Locale::Maketext::TPJ13Perl Programmers Reference GuLocale::Maketext::TPJ13(3)



NAME
       Locale::Maketext::TPJ13 -- article about software localization

SYNOPSIS
         # This an article, not a module.

DESCRIPTION
       The following article by Sean M. Burke and Jordan Lachler first
       appeared in The Perl Journal #13 and is copyright 1999 The Perl Jour-
       nal. It appears courtesy of Jon Orwant and The Perl Journal.  This doc-
       ument may be distributed under the same terms as Perl itself.

Localization and Perl: gettext breaks, Maketext fixes
       by Sean M. Burke and Jordan Lachler

       This article points out cases where gettext (a common system for local-
       izing software interfaces -- i.e., making them work in the user's lan-
       guage of choice) fails because of basic differences between human lan-
       guages.  This article then describes Maketext, a new system capable of
       correctly treating these differences.

       A Localization Horror Story: It Could Happen To You

           "There are a number of languages spoken by human beings in this
           world."

           -- Harald Tveit Alvestrand, in RFC 1766, "Tags for the Identifica-
           tion of Languages"

       Imagine that your task for the day is to localize a piece of software
       -- and luckily for you, the only output the program emits is two mes-
       sages, like this:

         I scanned 12 directories.

         Your query matched 10 files in 4 directories.

       So how hard could that be?  You look at the code that produces the
       first item, and it reads:

         printf("I scanned %g directories.",
                $directory_count);

       You think about that, and realize that it doesn't even work right for
       English, as it can produce this output:

         I scanned 1 directories.

       So you rewrite it to read:

         printf("I scanned %g %s.",
                $directory_count,
                $directory_count == 1 ?
                  "directory" : "directories",
         );

       ...which does the Right Thing.  (In case you don't recall, "%g" is for
       locale-specific number interpolation, and "%s" is for string interpola-
       tion.)

       But you still have to localize it for all the languages you're produc-
       ing this software for, so you pull Locale::gettext off of CPAN so you
       can access the "gettext" C functions you've heard are standard for
       localization tasks.

       And you write:

         printf(gettext("I scanned %g %s."),
                $dir_scan_count,
                $dir_scan_count == 1 ?
                  gettext("directory") : gettext("directories"),
         );

       But you then read in the gettext manual (Drepper, Miller, and Pinard
       1995) that this is not a good idea, since how a single word like
       "directory" or "directories" is translated may depend on context -- and
       this is true, since in a case language like German or Russian, you'd
       may need these words with a different case ending in the first instance
       (where the word is the object of a verb) than in the second instance,
       which you haven't even gotten to yet (where the word is the object of a
       preposition, "in %g directories") -- assuming these keep the same syn-
       tax when translated into those languages.

       So, on the advice of the gettext manual, you rewrite:

         printf( $dir_scan_count == 1 ?
                  gettext("I scanned %g directory.") :
                  gettext("I scanned %g directories."),
                $dir_scan_count );

       So, you email your various translators (the boss decides that the lan-
       guages du jour are Chinese, Arabic, Russian, and Italian, so you have
       one translator for each), asking for translations for "I scanned %g
       directory." and "I scanned %g directories.".  When they reply, you'll
       put that in the lexicons for gettext to use when it localizes your
       software, so that when the user is running under the "zh" (Chinese)
       locale, gettext("I scanned %g directory.") will return the appropriate
       Chinese text, with a "%g" in there where printf can then interpolate
       $dir_scan.

       Your Chinese translator emails right back -- he says both of these
       phrases translate to the same thing in Chinese, because, in linguistic
       jargon, Chinese "doesn't have number as a grammatical category" --
       whereas English does.  That is, English has grammatical rules that
       refer to "number", i.e., whether something is grammatically singular or
       plural; and one of these rules is the one that forces nouns to take a
       plural suffix (generally "s") when in a plural context, as they are
       when they follow a number other than "one" (including, oddly enough,
       "zero").  Chinese has no such rules, and so has just the one phrase
       where English has two.  But, no problem, you can have this one Chinese
       phrase appear as the translation for the two English phrases in the
       "zh" gettext lexicon for your program.

       Emboldened by this, you dive into the second phrase that your software
       needs to output: "Your query matched 10 files in 4 directories.".  You
       notice that if you want to treat phrases as indivisible, as the gettext
       manual wisely advises, you need four cases now, instead of two, to
       cover the permutations of singular and plural on the two items,
       $dir_count and $file_count.  So you try this:

         printf( $file_count == 1 ?
           ( $directory_count == 1 ?
            gettext("Your query matched %g file in %g directory.") :
            gettext("Your query matched %g file in %g directories.") ) :
           ( $directory_count == 1 ?
            gettext("Your query matched %g files in %g directory.") :
            gettext("Your query matched %g files in %g directories.") ),
          $file_count, $directory_count,
         );

       (The case of "1 file in 2 [or more] directories" could, I suppose,
       occur in the case of symlinking or something of the sort.)

       It occurs to you that this is not the prettiest code you've ever writ-
       ten, but this seems the way to go.  You mail off to the translators
       asking for translations for these four cases.  The Chinese guy replies
       with the one phrase that these all translate to in Chinese, and that
       phrase has two "%g"s in it, as it should -- but there's a problem.  He
       translates it word-for-word back: "In %g directories contains %g files
       match your query."  The %g slots are in an order reverse to what they
       are in English.  You wonder how you'll get gettext to handle that.

       But you put it aside for the moment, and optimistically hope that the
       other translators won't have this problem, and that their languages
       will be better behaved -- i.e., that they will be just like English.

       But the Arabic translator is the next to write back.  First off, your
       code for "I scanned %g directory." or "I scanned %g directories."
       assumes there's only singular or plural.  But, to use linguistic jargon
       again, Arabic has grammatical number, like English (but unlike Chi-
       nese), but it's a three-term category: singular, dual, and plural.  In
       other words, the way you say "directory" depends on whether there's one
       directory, or two of them, or more than two of them.  Your test of
       "($directory == 1)" no longer does the job.  And it means that where
       English's grammatical category of number necessitates only the two per-
       mutations of the first sentence based on "directory [singular]" and
       "directories [plural]", Arabic has three -- and, worse, in the second
       sentence ("Your query matched %g file in %g directory."), where English
       has four, Arabic has nine.  You sense an unwelcome, exponential trend
       taking shape.

       Your Italian translator emails you back and says that "I searched 0
       directories" (a possible English output of your program) is stilted,
       and if you think that's fine English, that's your problem, but that
       just will not do in the language of Dante.  He insists that where
       $directory_count is 0, your program should produce the Italian text for
       "I didn't scan any directories.".  And ditto for "I didn't match any
       files in any directories", although he says the last part about "in any
       directories" should probably just be left off.

       You wonder how you'll get gettext to handle this; to accomodate the
       ways Arabic, Chinese, and Italian deal with numbers in just these few
       very simple phrases, you need to write code that will ask gettext for
       different queries depending on whether the numerical values in question
       are 1, 2, more than 2, or in some cases 0, and you still haven't fig-
       ured out the problem with the different word order in Chinese.

       Then your Russian translator calls on the phone, to personally tell you
       the bad news about how really unpleasant your life is about to become:

       Russian, like German or Latin, is an inflectional language; that is,
       nouns and adjectives have to take endings that depend on their case
       (i.e., nominative, accusative, genitive, etc...) -- which is roughly a
       matter of what role they have in syntax of the sentence -- as well as
       on the grammatical gender (i.e., masculine, feminine, neuter) and num-
       ber (i.e., singular or plural) of the noun, as well as on the declen-
       sion class of the noun.  But unlike with most other inflected lan-
       guages, putting a number-phrase (like "ten" or "forty-three", or their
       Arabic numeral equivalents) in front of noun in Russian can change the
       case and number that noun is, and therefore the endings you have to put
       on it.

       He elaborates:  In "I scanned %g directories", you'd expect "directo-
       ries" to be in the accusative case (since it is the direct object in
       the sentnce) and the plural number, except where $directory_count is 1,
       then you'd expect the singular, of course.  Just like Latin or German.
       But!  Where $directory_count % 10 is 1 ("%" for modulo, remember),
       assuming $directory count is an integer, and except where $direc-
       tory_count % 100 is 11, "directories" is forced to become grammatically
       singular, which means it gets the ending for the accusative singular...
       You begin to visualize the code it'd take to test for the problem so
       far, and still work for Chinese and Arabic and Italian, and how many
       gettext items that'd take, but he keeps going...  But where $direc-
       tory_count % 10 is 2, 3, or 4 (except where $directory_count % 100 is
       12, 13, or 14), the word for "directories" is forced to be genitive
       singular -- which means another ending... The room begins to spin
       around you, slowly at first...  But with all other integer values,
       since "directory" is an inanimate noun, when preceded by a number and
       in the nominative or accusative cases (as it is here, just your luck!),
       it does stay plural, but it is forced into the genitive case -- yet
       another ending...  And you never hear him get to the part about how
       you're going to run into similar (but maybe subtly different) problems
       with other Slavic languages like Polish, because the floor comes up to
       meet you, and you fade into unconsciousness.

       The above cautionary tale relates how an attempt at localization can
       lead from programmer consternation, to program obfuscation, to a need
       for sedation.  But careful evaluation shows that your choice of tools
       merely needed further consideration.

       The Linguistic View

           "It is more complicated than you think."

           -- The Eighth Networking Truth, from RFC 1925

       The field of Linguistics has expended a great deal of effort over the
       past century trying to find grammatical patterns which hold across lan-
       guages; it's been a constant process of people making generalizations
       that should apply to all languages, only to find out that, all too
       often, these generalizations fail -- sometimes failing for just a few
       languages, sometimes whole classes of languages, and sometimes nearly
       every language in the world except English.  Broad statistical trends
       are evident in what the "average language" is like as far as what its
       rules can look like, must look like, and cannot look like.  But the
       "average language" is just as unreal a concept as the "average person"
       -- it runs up against the fact no language (or person) is, in fact,
       average.  The wisdom of past experience leads us to believe that any
       given language can do whatever it wants, in any order, with appeal to
       any kind of grammatical categories wants -- case, number, tense, real
       or metaphoric characteristics of the things that words refer to, arbi-
       trary or predictable classifications of words based on what endings or
       prefixes they can take, degree or means of certainty about the truth of
       statements expressed, and so on, ad infinitum.

       Mercifully, most localization tasks are a matter of finding ways to
       translate whole phrases, generally sentences, where the context is rel-
       atively set, and where the only variation in content is usually in a
       number being expressed -- as in the example sentences above.  Translat-
       ing specific, fully-formed sentences is, in practice, fairly foolproof
       -- which is good, because that's what's in the phrasebooks that so many
       tourists rely on.  Now, a given phrase (whether in a phrasebook or in a
       gettext lexicon) in one language might have a greater or lesser appli-
       cability than that phrase's translation into another language -- for
       example, strictly speaking, in Arabic, the "your" in "Your query
       matched..." would take a different form depending on whether the user
       is male or female; so the Arabic translation "your[feminine] query" is
       applicable in fewer cases than the corresponding English phrase, which
       doesn't distinguish the user's gender.  (In practice, it's not feasable
       to have a program know the user's gender, so the masculine "you" in
       Arabic is usually used, by default.)

       But in general, such surprises are rare when entire sentences are being
       translated, especially when the functional context is restricted to
       that of a computer interacting with a user either to convey a fact or
       to prompt for a piece of information.  So, for purposes of localiza-
       tion, translation by phrase (generally by sentence) is both the sim-
       plest and the least problematic.

       Breaking gettext

           "It Has To Work."

           -- First Networking Truth, RFC 1925

       Consider that sentences in a tourist phrasebook are of two types: ones
       like "How do I get to the marketplace?" that don't have any blanks to
       fill in, and ones like "How much do these ___ cost?", where there's one
       or more blanks to fill in (and these are usually linked to a list of
       words that you can put in that blank: "fish", "potatoes", "tomatoes",
       etc.)  The ones with no blanks are no problem, but the fill-in-the-
       blank ones may not be really straightforward. If it's a Swahili phrase-
       book, for example, the authors probably didn't bother to tell you the
       complicated ways that the verb "cost" changes its inflectional prefix
       depending on the noun you're putting in the blank.  The trader in the
       marketplace will still understand what you're saying if you say "how
       much do these potatoes cost?" with the wrong inflectional prefix on
       "cost".  After all, you can't speak proper Swahili, you're just a
       tourist.  But while tourists can be stupid, computers are supposed to
       be smart; the computer should be able to fill in the blank, and still
       have the results be grammatical.

       In other words, a phrasebook entry takes some values as parameters (the
       things that you fill in the blank or blanks), and provides a value
       based on these parameters, where the way you get that final value from
       the given values can, properly speaking, involve an arbitrarily complex
       series of operations.  (In the case of Chinese, it'd be not at all com-
       plex, at least in cases like the examples at the beginning of this
       article; whereas in the case of Russian it'd be a rather complex series
       of operations.  And in some languages, the complexity could be spread
       around differently: while the act of putting a number-expression in
       front of a noun phrase might not be complex by itself, it may change
       how you have to, for example, inflect a verb elsewhere in the sentence.
       This is what in syntax is called "long-distance dependencies".)

       This talk of parameters and arbitrary complexity is just another way to
       say that an entry in a phrasebook is what in a programming language
       would be called a "function".  Just so you don't miss it, this is the
       crux of this article: A phrase is a function; a phrasebook is a bunch
       of functions.

       The reason that using gettext runs into walls (as in the above second-
       person horror story) is that you're trying to use a string (or worse, a
       choice among a bunch of strings) to do what you really need a function
       for -- which is futile.  Preforming (s)printf interpolation on the
       strings which you get back from gettext does allow you to do some com-
       mon things passably well... sometimes... sort of; but, to paraphrase
       what some people say about "csh" script programming, "it fools you into
       thinking you can use it for real things, but you can't, and you don't
       discover this until you've already spent too much time trying, and by
       then it's too late."

       Replacing gettext

       So, what needs to replace gettext is a system that supports lexicons of
       functions instead of lexicons of strings.  An entry in a lexicon from
       such a system should not look like this:

         "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires"

       [\xE9 is e-acute in Latin-1.  Some pod renderers would scream if I used
       the actual character here. -- SB]

       but instead like this, bearing in mind that this is just a first stab:

         sub I_found_X1_files_in_X2_directories {
           my( $files, $dirs ) = @_[0,1];
           $files = sprintf("%g %s", $files,
             $files == 1 ? 'fichier' : 'fichiers');
           $dirs = sprintf("%g %s", $dirs,
             $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
           return "J'ai trouv\xE9 $files dans $dirs.";
         }

       Now, there's no particularly obvious way to store anything but strings
       in a gettext lexicon; so it looks like we just have to start over and
       make something better, from scratch.  I call my shot at a gettext-
       replacement system "Maketext", or, in CPAN terms, Locale::Maketext.

       When designing Maketext, I chose to plan its main features in terms of
       "buzzword compliance".  And here are the buzzwords:

       Buzzwords: Abstraction and Encapsulation

       The complexity of the language you're trying to output a phrase in is
       entirely abstracted inside (and encapsulated within) the Maketext mod-
       ule for that interface.  When you call:

         print $lang->maketext("You have [quant,_1,piece] of new mail.",
                              scalar(@messages));

       you don't know (and in fact can't easily find out) whether this will
       involve lots of figuring, as in Russian (if $lang is a handle to the
       Russian module), or relatively little, as in Chinese.  That kind of
       abstraction and encapsulation may encourage other pleasant buzzwords
       like modularization and stratification, depending on what design deci-
       sions you make.

       Buzzword: Isomorphism

       "Isomorphism" means "having the same structure or form"; in discussions
       of program design, the word takes on the special, specific meaning that
       your implementation of a solution to a problem has the same structure
       as, say, an informal verbal description of the solution, or maybe of
       the problem itself.  Isomorphism is, all things considered, a good
       thing -- it's what problem-solving (and solution-implementing) should
       look like.

       What's wrong the with gettext-using code like this...

         printf( $file_count == 1 ?
           ( $directory_count == 1 ?
            "Your query matched %g file in %g directory." :
            "Your query matched %g file in %g directories." ) :
           ( $directory_count == 1 ?
            "Your query matched %g files in %g directory." :
            "Your query matched %g files in %g directories." ),
          $file_count, $directory_count,
         );

       is first off that it's not well abstracted -- these ways of testing for
       grammatical number (as in the expressions like "foo == 1 ?
       singular_form : plural_form") should be abstracted to each language
       module, since how you get grammatical number is language-specific.

       But second off, it's not isomorphic -- the "solution" (i.e., the
       phrasebook entries) for Chinese maps from these four English phrases to
       the one Chinese phrase that fits for all of them.  In other words, the
       informal solution would be "The way to say what you want in Chinese is
       with the one phrase 'For your question, in Y directories you would find
       X files'" -- and so the implemented solution should be, isomorphically,
       just a straightforward way to spit out that one phrase, with numerals
       properly interpolated.  It shouldn't have to map from the complexity of
       other languages to the simplicity of this one.

       Buzzword: Inheritance

       There's a great deal of reuse possible for sharing of phrases between
       modules for related dialects, or for sharing of auxiliary functions
       between related languages.  (By "auxiliary functions", I mean functions
       that don't produce phrase-text, but which, say, return an answer to
       "does this number require a plural noun after it?".  Such auxiliary
       functions would be used in the internal logic of functions that actu-
       ally do produce phrase-text.)

       In the case of sharing phrases, consider that you have an interface
       already localized for American English (probably by having been written
       with that as the native locale, but that's incidental).  Localizing it
       for UK English should, in practical terms, be just a matter of running
       it past a British person with the instructions to indicate what few
       phrases would benefit from a change in spelling or possibly minor
       rewording.  In that case, you should be able to put in the UK English
       localization module only those phrases that are UK-specific, and for
       all the rest, inherit from the American English module.  (And I expect
       this same situation would apply with Brazilian and Continental Por-
       tugese, possbily with some very closely related languages like Czech
       and Slovak, and possibly with the slightly different "versions" of
       written Mandarin Chinese, as I hear exist in Taiwan and mainland
       China.)

       As to sharing of auxiliary functions, consider the problem of Russian
       numbers from the beginning of this article; obviously, you'd want to
       write only once the hairy code that, given a numeric value, would
       return some specification of which case and number a given quanitified
       noun should use.  But suppose that you discover, while localizing an
       interface for, say, Ukranian (a Slavic language related to Russian,
       spoken by several million people, many of whom would be relieved to
       find that your Web site's or software's interface is available in their
       language), that the rules in Ukranian are the same as in Russian for
       quantification, and probably for many other grammatical functions.
       While there may well be no phrases in common between Russian and Ukra-
       nian, you could still choose to have the Ukranian module inherit from
       the Russian module, just for the sake of inheriting all the various
       grammatical methods.  Or, probably better organizationally, you could
       move those functions to a module called "_E_Slavic" or something, which
       Russian and Ukranian could inherit useful functions from, but which
       would (presumably) provide no lexicon.

       Buzzword: Concision

       Okay, concision isn't a buzzword.  But it should be, so I decree that
       as a new buzzword, "concision" means that simple common things should
       be expressible in very few lines (or maybe even just a few characters)
       of code -- call it a special case of "making simple things easy and
       hard things possible", and see also the role it played in the
       MIDI::Simple language, discussed elsewhere in this issue [TPJ#13].

       Consider our first stab at an entry in our "phrasebook of functions":

         sub I_found_X1_files_in_X2_directories {
           my( $files, $dirs ) = @_[0,1];
           $files = sprintf("%g %s", $files,
             $files == 1 ? 'fichier' : 'fichiers');
           $dirs = sprintf("%g %s", $dirs,
             $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
           return "J'ai trouv\xE9 $files dans $dirs.";
         }

       You may sense that a lexicon (to use a non-committal catch-all term for
       a collection of things you know how to say, regardless of whether
       they're phrases or words) consisting of functions expressed as above
       would make for rather long-winded and repetitive code -- even if you
       wisely rewrote this to have quantification (as we call adding a number
       expression to a noun phrase) be a function called like:

         sub I_found_X1_files_in_X2_directories {
           my( $files, $dirs ) = @_[0,1];
           $files = quant($files, "fichier");
           $dirs =  quant($dirs,  "r\xE9pertoire");
           return "J'ai trouv\xE9 $files dans $dirs.";
         }

       And you may also sense that you do not want to bother your translators
       with having to write Perl code -- you'd much rather that they spend
       their very costly time on just translation.  And this is to say nothing
       of the near impossibility of finding a commercial translator who would
       know even simple Perl.

       In a first-hack implementation of Maketext, each language-module's lex-
       icon looked like this:

        %Lexicon = (
          "I found %g files in %g directories"
          => sub {
             my( $files, $dirs ) = @_[0,1];
             $files = quant($files, "fichier");
             $dirs =  quant($dirs,  "r\xE9pertoire");
             return "J'ai trouv\xE9 $files dans $dirs.";
           },
         ... and so on with other phrase => sub mappings ...
        );

       but I immediately went looking for some more concise way to basically
       denote the same phrase-function -- a way that would also serve to con-
       cisely denote most phrase-functions in the lexicon for most languages.
       After much time and even some actual thought, I decided on this system:

       * Where a value in a %Lexicon hash is a contentful string instead of an
       anonymous sub (or, conceivably, a coderef), it would be interpreted as
       a sort of shorthand expression of what the sub does.  When accessed for
       the first time in a session, it is parsed, turned into Perl code, and
       then eval'd into an anonymous sub; then that sub replaces the original
       string in that lexicon.  (That way, the work of parsing and evaling the
       shorthand form for a given phrase is done no more than once per ses-
       sion.)

       * Calls to "maketext" (as Maketext's main function is called) happen
       thru a "language session handle", notionally very much like an IO han-
       dle, in that you open one at the start of the session, and use it for
       "sending signals" to an object in order to have it return the text you
       want.

       So, this:

         $lang->maketext("You have [quant,_1,piece] of new mail.",
                        scalar(@messages));

       basically means this: look in the lexicon for $lang (which may inherit
       from any number of other lexicons), and find the function that we hap-
       pen to associate with the string "You have [quant,_1,piece] of new
       mail" (which is, and should be, a functioning "shorthand" for this
       function in the native locale -- English in this case).  If you find
       such a function, call it with $lang as its first parameter (as if it
       were a method), and then a copy of scalar(@messages) as its second, and
       then return that value.  If that function was found, but was in string
       shorthand instead of being a fully specified function, parse it and
       make it into a function before calling it the first time.

       * The shorthand uses code in brackets to indicate method calls that
       should be performed.  A full explanation is not in order here, but a
       few examples will suffice:

         "You have [quant,_1,piece] of new mail."

       The above code is shorthand for, and will be interpreted as, this:

         sub {
           my $handle = $_[0];
           my(@params) = @_;
           return join '',
             "You have ",
             $handle->quant($params[1], 'piece'),
             "of new mail.";
         }

       where "quant" is the name of a method you're using to quantify the noun
       "piece" with the number $params[0].

       A string with no brackety calls, like this:

         "Your search expression was malformed."

       is somewhat of a degerate case, and just gets turned into:

         sub { return "Your search expression was malformed." }

       However, not everything you can write in Perl code can be written in
       the above shorthand system -- not by a long shot.  For example, con-
       sider the Italian translator from the beginning of this article, who
       wanted the Italian for "I didn't find any files" as a special case,
       instead of "I found 0 files".  That couldn't be specified (at least not
       easily or simply) in our shorthand system, and it would have to be
       written out in full, like this:

         sub {  # pretend the English strings are in Italian
           my($handle, $files, $dirs) = @_[0,1,2];
           return "I didn't find any files" unless $files;
           return join '',
             "I found ",
             $handle->quant($files, 'file'),
             " in ",
             $handle->quant($dirs,  'directory'),
             ".";
         }

       Next to a lexicon full of shorthand code, that sort of sticks out like
       a sore thumb -- but this is a special case, after all; and at least
       it's possible, if not as concise as usual.

       As to how you'd implement the Russian example from the beginning of the
       article, well, There's More Than One Way To Do It, but it could be
       something like this (using English words for Russian, just so you know
       what's going on):

         "I [quant,_1,directory,accusative] scanned."

       This shifts the burden of complexity off to the quant method.  That
       method's parameters are: the numeric value it's going to use to quan-
       tify something; the Russian word it's going to quantify; and the param-
       eter "accusative", which you're using to mean that this sentence's syn-
       tax wants a noun in the accusative case there, although that quantifi-
       cation method may have to overrule, for grammatical reasons you may
       recall from the beginning of this article.

       Now, the Russian quant method here is responsible not only for imple-
       menting the strange logic necessary for figuring out how Russian num-
       ber-phrases impose case and number on their noun-phrases, but also for
       inflecting the Russian word for "directory".  How that inflection is to
       be carried out is no small issue, and among the solutions I've seen,
       some (like variations on a simple lookup in a hash where all possible
       forms are provided for all necessary words) are straightforward but can
       become cumbersome when you need to inflect more than a few dozen words;
       and other solutions (like using algorithms to model the inflections,
       storing only root forms and irregularities) can involve more overhead
       than is justifiable for all but the largest lexicons.

       Mercifully, this design decision becomes crucial only in the hairiest
       of inflected languages, of which Russian is by no means the worst case
       scenario, but is worse than most.  Most languages have simpler inflec-
       tion systems; for example, in English or Swahili, there are generally
       no more than two possible inflected forms for a given noun
       ("error/errors"; "kosa/makosa"), and the rules for producing these
       forms are fairly simple -- or at least, simple rules can be formulated
       that work for most words, and you can then treat the exceptions as just
       "irregular", at least relative to your ad hoc rules.  A simpler inflec-
       tion system (simpler rules, fewer forms) means that design decisions
       are less crucial to maintaining sanity, whereas the same decisions
       could incur overhead-versus-scalability problems in languages like Rus-
       sian.  It may also be likely that code (possibly in Perl, as with Lin-
       gua::EN::Inflect, for English nouns) has already been written for the
       language in question, whether simple or complex.

       Moreover, a third possibility may even be simpler than anything dis-
       cussed above: "Just require that all possible (or at least applicable)
       forms be provided in the call to the given language's quant method, as
       in:"

         "I found [quant,_1,file,files]."

       That way, quant just has to chose which form it needs, without having
       to look up or generate anything.  While possibly not optimal for Rus-
       sian, this should work well for most other languages, where quantifica-
       tion is not as complicated an operation.

       The Devil in the Details

       There's plenty more to Maketext than described above -- for example,
       there's the details of how language tags ("en-US", "i-pwn", "fi", etc.)
       or locale IDs ("en_US") interact with actual module naming ("Bogo-
       Query/Locale/en_us.pm"), and what magic can ensue; there's the details
       of how to record (and possibly negotiate) what character encoding Make-
       text will return text in (UTF8? Latin-1? KOI8?).  There's the interest-
       ing fact that Maketext is for localization, but nowhere actually has a
       ""use locale;"" anywhere in it.  For the curious, there's the somewhat
       frightening details of how I actually implement something like data
       inheritance so that searches across modules' %Lexicon hashes can
       parallel how Perl implements method inheritance.

       And, most importantly, there's all the practical details of how to
       actually go about deriving from Maketext so you can use it for your
       interfaces, and the various tools and conventions for starting out and
       maintaining individual language modules.

       That is all covered in the documentation for Locale::Maketext and the
       modules that come with it, available in CPAN.  After having read this
       article, which covers the why's of Maketext, the documentation, which
       covers the how's of it, should be quite straightfoward.

       The Proof in the Pudding: Localizing Web Sites

       Maketext and gettext have a notable difference: gettext is in C, acces-
       sible thru C library calls, whereas Maketext is in Perl, and really
       can't work without a Perl interpreter (although I suppose something
       like it could be written for C).  Accidents of history (and not neces-
       sarily lucky ones) have made C++ the most common language for the
       implementation of applications like word processors, Web browsers, and
       even many in-house applications like custom query systems.  Current
       conditions make it somewhat unlikely that the next one of any of these
       kinds of applications will be written in Perl, albeit clearly more for
       reasons of custom and inertia than out of consideration of what is the
       right tool for the job.

       However, other accidents of history have made Perl a well-accepted lan-
       guage for design of server-side programs (generally in CGI form) for
       Web site interfaces.  Localization of static pages in Web sites is
       trivial, feasable either with simple language-negotiation features in
       servers like Apache, or with some kind of server-side inclusions of
       language-appropriate text into layout templates.  However, I think that
       the localization of Perl-based search systems (or other kinds of
       dynamic content) in Web sites, be they public or access-restricted, is
       where Maketext will see the greatest use.

       I presume that it would be only the exceptional Web site that gets
       localized for English and Chinese and Italian and Arabic and Russian,
       to recall the languages from the beginning of this article -- to say
       nothing of German, Spanish, French, Japanese, Finnish, and Hindi, to
       name a few languages that benefit from large numbers of programmers or
       Web viewers or both.

       However, the ever-increasing internationalization of the Web (whether
       measured in terms of amount of content, of numbers of content writers
       or programmers, or of size of content audiences) makes it increasingly
       likely that the interface to the average Web-based dynamic content ser-
       vice will be localized for two or maybe three languages.  It is my hope
       that Maketext will make that task as simple as possible, and will
       remove previous barriers to localization for languages dissimilar to
       English.

        __END__

       Sean M. Burke (sburke@cpan.org) has a Master's in linguistics from
       Northwestern University; he specializes in language technology.  Jordan
       Lachler (lachler@unm.edu) is a PhD student in the Department of Lin-
       guistics at the University of New Mexico; he specializes in morphology
       and pedagogy of North American native languages.

       References

       Alvestrand, Harald Tveit.  1995.  RFC 1766: Tags for the Identification
       of Languages.  "ftp://ftp.isi.edu/in-notes/rfc1766.txt" [Now see RFC
       3066.]

       Callon, Ross, editor.  1996.  RFC 1925: The Twelve Networking Truths.
       "ftp://ftp.isi.edu/in-notes/rfc1925.txt"

       Drepper, Ulrich, Peter Miller, and Francois Pinard.  1995-2001.  GNU
       "gettext".  Available in "ftp://prep.ai.mit.edu/pub/gnu/", with exten-
       sive docs in the distribution tarball.  [Since I wrote this article in
       1998, I now see that the gettext docs are now trying more to come to
       terms with plurality.  Whether useful conclusions have come from it is
       another question altogether. -- SMB, May 2001]

       Forbes, Nevill.  1964.  Russian Grammar.  Third Edition, revised by J.
       C. Dumbreck.  Oxford University Press.



perl v5.8.6                       2001-09-21        Locale::Maketext::TPJ13(3)