<<

A brief summary of conclusions

There is no disputing the fact that few people speak Icelandic. Nevertheless, Icelandic is a living national language that is used in all areas of the nation’s communications and commerce. Its significance is therefore greater than that of languages that are used by more people, but are only secondary languages within their nations, or the languages of tribes or communities that constitute minorities within larger national units. Furthermore, Icelandic data technology is well advanced compared with the position in many other language areas. Iceland has one of the world’s highest per capita rates of computer ownership, and also one of the highest rates of Internet access and use. The need for Icelandic to cope with the demands of information technology is therefore greater than might be thought in terms of the population size alone.

The Icelandic Government is aware of these considerations. The booklet Í krafti upplýsinga, published by the Ministry of Education and Culture in 1996, contained the following statement: “It is necessary to promote the use of Icelandic in information technology and to encourage the production of material at a suitable rate so as to ensure access to the widest possible range of material in Icelandic. Producers in Iceland must be able to use new technology and contribute towards a good supply of Icelandic material on CDs and the Internet in the years ahead.” In accordance with this policy, an agreement was recently concluded on the translation of the Microsoft Windows system into Icelandic.

It is the group’s opinion that the next step on this route should be to press for the development of various types of language engineering tools that will work with Icelandic and so facilitate the use of Icelandic in the information society. This vision includes the writing of software to correct spelling and grammar, break words between lines, etc., the compilation of an electronic dictionary and thesaurus of Icelandic accessible to everybody; guidance on the inflection of words in all computers, etc. If Iceland does not take this step, there is a danger that it will prove difficult to use Icelandic in the information society.

Some of the problems involved in language engineering will probably be solved automatically thanks to the emergence of high-powered technology and changes in manufacturers’ policy towards their foreign markets, but others will have to be solved by the Icelanders themselves. What is vital here is to try to ensure that allowance is made in all areas for the use of Icelandic and for the special properties of the language right from the production stage of the equipment. It is also essential to fight for the inclusion of Icelandic in international standards. In general, universal, rather than specific, solutions must be used. This is the only policy that will guarantee that it will be possible to use Icelandic in information technology in the future. Specific solutions are expensive, are valid for short periods and are very difficult and labour-intensive to maintain, and they should only be used in cases of absolute necessity.

At present, the market for language engineering in Iceland is not sufficiently large to be able to support the developmental work needed to ensure the position of Icelandic in the information society. This is explained in further detail in the report. It is not certain that this will necessarily be the case in the future. Hitherto, Icelanders have “paid for Icelandic”, so to speak: they have a high level of book and newspaper publication, and pay a higher price for these items precisely because they are in Icelandic and the market for them is therefore limited. Gradually, Icelanders would presumably pay the cost involved in translating and adapting information technology for use in Icelandic.

Nevertheless, the committee is of the opinion that a campaign needs to be fought to establish language engineering, and that this will not be done without state support. The committee’s view is that such a campaign will pay for itself in the long run. The aim of the campaign should be to consolidate the common foundation for language engineering and the gathering of data for language engineering tools, and to encourage companies to develop the tools by making use of the database.

In this way, a new language engineering industry could be created, and that which already exists would be strengthened. This refers to various industries connected with publishing and language use, such as the publication of dictionaries and glossaries, programs to correct spelling and grammar, various programs to assist with text production, speech simulators and speech recognition devices. Such an industry in Iceland could be expected to make an entry on foreign markets, and should be encouraged to do so: there can be no doubt that various opportunities will arise on those markets in the years ahead.

It is proposed that the campaign be waged on four fronts:

1. The development of common databases (linguistic databases) that can be used by companies as sources of raw material for their products. 2. Investment in applied research in the field of language engineering. 3. Financial support for companies for the development of language engineering products. 4. Development and upgrading of education and training in the field of language engineering and linguistics.

This campaign would involve the establishment of a language engineering development centre which would have the task of working with publishers and others on the development of the necessary basic databases of the language. The involvement of interested parties such as computer firms, publishers, translators and others, in addition to the state, would be essential.

Secondly, the committee proposes that funds be channelled into a research fund to support research and development in the field of language engineering. This could either take the form of a special fund, or else the funds of the Icelandic Research Council (IRC) could be boosted with additional capital earmarked for this industry. The fund would operate in two divisions, as the IRC’s Research Fund does at present, making grants on the one hand for primary applied research, which would be of benefit to the industry in the long term, on the other for development projects within companies, in particular for the manufacture of language engineering devices.

It would be essential for this financial support to be counted as complementary contributions in connection with grants from the European Union, since Europe is the main source of expertise in this area and projects within the EU’s Fifth Framework Programme could be of great significance for development in Iceland, both as regards contacts and financing.

Next, the committee considers it essential to develop and upgrade education and training in this area, and proposes that short applied training courses in speech technology and a master’s programme in language engineering.

The total annual cost of the proposed campaign would therefore be (in ISK millions):

Development centre 25 - 50 Research and development fund 150 Special funding for large international projects 30 Short applied courses in speech technology 10 Master’s programme in computational linguistics 10

Total ISK 225 - 250 million per year

This may be thought a large sum, which indeed it is, but the committee believes it is a very modest and realistic assessment, and that any substantial reduction would mean that the desired results would not be achieved. In other words, no allowance has been made in this estimate for reductions for which no urgent reasons have been advanced.

It is vital that a start be made on these activities shortly. The aim should be that the campaign will last for a limited time, and that the activities should be self-supporting in five to ten years.

Appendices 1. Tasks to be Tackled in Icelandic Language Engineering

Priority tasks

For Icelanders, the main aim must be that it should be possible to use Icelandic, written with the proper characters, in as many contexts as possible in the sphere of computer and communication technology. Naturally, however, they will have to adjust their expectations to practical considerations. To make it possible to use Icelandic in all areas, under all circumstances, would be an immense task. Therefore, the main emphasis must be put on those areas that touch on the daily life and work of the general public, or are likely to do so in the near future. The committee proposes that over the next five years, efforts be concentrated on the following tasks:

1. The main computer programs on the general market (Windows, Word, Excel, Netscape, Internet Explorer, Eudora,...) should be available in Icelandic. 2. It should be possible to use the Icelandic letters (áéíóúýðþæöÁÉÍÓÚÝÐÞÆÖ) in all circumstances: in computers, mobile telephones, teletext and other applications used by the public. 3. Work should proceed on the parsing of Icelandic, with the aim that it should be possible to use computer technology to analyse Icelandic texts into parts of speech and syntactic units. For this to be possible, it will be necessary to: 3.1 Establish a large computerised text corpus including Icelandic texts of the widest possible variety of types as a basis for continuing work. 3.2 Establish a fully parsed linguistic database (grammatically and semantically analysed) for use in further work. 4. Good auxiliary programs should be developed for textual work in Icelandic, i.e. for hyphenation, spelling-checks, grammar correction, etc. 5. A good Icelandic speech synthesiser should be developed. It should be capable of reading Icelandic texts with clear and comprehensible pronunciation and natural intonation that is understandable without special training. 6. Work should be done on speech recognition for Icelandic, the aim being to develop programs that can understand normal Icelandic speech. 7. Work should be done on the development of translation programs between Icelandic and other languages, one of the aims being to simplify searches in databases. 8. Certain parties (institutions or companies) should be given responsibility for individual projects.

Further explanation of the above points

1. Recent years have seen a deterioration of the situation in this area. Ten or fifteen years ago, the leading word-processing programs were available in Icelandic. It is necessary to press for the translation into Icelandic of the Windows operating system and the other leading programs from Microsoft that the general public uses on a daily basis, e.g. web browsers, e-mail programs, etc. It is to be hoped that the agreement with Microsoft will mark a turning-point in this area.

2. A lot of progress has been made in this area in recent years, and good work has been done by those responsible for designing Icelandic standards and those who work for the preservation of the Icelandic language. Nevertheless, there are some clouds on the horizon. For example, Icelandic letters are not included in the character sets of the GSM mobile telephones, which is a serious matter, not least in view of the fact that computer technology and mobile communication technology are becoming more and more closely integrated.

3. “Parsing” here refers to the computerised analysis of texts and identification of parts of speech, phrases and clauses. Such analysis is essential for the design of grammar correction programs, translation programs, etc. Sophisticated parsing programs exist for many major languages, but little has been done in this field in Iceland. The elements of parsing analysis are to be found in the Púki program by Friðrik Skúlason and the programs designed some years ago by Stefán Briem for the University Dictionary, but no computerised syntactical analysis tools exist for Icelandic.

3.1 In order to be able to design programs to work with a language, it is essential to have available a great deal of accurate information about the language and the way it is used. One of the main elements in assembling such information involves establishing the largest possible corpus containing Icelandic texts in computerised form. This would have to span texts from a variety of sources: newspapers, academic texts in a variety of subjects, literary works, the spoken language, etc. It is also vital to ensure that the corpus contains language as used by both men and women, people of various age groups, of various class backgrounds and from different parts of the country, etc. Data of many types must then be produced by analysing these texts in order to write programs to carry out work of different types with the language. No such text corpus currently exists, even though a great deal of the raw material necessary does exist, not least in the University Dictionary’s databases. The creation of a text corpus of this type, and the processing of data from it, is the prerequisite for effective work in the field of Icelandic language engineering.

3.2 In general, what is said in the preceding sub-section also applies here. A linguistic database containing the basic vocabulary of Icelandic (a few tens of thousands of words) is the prerequisite for language engineering work of various types in Icelandic. It would have to contain the most accurate information possible on each word: its pronunciation, part of speech, inflection, position within a sentence or clause, meaning, stylistic value, etc. Information of this type is useful for the design of grammar correction programs, computer translation, database searches, etc. The basis of such a database exists at the University Dictionary, but much of the necessary data is lacking, e.g. all semantic classifications.

4. Reasonably good hyphenation programs exist for Icelandic, and also spelling-checker programs (e.g. Friðrik Skúlason’s Púki). However, the drawback to most of these programs is that they are not built into the most widely-used word-processing programs on the market (e.g. Word) and do not function satisfactorily with those programs. This reduces their practical application and limits their use. It is necessary to have good Icelandic programs included in Word. Grammar- check and style-check programs help the user to eradicate errors in inflection, wrong word order and clumsy syntax. However, such programs do not exist for Iceland, though there is a great need for them.

5. An Icelandic speech simulator has been on the market for the past 6-7 years. It is produced by the Swedish company Infovox, and was developed in collaboration with the University of Iceland’s Institute of Linguistics, the Faculty of Engineering and the Icelandic Federation of the Handicapped. The technology on which it is based is now regarded as outmoded, and views about its quality are divided. It is obviously that its pronunciation leaves much to be desired, though it has proved of great value to some people. Another speech synthesiser has also been produced, based on a different technology, but it has not been released on the open market. Work must be continued towards developing a sophisticated Icelandic speech synthesiser.

6. “Speech recognition” (or “voice recognition”) is the term used to refer to the recognition of speech by computers. A lot of progress has been made in this field recently. It is likely that speech recognition will become of great importance in various contexts in future, e.g. in connection with searches for information and the operation of devices of various types. It is therefore very important to begin deliberate work towards developing speech recognition technology for Icelandic. This is an area in which nothing has yet been done.

7. Machine translations have a long history, though the results have been of varying quality. In the past few years, however, translation programs have been produced that work quite well, at least in restricted fields. It is likely that machine translations will assume far greater importance in the future, e.g. in connection with searches in databases, etc. Stefán Briem has done a considerable amount of experimental work on translations between Icelandic and Esperanto, but the outcome has been very limited.

8. It can be argued that all development in this area has been severely hampered by the fact that no party has been responsible for ensuring that Iceland keep up to date in this field. Amongst those bodies that it would be natural to see involved in the projects described here may be mentioned, in particular, the University of Iceland’s Linguistics Institute, the Icelandic Language Centre, the University Dictionary, the Icelandic Standards Council and the Post and Telecommunications Administration, though it is also natural and necessary that private companies should be involved in the projects. Perhaps it would be desirable to set up some sort of formal collaborative forum for these parties. The Ministry of Education and Culture would then have to entrust particular parties, or their association, with responsibility for particular areas. 2. The Status of Icelandic Letters

Standards, character sets and fonts

Though computers were originally built to perform calculations, their use for handling texts of various types has assumed ever greater importance. To make this possible, computer manufacturers have built a variety of systems that are based on linguistic knowledge. Standards have come into being in this field; some have become public standards (European or international), while others are the private property of individual manufacturers.

By their nature, standards are agreements between interested parties about how something is to be done, and there is no need to establish a standard unless there is more than one interested party. For example, when data exchange takes place between different types of computer over the Internet, it is convenient to follow a standard, but within the operating system of a single computer there is less reason to have the data in any form other than that which suits the manufacturer best.

Types of standards bearing on language engineering

1. Character sets. In character sets, each character is given a name and assigned a position or number. For example, the SMALL LETTER THORN is number 239 (EF) in the character set Latin-1 (ISO/IEC JTCI 8859-1). Character sets do not determine the appearance of letters; appearance is determined by the font. 2. Fonts. The appearance of letters is defined in each font. For example, the letter g appears as [g] in Helvetica. Most fonts are private property, but there is an ISO standard for machine- readable fonts (OCR-B). 3. Keyboards. The position of the letters on the keyboard depends on the needs and customs of each nation. There are national standards regarding keyboards, e.g. ÍST 125 on the Icelandic keyboard. 4. Alphabetical order, the way sums of money and dates are written, and many other things. Manufacturers gather data on the needs of each language community, together with information about keyboards, and this forms a “locale”. The main standards applying in this area are owned by individual manufacturers. Iceland has produced an Icelandic preliminary standard, FS130, on these matters.

Icelandic participation in work on standards

Iceland is a member of the International Standards Organisation, ISO, and the European Standards Association CEN. The Icelandic Standards Council sees to the introduction of Icelandic standards, while the country’s membership of the European Standards Association has meant that that Iceland has been obliged to adopt the European CEN standards as Icelandic standards. Iceland has undertaken the operation of one technical committee, CEN/TC304, which is responsible for preparing European standards in connection with “globalisation”. The activities of the committee have been financed to a large extent by grants from the European Union, but Iceland has been in a position to keep abreast of the work and has had an influence on the production of standards that affect Iceland’s interests.

A recording standard has been prepared for TC304 on national needs in information technology. At present, work in the committee is being directed towards the creation of standards on searches in databases. It can be expected that more areas of language engineering will become the subject of work on standards, including methods of searching for texts on the Internet and speech recognition,

Iceland has created a standard on the Icelandic keyboard, ÍST 125, which for the most part has been observed by manufacturers, who have also complied with the standard FS 130 on the writing of monetary sums, alphabetical order, etc. FS130 needs to be revised, since it contains, amongst other things, obsolete letters that are used in editions of ancient texts; these do not exist in any character sets and no one is interested in incorporating them in the large character sets.

Character sets and letters

Most people have experience of the difficulties that frequently arise when documents are moved from one computer to another, particular when different operating systems are involved. This is frequently due to the fact that the operating systems do not use the same character sets. Although a lot has been done on standardising character sets, there is still a long way to go before character sets will cease to be a problem, particularly where languages like Icelandic are involved, which contain letters that used in few, if any, other languages.

To explain the reasons for these problems and forecast the prospects for future development, it is probably best to give a short account of the history of the various character sets and the contexts in which they were, or are, used.

What is a character set?

The simplest definition of a character set is that it is a set of entries that each consist of the number (position) and name of a . On the other hand, character sets do not define the appearance of the symbol, and the makers of fonts have a free hand in deciding, for example, what the letter “g” is to look like in various fonts. Frequently, however, the name indicates how the letter is to be written, e.g. the Danish “Ø” is called “LATIN CAPITAL LETTER WITH STROKE” and the letter “Á” is called “LATIN CAPITAL LETTER A WITH ACUTE”. The letters “Д and “Þ” are called by their Icelandic names: “LATIN CAPITAL LETTER ETH” and “LATIN CAPITAL LETTER THORN”. The letter which Croatians call “Djet” has been named “LATIN LETTER D WITH STROKE”.

In the standardisation of character sets for computers, work has concentrated on harmonisation of these names, and not of the appearance of the characters. Thus, the letter “GREEK CAPITAL LETTER UPSILON” in the 8-bit Greek character set is called by the same name in the 16-bit character set 10646, so it will not be confused with “LATIN CAPITAL LETTER ”, which is known in Icelandic as “Upsilon”.

...

Work to be done regarding character sets

Iceland’s main interests no longer lie in having Icelandic letters included in the principal character sets, where they already exist; it is now a matter of urgency to concentrate on various matters that are based on the character sets and must be standardised. For example, attention must be given to the definitions of the sub-sets of and the design of fonts and their possible standardisation. There is reason for concern, and even to take action, because, under pressure from Turkey, font designers have prepared fonts in which the Icelandic letters þ and ð have been given the appearance of Turkish letters. These fonts are designed for 8-bit character sets, but because their names are the same in Unicode, they will be imported unchanged into the 16-bit environment.

It has not generally been recognised in Iceland that keeping special symbols in the world’s character sets and fonts involves high costs, and that these costs must be borne by Iceland. Few nations in Western Europe have as many special characters as Iceland in their alphabets. The letters þ and ð are not variants of other recognisable letters of the traditional , and are the only letters of which this can be said, with the exception of the German ß.

This causes Icelanders themselves the most problems, for example in writing on a keyboard. It is rare for any language-users not to have all the letters of their alphabet on their keyboards: the Danes and Germans have keys for all their special letters, as do the French, though they have resorted to having their accented letters in the position occupied by the numerals 1, 2, 3, ... etc. on the ordinary QWERTY keyboard and using the Shift key to produce the numerals. This has also been done for Czech.

The most important issues that lie ahead regarding standards and character sets are as follows:

1. First and foremost, Icelanders must realise that it will cost money to continue to have the Icelandic alphabet included in international standards and fonts. It must also be understood that this is an issue that concerns only Iceland, and consequently no one other than the Icelanders will bear this cost. 2. Special measures are needed to define the appearance of the Icelandic letters (þ and ð) in fonts, and as these letters are not used elsewhere, this is exclusively a matter of concern to Iceland. 3. Special measures are called for regarding GSM mobile phones. Rapid progress has been made towards the combination of the mobile phones and the computers into a single device. The Icelandic letters are not included in the character sets most commonly used in GSM mobile phones. Work on standards in this are has not been done by the same parties as have been active in the computer field, and much remains to be done. 4. Iceland has not been quick in having the letters used in Old Icelandic manuscripts included in Unicode, and it is probably too late to have them included in the 16-bit part of Unicode. Before taking steps to do this, Icelanders must define their needs regarding the preservation and processing of data in manuscripts from all periods of Iceland’s history.

Iceland was very fortunate in being entrusted with the operation of EU character committee TC304, which has been financed by the EU. This has guaranteed Iceland a lot of influence without a great deal of expense. An example of a very different situation is how unsuccessful Iceland has been in the telecommunications field, where Iceland has neither exerted any influence nor had the necessary knowledge of the problems that have been addressed.

It can not be expected that the European Union will pay the bills for keeping the special Icelandic characters in computers in future. Iceland will have to be prepared to pay the costs they incur in connection with their requirements regarding the alphabets used in computers. Confusion in character sets and fonts results in both direct and indirect costs, e.g. in correcting entries made on a keyboard, and this would be saved by standardisation.

3. The Written Language

Linguistic databases

All language engineering applications for the handling of the written language depend heavily on the existence of two types of database that must be developed systematically for every language. It is necessary to establish, on the one hand, a corpus for the language, and on the other a lexicon.

“Corpus” here refers to a set of texts of various types, assembled according to certain rules and for a certain purpose. The rules could specify, e.g. the length of the texts, their type (literary texts, academic texts, etc), their authors (age, gender, origin), etc. These rules are set in order to be able to maintain that the corpus is representative of whatever it is supposed to be representative of, and not a chance agglomeration of texts from which no reliable conclusions can be drawn.

Nowadays, corpuses are almost without exception composed of texts in digital form, and the past few years large corpuses have been built up for various languages. Use has been made of these corpuses in various areas, not least in lexicography. Corpuses are also necessary for designing parsing programs. They make it easier to see the possible contexts in which words can occur, and also what is common and what is rare in the language. Information of this type can be used to increase the likelihood of correct parsing of ambiguous words or phrases.

The University Dictionary’s word frequency list Íslensk orðtíðnibók, published in 1991, was based on a corpus assembled for this purpose. It did not, however, include various types of text; for example, it contains no samples of the spoken language. In addition, the texts it used are now a decade or more old; such corpuses have to be updated constantly if they are to contain the vocabulary that is in use at any given time. The main criticism, however, is that this corpus is very small, consisting of only 500,000 words, while the corpuses that are now being developed for various languages in Iceland’s neighbouring countries contain tens or even hundreds of millions of words.

Computerised lexicons are also necessary in order to be able to make most language engineering tools. In many ways, such databases are similar to traditional dictionaries: they contain information on the pronunciation of words, parts of speech, meaning and syntactic position. What particularly distinguishes between traditional dictionaries and lexicons for use in language engineering is that the data contained in the latter must be far more exhaustive and varied, and it must also be set out much more systematically.

No computerised lexicon of Icelandic of this type exists, though the Nordic dictionary base (norræn orðabókastofn) that the University Dictionary is working on with Nordic financial support could become the basis for such a lexicon. It contains a lot of information of various types about the basic vocabulary of the Icelandic language, registered in a systematic way. But this database contains far from all the information that would be necessary for it to be of use in language engineering.

It should also be pointed out in this context that digital lexicons of this type are not only of value in the field of language engineering. Their existence could also be of crucial importance for the future of lexicography (dictionary production) in Iceland. There is a great shortage of good dictionaries between Icelandic and other languages. Various examples could be given of publishing companies that have taken on more than they could handle when publishing dictionaries, as it has nearly always been necessary to begin this work from scratch. If a high-quality digital dictionary basis existed, to which companies could have access, the premises for compiling dictionaries in Iceland would be completely transformed.

Parsing

Parsing is generally done by computer, and involves various types of linguistic analysis of texts. Such analysis is the foundation for various types of language engineering, e.g. machine translation, grammar-checking programs, etc.

Parsing may involve varying degrees of accuracy. Sometimes the text is only analysed into parts of speech, but the process may also extend to phrases and an analysis of the relationships between them, i.e. sentence structure. Such an analysis can seldom be completely correct, though very good results can be obtained if the analysis is based on a carefully-prepared database. This involves, on the one hand, a text corpus and a lexicon in which the program searches for information on individual word forms, and on the other hand a set of rules containing information on permissible sentence structures and phrase structures in the language.

One of the main problems in parsing lies in the identification of ambiguous word forms. For example, the word á in Icelandic has many meanings. It may be a preposition, as in Ég bý á Íslandi (“I live in Iceland”); the present of the verb eiga (“to own”), as in Stelpan á þessa bók (“The girl owns this book”); the nominative, accusative or dative of the feminine noun á (“river”), as in Þessi á er straumhörð (“This river is fast-flowing”), and the accusative or dative of the feminine noun ær (“ewe”), as in Hún gaf mér flekkótta á (“She gave me a speckled ewe”). In addition, it could be the name of the letter á, the exclamation á! (“ow!”, indicating pain) and could perhaps have other meanings.

A sense of the context is necessary in order to parse á correctly. If á stands immediately after a verb, or before a noun, as in Ég bý á Íslandi, then it is highly likely that it is a preposition, since verbs and nouns will hardly be found in this position. On the other hand, if á stands immediately after a noun and before an indicative pronoun, as in Stelpan á þessa bók, then it is most likely to be a verb, since it is unusual to find prepositions, not to mention nouns, in this position.

Syntactic analysis is generally sufficient in the above examples, because different parts of speech are involved. In order to distinguish between lexical forms that are the same parts of speech, semantic analysis could be needed. In the examples Ég sá straumharða á (“I saw a fast-flowing river”) and Ég sá flekkótta á (“I saw a speckled ewe”), the word á stands in the same syntactical environment, but the adjective shows that in the first case the noun referred to is (in the nominative) á (“river”) and in the second ær (“ewe”). In the sentence Ég sá þessa á í gær (“I saw this ewe/river yesterday”) on the other hand, it would be impossible to say which noun was meant. Admittedly this could be clarified by the preceding or following sentence (e.g. if the next sentence were “It was brown and muddy with melt-water” or “It had lost its lamb”), but it is not certain that a parsing program could make use of this type of information.

Correction (checker) programs

Various programs exist which read digital text and point out errors, or possible errors, in them. The simplest of these, spell checkers, search for spelling errors. Such programs have an in-built list of correctly written lexical forms. They read the text, word by word, and compare the words with those in the list. If they find a word form that is not in the list, they stop and draw the user’s attention to the lexical form. Sometimes what is involved is a correctly written word that is not in the list, in which case the user is given the opportunity of adding it to the list, but if the word is incorrectly written it can be corrected. Some such programs, e.g. Púki, contain data on the sorts of words that can exist in the language.

Programs of this type are generally not based on parsing and therefore do not detect errors where a permissible word form is used in the wrong position. In the sentence Ég hitti Þórarinn (“I met Thórarinn”), Þórarinn is in the accusative case and should contain only one n; however, because Þórarinn with two ns is a permissible form (the correct form of the name in the nominative case), the program would not notice this. The second word in the sentence Vatnið síður (“The water boils”) is incorrectly written; it should be spelt sýður (the correct inflection of the verb sjóða, “to boil”). However, because síður is a permissible lexical form (as an adverb meaning “less”), the program would not notice this. For it to do so, further analysis of the text would be necessary.

Hyphenation programs, which break polysyllabic words at the end of a line in accordance with the rules of the relevant language, are of a similar type. They are generally based on rules about the sequences of letters where words may be hyphenated. In Icelandic, for example, there must be at least one vowel in each part, so that the word skrjóðs may not be written skrj-óðs or skrjó-ðs. Nor may only one vowel stand in a different line from the rest of the word, and the second part of a hyphenated word should generally begin with a vowel. There are many exceptions from this last rule, however, particularly in compound words. Hyphenation programs are therefore unable to work exclusively according to general rules, but must also be based on a database listing the main exceptions.

Icelandic spell-checker programs do exist, e.g. Friðrik Skúlason’s Púki, and a few Icelandic hyphenation programs have been made, e.g. Skipta, made by the Icelandic Language Centre. However, these programs seem never to have gained significant currency, at least not recently, because they are not compatible with the main word-processing programs on the market.

Various grammar and style checker programs have been developed overseas. These can draw attention to word order, word usage, etc, depending on how sophisticated they are. Many computer users in Iceland have experience of how developed such programs are for English texts. However, parsing is the prerequisite for such programs, and none such exist for Icelandic.

Dictionaries

Dictionaries of various types are necessary aids to the writing of texts. Chief among these are spelling dictionaries, such as the Réttritunarorðabók (orthographic dictionary) produced by the Icelandic Language Council and the National Centre for Educational Materials, synonym dictionaries, such as Íslenska samheitaorðabók, and, of course, traditional dictionaries containing information on inflections and explanations of meanings, such as the Íslenska orðabók handa skólum og almenningi (“Icelandic Dictionary for Schools and the General Public”) produced by the Icelandic Cultural Fund and the publishing house Mál og menning. In addition to these, mention can of course be made of various two-language dictionaries, e.g. Icelandic-English and English-Icelandic dictionaries.

In many places abroad, there has been rapid progress in recent years towards making such dictionaries digital and connecting them directly with word-processing programs. The user is then able to call up information from the dictionary at any time and use it in connection with whatever he is doing. Obviously, this is much quicker and more convenient than having to keep many dictionaries at hand and looking up words in the traditional way. It can also be argued that the existence of dictionaries that run on a computer make it more likely that people will use the information they contain, so producing texts of higher quality.

No computerised Icelandic dictionaries exist, with the exception of the English-Icelandic dictionary Aldamóta which is now being published by Mál og menning. Mál og menning is in fact also planning to publish its Íslensk orðabók (Icelandic Dictionary) in a computerised form later on this year, but it is not yet clear what form the version will take and how it will function together with word- processing programs.

Both the orthographic dictionary Réttritunarorðabók and the thesaurus Íslensk samheitaorðabók exist in computerised form, and it ought therefore to be a simple matter to publish them in computerised form and use them in the way described above. However, in this context it must be borne in mind that they were compiled as traditional books, and their use as computerised databanks would be based on completely different premises. Thus, a lot of work would have to be put into changing their form and layout if they were to be used in a computerised form, and it is conceivable that it would be just as simple to start again from scratch.

4. The Spoken Language

Speech simulators

Speech simulators (voice simulators) are programs that pronounce text that is written on a computer. Formerly, they generally consisted of special devices in which the sounds were produced in special electronic circuits, but the most common speech simulators nowadays are programs that are run on an ordinary PC computer and make use of its sound system to produce the sounds. Special commands are then included in the word-processing programs in order to have text that is being examined or written in the program converted into speech. The user needs only the program and an ordinary computer equipped for multi-media effects.

How do speech simulators work?

There are various types of speech simulators on the market that work in different ways in some respects. The common denominator is that they take text and make a phonetic transcription of it in the same way as is done in dictionaries. Special programs are used for this, each of which is valid for a separate language. They contain rules on how certain elements and words are to be transcribed phonetically. A good knowledge of the phonology of the language involved is needed for the construction of such programs.

Programs of this type may vary as to their thoroughness. They may, for example ignore intonation or the difference between accented and unaccented syllables, in which case the quality of the speech is reduced. In Swedish, for example, it is necessary to make a distinction between the words anden (meaning “the spirit”) and anden (meaning “the duck”): they are not pronounced the same way in Swedish, even though a simple phonetic transcription might indicate that they are; in fact, a distinction is made between them by means of the intonation of the vowels.

After the text has been transcribed, different methods come into play:

One method is to imitate the processes that take place in human speech organs. Two approaches are used. For a long time, special devices were used to generate speech sounds. They contained logic circuits (sound filters) which imitated the workings of the speech organs. As computers became more powerful, digital models of the speech organs were designed, and these could be run on ordinary computers, so rendering the special devices unnecessary. Instead, the models make use of the sound card of the computer to pronounce the words.

The advantage of speech simulators that imitate the speech organs in one way or another is that, like the phonetic symbols, they are international. This is because people’s speech organs are the same, irrespective of what language they speak. By using a great number of phonetic symbols and setting precise phonetic values for each language, it is easy to adapt such speech simulators to a new language. The drawback to this method is that it has not proved possible to achieve high-quality sound using simple models of the organs of speech.

It should be stated here that these models are known under various names, but for the most part these all refer to the same phenomenon. In English they are referred to under the terms vocal-tract model, formant synthesis, Linear Predictive Coding, etc. To improve quality it has proved necessary to make the models so complex that the advantages of simplicity are lost, and consequently it has been found easier to achieve high quality by using another method

The other method is to use fragments of speech to produce the sounds of speech. It is easy to produce sophisticated speech simulators using this method if the vocabulary range is restricted. Simple applications of this technique that everybody is familiar with are telephone answering machines, the speaking clock and telephone bank statements. These use few words, so the task is much simpler than when ordinary speech is involved. Nevertheless, this method has become dominant in the design of all the newer types of speech simulators that are able to pronounce any word in a given language. In this method, a person’s speech is recorded and then broken down into units that are re-assembled according to the phonetic transcription of the text.

How this is done depends on the power of the computer. The smallest units are not traditional phonetic symbols but pairs of sounds (diphones). Although this calls for the storage of several tens of thousands of units, this is now well within the capacity of an ordinary personal computer, which is able to pronounce any text using this method. The drawback is that to improve the quality of the sound, the phonetic units must be increased and their number increased, and the number can become too great.

In primitive speech simulators of this type, the phonetic units are often badly spliced together, and the listener notices an annoying crackling. Various filters and programs are used in newer versions which make the splicing almost perfect, in addition to which it is possible to modulate the speech, controlling the speed, intonation and accentuation.

A large number of problems concerning speech simulators remain to be solved before they will be really pleasant to listen to. These are of both a technical and a phonetic nature. Their intonation and accentuation are not yet natural, but a great deal of work is being done abroad to tackle these problems.

Speech simulators for English are common. They are easy to understand, and the latest and best models are quite pleasant to listen to. They are easily available for ordinary PC computers. There is one Icelandic speech simulator currently in use. Based on a simulator from the Swedish company Infovox, it was adapted for Icelandic by the University of Iceland’s Institute of Linguistics, the Faculty of Engineering and the Icelandic Federation of the Handicapped in the period 1989-93. Most of this work was done by a linguist, Pétur Helgason, and was supported by Nordiska nämden for handicappfrågor.

This simulator has been used by the blind, and has produced good results. All the same, the quality is not sufficient for it to be used by the general public. Modifications have recently been made, and a new version is expected to be released in a few weeks’ time. This speech simulator is of the vocal- tract type.

In Iceland, the raw material for speech simulators would consist of phonetically transcribed word lists and recorded databases of texts that could be used to derive statistical information on features such as length, accentuation and intonation. It would also be useful to have automatic semantic analysis facilities, which would make it easier to put in the correct intonation.

Samples of speech simulators from Infovox, which is now owned by Telia, can be found on the website www.promotor,telia.se/infovox/. Samples of speech using both the older and the newer (diaphone) versions of simulators can be heard on the website. Telia is not planning to develop the new version for Icelandic: the cost involved is considered too high, and the Icelandic market too small.

Speech recognition

In speech recognition, a microphone is connected to the sound card of a computer. A high-quality microphone must be used to minimise background noises, as such noises cause problems when they are detected in addition to the speaker’s voice and distort that the computer detects. Speech recognition programs then break the sounds down, identify the frequency of each unit and compare it with a large database of phonetic units stored in the computer’s memory. These programs do not work with individual phonemes but with larger linguistic units. Statistical methods are used, with the words most likely to be correct being chosen.

In order to achieve this, the program should preferably know how the user speaks; consequently, the user has to say a certain number of specific known sentences before starting to use the program for the first time. The program contains an in-built knowledge of the language so that information on the language and the possible positions of words in sentences facilitates recognition to some extent. Word position in the sentence is also used to distinguish between homophones or words that sound fairly similar, e.g. the English words to, too and two. The computer guesses which word is involved on the basis of the context, e.g. in the examples “walk to London”, “walk two miles” and “walk too far”.

A great deal of computer capacity is needed for this processing. The most common PC computers, with a speed of 200-300 MHz, are scarcely able to do more than cope with the task, and for example are only able to investigate three candidate words. A lot of progress has been made recently on speech recognition, and even though the methods used rely first and foremost on large databases and computers with a great deal of processing capacity, it has in fact proved easier to achieve sophisticated speech recognition than sophisticated speech simulation.

Large databases of sound samples, spoken by large numbers of people and covering all phonetic elements in the language, are needed in order to design speech recognition programs. Then, from these samples, it is necessary to process data on the sounds of the language, statistical data on the position of words in sentences, etc. More specialised information on these points can be found in the language engineering websites listed in Appendix 7, and the reader is also referred to Appendix 3. Appendix 7 contains references to the main manufacturers and evaluations of the speech recognition programs currently on the market. 5. Machine Translations and Web Searches

Machine translations

The following survey of machine translations was prepared for the working group Stefán Briem.

What are machine translations?

Machine translations of written material from one language to another involve the use of specially designed computer programs which accept text in a particular language, the source language, and deliver its contents in a corresponding text in another language, the target language.

If the program works independently without human involvement from the time that it starts running to the time that it delivers the text, this is referred to as automatic translation. However, greater quality of translation can generally be obtained by using a combination of the computer and the human brain. This can involve, e.g., the computer referring problems to the user as it encounters them, and then continuing to translate automatically after receiving the answer, in which case the process is semi-automatic. Another method is to have a computer make a rough translation, after which a human translator takes over, correcting errors and making changes to the style. Yet another method is to prepare the text before it is translated by machine, so making it easier for the translation program to deal with.

Machine translation is one of the main topics in language engineering and is connected with most of its other aspects. Research and development work on machine translation began fifty years ago, but progress has been far slower than was initially expected. The most difficult part of machine translation is semantic analysis, i.e. making programs analyse the meaning of the text to be translated, and the quality of translation very much depends on how well this is done. Computers will never replace human translators, but they are already able to speed up their work and lighten their load considerably, at least in restricted fields, if they are used properly. Little attention has been given to machine translation in Iceland, with the exception of spare-time activities by amateurs.

Speech and sign language

The same methods as are used for machine translation of written language are used for machine translation of speech, and in addition a speech recognition (speech-to-text) device, a device that converts speech to text, and a speech simulator, a device that converts text to speech. Icelandic speech simulators exist, but need to be improved. On the other hand, no Icelandic speech-to-text device exists, and it is a far more difficult task to make a computer understand speech than convert text to speech. In the last two or three years much progress has been made on developing speech-to- text technology abroad. For example, a speech-to-text program for English for use on PC computers came on the market late in 1998. It is expected that machine translation of speech will make it possible for people of different nationalities to talk to each other, e.g. on the phone, each speaking their own language. The same technology will then be used to have computers take over the role of interpreters up to a certain point.

Language engineering may also be of assistance to people who are unable to express themselves by means of ordinary language, using instead sign language based on movements of the head and hands. It is possible that with special equipment, these movements could be converted into signals recorded in a computer which could then be converted by language technology into ordinary language to some extent, and vice versa. Little research and development in this area has been done abroad, and even less in Iceland.

The aim

The aim of machine translations of written language may take various forms according to the application intended and the quality level desired.

To make a search on the World Wide Web, for example, it is often enough for an Icelandic computer user to have a rough-and-ready translation into Icelandic to see the contents of a text, without having it all perfectly translated. Users of the AltaVista search engine on the web are probably familiar with the translation service provided there to translate texts and web pages automatically and mechanically between English, on the one hand, and French, Italian, Portuguese, Spanish and German, on the other. The translations done by AltaVista are not reliable, but they generally give a rough idea of the substance of the text in the target language. But Icelanders, like most of the world’s nations, can not use their own language there.

What is needed for machine translations from and into Icelandic?

What is needed is, firstly, a program to translate from at least one foreign language into Icelandic and a program to translate from Icelandic into at least one foreign language. Other languages could then be accessed by connection to other translation programs. The guiding principle regarding such connections is to use one common intermediate language for translations between all other languages. The international language Esperanto has been put forward as such an intermediate language because of its simplicity, its logical structure and the fact that it contains much less ambiguity than traditional languages. Another possibility is to use an intermediate language that is not a natural language, but some sort of computer processing language which would store linguistic information in another form, and would not be possible to read or speak in the same way as normal languages.

Secondly, the translation programs would have to have access to extensive computerised data about the languages involved in the translation. For Icelandic, a computerised lexicon would be needed to connect it with at least one other language, a computerised inflectional system, covering all inflectional forms of all Icelandic words, and information of many types on the use of individual words and phrases. This last part would also have to include detailed computerised information on context, which is necessary in order to decide the issue when a word or phrase is capable of having more than one meaning. The correct meaning depends largely on the context and an understanding of the linguistic environment. Human speakers generally find out the correct meaning unconsciously, drawing on their experience and knowledge, but a translation program needs to have a built-in method in order to choose the likeliest meaning, a method that is carefully thought out when the program is written and designed to run on the computer’s knowledge, and also perhaps on its experience.

Where should work begin?

The most urgent task for Iceland in the field of machine translation, in the short term, is in my opinion to compile basic software for the Icelandic language which would also be of use in other areas of language engineering and perhaps in other contexts too, and to make it accessible. Two main projects would be involved:

1. To establish and maintain a large bi-directional lexicon between Icelandic and another language, with no restrictions on access and use, including putting it on the Internet. This lexicon should be set out in such a way as to be usable for various applications, not only for machine translation but also other projects that institutions, companies and individuals might undertake in the field of language engineering, e.g. hyphenation programs, spelling checkers and auxiliary programs for translators; these could be built into user programs such as the Word word-processing program.

2. To establish and maintain a computerised inflectional system for Icelandic, with the same sort of access as to the lexicon described above, with exhaustive information on the inflection of Icelandic words. A computerised inflectional system of this type would be of use not only in various areas of language engineering, but also as a reference tool for schoolchildren and the general public.

These two projects are of such a nature as to make it natural that they should be financed largely or entirely from public funds. Otherwise there would be a danger that they would not be carried out properly and that access to the results would not be as free as would be necessary to ensure that good use were made of them. These projects are also the prerequisites for good results in subsequent projects.

With regard to the long term, it is important not to wait with the next stages, but to start as soon as possible, and if funding is limited then efforts should be concentrated mainly on projects that will produce financial advantages for Iceland in the foreseeable future. The following are among the attractive projects:

1. The design of limited systems for machine translation. On the one hand, these would be in areas where it is not important that the translations should be flawless or of a high quality, and on the other in areas where the text is either very simple or characterised by a great deal of repetition. One of the most useful projects of this type is machine translation of the type offered by AltaVista (see above). 2 The development of a speech recognition program for Icelandic and improvements to the best of the existing speech synthesisers. An Icelandic speech recognition program would pay for itself in terms of massive savings on the labour involved in entering text on a keyboard.

Some of the projects named here would call for collaboration with other nations, both on research and the design of equipment and on making the fruits of the work accessible in the form of common user software on the international market. But it is clear that where Icelandic is directly involved, the work would have to be done by Icelanders themselves.

Icelandic on the World Wide Web – search engines

It is important that there should be accessible search engines on the Internet capable of searching for material in Icelandic.

Icelandic characters

In this context, the most important thing is, of course, that the search engines should accept Icelandic characters and search for them correctly. They must be able to distinguish between upper and lower case letters, recognising, e.g., that Þ is comparable with þ. Some common search engines seem to offer this facility already, e.g. AltaVista (www.altavista.com), which is one of the most popular search engines. The main requirement for ensuring that Icelandic letters will not cause problems is to have them included in the most commonly-used character sets, as has now been done with ISO 8859- 1. The most popular search engines should be monitored and their sponsors should be encouraged to ensure that it is possible to use Icelandic characters in them.

Inflections

Some search engines can cope with searches for various inflectional forms of the same word, which makes for far more efficient searches. The degree of complexity involved depends on the inflectional rules of the language. For some languages, it is enough to identify the stem of the word and search for all the forms beginning with that stem. More complex procedures have to be developed for languages such as Icelandic, however, due to the occurrence of internal inflectional changes, i.e. the stems of the words, and not only the endings, can be affected by inflections (e.g. the vowel changes in saga - sögu; fara - fór).

However, there may not be necessary for Icelanders to embark on special measures in this area. Attention will have to be given to similar problems in connection with languages that have a far greater currency (e.g. German, Danish and Arabic), and it is likely that the solutions found for those languages can be adapted to the needs of Icelandic when designing powerful search engines. But it could be worth investigating where analytical tools such as Friðrik Skúlason’s spell-checker program for Icelandic, which identifies inflected forms, could be of use.

Logical operators

Logical operators can be used in many search engines to define a search more closely. For example “and” is used to limit the search to documents that contain two specified words, “or” is used to request all documents that contain either of two words and “not” to exclude documents containing particular words. Thus, a search for “(grammar or parsing) and Icelandic and not English” would produce documents mentioning grammar or parsing in connection with Icelandic, but without reference to English.

Most if not all search engines use exclusively English words as logical operators. While it is easily possible to imagine a search engine that recognised the Icelandic words og, eða and ekki (“and”, “or” and “not”), this can hardly be regarded as a priority task. The English words are used for this purpose in many contexts, e.g. in many of the most common programming languages, and many people are used to using them. Most people look on these words in these contexts as symbols of a sort, not as real words. Finally, if people are against using them, it should be pointed out that in many search engines it is possible to use special symbols instead: & for and, | for or and ! for not.

Artificial intelligence

In the AltaVista search engine, instead of entering search words, it is now possible to write in a question. (An example from AltaVista’s own advertisement: “What is the latest news coverage on the Monica Lewinsky scandal?”) The search engine then uses artificial intelligence to identify the types of document sought and retrieve them. It would certainly be convenient if it were possible to do the same in Icelandic. On the other hand, this technology is in its infancy, and so far this facility is not available in any language other than English. It is probably not appropriate at this stage to make it a priority to develop such a service for Icelandic, but rather to keep abreast of developments and try to join in when other languages are included. 6. Institutions Involved in Language Engineering

The committee considers it clear that the basic work in Icelandic language engineering can not be done without state support. This applies to the development of extensive databases of the written and spoken language, which all agree are the prerequisite for all further work, i.e. the construction of individual programs and marketable products. Such databases have been established in most of Iceland’s neighbouring countries with state aid, both from individual governments and from the European Union. In view of the smallness of the Icelandic language community, and consequently of the market, it is out of the question to expect private companies to shoulder the costs involved in this venture.

University institutions

The committee considers it natural to utilise as far as possible the state institutions that already exist in related fields. In particular, there are three institutions connected with the University of Iceland: the University Dictionary, the Icelandic Language Centre and the University’s Institute of Linguistics. These institutions have three important things in common. Firstly, their roles are defined in such a way in laws and regulations as to make their involvement in this area natural. Secondly, they all have important databases that would be of use in developing Icelandic language engineering. Thirdly, their employees have specialised skills that are important in this work.

According to the regulations governing the University Dictionary, it is “a scientific institution for the study of words”, one of the roles of which is “to engage in research of all types into the vocabulary of Icelandic and its development. The institution is also plays an important role in providing services to specialists and the general public.” Amongst other things, the Dictionary has a very extensive database of computerised Icelandic texts from the past 15 years. In addition, it has worked on the compilation of the basis of an Icelandic dictionary containing the basic vocabulary of the language. These two factors are important in developing the necessary databases.

The Icelandic Language Centre is run by the Icelandic Language Council in collaboration with the University of Iceland. By law, the main role of the council is “to work for the development of the Icelandic language and its preservation in speech and writing”, and it is to act “in an advisory capacity to the government on questions concerning the Icelandic language.” If it is accepted that the involvement of the government in developing Icelandic language engineering is an important contribution towards the protection of the Icelandic language, then it is natural that the Icelandic Language Council and the Icelandic Language Centre should also be involved. Furthermore, the Icelandic Language Centre possesses various materials that would be of use in developing databases, e.g. lexical lists, collections of questions on linguistic points, etc.

The Linguistics Institute is one of the institutes of the Faculty of Philosophy, and is defined in the regulations as “a scientific research and educational institution” that is “to engage in fundamental research in Icelandic and general linguistics”. It is also “to collect and preserve data (tape recordings, etc) on modern Icelandic”, and also “to engage in projects in applied linguistics”. This fits in well with the projects under discussion here, and the participation of the Linguistics Institute would secure the necessary contact with the teaching and research environment within the University of Iceland, in addition to which the institute possesses a great deal of data on spoken Icelandic.

The Icelandic Language Centre has just moved to Neshagi 16, where the Dictionary had already established its premises. This has opened the way to increased collaboration between these two institutions. It is desirable that state-backed operations in the sphere of language engineering should be located in the same place, so taking advantage of proximity to these institutions and utilising the materials stored on the site and the expertise of the institutions’ staff. No specific proposals are made at this stage regarding the formal status of work in language engineering or the organisational connections between such work and the institutions named above. Such matters will have to be given further attention in collaboration with the directors of the institutions involved.

The Ministry for Foreign Affairs’ Translation Centre

The Ministry for Foreign Affairs’ Translation Centre was established in 1990, the original aim being to translate the EEA Agreement and its constituent deeds (directives, regulations, etc) into Icelandic. From the outset, the intention was to make as much use as possible of computer technology to co- ordinate and facilitate the process of translation. For example, a program was developed to calculate the resemblance factor between documents, so making it easy to find texts that could be used as models. This program has not been updated, however, and does not work properly with the new altered format of document originals. A spell-checker and program that finds duplicated words.

The translation centre has a computerised lexicon, which is invaluable as an aid for co-ordinating vocabulary usage. The lexicon contains a large amount of specialised terminology, running to about 20,000 entries, in various fields. Each entry contains an Icelandic word or phrase and the foreign (generally English) equivalent. Examples of usage are generally cited as well. On the other hand, no grammatical information is normally recorded.

At first, considerable efforts were put into making as much of the translation work as possible automatic, particularly translations of various names, fixed phrases and the standard wordings found in the deeds. This attempt never included the more complex translations, however, and part of the automatic functions are no longer used. Although no further work has been done in the translation centre in the past few years on automatic translation and the production of other aids for translation, there is a great deal of interest in doing so. The staff of the centre realise that the harmonisation and standardisation required in these translations can not be achieved without the aid of computer technology.

The Icelandic Standards Council

The Icelandic Standards Council has operated since the change of year 1992-1993 under the legislation on standards in Iceland. Its role is to act in collaboration with those who are interested in the construction and use of standards in Iceland. The council’s guiding principles in its work are to contribute to expansion and new ventures in Iceland’s business and professional sectors and to improve the business environment, and also to improve consumer protection and safety. The council’s largest projects are connected with its membership of the European standards associations, CEN and CENELEC, since membership involves the undertaking to introduce all standards originating from those bodies as Icelandic standards.

The Icelandic Standards Council publishes specifically Icelandic standards and supervises their production. Specifically Icelandic standards are compiled when interested parties consider there is a need due to particular circumstances applying in Iceland or because there are no European or international standards on individual matters. In individual cases, standards are translated into Icelandic, and the Icelandic Standards Council supervises this work and supervises the publication.

The Standards Council provides information on all matters concerning standards and standardisation, and handles the sale of standards from a large number of standards institutions. In addition, one of its roles is to promote standardisation work in Iceland through publicity work. 7. Websites of Interest

Icelandic Government policy

- The Government of Iceland’s future policy in Icelandic URL: www.stjr.is/framt/syn00.htm

- The Government of Iceland’s future policy in English URL: www.stjr.is/framt/vision00.htm

- Website of the RUT committee. Contains information on Icelandic state institutions’ policy on information technology. No mention is made of Icelandic. URL: www.stjr.is/fr/rut/hugb97/gsigurd/index.htm

- Website on work on the Government’s strategic planning. It contains a section on the language, and also a reference to the preliminary standard FS130þ URL: eldur.stjr.is/fr/rut/uppsam95.htm#n1u

- Website of the project committee on the information society. It contains references to various reports. URL: brunnur.stjr.is/interpro/for/for.nsf/pages/verk

Language engineering, general

- Survey of the State of the Art in Human Language Technology. A collection of articles on language engeinnering. Technical and very thorough. URL: www.cse.ogi.edu/CSLU/HLTsurvey

- Swedish course in language engineering (språkteknologi). URL: www.nada.kth.se/~viggo/sprakteknologi/kursplan.html

- Professor Eiríkur Rögnvaldsson’s course on computers and language. URL: www.hi.is/~eirikur/ttt.html

Language engineering and linguistics

- Speech on the Web. A list of pages related to phonetics and speech sciences. URL: fonsg3.let.uva.nl/Other_pages.html

- WWW Indices related to Computational Linguistics. URL: www.ims.uni-stuttgart.de.info/Indices.html

- Conference Schedule for Linguists, Translators, Interpreters and Teachers of Languages. URL: www.clark.net/pub/royfc/confer.html

State Institutions and committees dealing with language and linguistic databases

- Orðabók Háskólans (Dictionary of the University of Iceland). URL: www.lexis.hi.is/

- Íslensk málstöð (Icelandic Language Centre). URL: www.ismal.hi.is/

- Lög um íslenska málnefnd (Icelandic Language Council Act). URL: www.althingi.is/dba-bin/unds.pl?txti=/ wwwtext/htdocs/lagas/122b/1990002.html&leito= Fyrirspurnir%5C0Sv%F6r#word1

Overseas research institutions

Websites of some computational linguistics departments in the Nordic countries:

- Universitetet i Stockholm. URL: www.ling.su.se/dali/dali.htm

- Universitetet i Göteborg. URL: svenska.gu.se/sprakdata and URL: www.cling.gu.se/

- Uppsala Universitet. URL: stp.ling.uu.se/educa/fudl.html

- Universitetet i Oslo. URL: www.hf.uio.no/ilf/fou/fag/sli.html

- Universitetet i Bergen. URL: www.hf.uib.no/i/LiLi/SLF/undervisning/aktuell-undervisning.html

- University of Helsinki. URL: www.ling.helsinki.fi/research/rumlat.html

- Handelshøjskolen i København. URL: www.cbs.dk/departments/dl.index.html

- Language engineering course in the University of Stockholm. Utbildningsplan för Språktkonsultlinje. URL: www.nordiska.su.se/konsult.htm

University research projects in various other universities:

- Speech at Carnegie Mellon University. URL: drum.speech.cs.cmu.edu/speech/

- Massachussets Institute of Technology, MIT. URL: www.sls.lcs.mit.edu/sls/

- University of California, Berkeley. URL: www.icsi.berkeley.edu/real/speech.html

- Mississippi State University. URL:

- Circuit Theory and Signal Processing Lab of the Faculté Polytechnique de Mons, Belgium. URL: tcts.fpms.ac.be/

Organisations and societies active in language engineering and linguistics

- Collection of websites on speech technology; manufacturers, societies and periodicals. URL: www.speechtechmag.com/hotlinks.htm

- ELRA. European Language Research Association. URL: www.icp.grenet.fr/ELRA/home.html

European Union. Language engineering and languages.

The following websites contain references to the European Union’s technical framework programmes. The European Union supports speech technology, machine translation and various other projects in the sphere of language engineering and linguistics. A new EU programme, the Fifth Framework Programme, begins in the spring of 1999.

- Telematics Application Programme, official homepage. This is part of the Fourth Framework Programme dealing with Language Engineering. URL: www2.echo.lu/telematics/

- Concord Directory to the Telematics Applications Programme. List of the projects within the programme. URL: www.concord.dcbru.be/

- Multilingual Information Society (MLIS) Programme. One of the EU programmes aimed at supporting multilingualism. URL: www2.echo.lu/mlis/mlishome.html

Production company active in Language Engineering

- Lingsoft ab in Finland. The company develops and manufactures language engineering software of various types for many languages, including Finnish, German, Russian and Norwegian. URL: www.lingsoft.fi/sv/

Standards

- Staðlaráð Íslands (Icelandic Standards Council) URL: www.stri.is/

- Innkaupahandbók um upplýsingatækni. (Purchasing manual for information technology.) Ministry of Finance (Iceland). Rit 1998-3. Ed. Jóhann Gunnarsson. URL: www.stjr.is/rut/

- On the history of the Icelandic letters and character sets, see the English web pages under www.stri.is/STRI/enska/Iceletters.shtml

- NIST: National Institute of Standards and Technology (USA) interacts with organisations in implementing Speech Processing Evaluations and Benchmark Texts. URL: www.nist.gov/speech/online.htm

Letters (characters), fonts, character sets

- Unicode. Universal character set. URL: www.unicode.org/

- Homepage of Gunnlaugur S.E. Briem, font designer. URL: rvik.ismennt.is/~briem

Selected articles and essays about Icelandic and language engineering

- Íslensk stafasaga. (History of Icelandic letters.). Þorgeir Sigurðsson. Icelandic Standards Council. URL: www.stri.is/STRI/enska/Iceletters.shtml

- Rúnastaðall. (Rune standard.) Standard on the writing of Icleandic using 16 runic letters. Þorgeir Sigurðsson. URL: www.stri.is/traoh2.html

- Stefán Briem’s homepage. URL: www.ismal.hi.is/stefan/

Languages

- Ethnologue, Languages of the World. A website with information on about 6,700 languages spoken in 228 countries, numbers of speakers, etc. URL: www.sil.org/ethnologue/

Machine translation

- Systran translation software. URL: www.systransoft.com/personal.html

Speech simulators

- The Telia hompage contains information on speech simulators, with aural samples. These include the Icelandic simulator that was developed with the involvement of the Icelandic Association of the Blind. The difference between the two types of speech simulator in use can be heard. URL: www.promotor.telia.se/infovox/index/htm

- Speech simulators for various languages can be sampled on a number of pages on the Internet. One such is: URL: www.elan.fr/speech/

Speech systems

The following are websites for speech systems currently offered for sale for PC computers, covering manufacturers and reviews of their products.

Manufacturers

- Lernout and Hauspie. The company’s products include Voice Xpress. URL: www.lhs.com/

- IBM Voice Products. The company’s products include ViaVoice. URL: www.software.ibm.com/speech/

- Dragon. The company’s products include NaturallySpeaking. URL: www.dragonsys.com/

- Philips. The company’s products include FreeSpeech98. URL: www.speech.be.philips.com/

Reviews/criticism

- What to Look for in a Speech Recognition Program. PC Magazine Online. URL: www.zdnet.com/pcmag/features/speech98/lookfor.html

- LET’S TALK! Speech technology is the next big thing in computing. Will it put a PC in every home? Business Week, 23.2.1998. URL: www.sls.lcs.mit.edu/sls/news/article01.html

- References to reviews of speech tools in computer magazines. These include reviews of the main programs currently available. URL: www.voicerecognition.com/1998/information/in_the_news.html

International nad multilingual software

When producing multilingual software for the international market it is important to go about the job in the right way from the outset. The following sites discuss what is involved.

- Globalisation of Software and User Interfaces. Richard ISHIDA. URL: ourworld.compuserve.com/homepages/Richard_Ishida/

- Multilingual Software Digest. URL: www.gy.com/homehtml#MultilingualSoftware

Linguistic databases

- Orðabanki Íslenskrar málstöðvar. (Database of the Icelandic Language Centre.) URL: www.ismal.hi.is/ob/

- Links to 400 dictionaries. URL: www.bucknell.edu/~rbeard/diction.html

- Examples of individual initiatives in putting technical vocabulary on the Internet. The person who built the site has been publishing user manuals for Word, amongst other things. URL: www.prim.is/Ordasafn/

- The Linguistic Data Consortium... (ÓBREYTT, Á ENSKU)

- European Language Resources Association.. (ÓBREYTT, Á ENSKU)

- LDC related sites. A great deal of information on websites related to speech and text databases. URL: www.ldc.upenn.edu/ldc/sites/index.html

- The following is a list...(ÓBREYTT Á ENSKU EFTIR ÞESSI INNGANGSORÐ)

- The European Association for Machine Translation...(ÓBREYTT, Á ENSKU)

- Electronic Publishing, R&D news and resources. A European Union webpage about this technology. URL: inf2.pira.co./

- The International Committee for the Co-ordination...(ÓBREYTT, Á ENSKU). I doubt that the problem of the Icelandic characters will solve itself.

More than a decade ago, the Icelandic characters Þþ, Ðð and Ýý were often included with the rest. I was told that this was owing to pressure from NATO, though I never managed to have this confirmed. At any rate, this pressure is no longer effective. Fewer and fewer font manufacturers now sell Icelandic characters. Competition is tough, and the Icelandic market simply isn’t worth the effort.

Many font designers misunderstand the letters þ and ð. They often become unusable. (I have put directions on the Internet, but no one is obliged to follow them.)

It would not surprise me if Turkey were to win the struggle for a place in character standards. They are now doing exactly what Iceland should be doing: getting the manfacturers on their side.

What is to be done?

Iceland can put its case across to the font manufacturers. They would probably be prepared to make the Icelandic characters available if they received them ready made and free of charge. (I should like to make it clear that I am not looking for work.) In the long term, thumping the table will not work. It would be more productive to set about securing some allies.

Iceland should extend a friendly hand to Turkey and support it in demanding a standard number, which would not affect us in Iceland.

These two points are quite obvious. It is also clear that most people in Iceland couldn’t care less. For a long time I tried to stick up for our interests, particularly among my colleagues in Association Typographique Internationale. I encountered less interest in Iceland than among people abroad.

I trust that this answers the main questions.

Yours sincerely, Gunnlaugur.

______

Gunnlaugur SE Briem [email protected] http://rvik.ismennt.is/~briem

Phone 44 (UK) 181 749 2919 Phone 1 (US) 415 771 3212 Phone 354 (Iceland) 565 7198