NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

DESIGN, IMPLEMENTATION AND EVALUATION OF A VOICE CONTROLLED INFORMATION PLATFORM APPLIED IN SHIPS INSPECTION

by

TOR-ØYVIND BJØRKLI

DEPARTMENT OF ENGINEERING CYBERNETICS

FACULTY OF ELECTRICAL ENGINEERING AND TELECOMMUNICATION

THESIS IN NAUTICAL ENGINEERING 0 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

ABSTRACT

This thesis describes the set-up of a platform in connection with ship inspection. Ship inspections involve recording of damages that is traditionally done by taking notes on a piece of paper. It is assumed that considerable double work of manual re-entering the data into the ship database can be avoided by introducing a speech recogniser. The thesis explains the challenges and requirements such a system must meet when used on board. Its individual system components are described in detail and discussed with respect to their performance. Various backup solutions in case the speech recogniser fails are presented and considered. A list of selected relevant commercially available products (, speech recogniser and backup solutions) is given including an evaluation of their suitability for their intended use. Based on published literature and own experiences gained from an speech demonstrator having essentially the same interface as the corresponding part as the DNV ships database, it can be concluded that considerable improvement in and speech recognition technology is needed before they are applicable under challenging environments. The thesis ends with a future outlook and some general recommendations about promising solutions.

Page i DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

PREFACE The Norwegian University of Science and Technology (NTNU) offers a Nautical engineering Studies programme, required entrance qualification are graduation from a Naval Academy or Maritime College, along with practical maritime experience as an officer. In combination with the studies at the NTNU, a post-graduate thesis is to be done. I embarked on a career at sea, when I commenced my education at Vestfold Maritime College and later joined the Navy’s Officers’ training school. With my background from RnoCG (The Royal Norwegian Coastguard), RnoN (Royal Norwegian Navy) as well as the merchant fleet I’m naturally interested in shipping and safety at sea. The thesis is a result of my maritime experience along with the theoretical basis adapted at the NTNU, results from the semester project “The surveyor in the information age” written spring 1999 has also been a contributing factor to the outcome. When forming the structure of this project, the NTNU project template has been used as the main guideline. The thesis is the result of a literature study as well as information given by former and present surveyors within the DNV. This thesis in nautical engineering has been accomplished under supervision of Professor Tor Onshus at Department of Engineering Cybernetics (ITK1) NTNU.

ACKNOWLEDGEMENTS This thesis in nautical engineering has been carried out during the autumn 1999 at the Department for Strategic Research in DNV. During this time, I was privileged to been work in DNV’s Mobile Worker project. That has provided an inspiring research at the junction between research and product testing. I would like to sincerely thank the project members for fruitful collaboration during my stay. The assistance and guidance given by a number of individuals has been essential for the accomplishment of the thesis. I would like to thank my head supervisor Professor Tor Onshus for his guidance during this project. I wish to thank all the researchers at DTP 343, Department for Strategic Research at DNV’s head office at Høvik. And especially Dr.Scient; Dipl. Ing. (FH) Thomas Mestl, for allowing me to carry out this project as a part of the Mobile Worker research program and his support during the thesis. Finally, I would also like to thank Thomas Jacobsen, a fellow student, for his assistance in checking and commenting my work.

Oslo Wednesday, 15 December 1999

Tor-Øyvind Bjørkli

1 Department of Engineering Cybernetics

Page ii DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

Table of contents page

ABSTRACT ...... i

PREFACE...... ii

1 INTRODUCTION...... 1

2 SYSTEM DESIGN...... 4 2.1 Constraints, requirements and potentials 4 2.2 System set up 7 2.3 Back up solutions 8

3 EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS...... 15 3.1 Microphones 15 3.1.1 Physical principles 15 3.1.2 Noise reducing measures 18 3.1.3 Body placement 23 3.1.4 Commercially available microphones 24 3.2 Speech recognition software 28 3.2.1 Principles 31 3.2.2 Recognition enhancing measures 31 3.2.3 Commercially available products 33 3.2.4 General conclusions 38

4 SPEECH DEMONSTRATOR...... 40 4.1 Set-up 41 4.2 Experiences gained from the speech demonstrator 44

5 RECOMMANDATIONS AND FUTURE OUTLOOK ...... 46

6 REFERENCES...... 48 Appendices 50

Page iii DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

INTRODUCTION

1 INTRODUCTION Speech recognition itself is nothing new in fact everybody is doing it every day. However a machine that recognises the spoken word is a technological challenge and only recently they have come available. Such dictation systems e.g. for specific professions such as radiology has been around for years and carry five- figure price tags. Less expensive general-purpose systems require discrete speech, which is a tedious method of dictation with a pause after each word. Two years ago Dragon Systems achieved a new milestone with the release of NaturallySpeaking. This first general-purpose speech recognition system allows dictating in a conversational manner. IBM quickly followed with ViaVoice, costing hundreds of dollars less than the first version of NaturallySpeaking. A major factor driving the development of these speech-enabled applications is the steady increase in computing power. Speech recognition systems demand a lot of processing and disk space. The fine line below gives the history of speech recognition system, (PC Magazine, 10 March 1998):

Speech Technology Timeline Late 1950s: Speech recognition research begins. 1964: IBM demonstrates Shoebox for spoken digits at New York World's Fair. 1968: The HAL-9000 computer in the movie 2001: A Space Odyssey introduces the world to speech recognition. 1978: Texas Instruments introduces the first single-chip speech synthesiser and the Speak and Spell toy. 1993: IBM launches the first packaged speech recognition product, the IBM Personal Dictation System for OS/2. 1993: Apple ships PlainTalk, a series of speech recognition and extensions for the . 1994: Dragon Systems' DragonDictate for Windows 1.0 is the first software-only PC-based dictation product. 1996: IBM introduces MedSpeak/Radiology, the first real-time continuous-speech recognition product. 1996: OS/2 Warp 4 becomes the first operating system to include built-in speech navigation and recognition. June 1997: Dragon ships NaturallySpeaking, the first general-purpose continuous-speech recognition product. August 1997: IBM ships ViaVoice. Fall 1997: Microsoft CEO Bill Gates identifies speech recognition as a key technological advance. Future: The next generation of speech based interfaces will enable people to communicate with computers in the same way they communicate with other people (Scientific American, August 1999)

In fact, the dream that machines can understand human speech has been for century’s as Leonhard Euler expressed already in 1761: ”It would be a considerable invention indeed, that of a machine able to mimic our speech, with its sounds and articulations. I think it is not impossible”. As functioning speech, recognition systems appear on the market they are tried out in a large variety of everyday applications, raging from cars, toys personal computers and mobile phones to telephone call centres. It has taken more than four decades for the speech recognition technology to become mature enough for these practical applications. Moreover, some computer industry visionaries have predicted that speech will be the main input modality in the future user interfaces. It is nevertheless important to note that the current speech recognition and application boom is not only due to advanced speech recognition algorithms developed during the last few years, but may be mainly due to huge processing power improvements in current

Page 1 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

INTRODUCTION microprocessors. In fact, the core speech recognition technology, on which current applications mainly rely, was already developed in the late 1980’s and early 1990’s. The trend in IT development goes towards miniaturisation of components. As the components become smaller and smaller they will have a size that easily will fit into e.g. clothing jewellery and helmet. The ideal situation would be that the individual components are so small, that the user could not notice wearing them. Already today, there exists various equipment that could allow entering digital information into the reporting system. These devices are however not designed for environments such as given in ships. They are mainly usable only in office surroundings. Assume an inspector is about to inspect a tank section in a ship: it is hot, noisy, dirty, and humid. To get access to the areas of importance a ladder has been arranged. Traditional inspection tools for tank coating and structure (rust, cracks, etc) is a hammer and a flashlight. As you climb the ladder both hands are needed, one for securing yourself and the other for the inspection tool. A crack in the tank structure must of course be reported. Today this is done by scribbling down a note on a peace of paper and stowed away in a pocket. Vision (Potential Applications of the Technology) Imagine you have access to all information, recordings, and equipment needed, by a not yet invented device. The device consists of a miniature digital camera integrated in your helmet, a very small PC unit, and a microphone assume further that this device is fully voice controlled allowing to navigate within an information database and taking notes even when both hands are occupied. A written report including pictures could then be completed “on the job” with help of a speech recognition system securing the information needed as well as saving time.

Figure 1: Ideally any equipment shall support the surveyor in his work in such a way that he can fully concentrate on his primary work: detection of defects, a speech recognition system would be a desired “secretary”. This thesis address the design, implementation, and evaluation of a voice recognition system in connection with DNV’s ship inspection. The thesis will mainly reflect on problems and tasks concerning Voice recognition. The background for this theme is to make a survey more effective (save time and money), increase the quality of work, immediate updating of class status upon completion of survey and to contribute in improving the surveyors work condition. Speech entry for entering comments and findings would therefore be of great benefit for today’s surveyors, since it would obliterate double work, sources of errors and speed up the necessary “paper work”.

Page 2 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

INTRODUCTION

Because of this tremendous potentials new IT technology offers, DNV has initiated a project called Mobile Worker, that focuses on the utilisation of new technology and cordless communication to help the surveyors in their everyday work. What can be done to make the best of these developments? And how can the tasks and processes be arranged in order to work faster and better? The traditional inspection tasks are varied, and the surveyors solve them in many different ways using various conventional aids. Inspectors look at the condition of what is being inspected, make notes, and fill in standard forms. They have rulebooks, instructions, and other necessary documentation to supplement their own knowledge and memory. Many DNV inspectors have also used mobile phones and portable PC’s for a long time to gain access to information and communicate with customers and colleagues. However, at some point, the mobility will cease and the equipment will have to remain in a cabin, a suitcase or at the office. The inspectors therefore often need information in situations where it is not available, such as when they want to compare observed conditions and damage with the regulations and standard examples. What is the point of having stored all kinds of information in a ship or loss database if you cannot use it when you have to make a decision? Alternatives that are even more mobile are now starting to appear on the scene and the possible mobile tasks, specifications of mobile solutions will be described. The human aspect may even be a greater challenge. Unless the user accepts the new opportunities as a natural part of their job the new equipment may lower their job satisfaction and they with the result that they may quickly stop using it. Portable technology can easily create the feeling that the tool is taking control over the work situation instead of supporting it. An important part of every new solution will be to adapt tasks such that the technology will motivate and make-work simpler (Andersen, 1999). Surveyors may have to leave the work site to find a required manual; list of approved suppliers and approved equipment. E.g., the right pumps from the right vendor. If they fail to refer to the manual, they may attempt lengthy or complex procedures from memory and produce errors. Once the manual is retrieved, the surveyor may have to find room within a cramped workspace to put a large manual/drawing. Also, any attempt to climb onto equipment while holding a large manual might jeopardise safety. Speech recognition systems are error-prone and not very robust to real world disturbances, such as ambient background noise including speech spoken by other surrounding speakers, communication channel distortions, pronunciation variations, speaker stress, or the effects of spontaneous speech. Current speech recognition systems can be categorised in two groups according to their robustness level. Applications falling into the first category have typically large vocabularies and they have been designed to recognise continuos speech successful use of these systems requires efficient minimisation of all possible interference sources. High quality audio equipment, including a close talk microphone, noise free operating environment, and particularly, a co-operative and motivated user, are needed in order to achieve a high recognition performance. Dictation systems for continuous speech are typical examples speech recognition systems belonging to this application category. Truly robust speech recognition applications form the second group. Robust systems can cope with distorted speech input (to a certain extent) and still provide high recognition accuracy even with inexperienced novice users. These systems can usually recognise only discrete words and their vocabulary size is limited to some tens of words. A good example of a robust system is a speaker-dependent name recognition application in voice dialling. In the name dialling system, the user has trained voice-tags; i.e. names that have phone numbers attached to. By speaking a certain voice-tag, a phone call to the attached number is then made. Because of this apparent simplicity, name dialling is very useful for example in a car environment where the user’s hands and eyes are busy. It is important to note

Page 3 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SYSTEM DESIGN that speech recognition alone does not have any particular value. To use speech as an input modality, there must always be some practical advantages. Furthermore, one cannot overestimate the importance of a good user interface; it is essential that speech recognition applications are extensively tested with real users under realistic operating environments.

Organisation of the Thesis In the next chapter, a possible design of a speech recognition system is presented. The subsequent chapters discuss the individual components in more detail. Chapter 4 describes a speech recognition demonstrator and chapter 5 ends the thesis with some discussion and future outlook.

2 SYSTEM DESIGN This chapter describes a possible system design. Constraints and requirements, the surveyors work process and a potential backup system is addressed.

2.1 Constraints, requirements and potentials DNV’s constraints can be divided into 2 groups namely, DNV’s database (NAUTICUS), and the surveyors work process including the effect of a speech based reporting system on it.

Det Norske Veritas has developed the NAUTICUS database system, which contains all information about the ship over its entire life (Lyng, 1999). NAUTICUS is based on a product-model. Allowing in principal unlimited information to be attached to each element of the ship. For example a geometrical representation of the hull, as well as mathematical analysis of structure and machinery are contained in it. The system also permits analysis of the ships structural strength and behaviour under any sea state and loading conditions, with visual feedback. Several analysis options are available; including fully integrated finite-element capabilities and direct wave-load calculations. The same model allows also for a full set of machinery calculations to be performed. With only one model serving all functions, repetitive data input is eliminated. The accumulated information regarding a specific NAUTICUS class vessel is available to DNV at any time, a surveyor is able to retrieve updated data on matters relating to ship status and conditions, survey feedback, new-building and certificate status, and component and system information.

Page 4 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SYSTEM DESIGN

Figure 2: A surveyor may retrieve some information by accessing the Product Model through on

his laptop PC. Figure 3: The NAUTICUS product model is intended to be a mirror of the real world and any information like user guides, hints, warnings or restrictions are attached to the model. Ambient condition: a worker in challenging environment (temperature, water, shielded, confined spaces, noisy, etc) no support available (power supply, Internet or phone connection)

Page 5 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SYSTEM DESIGN

Nauticus allows a 3D representation of the ships structure and allows to record, store, and retrieve information. The product model will at any instant of the ship’s lifetime hold its original description as well as its present and historical condition, both for the structure and equipment Throughout a ship's life, reports, drawings, sketches and engineering calculations will be stored in the NAUTICUS product model. Simply “clicking” on a part, system, or compartment in a 3D view or a tree structure the information can be accessed. The Classification and Statutory Certificates are main “deliverables” of a classification society. The main deliverable of a surveyor is however, the inspection report. Introduction of NAUTICUS to DNV surveyors in the field may enable them to issue full-term certificates while still onboard the vessel.

From NAUTICUS

Planning of survey, in Checklist office

Execution of survey, onboard the ship Survey report

Reporting the survey, at the office

Into NAUTICUS Figure 4: The survey work process: through NAUTICUS, the surveyors will have ready access to ship information needed for planning of surveys. When onboard the surveyor may wish to retrieve information about e.g. the minimum allowable steel thickness. However, this information can only be retrieved from NAUTICUS. The surveyor must therefore either be connected to the headquarters or to a local copy of NAUTICUS at the completion of a survey, and after verification of survey data; the results are recorded straight into the ship database and will then be accessible for other surveyors immediately.

Page 6 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SYSTEM DESIGN

Renewed plates

Fully integrated sketcher tool

Crack description

Figure 5: Screen dump of NAUTICUS operation and some of the options available, an integrated sketcher tool, for drawings and pictures, annotation of the inspected item is scheduled for future implementation in NAUTICUS.

2.2 System set up The speech recognition system can be compared with a chain usually consisting of 4 independent units, (see Figure 6): • User. • Microphone. • Speech recognition software. • Computer hardware. It may be possible to insert more components into the chain to improve the recognition accuracy, examples of such components are: • Noise reducing software. • Noise reducing Hardware.

Depending on the degree of integration, these “extra” components may increase the weight and size of the equipment totally and it may decrease the user’s ability to move around. Extra equipment may also be power consuming, which means that the operation time is reduced.

Page 7 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SYSTEM DESIGN

Figure 6: The speech recognition system can be compared with a chain consisting of a user, microphone, noise reducing hardware, noise reducing software, speech recognition software, and computer hardware.

Consists of a number of independent links. Each component of the speech recognition system has its individual strength. As indicated in Figure 6, the user represents the systems first link. In today’s systems the recognition rate depends principally on the users skills (dictating etc), to handle speech recognition systems. If the user is not familiar with speech as an input option, the result of recognition fails totally; the user is defined as a weak link. The microphone as input device is discussed in chapter 3.1. The microphone is the second component in any speech recording or transmission system. Its function is to convert acoustic sound waves into an equivalent electrical signal. The commercial available microphones today are not constructed to operate in ambient noise and therefore backup solutions for input may be necessary, as discussed in chapter 2.3. The microphone is identified as a weak link. Noise reducing hardware may increase the quality of the speech signal and is considered as a strong link. Furthermore, noise- reducing software is a strong link since powerful algorithms are achievable. Chapter 3.2 discusses speech recognition software and it is considered as a weak link. When it comes to dealing with non-native speakers, the level of recognition is unsatisfying. Today’s computer hardware is not the limiting factor in a speech recognition system.

2.3 Back up solutions If something fail a back up system is needed. An example of such a back up system is a hand held keyboard for text input or a track ball for navigation as shown in. (Figure 7). Trackball and touchpad do not allow immediate textual input reducing hereby their user range. Furthermore, all the backup solutions presented here render the ability to give input with the use of no hands. A backup device may also be space consuming and thus interfere with the user’s ability to move, e.g. in confined spaces. However some backup systems may be preferable when navigating in a 2 dimensional space; this is very difficult to achieve with voice. The error frequency when using a keyboard is little compared to the error frequency when using voice.

Page 8 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SYSTEM DESIGN

Figure 7: A keyboard or a trackball may function as a backup system if the speech recognition fails. Furthermore, a keyboard may be used in vacant periods of the inspection for text editing. One may consider different backup systems like keyboard, hand-scanner, mouse etc. These backup systems may primarily serve as a supplement to the existing voice input.

Figure 8: From left to right, potential back up systems: Trackball allows in addition mouse functions whereas the data glove in addition can also type letters or numbers into the computer, Twiddler2. That allows the user to have a full keyboard access using one hand. The right picture is of a hand held Dictaphone for audio input.

If the speech recognition system is used or, a mobile platform (palm top, wearable computer) then the standard desktop computer input devices are inadequate in this situation. A conventional keyboard, for example, is not a practical input device since it was designed to be used while sitting down. A major factor in the development of input devices concerns the placement of the devices. Keyboards require the user to have the fingers free for typing. Thus, the keyboard must be held in place by a means other than the user having to grasp it. The advantages and disadvantages of various backup solutions are summarised in table 1 and table 2. As mentioned full size keyboards are cumbersome. Chorded keyboards use fewer keys to input text. Combinations of keys are used to indicate particular letters. Some can be strapped to one hand or a wrist A keyboard allows a full range of textual input. In mobile work, the keyboard has to be worn and positioned for input. This conflict has given rise to alternative keyboard devices.

2 http://www.handykey.com/

Page 9 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SYSTEM DESIGN

(a) Full size keyboards: A normal keyboard is unattractive in a wearable context because a variety of small text-input devices have been developed.

(b) Miniaturised keyboards: There are miniaturised keyboards with the options of a full size keyboard, available on the commercial market. A chording keyboard is one where combinations of keys are punched to indicate particular letters. If only used solely, the backup keyboard could be stored in a jacket and be pulled out when needed. If used more frequently it could be strapped on the wrist, the belt or elsewhere on the body. The advantage is that the user has all the features of a full size keyboard, demands minimum of space, little or no training required. It is inexpensive, requires low power, low bandwidth, and is compatible with existing software. A shortcoming is that it is cumbersome to use for navigational purposes, annoying for the user because of its placing (on the arm), using a miniaturised keyboard in confined spaces may also require back lightning. Further no feedback (click or blimp) whether the pushing of a button was successful or not is available. There is no pointing capability inherent in the device.

Commercial available miniaturised keyboards → The QWERTY3 keyboard from L3 Systems4 is designed for wrist mounting. This keyboard is totally sealed, has optional adjustable back lighting with a choice of PS/2 or USB5 interface. Features: • Optional wrist strap to provide the capability of attaching it to your wrist. • Back lighting. • PS/2 or USB interface. → The PGI micro keyboard from Phoenix Group, Inc6 is rugged and sealed to protect it from the elements. Weighing less than 160 grams, it is designed to be “arm mounted” and offers PC compatibility with 59 keys and 99 functions in a package about the size of a dollar bill. It is supplied with a PS2 compatible mini-din connector. Features: • Optional wrist strap to provide the capability of attaching it to your wrist. • PS/2 or USB interface. • Back lighting.

(c) Virtual keyboard → The AUDIT7, which is a text editor and a computer control program that is particularly prepared for acoustic communication and remote computer operation8 9 (a keyboard interface devised for remote computer control). The keyboard interface may be used as an ordinary text editor at any office or personal computer terminal with graphic screen and normal keyboard or as

3The name derives from the first six characters on the top alphabetic line of the keyboard 4 www.l3sys.com 5 Universal Serial Bus. 6 www.ivpgi.com/ 7 Audible Editor 8 www.dnv.com/ocean/nbt/audit/docs/report.htm 9 www.dnv.com/ocean/nbt/audit/zframe.htm

Page 10 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SYSTEM DESIGN a remote text editor with 12 button keypad, as for instance a push button telephone/cellular, a calculator. A virtual keyboard could be based on a modified Braille alphabet (Gran, 1992).

Figure 9: Screen dump of the AUDIT virtual keyboard.

(d) Chording keyboards: They are smaller and with fewer keys and can be strapped to one hand or a wrist. These input devices typically provide one button for each finger and/or the thumbs. Each button controls multiple key combinations. Instead of the usual, one-at-a-time key presses, chording requires simultaneous key presses for each character typed, similar to playing a musical chord on a piano. The advantages are that it requires fewer keys than a conventional keyboard. With fewer keys and the fingers never leaving the keys, finger strain is minimised. The user can place the keyboard wherever it is convenient, which helps alleviate unnatural typing positions. Disadvantages is that the one handed requirement for input means that it could not be used for applications where the user must have both hands totally free at all times. It requires at least 10 to 15 hours of training to operate and it is only suitable for textual input. Usually slows data entry considerably. There is no pointing capability inherent in the device. Commercial available chording keyboards → The BAT personal keyboard from Infogrip, Inc10, is a one-handed, compact input device, that replicates all the functions of a full-size keyboard, but with greater efficiency and convenience. Letters, numbers, commands and macros are simple key combinations, “chords,” that can be mastered after some training BAT’s ergonomic design reduces hand strain and fatigue. The BAT is also a typing solution for persons with physical or visual impairments and can increase productivity when used with graphic or desktop publishing software. Features: • Left or right hand configuration. • Dual keyboard option includes both left and right-hand units. → The Twiddler from Handykey Corporation11 is a pocket-sized mouse pointer plus a full-function keyboard in one unit that fits in

10 www.infogrip.com/ 11 www.handykey.com/

Page 11 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SYSTEM DESIGN either right or left hand. It plugs into both keyboard and serial ports on IBM-compatible PC's and works on DOS, Microsoft Windows 3.X/95/NT, Unix, and Palm Pilot operating systems. The Twiddler's mouse pointer is based on a sensor sealed inside the unit and immune to dust and dirt. The Twiddler incorporates a keyboard that is an ergonomic keypad designed for “chords” keying, i.e. press one or more keys at a time. Each key combination generates a character or command. With 12 finger keys and 6 thumb keys, the twiddler can emulate the 101 keys on the standard keyboard.

Table 1: Evaluation of commercial available keyboards as backup system. Advantages Disadvantages Enables textual input Can’t be used if task requires two hands Reasonable speed (50 wpm) is achievable Training required for proficiency Inexpensive, low power and low bandwidth No pointing capability inherent in the device requirements Can be made waterproof

(e) POINTING DEVICES: Pointing devices may be necessary even if the speech recognition works perfectly. Joysticks, touchpads and trackballs, are defined as pointing devices in means of their ability to move the cursor on the screen. A pointing device does not enable hands-free navigating and requires training and a free surface. The ability to point to a position on a screen is important for all direct manipulation interfaces and, for all applications where there is a figure or a map to annotate. The advantages are that pointing devices are intuitive, allow random access and positional input, and are compatible with desktop interfaces (Krauss & Zuhlke 1998). They are widely available and could provide a virtual keyboard by having a representation of a keyboard on the screen and pointing to the various keys desired. Disadvantages are that the interfaces that currently utilise pointing devices are resource intensive. They are inexact for precise co-ordinate specification and they are slow when used to provide a virtual keyboard.

(f) TRACKBALLS: This stationary device lodged in the keyboard or found as a stand-alone product lets users control the cursor with a rotating ball (rather than a conventional mouse). Trackballs have been around for years and have been continually refined for better performance. The advantages is that a trackball requires little space, ability to point at, and “enter” a pushbutton, scroll menu and a text field in a program. The disadvantages are that there are no textual input options, cumbersome placing of the trackball requires training, not possible to operate hands-free.

(g) TOUCHPADS: Touchpads: Most commonly seen on notebook computers are made of flexible material similar to a laptop screen. Users control cursor movement by running their fingertips or a stylus along the touch-sensitive surface. The advantage of a touchpad is that it allows continuos positioning of the cursor. Many users find that it offers a more natural motion than a Track-Point button. The disadvantage of touchpads is that they are extremely sensitive to moisture contamination. They are also energy consumptive demanding a constant power supply and offering no battery saving “sleep mode” capability. In today’s

Page 12 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SYSTEM DESIGN

sub-notebooks, palmtops, cordless keyboards, and handheld remote controls, battery life is a major concern. Capacitance-based touchpads have another operational downside. They are insensitive to pressure directed downward on the pad, and will not operate using a common stylus. They, therefore, have no potential for 3D input options such as pressure-sensitive scrolling, signature capture, character recognition, etc.

Commercial available pointing devices

TRACKBALL: The RAT-TRAK™ trackball from Industrial Computer Source (ICS) 12. Feature user defined keys; instant speed control and ergonomic design. The product is comes with a PS/2 connection and is Microsoft mouse compatible. Dimensions (L x W x H) (201,4mm x 102,7mm x 57,2mm. The price is app 40 $.

13 JOYSTICK: The MicroPoint™ from Varatouch Technology, Inc (VIT) , is a small, base diameter of 10mm, fully functional joystick. MicroPoint is a variable resistance electronic analogue device that uses resistive rubber. It can be used with a variety of analogue-to-digital converters. The price is app 60 $.

TOUCHPAD: The Smart Cat from Cirque14. It measures about 10 cm square and has a touch surface that measures 3 by 7.62 cm. The Smart Cat allows single- or double-click by tapping on the surface, tap in the left corner for a left button click, and tap on the right side for the right button. The device also comes with standard left and right buttons. It also scrolls both horizontally and vertically through applications that support scrolling The price is app $49.

Table 2: Evaluation of commercial available pointing devices as backup system. Advantages Disadvantages

Can indicate a point in two dimensional Current devices are resource intensive and space (map) usually require a surface for positioning Are intuitive Inexact for precise co-ordinate specification

Faster than typing Slow when used to provide a virtual keyboard Can be made waterproof

(h) Dictaphone: Findings could also be described and recorded via a Dictaphone. In an earlier DNV project analogue Dictaphones has been tried out on surveyors with unsatisfying results. The reason could be found in the handling of the Dictaphones

12 www.labyrinth.net.au/~ieci/products/input/html/rat-trak.html 13 www.varatouch.com 14 www.cirque.com

Page 13 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SYSTEM DESIGN

and discipline of the user. However, in some situations and for some persons this may be a suitable backup solution.

Commercial available Dictaphones → The Olympus D1000 Dictaphone15. According to the producer, it is possible to dictate 140 words/min. The machine further allows: • indexing - automatic date time recording, automatic dictation numbering • 2 recording times, Standard (15 min) and Long mode (33 min), which influence the length of the storage capacity when 2 MB of flash memory is available. A special feature of the Olympus D1000 is that voice recorded in standard mode can be transcribed in editable text by the IBM voice recognition software ViaVoice. The recorded voice must for this task be transferred to the PC either by cable or by a flash card. ViaVoice Transcription uses approx. 30,000 words in its basic vocabulary, extendable up to 64,000 words by the user. In addition, ViaVoice Transcription has a dictionary with approx. 320,000 words and a back-up vocabulary that includes spell checking and pronunciation.

Table 3: Evaluation of Dictaphone as backup system Advantages Disadvantages

Are intuitive Only possible to use in quiet surroundings

Faster than typing Requires correct handling for good result

Can be made waterproof Too many and too small buttons

Summary and recommendations on input devices The choice of backup input devices will depend on the application and work surroundings and the users experience with these devices, the user should be allowed to choose his preferred input device.

RECOMMENDATIONS As seen in the discussion in this chapter. The scull and throat mounted microphone seems to be the most attractive solutions when it comes to hands free speech input in a noisy environment. However, these microphones are yet not available for speech recognition purposes. Another positive feature is the non- obstructing capabilities of a scull-mounted microphone since it can simply be hidden it in the helmet. Real user friendliness requires a wireless connection between the system components. It would allow the user to move freely but would also the take aspect of safety in to consideration.

15 www.olympus-europa.com/voice_processing/index.htm

Page 14 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

3 EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS 3.1 Microphones The use of voice applications (speech based technologies) are becoming increasingly common on personal computers. Audio applications like Internet telephony, computer telephony, videoconferencing and speech recognition is transforming the PC into the desired communications appliance. High quality microphones are required to enable these voice applications. However, many applications simple pre suppose an ideal microphone with the result that a not optimal microphone is selected leading to poor acoustic input into the voice recognition software. Severe performance degradation can result when the microphone is not viewed as a critical performance element in the speech recognition chain. By selecting the proper microphone element (scull mounted, noise cancelling, etc.) and implementing it correctly, the performance of the voice application can be dramatically improved. The primary barrier for a successful introduction and user acceptance of voice recognition software has been due to noise that contaminates the speech signal and degrades the performance and quality of speech recognition. The current commercial remedies, such as noise cancellation software and noise cancelling hardware show to be inadequate to deal with real world situations. Certain unwanted signals e.g. background talk that is very similar to the actual voice signal of interest and thus are indistinguishable from it, very often degrading the recognition. 3.1.1 Physical principles The voice is the users “keyboard”, and the recognition result of the voice input, depends on the sound characteristics of the microphone. Although there are different models of microphones, they all do the same job. They transform acoustical movements (the vibrations of air created by the sound waves) into electrical signals. This conversion is relatively direct and the electrical signal can then be amplified, recorded, or transmitted.

Page 15 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Definition16: a microphone is a generic term that refers to any element that transforms acoustic energy (sound) into electrical energy (electricity (audio signal)).

BASIC MICROPHONE THEORY Microphones can be classified with respect to several characteristics: • Current induction in coil • Voltage change in capacitator • Accelerometer • Change in resistance But the most common ones are dynamic and condenser. Dynamic microphones are dependable, rugged, and reliable (compared to condenser microphones) and used where physical durability is important. They are also reasonably insensitive to environmental factors, and thus find extensive use in outdoor applications. Figure 10 shows the construction of a coil microphone and Figure 11 of a condenser microphone.

Dynamic microphone: The characteristic of a dynamic microphone is that a flexible mounted diaphragm is coupled to a coil of fine wire. The coil is placed in the air gap of magnet such that it is free to move back and forth within the gap. When sound strikes the diaphragm, the diaphragm surface vibrates in response. The motion of the diaphragm couples directly to the coil, which moves back and forth in the field of the magnet. As the coil moves in the magnetic field, a small electrical current is induced in the wire. The magnitude and direction of that current is directly related to the motion of the coil, and the current thus is an electrical representation of the sound wave, main characteristics are (text box):

• Moving coil with magnet (like a speaker) • Requires no electrical power • Generally more rugged than condenser • Generally not as sensitive as condenser microphones

Figure 10: In a dynamic microphone a coil is mounted in the force field of a permanent magnet. Sound waves move the coil back and forth and thereby inducing a electrical current that is equivalent with the sound intensity.

Condenser (Electret) Microphones: This type of microphone transforms sound waves into electrical signals by changing the distance between condensator plates. The electret condenser microphone is the dominant choice for microphones used in computers because of its superior price/performance ratio and its small size. Sound waves cause the top plate to vibrate, which in turn alters the capacitance, resulting in a voltage. The electrical signal voltage varies correspondingly with polarity and amplitude of the frequency and amplitude of the sound waves. An external power supply is needed to measure capacity changes and to pre amplify the signal. Some condenser microphones have a battery attachment that is either part of the microphone housing or on the end of the cable, as part of the connector. Condenser

16 www.acronymfinder.com

Page 16 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS microphones are as mentioned also less durable than dynamic microphones. The characteristics are (text box):

Diaphragm

• Moving diaphragm only

• More sensitive than dynamic

microphones • Requires power • Traditional condenser microphones • requires high-voltage power supply • Modern electret microphones require battery • Notasruggedasdynamic

Figure 11: A condenser microphone measures a loss in the condenser caused by changing the distance between 2 thin metal plate. These types of microphone require power source in terms of a battery.

Microphone response The construction of a microphone determines its behaviour or response with respect to the physical properties of a sound wave. A pressure microphone has a response that is proportional to the pressure in a sound wave, whereas a gradient microphone has a response that corresponds to the difference in pressure across some distance in a sound wave. The pressure microphone is a fine reproducer of sound, while the response of a gradient microphone is typically greatest in a certain direction, rejecting thereby undesired background sounds, pressure microphones are therefore direction sensitive.

Figure 12: The basic microphone design independent of physical measured principle (condenser or dynamic). Most voice-based applications require that background noise is cancelled or attenuated and that the microphone captures the voice input clearly and with high fidelity. The main task of the microphone is to transform the sound wave into an electrical signal that ideally contains only the desired signal. The microphone must therefore deliver high quality

Page 17 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS signals to the computer even in noisy surroundings (up to 100 dB), humid areas, and high temperature zones having EX17 features. In addition, it should be easy in use. Some commercially available products fulfil these features but they are usually not made for usage with a PC. These products are mainly developed in connection with VHF’s, UHF’s, to suit professions like smoke divers, police officers, etc. The conversion of such VHF/UHF microphones to a PC microphone is in principle straightforward. However, their conceived performance will be considerably lower because in VHF/UHF usage the brain is able to detect spoken words buried in a noise that is even much higher than the signal level. This means that the brain is capable to reconstruct a sentence based on just a few fractions. A computer does not have this capability yet; its performance depends highly on the signal to noise ratio. Other equally important characteristics that are relevant to the speech recognition: • Frequency bandwidth. • Distortions. • Echo and echo delay. • Noise type (interference’s, reverberations, background stationary noise). • Signal-to-noise ratio. • Accelerations and movements, • Positioning of the microphone on the body, • Other characteristics such as the mechanical effects that may occur when using a press-to-talk microphone.

3.1.2 Noise reducing measures Noise or other undesired signals can be reduced in a variety of different ways.

A B C D Shielding Microphone Hardware Software construction signal signal processing processing

Figure 13: Different approaches to noise reducing.

A. Shielding: Old socks, etc i.e. material that absorbs certain frequencies of a sound. Shielding usually works better for high frequency noise than low frequency noise. B. Microphone construction There exist two different types of noise cancelling microphones namely: • Acoustic Noise Cancelling Microphone (ANCM) (passive) • Electronic Noise Cancelling Microphone (ENCM) (active)

→ Acoustic Noise Cancelling Microphone construction (ANCM) Both sides of an ANCM diaphragm is equally open to arriving sound waves see Figure 14. Both port openings are the distance “D” apart. Because of this distance, the magnitude of sound pressure is greater in the front than in the rear of the diaphragm and slightly delayed in time. These two effects create a net pressure difference (Pnet = Pfront - Prear) across the diaphragm that

17 The component is certified for explosive areas

Page 18 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS causes it to move. This system is less contaminated with noise than that of a microphone with only one opening. Noise cancelling is therefore achieved by the design of the microphone.

Figure 14: Acoustic noise cancelling microphone.

→ Electronic Noise Cancelling Microphone (active) In active noise cancellation a secondary noise source is introduced that destructively interferes with the unwanted noise. In general, active noise cancellation microphones rely on multiple sensors to measure the unwanted noise fields and the effect of the cancellation. The noise field is modelled as a stochastic process, and a algorithm is used to adaptively estimate the parameters of the process. Based on these parameter estimates, a cancelling signal is generated. The challenge of this approach is that future values of the noise field must be predicted. The electronic (active) noise-cancelling microphone is built in the same way as the acoustic noise- cancelling microphone in that it measures the net pressure difference in a sound wave between two points in space. The characteristics of the active electronic noise cancelling microphone is that it utilise an array of two "pressure" microphones arranged in opposing directions, with a spacing between the microphone that equals the port distance “D,” as illustrated in Figure 15. A typical pressure microphone in an array has the rear diaphragm port sealed to the acoustic wave front while the front is open. The result is the diaphragm movement represents the absolute magnitude of the compression and rarefaction of the incoming sound wave and not a pressure difference between two points. An array of two pressure microphones achieves noise-cancelling characteristics because the output signal of each microphone is electrically subtracted from the other by an operational amplifier. The operational amplifier output signal gives the “cleaned sound signal. (Oppenheim, A.V. Weinstein, E. Zangi, K. Feder, M and. Gauger D; 1994).

.

Figure 15 Electronic (active) noise-cancelling microphone.

Page 19 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Comparison of noise cancelling microphone technologies18 Table 4 Compares of passive and active noise cancellation microphones, (ANC-100 (Andrea electronics19) and nomad. (Telex20)) The active noise-cancelling microphone is more susceptible to electronic interference and fluctuating temperatures. it also and costs more because of the “extra” electronics packed inside (two opposing Omni-directional microphones). Table 4: comparison of boom microphone technology.

Passive noise cancellation Active noise cancellation Number of microphones One Two Microphone element type Bi-directional Omni-directional Frequency response pattern Same Same Product facts Single element has inherent Dual elements and electronics susceptible balance over all frequencies, to system imbalances over changing temperature and time frequencies temperature and time Noise cancellation approach Acoustic Electronic Noise cancellation performance Good in office surrounding Bad in office surrounding Susceptibility to electronic noise None Moderate Voice recognition performance Good in office surroundings Bad in office surroundings Cost Approx. 30 $ Approx. 60 $

C. Noise filtering hardware: Hardware filters are often used in connection with stationary noise, e.g. removal of 50 Hz noise or constant engine noise. Hardware filters can be designed such that they filter out noise above below or around (bandpass) some specified frequency. They are usually built up by operating amplifiers The time continuos signal is smoothened by the filter resulting in a (still) time continuos time signal (Kuo, F: 1966).

Time continuos microphone signal Hardware filter Time continuos filtered signal

COMMERCIAL AVAILABLE PRODUCTS

18 Test from Speech technology Magazine January/February 1998 19 www.andreaelectronics.com. 20 www.computeraudio.telex.com

Page 20 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

→ ClearSpeech-Microphone21: ClearSpeech-Mic is a digital noise reduction hardware that significantly removes background noise. Noise Cancellation Characteristics: • 300 Hz to 3,400 Hz voice bandwidth • Single tone noise reduction - >70 dB • White noise reduction - >12 dB Power Supply • 9 to 24V DC • 0.5 Watt power consumption Price: $129.00.

D. Noise filtering software: Powerful software algorithms can be written that can perform similar tasks as hardware filter (low, high and bandpass filters) these filters requires however that, the time continuos sound signal is digitised first by e.g. an A/D converter. Directional microphones reduce both continuous and discrete noise events from “off-axis” locations. Processing the digitised microphone signal by some software reduces continuous noise from all sources and directions, including any internal noise from the microphone and sound card circuitry. Noise reducing software and hardware usually assumes a stationary noise source. If the noise characteristics are changed over time (non-stationary) more adaptive software is needed. (Oppenheim, A., Schafer, R: 1975).

COMMERCIAL AVAILABLE PRODUCTS

→ ClearSpeech technology22: ClearSpeech is an algorithm designed to remove background noise from speech and other transmitted digital signals. ClearSpeech improves communication through devices such as telephones and radios, and can be used to increase the performance of speech recognition programs. ClearSpeech can be implemented in real-time on embedded chips or can be run on a PC under Windows. The algorithm is designed to remove stationary or near stationary noise from an input signal containing both noise and speech. Stationary noise is that which has constant noise signal statistics. ClearSpeech algorithm is adaptive as background noise changes. Speech picked up by a PC’s microphone can be impacted by background noise. NCT’s ClearSpeech- PC/COM removes ambient noise from speech while it is being recorded thereby dramatically improving receive-side intelligibility. ClearSpeech-PC/COM can be used in a variety of PC based voice applications including voice recognition, voice mail, internet voice communications, real-time processing of voice from noisy environments and post processing of previously recorded voice. THE CHARACTERISTICS ARE: • Continuous and adaptive removal of background noise from speech • Up to 20 dB signal-to-noise improvement • programmable noise reduction parameters • Includes application software to invoke ClearSpeech-PC/COM while recording and to process previously recorded audio files • Integrated with Windows® 95 or NT Audio Compression Manager

Noise Cancellation Characteristics • 300 Hz to 3.85 kHz voice bandwidth • Single tone reduction > 70 dB

21 www.nct-active.com/csmicr.htm 22 www.nct-active.com/

Page 21 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

• White noise reduction > 12 dB In the following a test performed by Defence Group Incorporated DGI is quoted, (Grover &Makovoz, 1999).

Figure 1623: shows noisy speech recorded simultaneously with noise cancelling microphones24 from different vendors, before (red) and after software noise reduction. (Black). In both cases the software was able to remove, noise satisfactory recorded by the microphones. Each microphone had quite different output with a distorted speech input signal. The software caused however no added distortions in either case.

Speech-to-Text Dictation requires a very low error rate in order to be accepted. Even with a headset microphone and in a “quiet” office, the background noise limits achievable error rates for large vocabulary dictation systems. This is shown from testing the VoiceType dictation system from IBM, both with and without software noise processing. Tests were done in an office, with no background speakers, using an Andrea ANC-600 headset microphone. Two different recordings were made one for enrolment, another for testing. The speech-to-noise ratios (SNRs) were 30-40 dB. One copy of VoiceType was trained and tested with no software noise reduction, and another was trained and tested with the software noise processing. Training and testing without the software noise reduction gave 76 errors in 1009 words of spoken text. Training and testing with added software filtering gave 22 errors out of the same 1009 words of spoken text. Results are summarised in Table 5 below.

Table 5: Test result of voice filtering IBM VoiceType No filtering Filtering Error rate in quiet office 7.5 % 2.2 %

By using vocabularies that are more restricted and grammars, voice command systems tend to be more noise resistant than large vocabulary dictation systems. Yet, they often must tolerate very high noise levels for automotive, industrial, military and other applications. The gain when combining noise reduction software with a commercial voice command system are certain. Data was recorded for 100 speech commands in a “quiet” environment25. An extended testing set was then also prepared by mixing with a range of noise levels and types from various vehicle and industrial environments. Enrolment using only the “quiet” data was then done both with and without software noise

23 pictures taken from http://www.ca.defgrp.com/n_test.html#microphones 24 (A) The Andrea ANC-600, and (B) the Shure 10A. 25 Test performed by Defence Group Incorporated (DGI) Signal and Image Processing Group: http://www.ca.defgrp.com/.

Page 22 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS reduction included in the enrolment process. Testing was then done, again both with and without filtering included, and now using the extended test data set with added noise mixing. For voice command applications, a false positive response is a critical failure actually worse than no response at all. Table 6 summarises both correct responses, and also false positives, versus the input speech-to-noise ratio, both with and without the benefits of noise reduction by software processing.

Table 6: Results with and without noise filtering. Voice No noise filtering With noise filtering Command SNR (dB) Correct False positives Correct False positives 35 100 % 0 % 100 % 0 % 20 88 % 4% 99 % 1 % 10 6 % 4 % 77 % 0 %

Table 7 shows results when the processing was used only for testing, but not for enrolment. Enrolment here (as above) used the “quiet” enrolment data, but no filtering. Testing was again done on the noisy test data, both with and without noise removal filtering. Performance gains in this case, from using noise removal filtering only for testing, are not as good, since there was inevitably some noise in the basic training data, where the processing was not used, while corresponding (and larger) noise perturbations were removed only under testing. Even so, the processing still provides appreciable benefits in a noisy environment, even if not used in initial enrolment.

Table 7: Results with and without noise filtering in testing. Voice No noise filtering With noise filtering Command training/testing Final testing SNR (dB) Correct False positives Correct False positives 35 100 % 0 % 100 % 0 % 20 88 % 4% 99 % 1 % 10 6 % 4 % 58 % 1 %

3.1.3 Body placement The location of the microphone during recording plays an important role. For instance, a microphone placed directly under the nose or mouth will capture a lot of breathing sounds and so contaminating the signal unnecessary. There are several options on placing the microphone on the body see; Figure 17.

Page 23 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Figure 17: Different options on placing of the microphone on or near the body are available.The best place of the microphone is such that it does not obstruct the user.

A scull mounted microphone appears to be very appealing from various reasons: It will not obstruct the user, it is small, it weighs as little as 40-50 g and it has a high signal-to-noise ratio. The scull microphone can be placed inside the helmet and as the user puts on his helmet, he “puts” on his microphone as well. The headphone is automatically connected with the microphone and also positioned inside the helmet. This communication system is based on the principle of bone transmission and does therefore not obstruct the face. Due to its placement inside the helmet, it the influence of outside noises considerably decreases. Clear and easy communication is possible even while wearing a mask or a breathing apparatus. Its use is simple and comfortable. Further, some users might not be comfortable when talking to a machine and may not accept a headset microphone easily. A microphone should therefore be attached to the helmet.

3.1.4 Commercially available microphones (a) Active noise cancelling boom microphones

EMKAY26: Offers a RF Wireless Headset, a single channel, full duplex system, with a transmit range greater than 10 m. The head-set is capable of performing up to 10 hours between recharges. Its lightweight construction and ear-loop design ensure a comfortable fit. The headset has been designed for use in PC voice recognition, computer telephony, and Internet telephony.

ANDREA27: Electronics Corporation Active Noise Reduction (ANR) earphone, Active Noise Cancellation (ANC) near-field microphone, patented Digital Super Directional Array (DSDA™) and Directional Finding and Tracking Array (DFTA™) far-field microphones.

26 www.emkayproducts.com. 27 www.andreaelectronics.com.

Page 24 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

TELEX28: offers a USB29 digital microphone for speech dictation applications. This microphone delivers pure digital signals to the speech recogniser and eliminates the performance variations inherent in analogue sound cards. The headset also includes Acoustic Noise Cancellation technology to cancel background noise that can degrade speech recognition performance. Speech recognition software performance is greatly dependent on the quality of the audio signal. Software developers have had a difficult challenge dealing with the wide variations in performance and quality of analogue sound cards. With a USB interface, the voice signal bypasses the soundcard with direct digital input to the USB bus. Table 8 below is an independent judgement of an ANCM.

Table 8: Pro and cons of Active noise cancelling microphones. Pro’s Con’s Not certified for Ex environments Commercial available for connection with Does not functions in extreme areas PC Does not stand rough use Not user friendly Consists of many components

(b) Scull mounted microphone Example of two manufacturers of scull microphones are, CGF Gallet30 a French producer and Ceotrtonics31 a German producer. THEIR MICROPHONE CHARACTERISTICS ARE SISMIALAR: • Measured principle: accelerometer with a sensitivity of 1mV/mG. Bandwidth: from 20Hz to 20 kHz. • Amplifier: bandwidth from 300 Hz to 3 kHz at -3 dB. • Weight of head equipment: 55±2 g. • Electrical tightness: IP 54 cover. • Has EX (explosion safety) features.

Table 9: Scull mounted microphone. Pro’s Con’s Certified for Ex environments Light weight Requires use of helmet

28 www.computeraudio.telex.com 29 Universal Serial Bus 30 http://www.gallet.fr 31 http://www.ceotronics.de

Page 25 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Functions in extreme32 areas Not commercial available for connection with PC Stands rough use Rather expensive (approx. 4000 Nkr) User friendly The microphone from Ceotronics needs amplifier

(c) Throat microphone The throat microphone uses an “indirect air-borne vibrations” technique to receive vibration energy generated on the skin near the vocal cords. It features high isolation capability not only for environmental noise, but also from frictional vibrating sound generated by the microphone head. They can even be made water, dust, and corrosion resistant. These microphones provides clear communication when wearing breathing apparatus or in very high noise environments. A dual slope band pass filter circuit rejects the unwanted low frequency body resonance.

Table 10: Throat microphone.

Pro’s Con’s Light weight Not certified for Ex environments Noise cancelling Not “user friendly” since it requires, taking the throat microphone on and off. Functions in extreme33 areas Not commercial available for connection with PC Stands rough use Consists of many components Ideal for wearing under protective or Hazardous material clothing Total hands-free operation VOX or PTT activation Provides clear audio in high noise environments

32 Areas with noise, humidity and high temperatures 33 Areas with noise, humidity and high temperatures

Page 26 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Table 11: The different microphone products compared with each other. Consideration Noise cancelling boom Scull mounted Microphone Throat microphone s microphone Performance Bad Good Good Price cheap Medium medium Easy use No Yes No Other Available for use with N.A for use with speech N.A for use with speech recognition recognition applications speech recognition applications applications

Table 12: Evaluation of commercial available microphones as input device for speech recognition. Advantages Disadvantages

Faster than a keyboard and mouse When user is working with co-workers, a cue system is needed to let the computer know when the utterances are intended for the computer, vice the co-worker Does not require use of hands May need press to talk switch requiring a hand

Can improve performance in hands- Use of “bracket words” requires some training busy (maintenance), eyes-busy (inspection) tasks High background noise levels can cause inaccurate word recognition and false inputs and thus a backup input device may be required

Behavioural states (e.g. anxiety, stress) and task loading can affect human voice characteristics and degrade interactive speech system performance Prompts are required when assistance is needed to recall the appropriate procedures in a given situation. Interfaces must be developed to prompt users when the possible Vocabulary used by the system is beyond the user’s recall. Feedback must be presented to the user when spoken words are not understood (this Takes up valuable display space) Recognition rates of 95% mean one error in 20 words. Easy, quick ways to correct errors must be developed.

Page 27 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Specifying a position in a two dimensional space is difficult. A pointing device is needed for these tasks. When using a speech synthesis system, it can be difficult (or impossible) to interrupt the system while it is “speaking” under the recognition process Annoyance when the recognition is not correct

3.2 Speech recognition software

Speech recognition systems are error-prone and not very robust to real world disturbances, such as ambient background noise including speech spoken by other surrounding speakers, communication channel distortions, pronunciation variations, speaker stress, or the effects of spontaneous speech. Current speech recognition systems can be categorised in two groups according to their robustness level. Applications falling into the first category have typically large vocabularies and they have been designed to recognise continuos speech. Successful use of these systems requires efficient minimisation of all possible interference sources. High quality audio equipment, including a close talk microphone, noise free operating environment, and particularly, a co-operative and motivated user, are needed in order to achieve a high recognition performance. Dictation systems for continuous speech are typical examples of this category. Truly robust speech recognition applications form the second group. Robust systems can cope with distorted speech input (to a certain extent) and still provide high recognition accuracy even with inexperienced novice users. These systems can usually recognise only discrete words and their vocabulary size is limited to some tens of words. A good example of a robust system is a speaker- dependent name recognition application in voice dialling. In the name dialling system, the user has trained voice-tags; i.e. names that have phone numbers attached to. By speaking a certain voice-tag, a phone call to the attached number is then made. Because of this apparent simplicity, name dialling is very useful for example in a car environment where the user’s hands and eyes are busy. It is important to note that speech recognition alone does not have any particular value. To use speech as an input modality, there must always be some practical advantages. Furthermore, one cannot overestimate the importance of a good user interface; it is essential that speech recognition applications are extensively tested with real users under realistic operating environments. All modern speech recognition systems follow roughly the same basic architecture as shown in Figure 18 The task of a speech recognition system is to transform the digital speech signal into a discrete editable word, or a sequence of words. This transformation process consists of several steps. (Viiki, 1999). 1. First, a time continuos, digital microphone signal is converted into a sequence of discrete acoustic observations. 2. Then the actual recognition process makes use of three different knowledge sources: the acoustic models, the lexicon, and the recognition algorithm. The algorithm extracts individual blocks that with a high degree of possibility represents single words.

Training Acoustic Speech Models Database Page 28 Recognition DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Figure 18: A block diagram of a speech recognition system. The actual “speech recognition algorithm” is the module dealing with individual words. The designers usually focus on three major elements: the vocabulary (complexity, syntax, and size), the environment (bandwidth, noise level, and distortion type), and the speakers (stressed/relaxed, trained/untrained). Language specific models (typical sentence construction, phrases, grammar, etc) to identify different word types (substantives, adjectives, verbs, etc) and the individual word recorded is compared with the word signature from the corresponding vocabulary databases. A number of voice recognition systems are available on the market. The most powerful can recognise thousands of words. However, they generally require an extended training session during which the computer system becomes accustomed to a particular voice and accent. Such systems are said to be speaker dependent. Many systems also require that the speaker speak slowly and distinctly and separate each word with a short pause. These systems are called discrete speech systems. Recently, great strides have been made in continuous speech systems (voice recognition systems that allow you to speak naturally). There are now several continuous-speech systems available for personal computers. Because of their limitations and high cost, voice recognition systems have traditionally been used only in a few specialised situations. For example, such systems are useful in instances when the user is unable to use a keyboard to enter data because his or her hands are occupied or disabled. Instead of typing commands, the user can simply speak into a headset. Increasingly, however, as the cost decreases and performance improves, speech recognition systems are entering the mainstream.

Important characteristics of a speech recognition program The bottom line for speech recognition software is speed and accuracy. If the software can't decipher

Page 29 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS what you said correctly, it is not usable. Likewise if the decipheration process takes too long time nobody will use it. Speech recognition programs can however, do more than take basic dictation, and there are a lot of other features that could make these packages productivity- enhancing tools. Further important characteristics are: → Set-up and Training Wizards that assist to set up the system ant to help getting started are considered important. The figure to the right shows the Dragon NaturallySpeaking 3.0 user wizard, to adjust the volume, measure sound quality, and train the program. To improve accuracy further, documents containing common words can be imported. → Editing and formatting Getting the words onto the screen is only half the job. How well speech recognition software handles editing and formatting is also critical. The way modeless operation and natural-language support is achieved influence the user friendliness. For example, when NaturallySpeaking stumbles on a homonym34, one should simply repeat the word and the program would select the alternative. → Application Integration In the future, speech recognition will take place in the background and speech will become simply another way of interacting with the PC. As a precursor to that, vendors have been developing tight links between their speech-recognition programs and the applications commonly used every day, especially word processors. In this example, we dictated directly into Word using L&H Voice Xpress Plus. The Command Browser shows all of the variations that can be used to insert a table into Word. → Command-and-control Although continuous speech dictation is a relatively recent development, command-and-control applications have been around for years. A “What Can I Say” command should lists the commands that are available anywhere in e.g. Windows.

34 Homonym (two words are homonyms if they are pronounced or spelled the same way but have different meanings)

Page 30 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

3.2.1 Principles Speech recognition and Natural Language Processing (NLP) systems are complex pieces of software (Raghavan, 1998). A variety of algorithms are used in the implementation of speech recognition systems. Speech recognition works by disassembling sound into small units and then piecing back together, while NLP translates words into ideas by examining context, patterns, phrases, etc. Speech recognition works by breaking down sounds the hardware “hears” into smaller, non- divisible, sounds called phonemes. Phonemes are distinct units of sound. For example, the word “those” is made up of three phonemes; the first is the “th” sound, the second the hard “o” sound, and the final phoneme the “s” sounds. A series of phonemes make up syllables, syllables make up words, and words make up sentences, which in turn represent ideas and commands. Generally, phonemes can be thought of as the sound made by one or more letters in sequence with other letters. When the Speech Recognition software has broken sounds into phonemes and syllables, a “best guess” algorithm is used to map the phonemes and syllables to actual words. Once the Speech Recognition software translates sound into words, Natural Language processing software takes over. The Natural Language Processing software parses strings of words into logical units based on context, speech patterns, and more “best guess” algorithms. These logical units of speech are then parsed and analysed, and finally translated into actual commands the computer can understand based on the same principles used to generate logical units. Optimally Speech Recognition and NLP software can work with each other non-linearly in order to facilitate better comprehension of what the user says and means. For example, a Speech Recognition package could ask a NLP package if it thinks the “tue” sound means “to”, “two”, “too”, or if it is part of a larger word such as “particularly”. The NLP system could make a suggestion to the Speech Recognition system by analysing what seems to make the most sense given the context of what the user has previously said. Speech Recognition systems may determine which sounds or words were emphasised by analysing the volume, tone, and speed of the phonemes spoken by the user and report that information back to the NLP system. 3.2.2 Recognition enhancing measures Using speech recognition systems in real working surroundings reveal that they do not perform as stated by the sales agent. There exist a couple of factors that determine the recognition rate (Allen, 1992). The major requirements relate to: • Vocabulary, speech and language modelling. • Training material (if needed), the data collection platform, pre-processing procedures. • Speaker dependency and speaking modes. • Environment conditions (ambient conditions). In the following, I will discuss the main factors influencing the recognition process. A speech recogniser is based on some speech modelling using various paradigms. The best known are Dynamic Time Warping (DTW), Hidden Markov Modelling (HMM), and Artificial Neural Networks (ANN’s). Most of the approaches distinguish two phases: A training phase and an exploitation phase. The first phase is devoted to learning speech characteristics from data: 1) Acoustic wave forms.

Page 31 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

2) Phonetic/linguistic descriptions. 3) Specific features, etc. Speaker dependency The system may either be tailor-made for a particular speaker (speaker dependent) or designed to tolerate a large variety of speaker variability (Stern et al. 1996). Other systems may be tuned to the voice of a particular single speaker. Or a set of speakers (multi-speaker system). They may also be tuned to trained instead of general untrained speakers. In order to achieve higher recognition rate a training phase is usually needed on a pre-specified set of words/sentences. The characteristics of this set required set are described in terms of: 1. Type of data. 2. Speech acoustic wave forms. 3. Acoustic data with phonetic labelling. 4. Acoustic data with the corresponding orthographic forms. 5. Acoustic data with the corresponding phonetic transcription. 6. Acoustic data with the corresponding recognition-units transcription, etc. 7. Size of data (how many hours/minutes of speech). 8. Number of speakers and how they are selected. 9. Other characteristics (Sex, age, physical psychological state, experience, attitude, accent, etc.). 10. Acquisition channels: (single microphone, set of microphones, similar telephone handset or as many handsets as possible). 11. Environment conditions (noisy, quiet, all conditions, etc.) and constraints derived from the operating condition. A speaker adapted system: learns the characteristics of the current speaker and hereby continuos improving the performance. At the beginning the system may be used in a degraded mode (either speaker-independent or speaker dependent) and ending up as an optimised speaker- adapted system. The person who is going to use the system should do the adaptation. In order to tune in on his specific voice signature. Usually two approaches are used: → Static adaptation: The static adaptation process starts with an off-line learning from pre- recorded data and a training phase before using the system. The system references are adapted to the new speaker once and for all. The duration of this process is important: it can be real-time or even last for hours. The speech data needed can be acoustic data without any manual labelling or manual pre-processing, or it may have to be labelled (orthographic plus phonetic). The speech corpus may range from a few minutes of speech to a few hours. → The Dynamic adaptation process: The system learns the current speaker characteristics while the speaker is using the system. This may be done by manually correcting errors during the adaptation, or the system may automatically take into consideration the speech data uttered by the present speaker. It can be distinguished between three speaking modes each characterised by different recognition performance: • Isolated: The words are pronounced in isolation with pauses between two successive words, this gives the highest recognition rate. • Connected: Usually used when spelling names or giving phone numbers digit by digit, a lower rate of recognition is achieved. • Continuous: Fluent speech, the mode with lowest recognition rate. The speaking rate: varies from one speaker to another. And depends on various factors such as, stress, culture, emotion, etc. it can be slow, normal, or fast and a measure for it is the average number of speech frames within a given set of sentences. Non-speech sounds: Sounds like, coughing, sneezing, clearing one's throat, etc. may represent a challenge for the software These (or non-linguistic utterances) must be considered as part of the speech modelling.

Page 32 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

The lexicon size: The size of the vocabulary is one of the main characteristics of automated speech recognition (ASR). The vocabulary size has a dual effect, if it is large, the user has greater freedom (quality of free speech), but at the same time the recognition velocity as well as the recognition rate is degraded. The vocabulary may consist of a small set of words (about 10 words), a medium-size set (from 10 to 100), a large set (from 100 to 1000) or very large set (over 1000 words). The vocabulary may be seen as a single dictionary or divided into several sub-lexicons downloaded to the application depending on the dialogue phases. Almost all speech recognition systems such as IBM ViaVoice, has a predefined lexicon that is however extendable. A SRS applied in the shipping industry must handle maritime terms and abbreviations. Lexicon generation: The recognition process can either use a set of words (global approach) or a set of sub-word units to identify the user utterances. If the system uses whole word models (global approach), these have to be learned in advance by the system. This additional vocabulary has to be recorded and used for training for each different application (fixed vocabulary). In the case of sub-word units, the speech units are learned once and for all and the vocabulary lexicon is generated as a concatenation of such units (flexible vocabulary). During the thesis I scanned in a number of existing survey reports and translated them by an OCR software into editable text. Ship specific vocabulary was then filtered out manually, see Appendix C. Grammar: The most widely used grammar model in speech recognition systems is the so-called Trigram model. In this model, a word is predicted based solely upon the two words, immediately preceding it. The simplicity of the Trigram model is its greatest strength. Since Trigram statistics can estimated by counting millions of words of data. The implementation of the model involves only table lookup, being thereby computationally efficient and usable in real-time systems. This model captures the relations between words by the sheer force of numbers. It ignores however, the rich syntactic and semantic structure, that constrain the natural languages but also allows it to be easily processed and understood by humans. Further advances in speech recognition will be based on computational methods for predicting and analysing natural language data at a greater extend than the methods which are used in today’s systems. A new statistical model for language modelling has been proposed which preserves the strengths and computational advantages of Trigrams; but also incorporates long-range dependencies and information that is more complex. This approach is based upon the ideas of probabilistic link grammar. These techniques may improve the predictive power by naturally incorporating Trigrams into a unified framework for modelling long-distance grammatical dependencies. Moreover, the methods are computationally efficient, which will allow them to be used in actual natural language systems on today's computers The result of this approach may be significant in three different ways: • First, the work allows the construction of language models that have greater predictive power, than those constructed by current methods. • Secondly, it is expected that it will deepen understanding in the technical foundations of this area of computer science. • And finally, when incorporating the methods into speech recognition applications their performance will be, translation, and understanding systems at both Carnegie Mellon and IBM. This will allow the approach to be quantitatively measured, not only in terms of entropy, but also in terms of the improvement it may brings to speech recognition applications. The information above is summarisation of the article A robust parsing algorithm for link grammars, by (Grinberg et al., 1995).

3.2.3 Commercially available products

Page 33 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

In 1989 the first dictation system from Dragon came on the market (DragonDictate SST) with a recognition rate of 12 words per minute spoken discrete and cost about 10.000 $. Since then tremendous progress has been made in the development of automatic speech recognition systems. In 1999, Dragon systems and several other vendors including IBM, Lernout & Hauspie, offer continuos automatic speech recognition systems software packages for under 50 $. In the following, the three most commonly used Speech recognisers are presented. They are IBM’s ViaVoice, Dragon NaturallySpeaking and L&H VoiceXpress.

Figure 19: The requirements to voice recognition in DNV has the same links as the basic speech recognition chain but, includes a final reporting into NAUTICUS.

DNV’s organisation is scattered throughout the world covering over 100 nations and 80 different languages. Any voice recognition software has to cope with many different ways of pronunciation and applied grammar. DNV has several and difficult constraints and requirements to voice recognition software. Each surveyor has his individual way of work and of speaking. Further new technology is often met with unfriendly feelings due to maybe irrational fears. The following list states the major evaluation criteria for relevant speech recognition software. • System requirements. • User friendliness. • Vocabulary, speech and language modelling. • Recognition rate, speaker dependency, environment conditions (ambient conditions). • Speaking modes. • Price. System requirements: Hardware necessity, memory soundcard etc. User friendliness: Determining user friendliness is usually a rather subjective task; often it is referred to the general impression of software and performance. More objective evaluation criteria’s could be: measuring time to learn specific tasks with or without documentation, the number of steps needed to make its own shortcuts or to calibrate the microphone etc. Vocabulary, speech and language modelling: Evaluation criteria’s could be the ability to define your vocabulary, grammar and whether this can be achieved dynamically or not. Words typically used in shipping industry should be implementable in the software’s main lexicon. Recognition rate, speaker dependency, and environment conditions: The rate of recognition, and speaker dependency is maybe the major evaluation criteria. The recognition must be satisfying also in difficult surroundings as in an engine room or a windblown deck. The evaluation is based on the recognition rate in percentage (the number of words that are recognised by the system), the lingual restrictions (e.g. “Indian English” vs Oxford English) and the recognition rate in difficult surroundings (background noise, humid areas, etc).

Page 34 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Speaking mode: Speaking mode could be evaluated in terms of, limited vocabulary mode and dictation mode.

35 IBM VIAVOICE System requirements: Pentium/166 PC or better with MMX and 256K L2 cache, 32MB RAM (48MB for Win NT or for natural-language integration with MS Word 97), 180MB hard disk space, 16-bit sound card, Windows 95, 98, or NT 4.0. User friendliness: ViaVoice is intelligently designed and easy to learn. Once the set-up is completed, a wizard provides a short tour, to help configuration of the microphone and speakers. A 30 minutes Quick Training module gives an introduction. The system learning consists of reading several texts and is followed by a processing of the voice information. The system supports multiple users and multiple enrolments per user. The working vocabulary can be transferred between machines. Modeless operation allows a switch between dictation, correction, navigation, and formatting mode. To correct an unrecognised word, you select it and say, “Correct this.” A list of alternatives appears in the correction window, and you can select the appropriate one by voice. If the right word isn't there, you can type it in a text box or begin spelling it. You can also change the format of a word from the correction window. Words that you change during correction are automatically added to your active vocabulary if needed. You can also add words en masse by importing files into ViaVoice's Vocabulary Expander. Vocabulary, speech and language modelling: Base active vocabulary is 64,000 words, with a maximum active vocabulary of 128,000 words and a backup dictionary of 260,000 words. The program automatically prompts you to add unknown words, and you can import documents. The system supports multiple users and dictionaries. The software supports macros for commonly used phrases; and the macros can be more than one line. Modeless operation is supported, provided you pause before and after a system command and the user can correct words by spelling. Dictates to any application that accepts text. Natural language commands works in both ViaVoice’s Speakpad and Word 97. Recognition rate, speaker dependency, and environment conditions: Our main disappointment with ViaVoice is that it didn't yield a high level of recognition accuracy. An average accuracy score of approximately 70 percent was achieved even after extensive system learning. Another concern is that it varied a lot. Despite its modeless operation and intuitive handling of ambiguous words, ViaVoice's throughput ranked near the bottom on our tests. Price: $150, including the Andrea Electronics NC-80 microphone headset. Conclusions IBM ViaVoice 98 offers some compelling features but until its accuracy improves, it will remain a program that is better at giving orders to your PC than it is at taking down what you say. (Test results from PC magazine 12 November 1999)

35 www-4.ibm.com/software/speech/

Page 35 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

36 L & H VOICE XPRESS System requirements: Pentium/166 PC with MMX, 40MB RAM (48MB for Win NT), 130MB hard disk space, 16-bit SoundBlaster-compatible sound card (L & H has a list of compatible cards), Windows 95, 98, or NT 4.0. User friendliness: Several features, such as modeless operation (switching between dictation and commands) that allows both dictation and commands at the same time. A wizard configures the hardware. During enrolment, you read a chapter of a book. The entire process took over an hour in the test. It has to be noted that it can’t be returned to the enrolment phase to enhance accuracy. The program also includes a speech synthesiser for text to speech conversion (in a female computer voice). This feature could be used to e.g. read email. A voice activated “command browser” allows one to see what commands are available at any given point. Another feature of the software is its ability to turn on a small control/task bar at the top of the screen, which allows one to dictate into every Windows application, and dictate even into a DOS window. The correction window feature available in the Windows Applications can be turned on whenever text is selected, a potentially useful feature for correction. The correct word can be spelled, using the standard alphabet, with the letters dictated into the correction box. Alternative word choices are listed in the correction dialogue box, and may be chosen by saying “take one, take two, etc.” The software also supports multiple users/dictionaries. Vocabulary, speech and language modelling: The base active vocabulary is 30,000 words, maximum active vocabulary of 60,000 and a backup dictionary of 230,000.words. The software can add new words with the Vocabulary Tool and import documents. The software supports macros for commonly used phrases; the macros can be more than one line. Recognition rate, speaker dependency, environment conditions: The recognition accuracy depends on the lexicon size. The L&H VoiceXpress has a good (99 %) recognition rate using a specialised lexicon. Using the standard lexicon the recognition rate is dramatically reduced a fluent Oxford English speaker had a recognition rate of approx. 75%, and speakers with foreign or heavy (regional) accents (40%) furthermore reduced this recognition. There is an optional feature that allows the recognition rate to be set as faster and less accurate, or, slower and more accurate. Slower and more accurate is often more desirable, from our viewpoint, since one can dictate at length, and then go back and make corrections after the recognition process has taken place. Faster recognition presumably comes with a fast computer processor. Price: The “Professional”, “Standard” and “Advanced” editions, is available in price ranges of about $50, $80, and $130. The standard version is minimal. The advanced version allows one to work within Microsoft Word, and contains additional vocabulary. The advanced version also has a still larger vocabulary and allows the use of macros and transcription from recorded material. The “Professional” edition includes the all features from the “Standard ” and “Advanced” edition. Conclusions

36 www.lhs.com

Page 36 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

The program comes with an audio-visual teaching module and online help. I expect that a computer user could master it with the help available in the program. The advantage of this product is that some of its user interfaces are kept simple but, in our opinion it is to complex because of its multiple options and depth. L & H can be recommended for professional use but for unskilled users, the program will not be satisfying. Like all other programs, one needs to see how it performs with continued learning over time. 37 DRAGON NATURALLYSPEAKING System requirements: Pentium/133 PC or better, 32MB RAM (48MB for Win NT), 180MB hard disk space, 16-bit SoundBlaster-compatible sound card, Windows 95, 98, or NT 4.0. User friendliness: Dragon NaturallySpeaking has several positive features, such as modeless operation (switching between dictation and commands), dictation into most Windows applications, integration with Microsoft Word (basic command-and-control, but does not offer full control over Word 97 menus.) and Corel WordPerfect, improved natural-language support, and hosts of hands-free editing improvements. Like the other above mentioned speech recognition software’s a wizard configures hardware and trains the system to recognise your voice in about 4 minutes. To improve accuracy, you can read from one of four popular works for about 30 minutes. The software also supports multiple users/dictionaries. NaturallySpeaking's New User Wizard guides you through the process of creating a new speech file, configuring the microphone headset, and training NaturallySpeaking to recognise your voice. Vocabulary, speech and language modelling: The base active vocabulary is of 30,000 words, maximum active vocabulary of 62,000 words and a backup dictionary of 230,000 words. Add new words with Vocabulary Editor and import documents using Vocabulary Builder. You can also run the Vocabulary Builder, which imports text files. The software supports macros for commonly used phrases; the macros can, however only be up to one line only. Recognition rate, speaker dependency, environment conditions: In our tests, NaturallySpeaking was the most accurate product. On the average, it translated 80 percent of the words correctly. Most users should attain accuracy levels of 87 to 95 percent. Price: $150, including VXI Parrott 10-3 microphone headset. Conclusions NaturallySpeaking supports dictation into most Windows applications. In Word or WordPerfect, you can use natural-language commands. One of the biggest shortcomings of NaturallySpeaking, however, is that unlike ViaVoice and Voice Xpress, it doesn't allow you to control Word 97's pull-down menus. (Test performed by PC magazine 12 November 1999).

37 www.naturallyspeaking.com

Page 37 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

3.2.4 General conclusions The ultimate test of any speech recognition system is its performance in use. Other aspects may however be considered as: how well does it learn over time, how well can the recognition accuracy be boosted with continued tweaking, and how fully can it be customised to the individual user's needs. Although each product initially recognises about 80% of spoken words, each user must spend a tedious half hour to several hours reading passages of text to the computer to train the software. NaturallySpeaking relieves some tedium by offering interesting read-back text from books by Arthur C. Clarke and Dave Barry. During my initial use, I made corrections to almost every sentence. As the products remember corrections and additions, the need for corrections rapidly declines until, after 40 hours of use (an estimated month's worth of dictation), the recognition rate plateau’s, and corrections are minimal. The major difference between the products is in correcting mistakes. NaturallySpeaking lets users select replacement words from a list or mark the words to correct using the keyboard or voice. ViaVoice requires that the user select the words via the keyboard. If typing skills are adequate, either program works well. If you are keyboard-averse, NaturallySpeaking has a pronounced advantage. However, IBM can dictate text directly into its notepad, and the text can then be pasted into any program, or you can dictate directly into Microsoft Word. NaturallySpeaking offers only the notepad approach. ViaVoice also offers a text-to-speech option and better support for multiple users. But you can't totally command the computer using speech with either product. Additionally, both products have substantial system requirements: 133-MHz (NaturallySpeaking) or 150-MHz (ViaVoice) Pentium computers with 32M bytes of RAM for Windows 95 (or 48M bytes for Windows NT 4.0), plus a 16-bit or greater SoundBlaster compatible soundboard and about 50M bytes of disk space. Improved versions are scheduled for release later this year.

Examples of what the different software did with a news story38: You might scratch your head in amazement when you see what ViaVoice (centre) and NaturallySpeaking (bottom) can do to a news story (top).

38 This example is found online at: http://www2.computerworld.com/home/online9697.nsf/All/970915screen_shot

Page 38 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

EVALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

The future of speech recognition From time to time, the press (popular science magazines, periodicals, and newspapers) publishes articles about revolutionary inventions, like the Berger-Liaw Neural Network Speaker- Independent Speech Recognition System (BLNNSISRS)39. However, to the authors knowledge the presented information is still not verified by other research institutions or independent reliable sources and it must therefore be treated with caution. If this information turns out to be true, the BLNNSISRS has a great potential in enhancing speech recognition systems. Berger-Liaw Neural Network Speaker-Independent Speech Recognition System, (BLNNSISRS). Researchers at the University of Southern California (USC) claim to have created the world's first machine system that can recognise spoken words better than even humans. A fundamental rethinking of long underperforming computer architecture led to their stated achievement. According to USC, benchmark-testing using just a few spoken words, the neural network approach, bested all existing computer speech recognition systems and outperformed human ears. Neural networks mimic the way brains process information. Speaker-independent systems can recognise a word no matter who or what pronounces it. Theodore W. Berger, Ph.D., at the USC, claims that the system can distinguished words in vast amounts of random “white” noise, noise with amplitude 1,000 times the strength of the target auditory signal. Human listeners can deal with only a fraction as much. Furthermore he claims that the system can filter out words from the background clutter of other voices the hubbub heard in bus stations, theatre lobbies and cocktail parties, for example.

39 http://www.usc.edu/ext-relations/news_service/real/real_video.html

Page 39 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SPEECH DEMONSTRATOR

Even the best existing systems fails completely when as little as 10 percent of hubbub masks a speaker’s voice. At slightly higher noise levels, the likelihood that a human listener can identify spoken test words is mere chance. By contrast, Berger and Liaw’s system claims to give 60 percent recognition with a hubbub level 560 times the strength of the target stimulus. USC claims that the system can identify different speakers of the same word with superhuman acuity with just a minor adjustment. First proposed in the 1940s and the subject of intensive research in the '80s and early '90s, neural nets is a software configured to imitate the brain's system of information processing, wherein data are structured not by a central processing unit but by an interlinked network of simple units called neurones. Rather than being programmed, neural nets learn to do tasks through a training regimen in which desired responses to stimuli are reinforced and unwanted ones are not. “Even large nets with more than 1,000 neurones and 10,000 interconnections have shown lacklustre results compared with theoretical capabilities. Deficiencies were often laid to the fact that even 1,000-neuron networks are tiny, compared with the millions or billions of neurones in biological systems”, (Nelson &Wan, 1998). Remarkably, USC's neural net system uses architecture consisting of just 11 neurones connected by a mere 30 links. Berger and Liaw’s computer neurones were combined into a small neural network using standard architecture. While all the neurones shared the same hippocampus- mimicking general characteristics, each was randomly given slightly different individual characteristics, in much the same way that individual hippocampus neurones would have slightly different individual characteristics, More information on neural speech enhancement: http://ee.ogi.edu/~ericwan/NSEL/

4 SPEECH DEMONSTRATOR

40 The fact that NAUTICUS’ API were not known by us and due to resource shortage an NAUTICUS imitation was made in Visual Basic This speech demonstrator mimics the same screen interface as the most, relevant parts in NAUTICUS SIO. The purpose of it was to get an indication about the performance of a speech-controlled application. Figure 22compares the existing NAUTICUS data entry structure with the demonstrator interface design. Our demonstrator has the same lay out as NAUTICUS and therefor results obtained from the demonstrator may be comparable. The characteristics of the demonstrator are that it: 1. Is easy to achieve. 2. Is possible to demonstrate dialog principles. 3. Gives an indication about the feasibility. The demonstrator will however give no experience with integration with Nauticus. This actual programming work was done in co-operation with SINTEF41. A fully integrated speech

40 Application Program Interface 41 The Foundation of Scientific and Industrial Research at the Norwegian Institute of Technology

Page 40 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SPEECH DEMONSTRATOR recogniser in NAUTICUS will not have as primary goal to speed up the actual survey but to remove double work, to allow data entry at site and even to permit hands free work.

4.1 Set-up The demonstrator was built from: • IBM ViaVoice SDK (Software Development Kit) for windows. • And for programming purposes: • Visual Basic 5.0. The ViaVoice recognition software was chosen since it’s free downloadable for trial purposes and Visual Basic 5.0 is well suited tool for development and demonstration purposes. The demonstrator mimics only that part of NAUTICUS SIO that deals with data entry. A copy of the speech demonstrator along with a short users manual are enclosed in appendix A. The reader may play with the software to test its capabilities and quality and how well it is suited as a speech based data input. The demonstrator was designed such that it imitates only the relevant fields of the NAUTICUS program that is currently used for data entry, see Figure 21 data is typically as in check-boxes giving information on the status of the inspected item (radio buttons) and free text areas, the demonstrator allows: • Navigation. • Conclusion. • Free text. It is hoped that the demonstrator gives indication about a hands free information entry mode in NAUTICUS an indication on how a voice based, inspection software might act in a “NAUTICUS mode”. Furthermore, the software was not trained to recognize the speaker individually, i.e. better performance can be expected when training the software.

Design of the demonstrator The idea was to use IBM’s ViaVoice directly to enter information and if possible for navigation in NAUTICUS SIO however, the NAUTICUS API’s was not known to us due to resource shortage, as described in Figure 20, this could not be achieved. The speech demonstrator has three modes; the main reason for having modes is to reduce the number of available commands in different situations, and thus making the speech recognition faster and more accurate. The modes correspond well with the task execution as usually performed by surveyors. The three modes are: 1. Selection mode. 2. Conclusion mode. 3. Dictation mode. Changing modes To switch from selection mode to conclusions mode, the command "Conclusion" must be given. As response, the tab folder labelled "Conclusions" is chosen. To switch from conclusions mode back to selection mode, the command "Select" must be given. There is no visual feedback from this command. To switch from conclusions mode to dictation mode, the command “Add memo” must be given. As a response, keyboard focus is moved to the Memo edit control. To switch from dictation mode back to conclusions mode, the command “Cancel memo” or “Save memo” must be given. There is no visual feedback from this command. Selection mode

Page 41 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SPEECH DEMONSTRATOR

In selection mode, the user may select the different parts of the tree representing the chosen survey scope. It is also possible to expand and collapse the chosen branch of the tree. The commands used relate the following concepts: A survey scope consists of a set of surveys. Each survey concerns a set of systems that are inspected. Connected to each system, there are a number of items that are controlled. In Nauticus, there may also be sub-items (or sub-systems?), but this is not implemented in the speech demonstrator. In selection mode, the user may give voice commands that select different surveys, systems, and items. In conclusions mode the user may give voice commands that set the status of the chosen survey, system, or item. In dictation mode, the user may dictate any free text. At start-up, selection mode is active, with the survey scope node selected and collapsed. Navigation inside the demonstrator consists of some few predefined commands, the user may issue the following: • Next survey. • Previous survey. • This survey. • Next system. • Previous system. • This system. • Next item. • Previous item. • Expand. • Collapse. • Conclusions.

Conclusion mode In conclusions mode, the user may give commands for choosing among the four radio button values: 1. Found in order. 2. Found not in order. 3. Not applicable. 4. Not inspected. Dictation mode In dictation mode, the user may dictate a text in plain English, or give the commands “Cancel memo” or “Save memo” to go back to conclusions mode. In addition, the dictation mechanism in ViaVoice includes some generic commands, like “New line”, “New paragraph”, “Comma”, and “Full stop”.

Page 42 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SPEECH DEMONSTRATOR

Figure 20: Communication between NAUTICUS SIO and ViaVoice requires that the API’s are known, ViaVoice can communicate with standard MS windows applications. Non standard applications like NAUTICUS can be interacted with based on their API’s. This interaction has to be tailor-made using the SDK software unfortunately, the API’s of NAUTICUS where not known by us nor was the necessary resources available to deal with it. Therefore, an imitator with roughly the same screen interface as NAUTICUS was programmed in Visual Basic that has known API’s.

Conclusion mode: • Ok, in order • Not ok, not in order

Mouse Memo mode: Selection mode: activated Free text input • Expand/collapse • Next, this, previous survey • Next previous item

Figure 21: The current NAUTICUS SIO interface it is mainly based on a mouse interaction.

Page 43 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SPEECH DEMONSTRATOR

Figure 22: Screen dump of the demonstrator interface mimicking the real NAUTICUS SIO layout: Experiences gained from it may therefore be transferable to the real NAUTICUS.

4.2 Experiences gained from the speech demonstrator The demonstrator itself is easy to use, it consists of little control commands, and, the interface is a good imitation of the actual screen in NAUTICUS SIO. Results obtained from the demonstrator might therefore be transferable to NAUTICUS SIO. Experiences show that successful speech recognition requires a disciplined user and a training phase. The way of talking affects the recognition rate, if the user is unskilled in dictation, the recognition accuracy will degrade. It became obvious that ending a sentence with “full-stop” improves the recognition. This is because the recognition processors internal language model deals better with whole sentences than “open” sentences. The recognition velocity and rate when navigating in the tree structure (within the three modes) is as desired, but the velocity and rate in free text modus is not satisfying see, Figure 23. By reducing the vocabulary size, the recognition velocity could be increased to a satisfying level. The demonstrator (and ViaVoice in general) was very sensitive to noise and the speaking mode of the user. It became obvious that speech recognition in challenging environments will not be feasible for a long time. It showed clearly that a back-up system is absolutely necessary when using speech as primary input.

Recognition enhancing measures During the thesis, a number of existing survey reports were analysed with respect to shipping terminology, Appendix B. These words are usually not found in the vocabulary of standard speech recognition software. Adding these words to the system vocabulary, the recognition rate can be improved.

Page 44 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

SPEECH DEMONSTRATOR

Difficulties will arise when implementing a voice controlled ship inspection system, simply by the fact that DNV has employees speaking 80 different languages today. Even though the working language is English, there will be a great variation in pronunciation. This will decrease the already bad recognition accuracy in the dictating modus furthermore, where a recognition rate of approximately 90 % is desired. The user should be able to recognise what he /she has dictated. To achieve higher recognition rates the software is entirely depending on a training session of each independent user. The training sessions found in today’s speech recognition software are time consuming. A faster session to distinguish between the users should be implemented. To reduce speaker dependencies the vocabulary should contain as few words as possible, The implementation of a speech based reporting system should be 100% functioning, otherwise it will quickly be rejected. We may therefore conclude that today’s speech recognition systems are not performing satisfying in challenging environments. Enhancing the recognition refers to the basic approaches described in chapter 4.2.2.

Bottom survey in dock Some minor dents noted in different locations on side shell, memo for owner’s issued, see 3. Some bolts for the steering gear were found loose and the steering gear rotor was found at an angle (tilted)

Figure 23: Comparison of a text that is dictated into the demonstrator (left) and what the demonstrator recognised (right).

Page 45 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

RECOMMANDATIONS AND FUTURE OUTLOOK

5 RECOMMANDATIONS AND FUTURE OUTLOOK

The most common microphone solution used for speech purposes today consists mainly of noise cancelling boom microphones (chapter 3.1). These microphones are however, not even close to deal with and operate satisfying in really noisy surroundings. It should be focused much more on the development of new microphones and enhancement of already existing microphones. In my opinion, a very promising approach is the scull mounted microphone (chapter 3.1) since they already have proven to work satisfying in difficult ambient conditions in connection with VHF (in fire fighting and police work). One might therefore expect that such scull microphones would be applicable also in surveyor environments. The advantages of this type of microphone are obvious: • It does not obstruct the user. • It has full functionality in ambient conditions. • It has light weight. • It has low power consumption. • It has EX features. The main advantage might however lie in the fact that it is easily integrated in the helmet. The helmet would provide protection, some degree of noise shielding and allow to house additional hardware equipment such as a pre-amplifier or hardware based noise filter. Another argument that would promote the scull microphone would be the human factor. It might well be the case that surveyors would not be pleased when looking like an “alien astronaut”. This might increase the distance between a surveyor and the ship crew. It should therefore be focused on the development of PC compatible scull mounted microphones. Future speech based software in ships inspection must support mobile work. Taking notes and comments may often be accomplished with occupation of both hands. Speech based input is therefore an interesting option. If the spoken word should be recognised at site then some kind of feedback of successful recognition may be necessary, either visual or audible. Alternatively, the speech could be recorded as e.g. wav file. File for later machine processing or as backup for later cross checking of recognised text. The system set-up described in chapter 2 does not mention where the speech recognition software is actually located. Small handheld or wearable devices are starting to get equipped with speech recognition systems. Alternatively, standard cellular phones could be used to dial into some powerful computer that does the number crushing. The phone would then act as microphone and feedback device (automated reading or displaying of recognised text). The recognised text could either be saved at the remote machine, on the phone or on some connected (cable or wireless) remote device. The large number of current speech recognition products may nevertheless give a little too optimistic view on the performance of speech recognition systems. Current speech recognition applications work reasonably well as long as the usage conditions match the data used to train the recognition system. Unfortunately, this does not always happen in real world operating environments. Practical speech recognition systems never operate in stationary conditions, but

Page 46 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

RECOMMANDATIONS AND FUTURE OUTLOOK are usually surrounded by various interference sources, e.g. environment and speaker variability, that continuously change their characteristics. Speech recognition systems that can cope with all these variations have typically a restricted vocabulary size and a restricted number of speakers that the system can recognise. It seems that surprisingly little attention has been paid to the hardware. In my/our opinion, the microphone is one of the weakest links in the speech recognition chain. But it appears to me/us that its potential is not fully exploited. If hardware could be developed that could deliver a speech signal that is very little affected by noise or other disturbing influences then the requirements on signal enhancing software, exceedingly long training session or highly developed speech recognition systems could be lowered. Ideally, the speech signal delivering hardware (microphone or hardware filter) in combination with software based signals enhancing algorithms should give a signal to the speech recogniser that corresponds to office quality (low noise). In this way, the speech recognition system could perform its primary task, the recognition of speech. The signal processing software should be able to adapt to time varying disturbances whereas the hardware filter could deal with stationary noise sources. Speech recognition should be held casual, and not require special user effort or intrusive equipment (Oberteuffer, 1999). Despite the advances in desktop dictation systems, the required training represents a psychologically barrier. Disfluencies of non-native speaker’s speech are another barrier and are not satisfyingly. Solved to give the user a meaningful outcome. In comparison with the human ear, current speech recognition systems perform a preliminary analysis of the speech signal. Some systems are insensitive to the phase of the speech signal, this is important information for human beings, allowing them to pick out individual speakers and to discriminate high levels of background noise in e.g. a crowded room (the so-called cocktail party effect). A more detailed speech wave analysis may even eliminate the use of sophisticated microphones. As mentioned in chapter 3 one approach has been based on neural net technology for both phoneme determination and natural language processing. Neural nets are closer to human structures and methods for processing speech signals. Presently, the traditional mouse based and stationary desktop programs are simply “audified” rejecting the fact that speech based input might require a different interface. No system based on speech alone can offer a natural intuitive way of creating text documents. It is however apparent that speech does not only have a great potential but it has also many severe limitations (chapter 2.3 backup solutions). A new interface should be designed combining speech input with either visual or audible feedback mechanisms. For applications, requiring well-formatted documents with accurate spelling, punctuation, and capitalisation, speech to text systems must provide easy formatting and editing capability, in ship inspection there are many standard sentences and phrases and some shorthand might be developed. In current speech recognition systems, editing involves usually keyboard and mouse although in some applications it is possible to correct words, capitalise by speech, or move the cursor, speech is impractical and tedious. An effective speech recognition system needs an additional input device. In addition, any computer with a graphic user interface needs display navigation capability. A successful introduction of speech recognition systems will strongly depend on the “maturity” of the user. Converting daily speech into poetry, technical reports or minutes of meetings will require a disciplined use of the language with complete sentences, correct grammar and without the too common “eeh”, “hmm”, “shit” language contamination’s. Educating the user to this right way of speaking may be one of the greatest challenges. Introducing the above mentioned shorthand’s for pre-defined sentences could lighten this task but then much of the freedom of speech- to-text will disappear, and the user might feel uncomfortable or forced to use expressions that are unnatural to him.

Page 47 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

REFERENCES.

The success of speech recognition depends strongly on the utilisation of mathematical methods. They are however, far from optimal and considerable improvement might therefore be expected in the future. This improvement will be obtained from more complicated models benefiting from increased processor speed but also from applying more noise resistant recognition methods. In principle there are two complementary approaches: either adapting the model parameters to the noisy environment or cleaning the speech data such that they can be recognised by a recognition model. As usual, the best strategy will be a middle way between these two approaches. The following three methods are examples of these two approaches (Bateman, 1992) Using Noise Resistance Features and Similarity Measurements: This technique focuses on the effects of noise on the speech signal rather on the removal of noise. It derives noise resistant speech parameters. This approach although not the most successful has the significant advantage of being applicable to a wide range of noise, since it does not assume special noise characteristics Speech Model Compensation for Noisy Environments: The models (Hidden Markov Models, Artificial Neural Networks etc.) are trained with clean speech and are then transformed to specific noisy environment. The shortcoming of this approach is that such a model is only operational in that specific noise environment, making the system too inflexible for general use. Speech Enhancement: this method represents the second way of obtaining noise resistance. Noisy speech is transformed into as close to the training environment as possible. This speech pre-processing is supposed to recover the waveform or specific parameters of clean speech. The hope is that the cleaned speech might then be easier recognised. The future potential for speech recognition is promising.

6 REFERENCES.

Allen, j. 1992: Overview of text-to-speech systems. In Furui,S and Sondhim M.M., editors, Advances in speech signal processing, chapter 23. Marcel Dekker, Inc., New York.

Andersen, K.A. 1999: News in brief internal DNV magazine Vol. 3 1999.

Bateman, D.C. Bye, D.K and Hunt, M.J. 1992: Spectral contrast normalisation and other techniques for speech recognition in noise. Proc. IEEE. Internat. Conf. Acoust. Speech Signal Process, Vol. I, pp.241-244, 1992.

Gran, S. 1992: Audible Editor-Audit keyboard interface. DNV report no: 92-2034.

Gran, S. 1994: Ocean Engineering Auto-Course. Det Norske Veritas research report no: 94-2022. Internet version: http://www.dnv.com/ocean/.

Grinberg, D. Lafferty, J.and Sleator, D. 1995: A robust parsing algorithm for link grammars, technical report CMU-CS-95-125, Department of Computer Science, Carnegie Mellon University, 1995.

Grover, M and Makovoz, D, 1999: Speech technology magazine August edition. Internet address: http://www.speechtechmag.com.

Page 48 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

REFERENCES.

Krauss, L, Zuhlke, D. 1998: Pointing devices for machine controllers-investigation and result of cursor positioning with different pointing devices. University Kaiserslautern internal report.

Kuo, F. 1966: Network analysis and synthesis 2. Edition: John Wiley and sons N.Y.

Lyng, J. 1999: DNV. Internet adress: http//:www.dnv.com.

Madden, J Baldwin, T, 1997: Technology overview of mobile computers. Naval Education and Training Professional Development and Technology Center (NETPDTC), Orlando. Mestl, T, Lindgren, R. 1999: The future of ship inspection and reporting, Internal report DNV.

Nelson, A and Wan, E. 1998. Handbook of Neural Networks for Speech Processing. Shigeru Katagiri, Artech House, Boston, USA. 1998, (in press).

Nilsson, E.G. 1999: Nauticus by voice. SINTEF Telecom and Informati’s.

Oberteuffer, J.A. 1999: Developing of effective Speech to text systems: Speech technology Magazine PP 23-28 October/November edition, 1999.

Oppenheim, A. Schafer, R. 1975: Digital signal processing: Prentice Hall.

Oppenheim, A.V. Weinstein, E. Zangi, K. Feder, M and. Gauger D. 1994: Single- Sensor Active Noise Cancellation. EEE Transactions of Speech and Audio Processing, Vol. 2, No. 4, April 1994. Raghavan, P. 1998: Speaker and environment adaptation in continuos speech recognition. Computer Aids for Industrial Productivity (CAIP) Technical report CAIP- TR-227.

Stern, R. M. Acero, L. Oshima.1996: Signal processing for robust speech recognition. C-H Lee and F. Soong, Eds. Boston Kluwer Academic publishers, pp 351-378.

Stern, R. M. Raj, B. Moreno,P. 1997: Compensation for environmental degradation in automatic speech recognition. Kluwer Academic publishers Boston, MA 1997.

Viikki, O. 1999: Adaptive methods for robust speech recognition. Tampere University of Technology (TEK) publication 257.

Page 49 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDICES

APPENDICES

Appendix A Introduction to Det Norske Veritas (DNV). Appendix B Brief user guide of the speech demonstrator. Appendix C List of words found in shipping. Appendix D Glossary of terms.

Page 50 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX

D INTRODUCTION TO THE DET NORSKE VERITAS (DNV) Established in 1864, Det Norske Veritas is an independent foundation working with the objective of ‘safeguarding life, property and the environment’. DNV has a total of 5,500 employees, and comprises a network of 300 exclusive offices in 100 countries. DNV's Head Office is in Oslo, Norway. DNV is a leading provider of safety and reliability services, where classification, certification, verification, and advisory services are key activities. The staff consists mainly of highly qualified engineers and technical personnel. The Society is authorised to act on behalf of some 110 national maritime authorities. As of 1 June 1999 DNV have 22.2 percent market share of the oil tankers world-wide, and 19.6 per-cent of the newbuildings. Of the world's bulk carriers 10.6 per-cent are classed in DNV, and 7.7 percent of the newbuildings are built to DNV class. With regard to container carriers, DNV has 2 percent of the ships in operation and 1.8 percent of the newbuildings. In addition, some 120 drilling and service rigs are classed with DNV. DNV establishes Rules and Guidelines for the classification of ships, mobile offshore platforms, and other floating marine structures. It also issues rules and standards for the classification, certification, and verification of fixed offshore structures. DNV is the leading classification society in certification of Safety and Quality Management systems for shipping companies based on DNV's SEP rules, the ISM code and ISO 9000. DNV is also assigned work on more than 500-process plants world-wide. DNV is accredited by 14 countries to certify quality assurance systems according to the ISO 9000 standards, and has so far certified close to 22,000 companies. DNV's Safety Rating System is used at over 6,000 industrial sites throughout the world. DNV provides certification of management systems, products, and personnel to the landbased and offshore industries. Among accredited quality system certification (ISO 9000/BS 5750), DNV is among the world-wide market leaders with a market share of some six percent. DNV is also a leading provider of advisory services within safety, environment- and quality management, as well as a specialist in technical services and software. Det Norske Veritas provides safety, quality, and reliability services to the worlds Offshore and process industries, with major markets in the United States, Europe, and Asia. DNV is also active in the aerospace and aviation industries. It has extensive Research and Development facilities, with laboratories in Norway, the Netherlands, Singapore, Fujairah, and the US. The US Coast Guard, and Tokyo MOU statistics, issued most recently in 1998 rated DNV as best of all classification societies when it comes to Port State detentions. DNV has highly qualified surveyors, working from an extensive world-wide network of offices strategically placed at shipping centres, enabling quick and efficient service. Local managers are technically competent and authorised to make most decisions on site, avoiding unnecessary delays, or confusion.

Page 51 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX

B BRIEF USER GUIDE OF THE SPEECH DEMONSTRATOR Installation In order to run the NAUTICUS speech demonstrator, the ViaVoice run-time and ViaVoice SDK tools must be installed. This requires approximately 200MB free disk space on the PC. To install the ViaVoice RT, the installation file “rtduk.exe” (on the installation CD42) must be run (follow the instructions given from the installation program). When the RT is installed, the ViaVoice SDK must be installed by running the file “vvsdk15.exe” (on the installation CD – again, follow the instructions given by the installation program). Mark that the ViaVoice RT must be installed before the ViaVoice SDK. Lastly, copy the files “ScopeTree.exe” and “scope.txt” to an arbitrary directory (or run the demonstrator from the CD). The file scope.txt may be altered (at own risk) to change the data in the Survey Scope tree, but make sure not to change the numbering scheme. Starting the demonstrator To run the demonstrator, execute the ScopeTree.exe file, either from the installation CD or from the directory to which it was copied. Modes The system has three modes43: 1. Selection mode. 2. Conclusions mode. 3. Dictation mode. In selection mode, the user may give voice commands that select different surveys, systems, and items. In conclusions mode the user may give voice commands that sets the status of the chosen survey, system, or item. In dictation mode, the user may dictate any free text. At start-up, selection mode is active, with the survey scope node selected and collapsed. Changing modes To switch from selection mode to conclusions mode, the command "Conclusion" must be given. As response, the tab folder labelled "Conclusions" is chosen. To switch from conclusions mode back to selection mode, the command "Select" must be given. There is no visual feedback from this command. To switch from conclusions mode to dictation mode, the command “Add memo” must be given. As a response, keyboard focus is moved to the Memo edit control. To switch from dictation mode back to conclusions mode, the command “Cancel memo” or “Save memo” must be given. There is no visual feedback from this command. Commands in Selection mode In selection mode, the user may select the different parts of the tree representing the chosen survey scope. It is also possible to expand and collapse the chosen branch of the tree. The commands used relate the following concepts: A survey scope consists of a set of surveys. Each

42 If running the installation file for the ViaVoice RT from the CD does not work (aborts with error message), the installation file must be copied to a hard disk, and run from there (still, an error message may occur, but installation is performed). To save disk space, the installation file (“rtduk.exe”) should be removed from the hard disk after the RT has been successfully installed. 43 The main reason for having modes is to reduce the number of available commands in different situations, and thus making the speech recognition faster and more accurate. The modes correspond well with the task execution as usually performed by surveyors.

Page 52 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX survey concerns a set of systems that are inspected. Connected to each system, there are a number of items that are controlled. In Nauticus, there may also be sub-items (or sub-systems?), but this is not implemented in the speech demonstrator.

The user may issue the following predefined commands: • Next survey. • Previous survey. • This survey. • Next system. • Previous system. • This system. • Next item. • Previous item. • Expand. • Collapse. • Conclusions. All these commands are orthogonal to the chose tree node. E.g. the command "next survey" will select the next survey (if any) regardless of whether the survey scopes (the root node), a system or an item is selected. Mark that there is no command for selecting the survey scope node. In addition, the user may choose an arbitrary node in the survey tree by saying the text connected to it. The node is selected as soon as a unique phrase is uttered. If the survey tree is small - or containing varied text - the first word could well be enough to identify a node. As a visual aid in cases of ambiguous texts, the available "next words" is showed in a list box when an ambiguous word or phrase is given. Commands in Conclusions mode In conclusions mode, the user may give commands for choosing among the four radio button values: 5. Found in order 6. Found not in order 7. Not applicable 8. Not inspected If radio button 2 is selected, a check box (repaired/rectified) is enabled. Such sets are available for all nodes in the survey scope tree except for the root. In Nauticus, this is a bit different. It the user sets these values for a system in Nauticus, the value is propagated to all items in that system (the same for surveys?). This mechanism is not implemented in the speech demonstrator. Each of the radio buttons and the check box (if active) may be chosen by a set of voice commands (in addition to the texts themselves): • Found in order: “In order”, “OK”. • Found not in order: “Not in order”, “Failed to pass”. • Repaired/rectified (either of the words may be said): “Not so important”. • Not applicable: “N.A.”, “Not relevant” • Not inspected: “Inspect later”; "Do it tomorrow" In addition, the command "Select" must be given the change back to selection mode. This is easy to forget44!

44 There are good reasons for an automatic mode change to selection mode as soon as a legal conclusion mode command is recognised.

Page 53 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX

Functionality in Dictation mode In dictation mode, the user may dictate a text in plain English, or give the commands “Cancel memo” or “Save memo” to go back to conclusions mode. In addition, the dictation mechanism in ViaVoice includes some generic commands, like “New line”, “New paragraph”, “Comma” and “Full stop”. While dictating, full sentences ending with “Full stop” gives best recognition results. In all modes, mouse and keyboard may to some extent be used in combination with voice commands, but care should be taken, specially in dictation mode, where keyboard focus should not be moved outside the memo edit control. Furthermore, it is not recommended to use mouse/keyboard navigation in the survey tree while in other modes than selection mode. This may even cause the program to crash. (Nilsson Erik G: 1999)

Page 54 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX

C LIST OF WORDS FOUND IN SHIPPING

Acceptable Access Accommodation Accumulator Acetylene Action Add checklist Add occasional survey Additional Additional report Adjacent Afloat Aft Agent Aggregate Ahead Air-pipe Alarm Align Amendment Amended Analysis Anchor And Angle Annex Annual Anodes Apparatus Applicable Applicator Applied Approve Area Arm Arrangement Assembled Assignment Astern Atmosphere Attached Audible Authorisation Automatic Automatic survey Auxiliary Ballast Band report Bars Battery Beacon Bearing Bed-plate Belt Between Biennial Bilge Black Blades Blank Blend Blending Blower Boatswain Boiler Boiler-casing Bolt Bolthole Book Booster Bottle Bottom Boundary Buoy Bow Bow-port Box Bracket Breaker Bridge Brine Bronze Browsers Buckled Build Bulbous Bulk Bulkhead Bulwark Burner Cabin Cable Calibrate Camshaft Cancel Captain Cargo Carrier Carry Casing Castle CC CC Ceiling Cellulose-nitrate Centre Centrifugal Certificate Chaffing Chain Chainlocker Channel Charger Charts Checked Chemical Chief Chief-engineer Chock Cinematograph Circulation Clamp Clamping Class Class status Classification Clean Clearance Cleating Closable Close Close Closed Close-up Closure Clutch Coated Coating Cocks Code Cofferdam Coil Combination Combustible Combustion Commenced Commencement Communication Companionway Compass Complete Completion Comply Component Compressor Computer Concept Conclusions Concrete Condensate Condenser Condition Condition of Class Connection Considered Construction Content Continuos

Page 55 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX

Control Convention Converter Cooler Cooling Corridor Corrosion Country Coupling Cover Cover letter Crack Crank Cranckpin Crankshaft Crew Crew Cropped Cylinder Damage Damages Damp Damper Dangerous Date Dat Davit Daylight Dead Dead-end Deadlight Deck Deckhouse Declaration Deepwell Deficiencies Delete Deletion Derated Det Norske Veritas Detector Detention Deterioration Deviation Device Diesel Direction Dirty Discharge Disclosed Disconnected Dismantled Displacement Distribution Division Dock Document Dome Donkey-boiler Door properties Double Down Downcomers Drain Drainage Draught Drawing Drill Drip Drive Dry Dry-dock DTP Duct Due Due date Dump Echo Effect Ejector Electric Electronic Embark Emergency Employed Empty Empty quick report Emulsifiable Enclosures Endorsement Engine Engineer Enhanced EO EPIRB Equipment Escape Escapeways ESP Evaporator Event Examined Exception Exhaust Existing Expansion Expiry Exposed Expansion Extended Extinguish Failure Fair Fairlead Fall Fan Favourites Feed Fill Filter Final recording Findings Fired Fireman Firepump Fitted Fitting Fittings Fixed Flag Flame Flammable Flap Flare Flashlight Flexible Flooding Floors Fluid Foam Following Fore Form Forward Found Foundation Frame Free Freeboard Freeing Frequency Fresh Fuel Full Funnel Fuse Galley Gangway Gases Gasket Gauge Gear General Generator Girder Gland Glass Glycol GMDSS Golten Good Gouged Governing Gravity Grease Grid Gross Guard-rail Guide Gyro Halon Harbour Hard Hatch Hatchcoaming Hatchcover Head Header

Page 56 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX

Heading 40.9a Heater Heavy High Historical Hoist Hold Housing Homogenizer Hose House Hydrocarbon Hull Hull survey report Hydraulic IMO Hydrostatic Hyperbaric Immersion Incidents Importance Including Inboard Indication Incinerator Inergen Indentation Inert-gas Induced Information Inert Inner Inflatable Installation Inlet Instrument Install Instruction Insulate Insulation Intake Interface Interior Interlock Intermediate Internal memos Internals In-water IOPP Isolation Isotope Issue Issue Issue place Item Job status Job templates Jobs Journal Keel Kit Laboratory Ladder Laying Leak Leakage Length Level Liability Lifeboats Lifebouys Lifejacket Lifeline Lift Light Lighting Limit Line Liner Lip Liquid Load Local Location Locker Lockers Log Log Longitudinal Loose Loss Low Lower Low-low Lubrication Machined Machinery Main Maintained Maintenance Maiwheel Major Make-up Management Manager Manifoil Manning Manometer Manoeuvring Manual Mapping Mark Marker Mast Master Mate Material Measurements Mechanic Medium Megger Members Memo Memo to Owner Memo to Surveyor Mess Middle Minor Mixture MO Mobile Modified Monitoring Mooring More Moveable Mover Musterlist Name Nautical NAUTICUS Navigation Necessary Non Newbuildings NIS No Notice Non-explosion Non-return Notation Occasional surveys Nozzle Nut Obstruction Official Offshore Officer Officer Oily water separator OK Oil Oilfired Ongoing jobs Open Onboard Ongoing Order confirmation Order number Operable form Order Out Outboard Original Other Overall Overboard Outlet Outside Own unit Owner Overhaul Overrule Page Paint Owner

Page 57 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX

Oxygen Panels Part Palm Pan Passenger Peak Partial Particular Penetration Pennant Pedestal Pellet Permanent Petrol Periodical Periodical survey Pilot Pinion Photograph Pickup Piston Pitch Pintles Pipe Planned Plant Place Place PMS Point Platform Plating Poor Poorly Pollution Poop Position Post survey Port documents Portable Powder Power Postpone Postponed Preheater Preparations Practice Pre survey Primary Prime Pressure documents Prevention Procedures Process Program Propeller Protect Protection Provided Pulley Pump Pumproom Purifier Quick Quick recording Quick report Radar Radio Radiotelephone Raft Rails RCH Readily Realigned Receptacles Recharging Receiver Record Reduction Reefer Refilled Refrigeration Registration Regulator Release Remote Remove checklist Removed Renew Renewal Repair Report Representative Requested Requirement Rescue Rescueboat Resisting Restricted Result Retaining Return Reverse Reversing Rigged Ro/ro Rocker Rocket Rod Roller Room Rope Rope Rotate Rotor Round Rudder Running Safety Salinometer Sandblasted Sanitary Satisfaction Scavenger Scope Screen Scrubber Scupper Scuppers Sea Seal Sealing Search Secured See Segregation Select scope Select vessel Selected Self Selfclosing Semiportable Sensor Sent jobs Separated Separator Serious Servo Set completed Set survey scope Severit Shaft Shaft Shaftline Shape Shell Shelter Shield Shielding Ship Shipyard Short Shut Shutdown Shutter Side Sidescuttles Signal Signature Signboard Sills Single SIO SIO survey job Sketch Skimmed Skylight Slide Slipway Slop Slow

Page 58 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX

Sludge Smokesignal Soft SOLAS Soot Sounder Sounding Source Spaces Spanner Spare Sparks Special Spell check Spills Spot Spray Spraying Spring Sprinkler Stability Staging Stairway Stamp Standard Standard features Starboard Start Station Station code Statutory Steam Steam-steam Steel Steering Stem Stern Sterntube Stiffeners Stock Stop Stopper Stopping Store Stowage Stringers Stripping Strong Strongpoint Structural Submitted Substantial Suction Suit Supercharger Superstructure Supply Support Surface Survey Survey checklist Survey checklist with Survey job completed Survey job recording information Survey job started Survey report Survey report 40.9a Survey report owners Surveyor signature Switch Switchboard Switchover System Systematic Systematic Tailshaft Tank Tanker Technical Teeth Telegraph Temperature Temporary Term Test Text Thermal Thermometer Thickness Thordon Thrust Tie Tiebolt Tier Tightness Tilted Timber Time Tonnage Tools Top Torch Tow Towing Track Transducer Transfer Transverse Trap Traps Trunk Trustshaft Tube Tunnel Turbo Type UHF Ullage Ultrasonic Unacceptable Unapproved Underneath Undersigned Underwater Undo Unit Upper UTM Vacuum Valid Valve Vapours Varnishes Vehicle Ventilation Ventilator Verified Vessel ID Vessel name VHF Vibrate View View details View flag info View scope View survey plan Visual Void Waiving Walk Wall Warranty Watertight Waste Wasted Water Weld Way Weather Web Where Well Wheel Wheelhouse Wing Widely Wind Window Workshop Wiring With Work Yard Yard name Yard vessel number

Page 59 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX

Page 60 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX

D GLOSSARY OF TERMS Active Matrix Displays: A type of flat-panel display in which the screen is refreshed more frequently than in conventional passive-matrix displays. The most common type of active-matrix display is based on a technology known as TFT (thin film transistor). The two terms, active matrix, and TFT are often used interchangeably. API (Application Program Interface): The interface (calling conventions) by which an application program accesses operating system and other services. An API is defined at source code level and provides a level of abstraction between the application and the kernel (or other privileged utilities) to ensure the portability of the code. An API can also provide an interface between a high level language and lower level utilities and services which were written without consideration for the calling conventions supported by compiled languages. In this case, the API's main task may be the translation of parameter lists from one format to another and the interpretation of call-by-value and call-by-reference arguments in one or both directions. Ambient Noise: The prevailing sound field in a room in the absence of an applied signal from a loudspeaker, musical instrument, or other sound source. Bandwidth: The amount of data that can be transmitted in a fixed amount of time. For digital devices, the bandwidth is usually expressed in bits or Bytes Per Second (BPS). For analogue devices, the bandwidth is expressed in cycles per second, or Hertz (Hz). Bi-directional microphone: A microphone that is equally sensitive to sounds arriving from the front and back, and insensitive to sounds arriving from the sides. BIOS: Pronounced “bye-ose,” an acronym for basic input/output system. The BIOS is built-in software usually placed on a ROM chip that determines what a computer can do without accessing programs from a disk. On PCs, the BIOS contains all the codes required to control the keyboard, display screen, disk drives, serial communications, and a number of miscellaneous functions. Cardioid Microphone: A unidirectional microphone with 6dB of attenuation at the sides (±90 degrees) and a null at 180 degrees. So called due to the cardioid-like shape of its polar pattern. In a few words, it picks up more sound from the front than from anywhere else. Central Processing Unit (CPU): The CPU is the “brains” of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where most calculations take place. In terms of computing power, the CPU is the most important element of a computer system. Continuous speech recognition: A vocal-to-digital translation system with heightened capabilities; unlike standard speech recognition systems, it can interpret words spoken in a natural cadence and within several contexts. dB: (Decibel)- One dB is the smallest change in loudness that the average human ear can detect. 0dB is the threshold of human hearing. The threshold of pain is between 120 and 130dB. The decibel is a ratio, not an absolute number, and is used to identify the relationship between true power, voltage, and sound pressure levels. Decibels alone have no specific meaning. For example, dBV is a voltage ratio; 0dB = 0.775 V root-mean square (RMS). dBSPL is the sound- pressure level ratio. It measures acoustic pressure. dBM is a power ratio. dBA takes into account

Page 61 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX the unequal sensitivity of the ear, and sound-pressure level is measured through a circuit that compensates for this equal loudness. Diaphragm: The moving element of a microphone that converts sound-wave energy into mechanical energy. Digital Signal Processing (DSP): Refers to manipulating analogue information, such as sound or photographs that have been converted into a digital form. DSP also implies the use of a data compression technique. Dynamic Microphone: Any microphone whose output is a function of magnetic induction in a voice coil, ribbon, or other conductor moving within a permanent magnetic field. Dynamic Time Warping (DTW): Dynamic time warping is the main frame to elaborate some algorithm that can resolve a large number of optimisation problems, such as character (OCR) and speech recognition or picture modifier software. Ethernet: A local-area network (LAN) protocol developed by Xerox Corporation in co- operation with DEC and Intel in 1976. Ethernet uses a bus topology and supports data transfer rates of 10 MBPS. The Ethernet specification served as the basis for the IEEE 802.3 standard, which specifies the physical and lower software layers. Ethernet employs a media access control mechanism called CSMA/CD (Carrier Sense Multiple Access with Collision Detection) and employs a "bus" topology using coaxial cable operating at 10 MBPS. There is no central controller and all devices have equal status. With CSMA/CD, each device, which is sending or receiving, has the entire line capacity to itself. Ethernet allows a device to transmit at any time, first having listened to ensure that the line is not already busy. It is still possible that two devices can start transmitting simultaneously, so Ethernet devices are equipped with “Collision Detection”. If a device detects a collision, it simply stops, listens again, and re-transmits when the line is free. The CSMA/CD access method is one of the most widely implemented LAN standards. A new version of Ethernet, called 100BaseX (or Fast Ethernet), supports data transfer rates of 100 MBPS. And a proposed standard, called Gigabit Ethernet will support data rates of 1 gigabit (1,000 megabits) per second. Flash Memory: A non-volatile memory that "remembers" data stored in it, even without power (or backup batteries). A special type of EPROM (Erasable Programmable Read Only Memory) that can be erased and reprogrammed in blocks (blocks are groups of bytes, usually in multiples like 4k, 16k, etc.) instead of one byte at a time. Some modern PCs have their BIOS stored on a flash memory chip so that it can easily be updated if necessary. Such BIOS is sometimes called a flash BIOS. Frequency Response: A measure of the effectiveness with which a circuit, device or system transmits the different frequencies applied to it. The way in which an electronic device (Mic, amp, or speaker) responds to signals having a varying frequency. This is a measurement of how well an amplifier reproduces and amplifies a specified audible range with equal amplitude or intensity, for example, 30 to 16,000 Hz. GB: Gigabyte. Geographic Information System (GIS): A very simple way of defining GIS is that they are electronic maps. GIS allows the user to define layers, or levels, that generally contain related types of information. For example: a GIS could have a level that had all of the hydrology for a circumscribed part of the world, or a level that had contour information of the same part of the world, or streets, vegetation types, geology, archaeology, biology and so on. If you envision each of these “thematic” levels on mylar, and overlaid the hydrology, archaeology, and contours, you would then see the relationship between water resources, physiography, and archaeological sites. By entering basemap data (contours, political features, Township, Range, Section lines, etc.), and then importing data from GPS (Global Positioning System) into the GIS, resource management researchers have a powerful tool for analysis.

Page 62 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX

GlidePoint Touchpads: A touchpad designed to replace the computer mouse and related pointing devices. Without requiring contact pressure, the finger manipulates the screen cursor by simply gliding across the surface. Tapping the surface serves the same purpose as clicking the button on a mouse. A touchpad provides all the functions of a mouse, and more. A mouse reports relative motion, i.e., its change in position, while touchpads can also report the absolute position of a finger or a handheld pen-like stylus. This supports the signing of electronic documents, simple sketching, and a host of exciting future applications. Additionally, a “Z” factors relating to finger contact pressure is available. These characteristics give the glidepoint touchpad an intelligence potential and capability beyond that of the mouse. Global Positioning System (GPS): A satellite-based device that records x, y, z co-ordinates and other data using global positioning. GPS devices can be taken into the field to identify an individuals global position while driving, flying, or hiking. Ground locations are calculated by signal from satellites orbiting the Earth. GPS devices play a significant role in geographic data collection. Head Mounted Display (HMD): State-of-the-art system worn on the head consisting of optics that present visual information to the user. Miniature display technologies generate images either monocular or binocular. High-resolution VGA displays are the standard. Hidden Markov Modelling (HMM): is based upon a statistical state-sequence known as a Markov chain, consisting of a set of states, with transitions between the states. Each state corresponds to a symbol, and to each transition is associated a probability. Symbols are produced as the output of the Markov model by the probabilistic transitioning from one state to another. Infrared Transmission: Wireless computing is one step closer to broad industry acceptance thanks to a new infrared data-transfer standard. The Infrared Data Association (IRDA) has expanded the existing protocol to accommodate transfer speeds of 1.152 and 4 megabits per second (MBPS), a significant improvement over the 115-Kbps rate of IRDA 1.0. At these faster speeds, infrared (IR) data transfer is a viable option for PC-to-PC transfer, printing, and, most important, network access. Mobile computers with an IR port can access a network via a beam of light. Windows 95 ships with drivers that preserve data from loss or corruption if the connection is interrupted. All infrared transfers share a few key benefits, such as bi-directional data exchange. And unlike radio-frequency devices, which broadcast data and are subject to interference, IR devices are secure in that they require a direct line of sight. The drawbacks: IR transfers are effective at distances of up to only one meter, and the one-to-one connection means that only one notebook can be connected to a network access device at a given time. Multiple connections are possible sequentially, meaning that any number of notebooks can access the same IRDA device in succession. Local-Area Network (LAN): A computer network that spans a relatively small area. Most LANs are confined to a single building or group of buildings. However, one LAN can be connected to other LANs over any distance via telephone lines and radio waves. A system of LANs connected in this way is called a wide-area network (WAN). Most LANs connect workstations and personal computers. Each node (individual computer) in a LAN has its own CPU with which it executes programs, but it is also able to access data and devices anywhere on the LAN. This means that many users can share expensive devices, such as laser printers, as well as data. Users can also use the LAN to communicate with each other, by sending e-mail or engaging in chat sessions. There are many different types of LANs, token-ring networks, Ethernet, and ARCnets being the most common for PCs. Most Apple Macintosh networks are based on Apple's AppleTalk network system, which is built into Macintosh computers. The following characteristics differentiate one LAN from another: Topology: The geometric arrangement of devices on the network. For example, devices can be arranged in a ring or in a straight line.

Page 63 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX

Protocols: The rules and encoding specifications for sending data. The protocols also determine whether the network uses peer-to-peer or client/server architecture. Media: twisted-pair wire, coaxial cables, or fibre optic cables can connect Devices. Some networks do without connecting media altogether, communicating instead via radio waves. LANs are capable of transmitting data at very fast rates, much faster than data can be transmitted over a telephone line; but the distances are limited, and there is also a limit on the number of computers that can be attached to a single LAN. Liquid Crystal Display (LCD): A type of display used in digital watches and many portable computers. LCD displays utilise two sheets of polarising material with a liquid crystal solution between them. An electric current passed through the liquid causes the crystals to align so that light cannot pass through them. Each crystal, therefore, is like a shutter, either allowing light to pass through or blocking the light. LINUX: A freely distributable implementation of UNIX that runs on a number of hardware platforms, including Intel and Motorola microprocessors. Mainly Linus Torvalds developed it. Because it's free, and because it runs on both PCs and , Linux has become extremely popular. MB: Megabyte. Microphone: A transducer for converting acoustic energy to electrical energy. Microwave Signals: An electromagnetic wave used to transmit data with a wavelength ranging from approximately one millimetre to one meter. This region is between infrared and short wave radio wavelengths. Neural network: A system that uses the human inference concept of an expert system but widens the scope to include many subjects. Several processors, each with its own “speciality,” form a problem-solving network. Omnidirectional Microphone: A microphone that is equally sensitive in all directions. Personal Computer Memory Card International Association (PCMCIA): PCMCIA is an organisation consisting of some 500 companies that has developed a standard for small, credit card-sized devices, called PC Cards. Originally designed for adding memory to portable computers, the PCMCIA standard has been expanded several times and is now suitable for many types of devices. There are three types of PCMCIA cards. All three have the same rectangular size (85.6 by 54 millimetres), but different widths. Type I card can be up to 3.3 mm thick and are used primarily for adding additional ROM or RAM to a computer. Type II cards can be up to 5.5 mm thick. These cards are often used for modem and fax modem cards. Type III cards can be up to 10.5 mm thick, which is sufficiently large for serving as portable disk drives. Peripheral Devices: Any external device attached to a computer. Examples of peripherals include printers, disk drives, display monitors, keyboards, and mice. QWERTY Keyboard: Pronounced kwer-tee refers to the arrangement of keys on a Standard English computer keyboard or typewriter. The name derives from the first six characters on the top alphabetic line of the keyboard. RAM: An acronym for random access memory, a type of computer memory that can be accessed randomly; that is, any byte of memory can be accessed without touching the preceding bytes. RAM is the most common type of memory found in computers and other devices, such as printers. Resolution: Refers to the sharpness and clarity of an image. The term is most often used to describe monitors, printers, and bit-mapped graphic images. For graphics monitors, the resolution signifies the number of dots (pixels) on the entire screen. For example, a 640-by-480- pixel screen is capable of displaying 640 distinct dots on each of 480 lines, or about 300,000 pixels. This translates into different dpi (dots per inch) measurements depending on the size of

Page 64 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX the screen. For example, a 15-inch VGA monitor (640x480) displays about 50 dots per inch. Printers, monitors, scanners, and other I/O devices are often classified as high resolution, medium resolution, or low resolution. The actual resolution ranges for each of these grades is constantly shifting as the technology improves. Radio Frequency (RF): A transmission frequency used by radio stations, and in the range, which radio waves may be, transmitted 10 kHz to 300,000 MHz. ROM: An acronym for read-only memory. This is computer memory on which data has been pre-recorded. Once data has been written onto a ROM chip, it cannot be removed and can only be read. RS-232: IEEE Standard that defines three types of connections: electrical, functional, and mechanical. It is the most commonly used interface for the data-transmission range of 0-20 Kbps/50ft (15.2m). It employs unbalanced signalling and is usually used with 25-pin D-shaped connectors (DB25) to interconnect Data Terminal Equipment (DTE) computers, controllers, etc., and Data Communications Equipment (DCE) such as modems and converters. Serial data exits through an RS-232 port via the Transmit Data (TD) lead and arrives at the destination device’s RS-232 port through its Receive Data (RD) lead. Serial data transmission is the most common method of sending data from one DTE to another. Data is sent out in a stream, one bit at a time over one channel. SCSI: Abbreviation of small computer system interface. Pronounced “scuzzy”, SCSI is a parallel interface standard used by Apple Macintosh computers, some PCs, and many UNIX systems for attaching peripheral devices such as disk drives and printers to computers. SCSI interfaces provide for faster data transmission rates (up to 40 megabytes per second) than standard serial and parallel ports. In addition, many devices can be attached to a single SCSI port, so that a SCSI can be an I/O bus rather than simply a parallel interface. Although SCSI is an ANSI standard, there are many variations of it. Signal-to-Noise Ratio: The ratio, usually expressed in decibels, of the average signal (recorded or processed) to the background noise (caused by the electronic circuits). Speech recognition: A computer's ability, through software, to accept spoken words as dictation or to follow voice commands. Vocabulary limitations and recognition abilities can vary greatly from system to system. Also called voice recognition or speech understanding. Speaker dependent: This technology requires users to participate in extensive training exercises that can last several hours. Once you are done “drilling the machine”, the computer then begins several calculations from the data it has received from your exercises. After these calculations, the computer makes a voice profile that attempt to match your voice synthesizations. Speaker-Independent: Speech recognition systems that will respond to any user voice regardless of dialect or tone are said to be speaker-independent. Older generations of speech recognition systems were generally speaker-dependent. These systems had to be trained to each individual’s voice characteristics for useful speech recognition. Streaming Video/Audio: A new class of intelligent, integrated media servers that deliver interactive, real-time, high quality MPEG-1, MPEG-2, and H.263/G.723 video and audio streams to clients via networks. This system will ultimately replace the method of “downloading video clips” which uses large amounts of local storage (not required by streaming video solutions). SVGA: Short for Super Video Graphics Array (VGA), a set of graphics standards designed to offer greater resolution than VGA. There are several varieties of SVGA, each providing a different resolution, 800 by 600 pixels, 1024 by 768 pixels, 1280 by 1024 pixels, 1600 by 1200 pixels. All SVGA standards support a palette of 16 million colours, but the number of colours that can be displayed simultaneously is limited by the amount of video memory installed in a system. One SVGA system might display only 16 simultaneous colours while another displays the entire palette of 16 million colours. A consortium of monitor and graphics manufacturers develops the SVGA standards.

Page 65 DNV NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY

APPENDIX

UNIX: A popular multi-user, multitasking operating system developed at Bell Labs in the early 1970s. Created by just a handful of programmers, UNIX was designed to be a small, flexible system used exclusively by programmers. Although it has matured considerably over the years, UNIX still betrays its origins by its cryptic command names and its general lack of user- friendliness. VGA: Abbreviation of video graphics array, a graphics display system for PCs developed by IBM. VGA has become one of the de- facto standards for PCs. In text mode, VGA systems provide a resolution of 720 by 400 pixels. In graphics mode, the resolution is either 640 by 480 (with 16 colours) or 320 by 200 (with 256 colours). The total palette of colours is 262,144. VHS: Short for video home system, a trademark for an electronic system for recording video and audio information on videocassettes. White Noise: A full audio spectrum signal with the same energy level at all frequencies. Windows NT: An advanced version of the Windows operating system. Windows NT is a 32-bit operating system that supports pre-emptive multitasking. There are actually two versions of Windows NT: Windows NT Server, designed to act as a server in networks, and Windows NT Workstation for stand-alone or client workstations. Wireless Local Area Network: A LAN utilising radio transmission of voice, data, and images to interconnect mobile users. These systems rely on transportable high-speed reliable radio frequency communications in an area typically a few kilometres in radius.

- o0o -

Page 66