2005:084 CIV MASTER'S THESIS

Transcoding SIP gateway

David Åberg

Luleå University of Technology MSc Programmes in Engineering

Department of Computer Science and Electrical Engineering Division of Computer Communication

2005:084 CIV - ISSN: 1402-1617 - ISRN: LTU-EX--05/084--SE Transcoding SIP gateway

Master Thesis Preformed at Omnitor AB Luleå David Åberg (Computer Science, Luleå university of technology)

Supervisor at Omnitor: Andreas Piirimets

Examinator Pierre Fransson Abstract In todays multimedia society more and more communication is becoming digitized. However under a temporary period, digital and analog techniques must coexist. So that every one can communicate with everyone else. For the deaf community, among other things, this means connecting analog text telephones to new internetbased technology. To enable this coexistance the Telecommunication Access Rehabilitation Engineering Research Center of the University of Wisconsin, Trace Center joint with Gallaudet University, and Omnitor AB have presented a idea of how this can be achieved.

This report describes a project, which shows how a gateway can be designed and implemented. A gateway that has the features of enabling communication between analog text telephones and based clients. Enabling the users to communicate with both voice and text regardless if the users are using analog or digital media.

The implementations in this project are done in Java. So the resulting gateway is as operating system independent as it can be.

The result of this work was a gateway that can separate vice and analog text telephone signals in a in coming sound stream. The separation is good enough so as not to influence a conversation through the gateway.

2 (32) Preface This work was partially funded by the National Institute on Disability and Rehabilitation Research, US Dept of Education under Grant H133E990006 as part of a co­operation between the Telecommunication Access Rehabilitation Engineering Research Center of the University of Wisconsin ­Trace Center joint with Gallaudet University, and Omnitor. The goal was to promote mainstreaming and functional enhancements toward telecommunications access for all. The opinions herein are those of the authors and not necessarily those of the funding agencies.

3 (32) Table of Contents

1 Introduction...... 6 1.1 About the project...... 6 1.2 Purpose...... 7 1.3 Objectives...... 8 1.3.1 Minimum requirements...... 8 1.3.2 Desired requirements...... 8 1.3.3 Extra requirements...... 9 1.4 Demarcation...... 9 1.5 Omnitor...... 9 1.6 Design environment...... 10 2 Background...... 11 2.1 Theory...... 11 2.1.1 SIP – Session Initiation Protocol...... 11 2.1.1.1 SIP entities...... 12 2.1.2 RTP and SDP...... 12 2.1.3 JMF­Java Media Framework...... 12 2.1.4 V.18 Text telephones...... 13 2.1.5 FIR and IIR filter...... 13 2.1.5.1 FIR filter...... 15 2.1.5.2 IIR filter...... 16 2.2 Project tools...... 17 3 Project work...... 18 3.1 Design...... 18 3.1.1 Java design...... 18 3.1.2 The controllers...... 19 3.1.3 Logging...... 20 3.1.4 The different session blocks...... 20 3.1.5 Incoming session blocks...... 20 3.1.6 Outgoing session blocks...... 21 3.1.7 Serial session blocks...... 22 3.2 User cases...... 22 3.2.1 Case1...... 22 3.2.2 Case 2...... 23 3.2.3 Scenarios...... 23 3.3 Implementation...... 24 3.3.1 Step1:PSTN modules...... 24 3.3.2 Step2:Modem signal detector...... 26 3.3.2.1 Detector usage...... 28 3.3.3 Step3:The gateway...... 28 3.4 Testing...... 28 3.5 Evaluation...... 29 4 Project specific problems...... 29 5 Results...... 30 6 Conclusions...... 30

4 (32) 7 Further work...... 30 I References...... 31 II Abbreviations...... 32

5 (32) 1 Introduction

1.1 About the project Today more and more communication is IP[9] based. Instant messaging is increases in popularity and even phones are now being connected to IP­enabled, through voice over IP (VoIP). The deaf community is also starting to use the Internet, moving from communicating with text telephones to new solutions using VoIP. However, the old text telephony system should at the same time not be abandon, since that would mean losing contact with people that are still using it. A more serious problem with text phones not being compatible with VoIP is that the text phone users become more isolated from the rest of the world. If more and more services like deaf translation starts to use software based systems. Offering translation to and from text, speech and video. In the extreme case text phone users would be unable to communicate with, for example, the emergency rescue centers. To address this issue a project was started by the university of Wisconsin and Omnitor AB. The goal of the project is to enable “old” text telephones to communicate with the new internet based clients. As the telephone networks embrace digital techniques, the conditions for text telephone traffic become worse. While transmission quality is sufficient for voice, it is insufficient for transmission of modem signals resulting in an inappropriate and unacceptable level of error. For example, when transmitting audio over IP, packet loss is a concern. With voice transmission, the loss of packets can then be smoothed over by using error correction techniques that provide a transition from the last good packet to the next good packet. When transmitting text telephony tones, this type of smoothing can actually cause bit errors resulting in inaccurate translation of characters or worse. In order to address the problem, a number of standards groups are proposing competing solutions. Some propose to take the analogue text telephone tones and change them into ITU­T T.140 [2](an IP text standard), and transmit them in a digital fashion with built­in error correction. At the receiver’s end, the T.140 characters can then be displayed on an IP telephone or be changed into the text telephone tones of the native country. This approach would allow computer users to send and receive in T.140[2] digital format and have it translated into tones if the call is transferred over the public switched telephone network. This requires each user to call a provider that have a gateway that connects the PSTN and the IP network and simultaneously translates T.140 in to modem signals. The Telecom RERC (Rehabilitation Engineering Research Center) has proposed a hybrid approach, which has been dubbed the T­hybrid strategy. This approach makes use of special servers, which could provide translation services between different protocols. The servers should be accessible on the internet and not connected to a specific physical location like the PSTN/IP gateway or the user client. This makes it posible for users to use any of the different services that are available and the user are not bound to a special provider. This approach has a number of advantages, among others: 1. Allows different components in the communication system to implement different strategies, and allow them all to work together

6 (32) 2. Allows individuals with computers or other digital text communication devices to interact with individuals using analogue text telephone technologies 3. Allows individuals with different incompatible text telephone technologies from different countries to be able to interact successfully 4. Allows individuals using a wide range of text chat technologies to be able to use these technologies transparently to communicate with individuals using other text chat technologies as well as traditional analogue text telephones The T­hybrid strategy provides a natural means for connecting analogue with digital. It would allow anyone to continue using text telephones, while still being able to communicate freely with friends who have switched to digital technologies. Finally, T­hybrid would allow different companies or industry segments freedom in implementing different strategies as long as they provide equivalent translation capabilities. The components and functionality of T­Hybrid can be described as:

• A transparent transcoder between two IP elements on the internet (see figure 1 below)

• Text conversion between digital and analogue formats

• Support for alternating voice and text communication

• Support for connections directly to PSTN using a text telephone modem connected to T­hybrid

Public Switched Telephone Internet Network (PSTN) Gateway IP text Text PSTN / Internet device telephone device

T­hybrid server

Figure 1 The call between the gateway and the IP text device is routed via T­hybrid, which is transparent to the users.

1.2 Purpose The purpose of this master thesis project is to investigate if and how communication between text telephones and software clients is possible over a transcoding gateway. Implementing a prototype to show how a transcoding gateway can be used, incorporating the T­Hybride strategy.

7 (32) 1.3 Objectives The main goal is to design a gateway that makes it possible to communicate between a text telephones connected to the internet through a PSTN/VoIP gateway to an internet based VoIP client using SIP[4]for session setup and terdown. A SIP communication library was provided by Omnitor. Also the hardware to communicate on a PSTN network was given together with a partially implemented interface to communicate with the hardware. An ATA box, hardware to translate PSTN to SIP, was also provided by Omnitor. The ATA box can setup a SIP session and comprimize the sound so it can be sent over a IP network with SIP.

1.3.1 Minimum requirements These requirements were considered critical to the project and must be completed in order for the gateway to work.

• Enabling communication between text telephone using baudot[6], and VoIP clients. Switching between voice and text telephony. Since the text telephony user can not write or receive text and speak simultaneously switching is necessary.

• Design of a transcoding gateway.

• The gateway should be able to setup a connection using SIP [4] with a client supporting SIP.

• The gateway should be able to connect to PSTN network through a text telephone modem.

• The gateway should be able to connect a PSTN client with a internetbased client using SIP [4] or connect two VoIP clients.

• The text transmitted to the SIP clients shall be coded according to RFC2793 [1]

• The audio transmitted to the SIP clients shall be coded with G.723

• Configuring options collected in a single file to simplify the creation of a administration interface.

1.3.2 Desired requirements These requirements are items that should be implemented when the minimum requirements have been fulfilled.

• To be able to handle not only baudot, but the whole V.18 [5] standard.

• Add a web interface for administration of the gateway.

8 (32) 1.3.3 Extra requirements If all the above requirements have been fulfilled and there still remains time to spend on this project, the following things should to be added.

• Add extra intelligence to the switching function to make the conversation flow, make the gateway understand when a user can receive audio or text, if the user can't receive this simultaneously. So that the communication is switched in a way that it does not interrupting the current conversation or over flow any buffer.

• Access control of which users that are allowed to use the gateway. Limiting the users based on telephone numbers.

1.4 Demarcation The following issues were considered to be out of the scope of the thesis.

• No security issues should be taken in to consideration. This means that eventual hijacking of conversation or unknown listener is not considered or prevented.

• No optimizations are to be made to the gateway. The detection algoritms is not proven to be the fastest or best in any way. They are tested to work but noting more. No eventual shortcuts in the SIP protocol is used or any thing else that eventually could speedup some part of the session.

• No evaluation of any protocol are to be made. Known libraries that implements protocol will be used no protocols will be implemented.

1.5 Omnitor Omnitor AB is a Swedish company working in the accessible information technology area. They are developing solutions for disabled people in Sweden and abroad. A main area of work is deaf people's communication. Founded in 1995, Omnitor has contributed to computer communication through participation in development and projects, such as:

• Establishing a quality assessment method for evaluating of the performance of sign language through video communication

• Establishing an international standard for text telephony ITU­T V.18 [5], compatible with all old national methods

• Establishing international standards for Total Conversation including video, text and voice. Total Conversation is standardized in the ­mobile worlds organization 3GPP[E7] and in the world of telecom standard organization ITU [E8].

9 (32) • Developing a PC based Total Conversation Terminal for video, text and voice conversation. It was the first terminal to include harmonising standards for communication in video, text, and voice.

1.6 Design environment In order to make the code as readable as possible, the Code Conventions for the Java Programming Language, as specified by Sun Microsystems, have been used. In order to document classes, there are JavaDoc and low­level comments written in the code. The JavaDoc in the code follow the Requirements for Writing Java API Specifications, as specified by Sun.

10 (32) 2 Background This section describes the different theories used in this project. Including a short introduction to the Session Initiation Protocol (SIP) [4], Session Description Protocol (SDP) [7] and Real Time Transport Protocol (RTP)[3]. Also a description of the Java Media Framework is provided. Furthermore, the different standards included in V.18 and why V.18 was created are tended upon. Finite Impulse Response (FIR) [E4] filter and Infinite Impulse Response (IIR) [E5] filters are described and how they work when used in a Digital Signal Processing (DSP) context. The different tools used in this project are also described.

2.1 Theory

2.1.1 SIP – Session Initiation Protocol The Session Initiation Protocol (SIP)[4] is a signaling protocol that is used for establishing, modifying and terminating sessions in human to human communication on the Internet. A SIP session could be a video conference or a voice call. SIP supports name mapping and redirection services, enabling the SIP user to be portable much in the same way as an e­mail user. SIP resembles the HTTP and SMTP protocols in many aspects for example the URL’s and status codes. Besides session establishing and termination, SIP has many other features, for example [4]: • Call forwarding • Call transfers • Call hold • Multiparty conferencing • Call park • Call waiting • Call queuing • Presence information The typical message flow for a SIP user to user communication setup is that the caller initiates the call by sending an invite message to the target user. The target answers with status code 180, which means that the target user's SIP application is ringing. When the target user accepts the call, his/her application sends a message with the status code 200, which means that it has accepted the call (OK). The callers application then verifies that the call has been established by sending an acknowledgment (ACK). As long as the call is in established mode, the exchange of media is taken care of by other protocols. Once a user that is participating in the call terminates the call, his/her application sends a BYE message to the other participant. The other participants’ application then verifies that it has noted that the other participant is terminating the call by sending a message with status code 200 (OK).

11 (32) 2.1.1.1 SIP entities There are two major categories of SIP entities, the SIP client and the SIP server. The SIP client is a device communicating with other SIP clients. This are the applications that the users sees and interacts with. The server provides extra functionality to the SIP network by providing a known address that a user can connected to. It can also give the user a way through a firewall that would have been imposible otherwise. The SIP client handles SIP calls on a user level and negotiate with other SIP clients about what type of media and what characteristics to use. The SIP server can be of two types: a redirect server or a proxy server. The redirect server manages the location of the SIP users known to that server. Upon an INVITE request from another SIP entity the redirect server answers where that user can be found if the user is not located on the redirect server. The calling client then makes a new call to the address provided by the redirect server. A Proxy server works similarly to the redirect server. However, the user clients does not have to send another invite request, this is handled by the proxy. A SIP address may point to many different destinations, one single person can have several SIP clients, for example several stationary computers, an IP telephone, a cellular phone, etc. When someone sends an invitation to the proxy address, the proxy signals all the units currently registered in the proxy simultaneously. This is called proxy forking. When the user answers on any of the clients the rest of the invitations are canceled.

2.1.2 RTP and SDP When SIP (Session Initiation Protocol) [4] has set up the connection, SDP (Session Description Protocol) [7] performs the negotiation of the media type (video, audio, text) to be use in the session. When this negotiation is done RTP (Real Time Transport Protocol)[3] takes over transportation of the media between the users. All these protocols are separate and can be used in conjection with other protocols. RTP was initially made to carry audio and video only but has now been extended to carry text as well.

2.1.3 JMF­Java Media Framework JMF[E2] is the main way Java handles streaming video and audio. JMF is implemented as a set of native classes to play and capture sound and video files in many formats including Microsoft's wav and Sun's AU. JMF Supports MIDI, MP1, MP2, RTP[3] streaming JPEG[10], RTP H.261[11], RTP H.263[12] among other formats. There exists both JavaSoft and Intel developed packages for JMF. Different audio and visual formats give different amounts of compression and take different amounts of computation to encode (compress them to prepare for transmission or storage) and decode (decompress them to prepare for playback at the receiving end) them. There exists special purpose Digital Signal Processing (DSP) hardware that can help, if the JMF codec software is smart enough to use it. Like the MMX DSP­like features of the Pentium class chips that can accelerate the encoding/decoding process.

12 (32) Encoding consumes more resources than decoding. Video consumes more resources than audio. The more you can compress the data without losing any of the original data, the better the quality is that can be sent on a connection with a certain bandwidth. Some compression formats removes some of the data to make the size even smaller. This means that more images can be sent on a connection with a certain bandwidth. But the quality will not be as good as in the first case. Generally, the more the data is compressed, the more computing power it takes to compress and decompress. JMF uses logic abstractions to represent different components of the flow of time­ based media. Data is captured by external devices, for example a camera. The data is then collected in data source classes. Data source classes used by JMF to represent any type of collected multimedia data. Data sources and streams may be created from URLs, files or capture devices. The actual sockets and lower layer details are built into JMF and are hidden from the developer. The captured media is presented to the user using player classes. These classes can play sound or display video for the user. If manipulation of the data is required a processor can be created from a data source. The abstraction makes sending the data over the network, as simple as writing to the local harddisk. JMF is also extensible, and allows the developer to add plug­ins in order to support new media formats. A developer who wants to create a plug­in, simply extends one of the plug­in classes. Extending a codec allows the developer to create a custom packetizer or depacketizer. Packetizer is used by JMF to divide the data in suitable sized buffers and depacketizer combines the buffers to the original data. An Effect object is used to perform changes on captured or depacketized data, such as raising the volume of sound. Multiplexers and Demultiplexers allow the developer to merge different data types into one singel stream, for example a single RTP [3] stream that can be sent over the network.

2.1.4 V.18 Text telephones Most countries have created their own text telephone standards. Therefore, text­ telephone users from different countries have difficulties in calling each other. This problem of communicating over borders was one of the reasons to start developing a global standard, V.18 [5] standard was created. When communicating with V.18 only one side has to have a V.18 capable device. The V.18 standard detects the standards the other side supports. If the other side can't use V.18, V.18 uses one of the old standards. The different standards that make up V18 is EDT, Baudot [6], V.21, V.23, Bell 103 and DTMF. While EDT,V.21 and V.23 is used in different countries in Europe, Bell is used in the US and Baudot is used in the rest of the world. DTMF is used in Denmark. V.21, V.23 and V.18 uses a carrier signal throughout the whole conversation, while the other simply transmits each character directly with only a short carrier signal in front and after the character.

2.1.5 FIR and IIR filter FIR filter or Finite Impulse Response Filter [E4] is one of two types of filter used in Digital Signal Processing (DSP)[E3]. The other filter is IIR, Infinite Impulse Response Filter [E5]. The impulse response is said to be “finite” in the FIR filter because there is no feedback in the filter. Given an impulse, the output will eventually be zero again.

13 (32) The IIR filters on the other hand uses feedback, so when you input an impulse the output theoretically “rings” indefinitely. The impulse response is "infinite" because there is feedback in the filter therefore an infinite number of non­zero values can come out (theoretically). Compared to IIR filters, FIR filters offer the following advantages: • They can easily be designed to be "linear phase" (and usually are). Put simply, linear­phase filters delay the input signal, but do not distort its phase. • They are simple to implement. On most DSP microprocessors, the FIR calculation can be done by looping a single instruction. • They are suited to multi­rate applications. Multi­rate refers to either "decimation" (reducing the sampling rate), "interpolation" (increasing the sampling rate), or both. Whether decimating or interpolating, the use of FIR filters allows some of the calculations to be omitted, thus providing an important computational efficiency. In contrast, if IIR filters are used, each output must be individually calculated, even if that output will be discarded (so the feedback will be incorporated into the filter). • They have desirable numeric properties. In practice, all DSP filters must be implemented using "finite­precision" arithmetic, that is, a limited number of bits. The use of finite­precision arithmetic in IIR filters can cause significant problems due to the use of feedback, but FIR filters have no feedback, so they can usually be implemented using fewer bits, and the designer has fewer practical problems to solve related to non­ideal arithmetic. • They can be implemented using fractional arithmetic. Unlike IIR filters, it’s always possible to implement a FIR filter using coefficients with magnitude of less than 1.0. (The overall gain of the FIR filter can be adjusted at its output, if desired by multiplying the output with a constant). This is an important consideration when using fixed­point DSP's, because it makes the implementation much simpler. Compared to FIR filters, IIR filters offer the following advantages: • They are usually faster. • They needs fewer values to describe the filter to get the same filter characteristics as a FIR filter. Compared to IIR filters the disadvantages of FIR Filters are that FIR filters sometimes require more memory and/or calculation to achieve a given filter response characteristic. Also, certain responses are not practical to implement with FIR filters. In this project FIR filters of order 116 has been used. And IIR filters of order 5 which is equivalent. Each filter only filters out one frequency, to test all frequencies several filters have been used seperatly.

14 (32) 2.1.5.1 FIR filter The output from a finite impulse response (FIR)[E4] filter is only dependent upon the present and previous inputs. The filter is described in the finite difference equation

N−1

y[n]= ∑ bk∗x[n−k] (2.1) k=0 the z­transformed function is then

N−1 −k Hz= ∑ bk∗z (2.2) k=0 A way to illustrate how FIR filters is working is to look at a first­order FIR [E4] filter. The z­transform function is then

−1 Hz=b0b1 z (2.3) This can then be illustrated like this

b0 x[n] + y[n]

z-1

b1

The input to the delay element ( z−1 ) at time n i x[n] and the corresponding output value is x[n−1] which is multiplied by the constant b1 . The input signal x[n] is also multiplied by a constant value b0 . The output y[n] from the filter is found by adding this to signal together.

y[n]=b0 x[n]b1 x[n−1] (2.4)

15 (32) In this project a FIR filters of order 116 was used. To get the values of bn a windowing method called Kaiser windowing method was used.

b[n]=w[n]∗hD[n] (2.5) where hD [n] is the ideal impulse response and w[n] is given by

2 ∣2n−N1∣ I 1− o  N−1  (2.6) wn= [ ] −Io  where I0 is a zero­order modified Bessel function of the first kind described by the formula.

2 ∞ xk I x (2.7) 0  =∑ k k=0  2 k !  N is the number of filter coefficients.  affects the side lobe attenuation, increasing  widens the main lobe and decreases the amplitude of the side lobes.

2.1.5.2 IIR filter A recursive filter involves feedback. In other words, the output values are calculated using one or more of the previous outputs, as well as inputs. In most cases a recursive filter has an impulse response which theoretically continues forever. It is therefore referred to as an infinite impulse response (IIR) [E5] filter. Assuming the filter is causal, so that the impulse response is h[n] = 0 for n < 0, it follows that h[n] cannot be symmetrical in form. Therefore, an IIR filter cannot display pure linear­ phase characteristics like its counter part, the FIR [E4] filter. The finite difference equation and transfer function of an IIR filter is described by the equation

N M

y[n]=∑ b[k]∗x[n−k]−∑ a[k]∗y[n−k] (2.8) k=0 k=1 and equation

N

∑ bk∗z−k k=0 Hz= M (2.9) −k 1∑ ak∗z k=1 respectively. In general, the design of an IIR filter usually involves one or more strategically placed poles and zeros in the z­plane, to approximate a desired frequency response. An analoge filter can always be described by a frequency domain transfer function of the general form. s−z s−z s−z ... Hs=K 1 2 3 (2.10) s−p1s−p2s−p3... Where s is the Laplace variable and K is a constant, or gain factor. The filter is characterised by its poles p1, p2, p3, ... , and its zeros z1, z2, z3, ... , which can

16 (32) be plotted in the complex s­plane. The frequency response of the filter H(), can be obtained by replacing s=j into Equation 2.10. The complete response of the filter is then generated by varying  between 0 and  in the equation  j −z  j −z  j −z ... H=K 1 2 3 (2.11)  j −p1 j −p2 j −p3... To illustrate a IIR filter a first­order filter can be used 1 H z  = −1 (2.12) 1a1 z

x[n] + y[n]

z-1

-a1

The input to the delay element ( z−1 ) at time n is y[n] and the corresponding output value is y[n−1] which is multiplied by the constant −a1 this is then added to the input value x[n] . The filter can also be written as

y[n]=−a1 y[n−1]x[n] (2.13)

2.2 Project tools The development environment for this project was Linux Fedora core 2 with Java 2 SDK 1.4.2 [E1] and JMF 2.1.1e [E2]. RXTX [E6] was used to communicating with the com port since Java Comm API only works in Windows (RXTX works in both Linux and Windows). Software tools used where Netbeans 3.6 as a programming environment, Ethereal for sniffing packets in order to verify that the IP communication worked well, DIA for UML design , and Concurrent Versions System (CVS) for version handling. Open office 1.1.1 and Microsoft word 2002 was used for all text documents in this project.

17 (32) 3 Project work This section describes the design and implementation of the transcoding gateway created in this project. The design section describes the design of the gateway. Since not all the gateway functionality was implemented, only parts of the design is described in the implementation section.

3.1 Design The use of Java was specified by Omnitor, as was the choice of Linux as the operating system. The T­hybrid gateway design was made to be as modular as possible, and it should be easy to extend the functionality to other modem standards or networks. Each part of the system should be easy to use in other systems, independently from each other. From the start of the project there was a desire from Omnitor AB that MEGACO[8], Media Gateway Control, should be used as a communicating protocol between the different modules in the gateway to add modularity. This was not implemented. Since implementing MEGACO was too extensive and did not fit into the time frame of this master thesis project.

V18 translation

RTP RTP Incoming Outgoing Session Session Figure 2 A schematic view of the stream through the gateway.

In SIP each client has one incoming session and one outgoing. In t­hybrid one incoming is linked through some session blocks to one outgoing session to connect two users. In some special cases, then the incoming audio is text telephone signals, the audio needs to be converted to text. In this case the incoming session will be connected to a V18 translation module and not directly to the outgoing session (see Figure 2). There are special controller classes that decides how the different sessions are created and how they are connected to each other. To help the controllers there are several defined session blocks that, when combined, can connect two clients.

3.1.1 Java design The BufferRenderer, AudioReciver, Detector and the Demixer are the only classes that were fully implemented during this project and are therefor the only parts that are designed in detail. They were designed with the intention to be modular, making them easy to use in other projects and simple to expand with additional functions. The BufferRenderer is used to get the raw audio packets from the JMF.

18 (32) The AudioReciver is a renderer. With the purpose to get the audio packets as raw audio packets. It uses the BufferRenderer to get the buffer packets from the audio track from the incoming RTP [3] stream. The Detector is fed with raw audio packets and reports the result to the listeners. It uses a simple set of filters to see if the power of a specific frequency is larger than a given threshold. If so the Detection reports that it has found a signal. The Demixer is a listener to the Detector class. It’s fed with raw audio packets from the Bufferrenderer. The Demixer can have several outgoing channels. Based on the information from the detection class the incoming packets are sent to different output channels. The Demixer also makes sure that all modem sounds are sent to the modem channel. If for example a packet can only be partly filled with a modem signal the signal can then be to short to be detected by the Detector class, resulting in a clicking noise. To prevent this the packet prior to the packet that the modem signal was detected in will also be treated as containing a modem signal. If a single packet in a series of packets of detected modem signals is not detected as a modem signal the whole series of packets will be sent to the modem channel as a precaution.

3.1.2 The controllers The design model of T­hybrid should only consist of a small number of smart components that controls several dumb components. Smart components are the components that decides what to do depending on the data sent through the gateway. They can decide what components are needed to complete a certain task. Dumb components are told what to do by the user or another component. The controllers are arranged in a hierarchy (see Figure 3).

Controller

V18 modem Admin

RTP Controller SIP module Figure 3 A overview of the controller hierarchy used in the gateway. The Controller is the overall controller, it collects system settings and tells the other controllers which tasks to preform. Admin is a dumb component which provides the Controller with the settings specified by the administrator. SIP handles the connection sequence. When the SIP component has connected both users, it signals to the Controller and then waits for the next session to start. This session information is sent to the RTP Controller. The RTP Controller then decides what kind of session blocks to create. The RTP Controller is also the component that decides which frequencies should be checked for modem signals.

19 (32) The V18 modem component handles all the text telephony translation. For the moment this means that this component handles the hardware modem that is used to translate the modem signals.

3.1.3 Logging Logging is done through the Java's logger object and the log is saved in the .xml file format. The log is used both for debugging and to keep gateway usage statistics.

3.1.4 The different session blocks All the following session blocks are created and controlled by the RTP controller. The first four blocks are attached to the incoming RTP streams and the last four are attached to the outgoing RTP stream. Each incoming block is connected to one or several outgoing blocks.

3.1.5 Incoming session blocks The incoming session block is a group of classes started by the RTP Controller to handle the incoming RTP stream. Depending on the type of the incoming data, text, ..., different blocks are active. The incoming session can be one of the following four blocks: 1.) The calling user could be using a text telephone then the incoming stream is audio only. In this case the detector checks if the incoming sounds are modem signals or speech. If a modem signal is detected the Demixer is notified and the audio stream is sent to V18 translation and is converted into text. If no modem signal is detected the audio stream is sent as “as is” to the receiving user. The output from the Demixer is in raw audio buffers. Since the Detector destroys the audio, the buffers are cloned and the same buffers are sent to both the Detector and the Demixer. (See Figure 4).

dio Detector Au io AW d R Au RTP Audio Receiver RAW RAW Aud io Demixer RAW Audio

Figure 4 Session block used to separate modem signals form speech. 2.) If one of the incoming streams from the user is T.140, a Text Receiver receives the text and sends it to a buffer. The buffer is there in case the other user is using a text telephone so that the text can be held until the text telephone user can receive text. The Buffer is controlled by the RTP controller so that the Mixer only has one actively incoming stream. (See Figure 5).

RTP Text Text Text Receiver Buffer Figure 5 Session block to receive text coded according to T.140.

20 (32) 3.) If received audio must be converted to raw audio buffers (so that JMF can handle and modify the audio). Raw audio buffers are the simplest form of audio in JMF these buffers are uncompromized (See Figure 6). This block is used to get the correct format to the Mixer. (See Figure 13).

RTP Audio Receiver RAW Audio

Figure 6 Session block to receive audio.

4.) In any other case the RTP stream is received and sent to the other side unchanged. (See Figure 7).

RTP RTP RTP Receiver Figure 7 Session block to receive a unspecified RTP stream.

3.1.6 Outgoing session blocks The outgoing session can be composed of one of the following four blocks. The outgoing session block is a group of classes started by the RTP Controller to handle the outgoing RTP stream. 5.) If the receiving user is using a text telephone a mixer is created. The mixer selects one of the incoming channels (one channel from the V18 translator and one audio channel from the other user). Only one channel should be sending at the same time the other channel is ignored and any data coming on that channel is lost. The

currently active channel is controlled by the RTP Controller (See Figure 8).

RAW

Au

d io

RAW Audio RAW Audio Mixer Audio Transmitter RTP Figure 8 Session block to combine to audio steams and send them over RTP.

6.) If one of the outgoing channels is a text channel a Buffer and a Text Transmitter is created. The Buffer is there if the incoming text is received faster than the set outgoing maximum speed. (See Figure 9).

21 (32) Text Text RTP Buffer Text Transmitter

Figure 9 Session block to send text. 7.) If the audio is in raw audio buffers the Audio Transmitter packetizes the audio (See figure 10). RAW Audio RTP Audio Transmitter Figure 10 Session block to send audio.

8.) In any other case the RTP stream is just transmitted ”as is” (See Figure 11).

RTP RTP Transmitter RTP Figure 11 Session block to send RTP packets.

3.1.7 Serial session blocks To connect directly to the PSTN network a V.18 modem is used. The modem translates the V.18 signal to ASCII and sends it to the computer through the serial interface. This text can then be fed to the text buffer and sent to a SIP client. In the opposite direction the text from a text buffer can be sent to the modem through the serial interface and the modem will translate the text to V.18 signals (See figure 12).

PSTN Text Text RTP V.18 modem Buffer Text Transmitter

RTP Text Text PSTN Text Receiver Buffer V.18 modem Figure 12The chain of modules used then connection a PSTN client with a client using RTP

3.2 User cases The following section presents a number of user cases. The user cases described here is to demonstrate how the different session blocks can be combined in various cases. In all cases the receiving end, the callee, can receive all types of RTP streams (voice, audio, text).

3.2.1 Case1 The calling user, the caller, has an audio channel only, probably a text telephone. When the SIP connection is done, the RTP Controller knows the different media in use and can therefore decide to use incoming session block 1 for the voice only user and outgoing session block number 2 and 3 for the other user. The video channel will not be used in this case. For the other user, the callee, incoming session block 2 and 3 is used and outgoing session block 5 for the audio only user, the caller. The RTP Controller will control

22 (32) the Buffer so that the text is buffered until the audio only user can accept modem signals (See Figure 13).

Text RTP Text Detection V18 translation Buffer Transmitter

RTP Audio Audio Demixer Audio RTP Audio Receiver Transmitter Text RTP Text V18 translation Buffer Receiver

RTP Audio Mixer Audio RTP Audio Receiver Figure 13 This figure shows all the different session blocks used, when one user only can receive audio and the other and use both text and sound. It also shows how the session blocks are connected in the gateway when connecting the two users. 3.2.2 Case 2 The calling user, the caller, can receive and send text and audio. Then the RTP Controller will create two RTP transmitters one for Audio and one for Text. RTP RTP RTP Receiver RTP Transmitter Text RTP RTP RTP Receiver RTP Transmitter Figure 14This figure shows the session blocks used to connect the users in one connection where both users has the capability to send and receive both voice and text. The same thing will be done in the other direction. The video will not be used.

3.2.3 Scenarios The following table shows the different combinations of system capabilities (Text, Audio and Video) the two users may have and how the system will handle each of this cases. The number in side the parenthesis refers to the session blocks described section 3.1.5 and 3.1.6. User1(caller) User2(callee) Controller Text, Audio, Text Links the text channel with a RTP receiver (Video) (4) and a RTP transmitter (8), other channels will be closed. Text, Audio, Text, Audio See user case 2 (Video) Text, Audio, Text, Video Links the text and video channels with RTP (Video) receiver(4) and RTP transmitter (8)

23 (32) User1(caller) User2(callee) Controller Text, Audio, Text, Audio, Links the text, audio and video channel if (Video) Video user1 has Video with RTP receiver(4) and RTP transmitter (8) Text, Audio, Audio See user case 1 (Video) Text, (Video) Audio All sounds from the Audio channel will be translated in to text and all text in the Text channel will be translated to modem signals. Video will not be used. Text, Audio, Video If User1 has Video channel will be linked (Video) with RTP receiver(4) and RTP transmitter (8). If User1 don't have Video the call will be disconnected. “(Video)” means that the user may have video.

3.3 Implementation The implementation part was divided in three steps. In the implementation phase the Controller and the RTP Controller was implemented under the names ServerController and MediaController. Step1: Create a module that connects a PSTN call from a text telephone modem to a software based text SIP client and enable communication between the hosts.

Step2: Create a module that receives a RTP stream and detects if it contains a modem signal.

Step3: Create a gateway that connects two RTP sessions and detects if the audio stream contains any modem sounds and, if that is the case, translate the analoge signals to text for the SIP client. In this part no new classes were created and only modifications to existing classes were made, mostly in the ServerController and the MediaController classes.

3.3.1 Step1:PSTN modules The PSTN module uses an external modem that's connects to the computer through the serial interface. All communication with the modem is done with ASCII and AT commands. In this step the interface that enables communication with the hardware modem was completed. In the following tables the created classes are listed with a short description.

24 (32) Modem/Serialport connection classes: SerialConnection.java Represents the connection to the modem. It sets up an initialize the modem and prepares the modem to receive and create connections. ConnectionState.java Defines the connection states, (CONNECTING, CONNECTED, DISCONNECTING,...) ModemCommand.java Represent the command that is sent to the modem. ModemCommandHandler.java Sends the different commands and waits for there response or timing out if the command takes to long time. ModemResponse.java Represent the modems response to a sent command. ConnectionMode.java Holds the current connection mode the modem is in (V.18, BAUDOT,...) SerialConnecitonReciever.java Server thread that accepts and handles data received from the modem. SerialConnecitonTransmitter.java Handles the data sent to the modem. SerialPortParamerers.java Sets and handles the parameters sent to the modem during initialization. SerialReceiver.java Class used to send data to remote user. SerialTansmitter.java Class used to receive data from remote user. Events ConnectionStateChangedEvent.java Class that holds the new status of the modem. Interfaces ModemCommandHandlerListener.java Interface used for callback function when a modem command is finished. SerialConnectionListener.java Interface used for callback when modem state is changed or a new incoming call arrives. Exceptions SerialConnectionException.java Used to report error in the serial connection.

T­Hybrid server classes: Thybrid.java The main class.

25 (32) T­Hybrid server classes: ServerController.java Overall controller that starts up the MediaController that handles the different call medias (SIP, PSTN) and also the Admin module. This class implements the functionality of the Controller in the design. MediaController.java Handles and sets up the different calls. It decides if detection is used or not and how it is used. This class implements the functionality of the RTPController in the design. SessionInfo.java Holds the information about a certain session. Information such as users info and medias used. SerialHost.java A representation of the user using the PSTN. It is a implementation of the SessionInterface. Interfaces SessionInfoListener.java Interface is used by the session to inform that a new connection is received. SessionListener.java Class used to get feedback from a session. SessionInterface.java Interface used by the Mediacontroller to control the different sessions. SerialHostListener.java Class used to get feedback from a session connected to the serial port.

3.3.2 Step2:Modem signal detector The detector uses a set of filters to filter out the modem signals. When the energy in any of the scanned frequencies exceeds a pre­defined threshold, the detector marks that buffer as containing a wanted frequency. To make the signal detector less sensitive to the longer pause between characters in some modem standards some delays in switching have been implemented. If a sequence of buffers containing V18­modem signals have been detected, the next incoming buffer is also handled as if it was detected as a modem signal. This prevents a buffer only partly filled with a modem signal, buffers with silence and only a short segment of a modem signals that is to short to trigger the detector, to be accidentally sent to the voice channel. To remove buffers that are only partly filled with a modem signal the buffer before a detected modem signal buffer is also handled as if it contains a modem signal. This buffers which is only partly filled with modem signals is heard by the user as “clicks”.

26 (32) The filters can be of any sort, the ones in this prototype are FIR and IIR filters. The FIR filters were created with a java applet using the Kaiser window method. The IIR filters were created with Mathlab. RTP connection classes: Codec.java Representation of a codec used in a SIP session. CodecList.java Representation of all codecs used in the current session. SipCaller.java Class used when calling another SIP client. SipHost.java Class that receives incoming SIP calls. SipHostListener.java Used to get feedback from a SIP session. SipSession.java Representation of the user using SIP. It is a implementation of the SessionInterface.

T­Hybrid server classes: TextBuffer.java A text FIFO buffer class with functionality for blocking input and inserting information text. This class can be told to hold the text it is fed with. It can also be specified how fast the buffer should output characters. Bufferrenderer.java Used to get RAW audio from an RTP stream. Channel.java Representation of the output from the Demixer Demixer.java This Class has one or more Channels and can be controlled to redirect its output to any of the output Channels. Interfaces BufferListener.java Interface used by the BufferRenderer to write it's output data.

Classes for the detection: SignalDetector.java Used to control a RAW audio buffer for modem signals. Interfaces Filter.java Interface implemented by all filters used by the SignalDetector.

27 (32) In addition all the Filter classes were created implementing the filter interface specified in Filter.java.

3.3.2.1 Detector usage To use the detector, the buffer in JMF's datasource must be extracted. This can be done by using the BufferRenderer class and adding the Detector as a buffer listener. For the detector to work with the Demixer two channels must be created, “Voice” and “V18”. Then the Demixer is used by Detector so that all detected modem signals will be directed to the “V18” channel and all other sounds will be found in the “Voice” channel. To create a channel, the channel interface must be implemented and then added to the Demixer channel by the addChannel function.

3.3.3 Step3:The gateway Only parts of this step were finished. The translation between modem signals and text isn’t finished and this ended up being a separate master thesis project. From the beginning there was an idea to use an external modem as the translating part. The gateway would phone through PSTN to itself and then use the translated characters and send them to the other client. This was not possible since the hardware to have two separate PSTN calls converted to SIP was not provided.

3.4 Testing The functionality of the detection was tested in two ways. The first test entailed making sure that the voice channel only contained voice and not any modem signal. This was done by listening to the audio output from the demixer. The other was to see if the detection module managed to detect the modem signals. This was done by sending different signals to the detector and examining the debug output to see if the detector had successfully detected the signal. This test was made by the help of “The ultimate HI­FI test pack CD2”, a text telephone and an ordinary phone. The phone and the text telephone where used to mimic the real use of the detector. If the signals would have been provided to the detector directly, the signals would not have been compressed, packetized with RTP, depacketized and finally decompressed. The time it took to process the audio buffers was also tested. All the tests were made both with IIR filters and with FIR filters. The signals on the CD were repeated under 30 seconds depending on how long the sounds were, this ment the signals were sent 5 to 10 times during a test. All tests was preformed 3 separate times. Tests to fool the detector with voice produced favorable results. When talking it was not possible to fool the detector, no false detections of modem signals were made. When singing or whistling, it was possible to fool the detector for a short while, only for one or two buffers, resulting in gaps of 60­120 millisecond, so it was difficult to notice the gaps using hearing alone. Screaming loudly produced the same results. This should not have any impact on the gateways performance or conversations. The easiest way to fool the detector was to blow hard into the telephone. This way it was posible to fool the detector for as long as a steady blow was sustained. Depending on how the V.18 translation is done this can result in either a missed character or the transmission simply being ignored as noise.

28 (32) When testing the detector with a frequency sweep, the detector detected some signals after the sweep passed a desired frequency. This was expected and desired. When pink noise was sent to the detector no false signals were detected. With white noise the volume of the noise was decisive if a false detection was made. If noise was added to speech some errors were made in the detection. The more noise the more errors. Several known frequencies were also sent to the detector, all desired, and were detected correctly. This was done by the help of ”The ultimate HI­FI test pack CD2”. Some frequencies close to desired frequencies were also sent to the detector resulting in frequencies roughly within 50 Hz of the wanted frequencies being detected. When testing with real v18 signals no signals were missed. When noise was added to the detector, the detector still detected the desired signals. The tested signals were DTMF tones, with Baudot[6] as the most tested since this scheme does not need to setup any connection. With V21 and V18 the carrier signal was only tested for detection, since this signal is always present during a session. Different delays in the detector were also measured. Here the worst cases was tested for both the IIR filter and for the FIR filter. It was clearly shown that the IIR filter is much faster than the FIR filter when testing all frequencies (in this project 25 frequencies). The IIR filter took 3 milliseconds and the FIR filter took 18 milliseconds when all frequencies were tested for a signal. Since the buffer size is 60 milliseconds the voice channel was delayed for 63 milliseconds when IIR was used and delayed for 78 milliseconds when FIR filters were used. The delay in the v18 channel becomes 3 milliseconds and 18 milliseconds since no extra buffering is preformed in this case.

3.5 Evaluation The tests shows that the detector can be fooled quite simply. The voice channel is free from disturbing sounds and a normal conversation is not interrupted by missing fragments. Some frequencies that are not part of any text phone variant but similar to existing signals are indicated as modem signals. This is expected and caused by the fact that the filter has a width of 63 Hz and therefore interpret these signals as adjacent modem signals.

4 Project specific problems JMF's documentation is in some cases unclear and it takes some time to understand how JMF was intended be used. Sometimes non existing error messages also makes it difficult to debug. Sometimes classes are not documented or documented but not implemented. For example it was impossible to change the size of the sound buffers, even though there is a function in the documentation of JMF that say that it will change the buffer size, but this function is ignored. The buffer is always has a length of 480 or 60 milliseconds. A second ATA box would also have been desirable. That would have been used to translate PSTN calls to SIP[4] calls. Lack of a second ATA box made it imposible to translate the modem signals to text.

29 (32) There was some initial problems with the communication to the external modem when translating Swedish characters.

5 Results • There is now a design that shows how a gateway can be built. • There is now a prototype gateway implemented. That detects and separates modem signals from a SIP audio stream. • The gateway has a demo version that connects a PSTN network user with a SIP client user.

6 Conclusions The purpose of this project has been full filed since there is now a prototype gateway, that connects two SIP[4] hosts and detects if the incoming sound contains any modem signals. It also separates sound buffers that contains modem signals from those that do not. There are some basic algorithms to minimize the interruption in the communication.

7 Further work The following items can be part of further projects. Add an admin interface, this has already been done during the time this document was written. Add a V.18 translation module. This is under construction in another master thesis project. More work can be done on the detection filters and the detection algorithms. They are quite rough at the moment. Optimizing the detection. For example at the moment all frequencies are checked all the time. This is not necessary when the modem standard used is known.

30 (32) I References 1. Hellström, G. (2000). RTP Payload for Text Conversation. RFC2793 2. Hellström, G. (1998). ITU­T Recommendation T.140 – Text Conversation Protocol for Multimedia Application. International Telecommunication Union. COM­16­48E corr. 3. Schulzrinne, H; Casner, S; Frederick; Jacobson, V. (1996). RTP: A Transport Protocol for Real­Time Applications. RFC1889. 4. Sparks, R. (2003). The Session Initiation Protocol (SIP) Refer Methos. RFC3515 5. ITU­V Recommendation V.18 – Operational and interworking requirements for DCEs operating in the text telephone mode. (2000) ITU E 20233. 6. TIA/EIA 825a A Frequency Shift Keyed Modem for use on the Public Switched Telephone Network 7. Handley, M; Jacobson, V. (1998) SDP: Session Description Protocol. RFC2327 8. C. Groves; M. Pantaleo; LM Ericsson; T. Anderso (2003) Gateway Control Protocol Version 1. RFC3525 9. Marina del Rey (1981) . RFC791 10.ITU­T Recommendation T.81, Information technology – Digital compression and coding of continuous­tone still images ­ Requirements and guidelines. (1992) ISO/IEC IS 10918­1 11.T. Turletti; C. Huitema (1996) RTP Payload Format for H.261 Video Streams. RFC2032 12.C. Zhu (1997) RTP Payload Format for H.263 Video Streams. RFC2190

Electronic documents E1. Code Convenrions for the Java programing language http://java.sun.com/docs/codeconv/html/CodeConvTOC.doc.html E2. Sun Microsystems. (1999). Java Media Framework API Guide. http://java.sun.com/products/java­media/jmf/2.1.1/guide/ E3. Hinton, O. Introduction to Digital Filters http://www.staff.ncl.ac.uk/oliver.hinton/eee305/Chapter3.pdf E4. Hinton, O. Design of FIR Filters http://www.staff.ncl.ac.uk/oliver.hinton/eee305/Chapter4.pdf E5. Hinton, O. Design of IIR Filters http://www.staff.ncl.ac.uk/oliver.hinton/eee305/Chapter5.pdf E6. The Prescription for Transmission http://users.frii.com/jarvi/rxtx/ E7. 3GPP http://www.3gpp.org E8. ITU http://www.itu.int

31 (32) II Abbreviations API Application Programming Interface AT commands A set of commands used to control a modem. DSP Digital Signal Processing FIR A type of ”finite” filter IP Internet Protocol IIR Infinite Impulse Response ITU International Telecommunications Union JMF Java Multimedia Framework VoIP Voice Over IP RTP Real Time Transport RXTX Program library for java used then using com or serial port. SDP Session Description Protocol SIP Session Initiation Protocol PSTN Public switched telephone network MEGACO Media Gateway Control MMX Multimedia Extensions

32 (32)