A PROCESSOR FOR PULSED ULTRA-WIDEBAND SIGNALS

Raul Blazquez, Puneet P. Newaskar, Fred S. Lee, Anantha P. Chandrakasan

Microsystems Technology Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA Emails: rbf,fslee,[email protected] and [email protected]

Abstract Signal from Samples RF Front-end This paper presents a baseband processor for pulsed ultra- Coarse ADC 4x 40x Acquisition Subsystem wideband signals. It consists of an analog to digital converter Samples Four Phases (ADC), a clock generation system and a digital back-end. 30MHz Clock

300MHz Clock Retiming Block Fine Correlators Parallelization x10 Tracking The FLASH interleaved ADC provides four bit samples Clock Control Subsystem Generation Control at 1.2GSPS. The back-end uses parallelization to process 300MHz Timing Subsystem Clock Corrections Fast clocking Slow clocking these samples and to reduce the signal acquisition time to domain domain 70µs. The baseband processor was implemented in the same 0.18µm CMOS chip as a part of a complete transceiver. A complete 193kbps wireless link is demonstrated. Fig. 1. Baseband processor block diagram. Keywords: Ultra-wideband (UWB), wireless transceiver, FLASH ADC, system-on-a-chip and correlation receiver Introduction of magnitude as the acquisition time required for lower bandwidth systems, such as IEEE 802.11. Ultra-wideband (UWB) radio communication, though a widely used technology in the mm-wave radar community, is System Trade-offs recently being re-visited by the integrated circuits community as a viable high-speed, last-meter wireless link technology. An important consideration in the receiver architecture With the recent FCC ruling on UWB emissions specifications is determining the analog/digital partition. In this paper, a [1] and the current IEEE 802.15.3a standards committee digital architecture that also implements all the synchroniza- activity for high-speed wireless local area networks, UWB tion in the digital domain is chosen. The advantages are the transceivers have become an active area of research. simplification of the analog elements in the transceiver, its scalability, and the possibility of exploring digital channel UWB signals adaptability and recovery. Performing the synchronization in Contrary to traditional radio signals where data is embed- the digital domain eliminates the need to feed a signal from ded in the modulation of a sinusoidal carrier, the digital domain to the clock generation subsystem. UWB signals occupy larger bandwidths on the order of A digital architecture depends on the feasibility of the hundreds of MHz. There are two competing UWB signaling analog to digital converter (ADC) required to digitize the schemes for the IEEE 802.15.3a standard: pulsed UWB signal. To allow for an all-digital timing recovery, the ADC and multiband OFDM signals. Pulsed UWB represents a must sample at 1.2GSPS, oversampling at twice the Nyquist departure from traditional frequency-based data-encoding rate. A FLASH ADC architecture is well suited for such methods. Multiband OFDM signals are an extension of the a high sampling rate. Since the power consumption in signal used in the IEEE 802.11a standard, with a larger FLASH ADCs scales exponentially with the number of bits bandwidth. This work focuses on a receiver for pulsed of resolution, minimizing ADC resolution is critical to reduce UWB signals, using 0-300MHz baseband pulses. Each bit the power consumed in the receiver. It is found that four of information is represented with a sequence of 31 pulses bits of resolution are sufficient to be closer than 1dB to the with a width equal to Tp=3.3ns. Since every two consecutive infinite resolution ADC curve for bit error rate [2]. pulses are separated by 50·Tp, the duration of a bit is The baseband UWB receiver uses a RF front-end that Dbit=1550·Tp. The information is encoded on the sign of amplifies the signal but does not down-convert its frequency the pulses, that also depends on the corresponding bit of a [3]. After it, the baseband processor shown in Figure 1 Gold code sequence of length 31. demodulates the signal. The following sections describe the The receiver is specified to achieve signal acquisition in blocks of the baseband processor: clock generation subsys- an average time of 70µs. This time is on the same order tem, the ADC, and the digital back-end. Φ Pipelined Decoder x4 x4 x4 x4 1 T15 Preamp Comp 1 Comp 2 T15 x4 x4 x4 x4 Φ Clock Buffers 2 Vinp B0 PCC 15 T/H B1 B2 Φ Vinm Xtal x4 x4 x4 x4 3 B3 T0 Charge Preamp Comp 1 Comp 2 Pump x4 x4 x4 x4 Φ T0 Phase 4 Freq PCC 0 Detect Charge Bias + - + - + - + - Pump Gen -+ -+ -+ -+ Fig. 3. A Single FLASH ADC channel. Four way interleaving is ECL Delay Cells used.

Div by 2 Div by 2 Div by 2 Div by 2 Div by 2 Div by 2

CLK Voutp

Fig. 2. Clock generation and distribution subsystem. Voutp Voutm Voutm

Vinp Vinm Vinp Vinm Voutp Voutp Vrefp Vrefm

CLKi CLK V V Vbias inp inm CLK Clock Generation Vbias Since the system implements a fully digital synchroniza- Preamplifier Comparator 1 Comparator 2 tion algorithm, the only input to the clocking system is the reference crystal clock, and its sole function is to track it. Fig. 4. Preamplifier and comparators. This is done with a phase-locked loop (PLL) based upon a design described in [4]. Figure 2 shows the block diagram of the PLL. The jitter requirements are mostly constrained by the digital back-end of the receiver. Given that the probability is comprised of only a switch and sampling capacitor due of losing synchronization during a 1024-bits data packet to the low resolution required by the system. The ADC with a clock jitter of 100ps is smaller than 0.01, and the core is fully-differential for improved noise immunity and degradation in the SNR introduced in the ADC by the same supports 500mV peak input swings. These differential blocks jitter is smaller than 1dB, a ring-oscillator-based VCO can be are shown in Figure 4. The preamplifier reduces kickback used to generate the 300MHz clocks required for the ADC. noise to the input and reference nodes and provides a very The ring oscillator consists of four inverter buffers connected small amplitude gain (only 6dB). No further gain is needed in a positive feedback loop, producing the four 90 degree here as most of the gain will be provided more efficiently phase-shifted clocks that drive the time-interleaved FLASH by positive feedback in the two comparators. Comparator 1 ADC. The inverter buffers are emitter-coupled-logic (ECL) is a track and latch comparator. Its speed advantage is due with symmetric loads. The loop filter capacitor is 400pF, to its reduced swing. Comparator 2 is a StrongARM latch which can be integrated on-chip. [5]. The use of two latching comparators enhances the meta- A second charge pump is used to provide a feed-forward stability resolution of the ADC. The pipelined decoder uses zero in the type III loop filter. The bias generator is a replica an intermediate Gray code to reduce the impact of bubble of the ECL delay cells that is self-biased, and generates errors in the performance of the ADC. the control voltages for the ECL loads and tail current so The use of a FLASH interleaved ADC has two main that they are biased at the maximum gain and swing points advantages. First, overall performance is determined by the for the given current density and frequency of oscillation. average of all four channels rather than being limited by the The common circuit structures of the charge pump, bias worst case. This occurs because the digital back-end adds generator, and delay cells lends this PLL to be process- groups of four consecutive samples, coming from the four independent and low-jitter. different channels, and treats the result as a single sample. This reduces the need for calibration across the channels as Analog to Digital Converter required in most time-interleaved ADCs. The 4-bit ADC in the UWB receiver is comprised of The second advantage is that data is supplied to the back- four FLASH time interleaved channels running on 300MHz end at a reduced data rate. The outputs of the different phase-offset clocks supplied by the PLL. A sampling rate interleaved channels of the ADC are presented in parallel to of fs=1.2GSPS is achieved. Figure 3 shows the block di- the digital back-end. The samples from the four channels are agram of a single FLASH channel. Each channel contains aligned to the same 300MHz clock edge instead of creating a track-and-hold (T/H) circuit, followed by a bank of 15 a sample data rate clocked at 1.2GHz. The clock frequency preamplifiers, comparators and a pipelined decoder. The T/H of the digital back-end is then reduced to 30MHz. Digital Back-end Correlation 1 The digital back-end recovers the information contained Sample 1 Shift Registers Sample 2 R1 Sample 3 R2 R3 R4 R5 in the data packets from the samples given by the ADC Sample 4 using an approximation to the matched filter [6]. A matched filter is the optimum demodulator of a signal in AWGN, and Correlator 1 implies the correlation of the received signal with a local Correlation 2 template synchronized to it and comprising a sequence of Sample 5 Shift Registers perfect replicas of the received pulses. This receiver uses a Sample 6 R1 Sample 7 R2 R3 R4 R5 sequence of rectangular pulses instead of the perfect replicas. Sample 8 The correlation of the incoming signal with a rectangular pulse of width Tp is equivalent to adding together four Correlator 2 consecutive samples taken at 1.2GSPS. This simplification of the architecture comes at a cost of 1.7dB of SNR. The synchronization of the local templates with the incom- Correlation 10 ing signal is achieved in two stages: coarse acquisition and Sample 37 Shift Registers Sample 38 R1 R2 R3 R4 R5 fine tracking. To reduce the coarse acquisition time below Sample 39 70µs, it is necessary to calculate at least 50 correlations Sample 40 in parallel during this stage [7]. For fine tracking, only Correlator 10 two correlations using contiguous templates are needed in Code Reset 30MHz order to implement the delay estimator needed for a delay Gold Code Clock locked loop (DLL) and to demodulate the data bits. Since Generator timing recovery is performed in the digital domain, the delay corrections are obtained by advancing or delaying an integer Fig. 5. Implementation of one of the correlator channels. number of samples. A. Architecture The digital back-end is divided into a fast and a slow coefficients is one single bit. Each correlator performs five clocking domain, as shown in Figure 1. The fast domain uses correlations at the same time, equivalent to the output of an D ·f a custom layout, a 300MHz clock coming from the PLL, and FIR of bit s=6200 coefficients with values equal to 1, -1 or is composed of a retiming block, a block that performs a 10x 0. The outputs from the ten correlators are used by the coarse parallelization of the incoming ADC samples, and the main acquisition subsystem, but only the first two are active during control of the digital back-end. The parallelization allows the fine tracking. The Gold code generator is implemented with slow domain to work with a 30MHz clock. The correlations two shift registers, each of them generating a linear recursive and other mathematical operations needed to implement the sequence of which both the coefficients of the generating synchronization and demodulation are performed using 2’s- polynomial and the seed values are programmable. complement arithmetic in the slow domain, and their design The fine tracking subsystem shown in Figure 6 provides is carried out using synthesis tools. the functionality required to close the DLL. The division The retiming block provides the one-sample delay granu- needed in the delay estimation is avoided by multiplying by larity required for fine tracking. The groups of four samples an approximation to the inverse stored in a ROM. The ROM that are the inputs to the correlators may start with any stores 32 seven-bit numbers and the five more significant arbitrary sample and may include samples belonging to two bits of the numerator are used to choose the output. The different ADC vectors. These groups are obtained by selec- coefficients of the filter are programmable, and it uses Baugh- tively delaying the outputs of one or more ADC interleaved Wooley multipliers. The delay decoder transforms the output channels in the retiming block. of the filter into signals relevant to the fast clocking domain: After the parallelization, the outputs of the fast clocking the new state of the retiming block and indication of the need domain are processed by 10 correlators as shown in Figure to start correlations a 300MHz clock cycle before (signal 5. In each 30MHz clock cycle, the four samples at its Advance) or later (signal Delay). The fine tracking subsystem input are added together, implementing the correlation with also provides a flag to restart coarse acquisition when the a rectangular pulse of width Tp. The result of this addition signal is lost. All the outputs of the fine tracking system are is either added or subtracted, depending on the value of the ready in six 30MHz cycles. Gold Code, to the 11-bit value stored in the shift registers As the 50 correlations are completed, they are read into five cycles before (50·Tp, equal to the time between two con- the coarse acquisition subsystem, whose block diagram is secutive pulses). All multipliers in Figure 5 are implemented shown in Figure 7. The memory block provides not only the using 2-to-1 multiplexers because in each of them one of the value of the maximum but also the values in the two adjacent Delay Estimator Loop Filter Advance Correlator 1 Delay Coder Delay b0 ROM Control of Unit Correlator 2 (inverses) Retiming Delay Block a1 b1 Unit Signals Delay to Fast a2 b2 Clocking Domain Absolute If not, signal lost Value >= Threshold 2

Fig. 6. Fine tracking subsystem block diagram.

Advance Simplified Fig. 8. Single chip UWB transceiver photograph. Delay Fine Tracking Retiming Correlation 1 Subsystem Position + 1 Advance AUXILIARY Control Delay TABLE I Position >= MEMORY Retiming Correlation 10 Position - 1 CHIP MEASUREMENTS Advance To Fast Simplified Delay Clocking Fine Tracking Domain Write Read Subsystem Retiming

2-to-1 MULTIPLEXERS Chip specifications MAXIMUM Position of the maximum Process Technology 0.18µm non-epi Die Size 4.3mm × 2.9mm THRESHOLD Signal detected To Fast Clocking Domain 193kbps ADC Performance Abs. Accuracy Rel. Accuracy Fig. 7. Coarse acquisition block diagram. Static Performance 3.1bits 3.9bits Dynamic Performance 3.0bits 3.4bits Power Consumption ADC + PLL 135mW positions. The two simplified fine tracking subsystems lack CLK Buffers 65mW the loop filter shown in Figure 6 except for its direct path Back-End 75mW (b0). Only the simplified fine tracking subsystem using the Total 275mW two positions with the most energy will be used to initialize the DLL. Detection of the signal is given in six 30MHz cycles, and the rest of the outputs are ready in seven cycles Acknowledgments more. This research is sponsored by Hewlett-Packard under the All thresholds, coefficients and other parameters used in HP/MIT Alliance and the NSF under contract ANI-0335256. the digital back-end must be configured before its use using Any view expressed in this paper are those of the authors a serial port. and do not necessarily reflect the views of the NSF. Performance results References Figure 8 shows a photograph of the 0.18µm ASIC. The PLL was verified at 300MHz and can provide much higher [1] Federal Communications Commission, Ultra-Wideband (UWB) First Report and Order, February 2002. clock frequencies (up to 2GHz). The ADC is verified using [2] Newaskar, P., Blazquez, R. and Chandrakasan, A., A/D Preci- the testing method presented in [8]. Its effective number of sion Requirements for an Ultra-wideband Radio Receiver,in bits is greater than 3. The digital back-end is completely Proc. of the 2002 IEEE SIPS, San Diego, CA, pp. 270-275. functional at a clock frequency of 300MHz. The frequency [3] Lee, F., Wentzloff, D. and Chandrakasan, A., An Ultra- range for the coarse acquisition algorithm between a pair of Wideband Baseband Front-End, in Proc. of the 2004 IEEE ± % Radio Frequency Integrated Circuits Symp., Fort Worth, TX. transceivers is shown to be 3 . At 300MHz a wireless link [4] Maneatis, J., Low-Jitter Process-Independent DLL and PLL was demonstrated at a data rate of 193kbps. Table I contains Based on Self-Biased Techniques, in IEEE J. Solid-State Cir- a summary of overall chip measurements. cuits, vol. 31, no. 11, pp. 1722-1732, Nov. 1996. [5] Chandrakasan, A., Bowhill, W. and Fox, F., editor, Design of Conclusions High Performance Microprocessor Circuits, IEEE Press, 2001. The baseband processor for a UWB system-on-a-chip [6] Proakis, J.G., Digital Communications, McGraw Hill Inc, 1983. [7] Blazquez, R., Newaskar, P. and Chandrakasan, A., Coarse transceiver with a mainly digital architecture was presented Acquisition for Ultra Wideband Digital Receivers, in Proc. the in this paper. The chip was implemented in 0.18µm CMOS. 2003 IEEE ICASSP, vol. 4, pp. 137-140, Hong Kong. The total power consumed by the baseband processor was [8] Dornberg, J., Lee, H.S., Hodges, D.A., Full-Speed Testing of 275mW. A wireless link with a data rate of 193kbps was A/D Converters, in IEEE J. Solid-State Circuits, vol. 19, no. 6, demonstrated with this system. pp. 22-26, Dec. 1984.