<<

FORMAT INDEPENDENCE PROVISION OF AUDIO AND VIDEO DATA IN MULTIMEDIA DATABASE MANAGEMENT SYSTEMS

Der Technischen Fakultät der Universität Erlangen-Nürnberg

zur Erlangung des Grades

D O K T O R – I N G E N I E U R

vorgelegt von

Maciej Suchomski

Erlangen – 2008

Als Dissertation genehmigt von der Technischen Fakultät der Universität Erlangen-Nürnberg

Tag der Einreichung: 13.05.2008 Tag der Promotion: 31.10.2008

Dekan: Prof. Dr.-Ing. habil. Johannes Huber

Berichterstatter: Prof. Dr.-Ing. Klaus Meyer-Wegener, Vizepräsident der FAU Prof. Dr. Andreas Henrich

BEREITSTELLUNG DER FORMATUNABHÄNGIGKEIT VON AUDIO- UND VIDEODATEN IN MULTIMEDIALEN DATENBANKVERWALTUNGSSYSTEMEN

Der Technischen Fakultät der Universität Erlangen-Nürnberg

zur Erlangung des Grades

D O K T O R – I N G E N I E U R

vorgelegt von

Maciej Suchomski

Erlangen – 2008

Als Dissertation genehmigt von der Technischen Fakultät der Universität Erlangen-Nürnberg

Tag der Einreichung: 13.05.2008 Tag der Promotion: 31.10.2008

Dekan: Prof. Dr.-Ing. habil. Johannes Huber

Berichterstatter: Prof. Dr.-Ing. Klaus Meyer-Wegener, Vizepräsident der FAU Prof. Dr. Andreas Henrich

To My Love Parents Dla Moich Kochanych Rodziców

Abstract

ABSTRACT

Since late 90s there is a noticeable revolution in the consumption of multimedia data being analogical to the electronic data processing revolution in 80s and 90s. The multimedia revolution covers different aspects such as multimedia production, storage, and delivery, but as well triggers completely new solutions on consumer market such as multifunction handheld devices, digital and internet TV, or home cinemas. It brings however also new problems. The multimedia format variety is on one hand an advantage but on the other one of the problems, since every consumer has to understand the data in a specific format in order to consume them.

On the other hand, the database management systems have been responsible for providing the data to the consumers and applications regardless the format and storage characteristics. However in case of multimedia data, the MMDBMSes have failed to provide data independence due to complexity in “translation process”, especially when considering continuous data such as audio and video. There are many reasons of such situation: the time characteristic of the continuous data (processing according to functional correctness but also to time correctness), the complexity of conversion algorithms (especially compression), and the demand of processing resources varying in time (due to the dependence on content) thus requiring sophisticated resource allocation algorithms.

This work focuses on a proposal of the conceptual model of the real-time audio-video conversion (RETAVIC) architecture in order to diminish existing problems in the multimedia format translation process, and thus, to allow the format independence provision of audio and video data. The data processing within the RETAVIC architecture has been divided in four phases: capturing, analysis, storage and delivery. The key assumption is the meta-data-based real- time transcoding in the delivery phase, where quality-adaptive decoding and encoding employing Hard-Real-Time Adaptive model occurs. Besides, the Layered Lossless Video format (LLV1) has been implemented within this project and the analysis of format independence approaches and support in current multimedia management systems has been conducted. The prototypical real-time implementation of the critical parts of the transcoding chain for video provides the functional, quantitative and qualitative evaluation.

i Abstract

ii Kurzfassung

KURZFASSUNG

Seit den späten 1990er Jahren gibt es eine wahrnehmbare Revolution im Konsumverhalten von Multimediadaten, welche analog der Revolution der elektronischen Datenverarbeitung in 1980er und 1990er Jahren ist. Diese Multimediarevolution umfasst verschiedene Aspekte wie Multimediaproduktion, -speicherung und -verteilung, sie bedingt außerdem vollständig neue Lösungen auf dem Absatzmarkt für Konsumgüter wie mobile Endgeräte, digitales und Internet- Fernsehen oder Heimkinosystemen. Sie ist jedoch ebenfalls Auslöser bis dato unbekannter Probleme. Die Multimediaformatvielzahl ist einerseits ein Vorteil, auf der anderen Seite aber eines dieser Probleme, da jeder Verbraucher die Daten in einem spezifischen Format „verstehen“ muss, um sie konsumieren zu können.

Andererseits sind die Datenbankverwaltungssysteme aber auch dafür verantwortlich, dass die Daten unabhängig von Format- und Speichereigenschaften für die Verbraucher und für die Anwendungen zur Verfügung stehen. Im Falle der Multimediadaten jedoch haben die MMDBVSe die Datenunabhängigkeit wegen der Komplexität „im Übersetzungsprozess“ nicht zur Verfügung stellen können, insbesondere wenn es sich um kontinuierliche Datenströme wie Audiodaten und Videodaten handelt. Es gibt viele Gründe solcher Phänomene, die Zeiteigenschaften von den kontinuierlichen Daten (die Verarbeitung entsprechend der Funktionskorrektheit aber auch entsprechend der Zeitkorrektheit), die Komplexität der Umwandlungsalgorithmen (insbesondere jene der Kompression) und die Anforderungen an die Verarbeitungsressourcen, die in der Zeit schwanken (wegen der Inhaltsabhängigkeit), die daher anspruchsvolle Ressourcenzuweisungsalgorithmen erforderlich machen.

Die vorliegende Arbeit konzentriert sich auf einen Vorschlag des Begriffsmodells der Echtzeitumwandlungsarchitektur der Audio- und Videodaten (RETAVIC), um vorhandene Probleme im Multimediaformat-Übersetzungsprozess zu mindern und somit die Bereitstellung der Formatunabhängigkeit von Audio- und Videodaten zu erlauben. Die Datenverarbeitung innerhalb der RETAVIC-Architektur ist in vier Phasen unterteilt worden: Erfassung, Analyse, Speicherung und Anlieferung. Die Haupthypothese ist die metadaten-bezogene Echtzeittranskodierung in der Anlieferungsphase, in der die qualitätsanpassungsfähige Decodierung und Enkodierung mit dem Einsatz des „Hard-Real-Time Adaptive (Hart-Echtzeit-

iii Kurzfassung

Adaptiv-)-Modells“ auftritt. Außerdem ist das „Layered Lossless Video Format“ (Geschichtetes Verlustfreies Videoformat) innerhalb dieses Projektes implementiert worden, eine Analyse der Formatunabhängigkeitsansätze sowie der -unterstützung in den gegenwärtigen Multimedia- Managementsystemen wurde geführt. Die prototypische Echtzeit-Implementierung der kritischen Teile der Transkodierungskette für Video ermöglicht die funktionale, quantitative und qualitative Auswertung.

iv Acknowledgements

ACKNOWLEDGEMENTS

First and foremost, I would like to thank my supervisor Prof. Klaus Meyer-Wegener. It was a great pleasure to work under his supervision. He was always able to find time for me and conduct stimulating discussions. His advices and suggestions at a crossroads allowed me to choose correct path and bring my research forward keeping me right on to the end of the road. His great patience, tolerance, and understanding helped in conducting the research and testing new ideas. His great wisdom and active support are undoubted facts. Prof. Meyer-Wegener spent not only days but also nights on co-authoring the papers published in the time of research on this work. Without him beginning and completion of this thesis would never be possible.

Next, I would like to express my gratitude to Prof. Hartmut Wedekind for his great advices, shared spiritual experiences during our stay in Sudety Mountains and for accepting the chairman position during the viva voce examination. I also want to thank Prof. Andreas Henrich for many fruitful discussions during the workshops of the MMIS Group. Moreover, I am very happy that Prof. Henrich has committed himself to be the reviewer of my dissertation and I will never forget these efforts. Subsequently, I give my great thanks to Prof. André Kaup for his participation in the PhD-examination procedure.

The expression of enjoyment deriving from the cooperative and personal work, from meetings “in the kitchen”, and from the funny every-day situations goes to all my colleagues at the chair. Particularly I would like to thank few of them. First, I wan to give my thanks to our ladies: to Mrs. Knechtel for his organizational help and warm welcome at the university, and to Mrs. Braun and to Mrs. Stoyan for keeping the hardware infrastructure up and running allowing me to work without additional worries. My appreciations are directed to Prof. Jablonski for many scholar advices and for the smooth and unproblematic collaboration during preparation of the exercises. I give my great thanks to Dr. Marcus Meyerhöfer – I have really enjoyed the time with you not only in the shared office but also outside the university, and there is no time spent together that will be forgotten. Finally, I would like to thank other colleagues: Michael Daum, Dr. Ilia Petrov, Dr. Rainer Lay, Dr. Sascha Müller, Dr. Udo Mayer, Florian Irmert and Robert Nagy. They spent willingly the time with me also outside the office and brought me closer not only to the German culture but also to the night life fun.

v Acknowledgements

I am also grateful to all students supervised by me, which have done their study projects and master theses supporting the RETAVIC project. Their contribution including among other things discussions on architecture issues, writing the code, benchmarking and evaluation allowed refining the concepts and clarifying the doubts. Especially, the best-effort converter prototypes and then their real-time implementations have made proving the assumed hypotheses possible – great thanks to my developers and active discussion partners.

I must express my gratitude to my dear brother, Paweł. We are two people but one heart, blood, soul and mind. His helpfulness and kindness in supporting my requests and asks will never be forgotten. He helped me a lot by organizational aspects of every day life and was the reliable representative of my person in many important cases in Wroclaw in Poland in the time of my stay in Erlangen. I am not able to say how much I owe to him.

I would like to thank my love wife, Magda, for her spiritual support and love, for the big and small things day by day, for always being with me in good and bad times, for tolerating my bad mood, and for her patience and understanding. Her entertainment and amusement made me very happy even during the time under pressure. I have managed to finish this work thanks to her.

Finally, I am honestly thankful to my dear parents for their advices, continuous support and love. It is a great honor and the same my convenience to have such love parents and to have a privilege of using their wisdom and experience. To them I dedicate this work entirely.

vi Contents

CONTENTS

ABSTRACT...... I KURZFASSUNG...... III ACKNOWLEDGEMENTS ...... V CONTENTS...... VII LIST OF FIGURES...... XIII LIST OF TABLES ...... XIX INDEX OF EQUATIONS ...... XXI

CHAPTER 1 – INTRODUCTION...... 1

I. INTRODUCTION ...... 1 I.1. EDP Revolution...... 2 I.2. Digital Multimedia – New Hype and New Problems...... 3 I.3. Data Independence ...... 6 I.4. Format Independence...... 9 I.5. AV Conversion Problems ...... 11 I.6. Assumptions and Limitations ...... 11 I.7. Contribution of the RETAVIC Project...... 12 I.8. Thesis Outline...... 13

CHAPTER 2 – RELATED WORK ...... 15

II. FUNDAMENTALS AND FRAMEWORKS...... 15 II.1. Terms and Definitions...... 16 II.2. Multimedia data delivery ...... 17 II.3. Approaches to Format Independence ...... 18 II.3.1. Redundancy Approach...... 19 II.3.2. Adaptation Approach...... 20 II.3.3. Transcoding Approach ...... 22 II.3.3.1 Cascaded transcoding...... 24 II.3.3.2 MD-based transcoding...... 25 II.4. Video and Audio Transformation Frameworks...... 27 II.4.1. Converters and Converter Graphs...... 28 II.4.1.1 Well-known implementations...... 29 II.4.2. End-to-End Adaptation and Transcoding Systems ...... 31 II.4.3. Summary of the Related Transformation Frameworks...... 34 II.5. Format Independence in Multimedia Management Systems...... 34 II.5.1. MMDBMS ...... 35 II.5.2. Media (Streaming) Servers...... 39 III. FORMATS AND RT/OS ISSUES...... 40 III.1. Storage Formats...... 40

vii Contents

III.1.1. Video Data...... 40 III.1.1.1 Scalable ...... 41 III.1.1.2 Lossless codecs...... 42 III.1.1.3 Lossless and scalable codecs...... 42 III.1.2. Audio Data...... 43 III.2. Real-Time Issues in Operating Systems...... 45 III.2.1. OS Kernel – Execution Modes, Architectures and IPC...... 46 III.2.2. Real-time Processing Models...... 47 III.2.2.1 Best-effort, soft- and hard-real-time...... 47 III.2.2.2 Imprecise computations...... 48 III.2.3. Scheduling Algorithms and QAS ...... 48

CHAPTER 3 – DESIGN ...... 51

IV. SYSTEM DESIGN...... 51 IV.1. Architecture Requirements...... 52 IV.2. General Idea ...... 53 IV.2.1. Real-time Capturing...... 56 IV.2.1.1 Grabbing techniques ...... 56 IV.2.1.2 Throughput and storage requirements of uncompressed media data ...... 58 IV.2.1.3 Fast and simple lossless encoding...... 61 IV.2.1.4 Media buffer as temporal storage...... 64 Different hardware storage solutions and offered throughput...... 64 Evaluation of storage solutions in context of RETAVIC ...... 68 IV.2.2. Non-real-time Preparation ...... 69 IV.2.2.1 Archiving origin source...... 70 IV.2.2.2 Conversion to internal format...... 71 IV.2.2.3 Content analysis...... 72 IV.2.3. Storage ...... 72 IV.2.3.1 Lossless scalable binary stream...... 73 IV.2.3.2 Meta data ...... 75 IV.2.4. Real-time Delivery ...... 76 IV.2.4.1 Real-time transcoding...... 76 Real-time decoding ...... 77 Real-time processing ...... 77 Real-time encoding ...... 78 IV.2.4.2 Direct delivery ...... 80 IV.3. Evaluation of the General Idea...... 81 IV.3.1. Insert and Update Delay...... 82 IV.3.2. Architecture Independence of the Internal Storage Format...... 83 IV.3.2.1 Correlation between modules in different phases related to internal format...... 84 IV.3.2.2 Procedure of internal format replacement ...... 85 IV.3.2.3 Evaluation of storage format independence ...... 86 V. VIDEO PROCESSING MODEL ...... 87 V.1. Analysis of the Video Representatives...... 87 V.2. Assumptions for the Processing Model...... 92 V.3. Video-Related Static MD...... 94 V.4. LLV1 as Video Internal Format...... 99 V.4.1. LLV1 Algorithm – Encoding and Decoding ...... 100 V.4.2. Mathematical Refinement of Formulas...... 102 V.4.3. Compression efficiency...... 104 V.5. Video Processing Supporting Real-time Execution ...... 105

viii Contents

V.5.1. MD-based Decoding...... 105 V.5.2. MD-based Encoding ...... 106 V.5.2.1 MPEG-4 as representative...... 106 V.5.2.2 Continuous MD set for video encoding...... 108 V.5.2.3 MD- as proof of concept...... 111 V.6. Evaluation of the Video Processing Model through Best-Effort Prototypes...... 114 V.6.1. MD Overheads...... 114 V.6.2. Data Layering and Scalability of the Quality...... 116 V.6.3. Processing Scalability in the Decoder...... 122 V.6.4. Influence of Continuous MD on the Data Quality in Encoding...... 126 V.6.5. Influence of Continuous MD on the Processing Complexity...... 129 V.6.6. Evaluation of Complete MD-Based Video Transcoding Chain ...... 131 VI. AUDIO PROCESSING MODEL ...... 134 VI.1. Analysis of the Audio Encoders Representatives...... 134 VI.2. Assumptions for the Processing Model...... 138 VI.3. Audio-Related Static MD...... 140 VI.4. MPEG-4 SLS as Audio Internal Format...... 142 VI.4.1. MPEG-4 SLS Algorithm – Encoding and Decoding ...... 143 VI.4.2. AAC Core ...... 144 VI.4.3. HD-AAC / SLS Extension...... 145 VI.4.3.1 Integer Modified Discrete Cosine Transform ...... 147 VI.4.3.2 Error mapping...... 151 VI.5. Audio Processing Supporting Real-time Execution...... 151 VI.5.1. MD-based Decoding...... 151 VI.5.2. MD-based Encoding ...... 153 VI.5.2.1 MPEG-4 standard as representative...... 153 Generalization of Perceptual Audio Coding Algorithms ...... 153 MPEG-1 Layer 3 and MPEG-2/4 AAC ...... 154 VI.5.2.2 Continuous MD set for audio coding ...... 156 VI.6. Evaluation of the Audio Processing Model through Best-Effort Prototypes...... 158 VI.6.1. MPEG-4 SLS Scalability in Data Quality...... 158 VI.6.2. MPEG-4 SLS Processing Scalability...... 159 VII. REAL-TIME PROCESSING MODEL ...... 161 VII.1. Modeling of Continuous Multimedia Transformation ...... 161 VII.1.1. Converter, Conversion Chains and Conversion Graphs...... 161 VII.1.2. Buffers in the Multimedia Conversion Process ...... 164 Jitter-constrained periodic stream ...... 164 Leading time and buffer size calculations...... 165 M:N data stream conversion...... 165 VII.1.3. Data Dependency in the Converter...... 166 VII.1.4. Data Processing Dependency in the Conversion Graph ...... 167 VII.1.5. Problem with JCPS in Graph Scheduling...... 168 VII.1.6. Operations on Media Streams ...... 173 Media integration (multiplexing)...... 173 Media demuxing...... 173 Media replication...... 173 VII.1.7. Media data synchronization...... 173 VII.2. Real-time Issues in Context of Multimedia Processing ...... 175 VII.2.1. Remarks on JCPS – Data Influence on the Converter Description...... 175 VII.2.2. Hard Real-time Adaptive Model of Media Converters...... 176 VII.2.3. Dresden Real-time as RTE for RETAVIC...... 178

ix Contents

Architecture...... 178 Scheduling...... 180 Real-time thread model ...... 180 VII.2.4. DROPS Streaming Interface...... 182 VII.2.5. Controlling the Multimedia Conversion ...... 184 VII.2.5.1 Generalized control flow in the converter ...... 184 VII.2.5.2 Scheduling of the conversion graph...... 185 Construct the conversion graph ...... 185 Predict quant data volume...... 187 Calculate bandwidth ...... 187 Check and allocate the resources...... 187 VII.2.5.3 Converter’s time prediction through division on function blocks...... 189 VII.2.5.4 Adaptation in processing...... 190 VII.2.6. The Component Streaming Interface...... 190 VII.3. Design of Real-Time Converters...... 192 VII.3.1. Platform-Specific Factors...... 193 VII.3.1.1 Hardware architecture influence ...... 193 VII.3.1.2 effects on the processing time ...... 195 VII.3.1.3 Thread models – priorities, multitasking and caching...... 197 VII.3.2. Timeslices in HRTA Converter Model ...... 199 VII.3.3. Precise time prediction...... 200 VII.3.3.1 Frame-based prediction ...... 201 VII.3.3.2 MB-based prediction...... 204 VII.3.3.3 MV-based prediction...... 210 VII.3.3.4 The compiler-related time correction...... 215 VII.3.3.5 Conclusions to precise time prediction...... 216 VII.3.4. Mapping of MD-LLV1 Decoder to HRTA Converter Model...... 216 VII.3.5. Mapping of MD-XVID Encoder to HRTA Converter Model...... 218 VII.3.5.1 Simplification in time prediction...... 218 VII.3.5.2 Division of encoding time according to HRTA...... 219

CHAPTER 4 – IMPLEMENTATION...... 223

VIII. CORE OF THE RETAVIC ARCHITECTURE...... 223 VIII.1. Implemented Best-effort Prototypes...... 223 VIII.2. Implemented Real-time Prototypes...... 224 IX. REAL-TIME PROCESSING IN DROPS...... 226 IX.1. Issues of Porting to DROPS...... 226 IX.2. Process Flow in the Real-Time Converter ...... 229 IX.3. RT-MD-LLV1 Decoder ...... 231 IX.3.1. Setting-up Real-Time Mode...... 231 IX.3.2. Preempter Definition ...... 232 IX.3.3. MB-based Adaptive Processing...... 233 IX.3.4. Decoder’s Real-Time Loop...... 234 IX.4. RT-MD-XVID Encoder...... 235 IX.4.1. Setting-up Real-Time Mode...... 236 IX.4.2. Preempter Definition ...... 237 IX.4.3. MB-based Adaptive Processing...... 238 IX.4.4. Encoder’s Real-Time Loop...... 239

CHAPTER 5 – EVALUATION AND APPLICATION...... 241

x Contents

X. EXPERIMENTAL MEASUREMENTS...... 241 X.1. The Evaluation Process...... 242 X.2. Measurement Accuracy – Low-level Test Bed Assumptions...... 243 X.2.1. Impact Factors ...... 243 X.2.2. Measuring Disruptions Caused by Impact Factors ...... 245 X.2.2.1 Deviations in CPU Cycles Frequency ...... 245 X.2.2.2 Deviations in the Transcoding Time...... 246 X.2.3. Accuracy and Errors – Summary ...... 248 X.3. Evaluation of RT-MD-LLV1...... 249 X.3.1. Checking Functional Consistency with MD-LLV1...... 249 X.3.2. Learning Phase for RT Mode ...... 250 X.3.3. Real-time Working Mode ...... 253 X.4. Evaluation of RT-MD-XVID ...... 255 X.4.1. Learning Phase for RT-Mode...... 255 X.4.2. Real-time Working Mode ...... 258 XI. COROLLARIES AND CONSEQUENCES...... 264 XI.1. Objective Selection of Application Approach based on Transcoding Costs ...... 264 XI.2. Application Fields ...... 265 XI.3. Variations of the RETAVIC Architecture...... 267

CHAPTER 6 – SUMMARY...... 269

XII. CONCLUSIONS...... 269 XIII. FURTHER WORK ...... 271

APPENDIX A ...... 275

XIV. GLOSSARY OF DEFINITIONS ...... 275 XIV.1. Data-related Terms...... 275 XIV.2. Processing-related Terms ...... 277 XIV.3. Quality-related Terms...... 280

APPENDIX B ...... 281

XV. DETAILED ALGORITHM FOR LLV1 FORMAT ...... 281 XV.1. The LLV1 decoding algorithm...... 281

APPENDIX ...... 285

XVI. COMPARISON OF MPEG-4 AND H.263 STANDARDS...... 285 XVI.1. Algorithmic differences and similarities...... 285 XVI.2. Application-oriented comparison ...... 288 XVI.3. Implementation analysis...... 291 XVI.3.1. MPEG-4...... 292 XVI.3.2. H.263...... 292

xi Contents

APPENDIX D...... 295

XVII. LOADING CONTINUOUS METADATA INTO ENCODER ...... 295

APPENDIX E ...... 297

XVIII. TEST BED ...... 297 XVIII.1. Non-real-time processing of high load ...... 297 XVIII.2. Imprecise measurements in non-real-time...... 300 XVIII.3. Precise measurements in DROPS...... 301

APPENDIX F ...... 303

XIX. STATIC META-DATA FOR FEW VIDEO SEQUENCES...... 303 XIX.1. Frame-based static MD...... 303 XIX.2. MB-based static MD ...... 304 XIX.3. MV-based static MD...... 306 XIX.3.1. Graphs with absolute values ...... 306 XIX.3.2. Distribution graphs...... 307

APPENDIX G ...... 311

XX. FULL LISTING OF IMPORTANT REAL-TIME FUNCTIONS IN RT-MD-LLV1...... 311 XX.1. Function preempter_thread()...... 311 XX.2. Function load_allocation_params()...... 312 XXI. FULL LISTING OF IMPORTANT REAL-TIME FUNCTIONS IN RT-MD-XVID...... 313 XXI.1. Function preempter_thread()...... 313

APPENDIX H...... 315

XXII. MPEG-4 AUDIO TOOLS AND PROFILES...... 315 XXIII. MPEG-4 SLS ENHANCEMENTS...... 317 XXIII.1. Investigated versions - origin and enhancements...... 317 XXIII.2. Measurements...... 317 XXIII.3. Overall Final Improvement...... 318

BIBLIOGRAPHY ...... 321

xii List of Figures

LIST OF FIGURES

Number Description Page

Figure 1. Digital Item Adaptation architecture [Vetro, 2004]...... 23 Figure 2. Bitstream syntax description adaptation architecture [Vetro, 2004]...... 24 Figure 3. Adaptive transcoding system using meta-data [Vetro, 2001]...... 31 Figure 4. Generic real-time media transformation framework supporting format independence in multimedia servers and database management systems. Remark: dotted lines refer to optional parts that may be skipped within a phase...... 54 Figure 5. Comparison of compression size and decoding speed of lossless audio codecs [Suchomski et al., 2006]...... 63 Figure 6. Location of the network determines the storage model [Auspex, 2000]...... 66 Figure 7. Correlation between real-time decoding and conversion to internal format...... 84 Figure 8. Encoding time per frame for various codecs...... 88 Figure 9. Average encoding time per frame for various codecs...... 89 Figure 10. Example distribution of time spent on different parts in the XVID encoding process for Clip no 2...... 90 Figure 11. Example distribution of time spent on different parts in the FFMPEG MPEG-4 encoding process for Clip no 2...... 91 Figure 12. Initial static MD set focusing on the video data...... 98 Figure 13. Simplified LLV1 algorithm: a) encoding and b) decoding...... 101 Figure 14. Quantization layering in the frequency domain in the LLV1 format...... 104 Figure 15. Compressed file-size comparison normalized to LLV1 ([Militzer et al., 2005])...... 105 Figure 16. DCT-based video coding of: a) intra-frames, and b) inter-frames...... 108 Figure 17. XVID Encoder: a) standard implementation and b) meta-data based implementation...... 113 Figure 18. Continuous Meta-Data: a) overhead cost related to LLV1 Base Layer for tested sequences and b) distribution of given costs...... 115 Figure 19. Size of LLV1 compressed output: a) cumulated by layers and b) percentage of each layer...... 118 Figure 20. Relation of LLV1 layers to origin uncompressed video in YUV format...... 119 Figure 21. Distribution of layers in LLV1-coded sequences showing average with deviation...... 120 Figure 22. Picture quality for different quality layers for Paris (CIF) [Militzer et al., 2005]...... 122

xiii List of Figures

Figure 23. Picture quality for different quality layers for Mobile (CIF) [Militzer et al., 2005]...... 122 Figure 24. LLV1 decoding time per frame of the Mobile (CIF) considering various data layers [Militzer et al., 2005]...... 123 Figure 25. LLV1 vs. Kakadu – the decoding time measured multiple times and the average...... 126 Figure 26. Quality value (PSNR in dB) vs. output size of compressed Container (QCIF) sequence [Militzer, 2004]...... 127 Figure 27. R-D curves for Tempete (CIF) and Salesman (QCFI) sequences [Suchomski et al., 2005]...... 128 Figure 28. Speed-up effect of applying continuous MD for various bit rates [Suchomski et al., 2005]...... 129 Figure 29. Smoothing effect on the processing time by exploiting continuous MD [Suchomski et al., 2005]...... 130 Figure 30. Video transcoding scenario from internal LLV1 format to MPEG-4 SP compatible (using MD-XVID): a) usual real-world case and b) simplified investigated case...... 131 Figure 31. Execution time of the various data quality requirements according to the simplified scenario...... 132 Figure 32. OggEnc and FAAC encoding behavior of the silence. (based on [Penzkofer, 2006])...... 135 Figure 33. Behavior of the Lame encoder for all three test audio sequences (based on [Penzkofer, 2006])...... 136 Figure 34. OggEnc and FAAC encoding behavior of the male_speech.wav (based on [Penzkofer, 2006])...... 136 Figure 35. OggEnc and FAAC encoding behavior of the go4_30.wav (based on [Penzkofer, 2006])...... 137 Figure 36. Comparison of the complete encoding time of the analyzed codecs...... 138 Figure 37. Initial static MD set focusing on the audio data...... 141 Figure 38. Overview of simplified SLS encoding algorithm: a) with AAC-based core [Geiger et al., 2006] and b) without core...... 143 Figure 39. Structure of HD-AAC coder [Geiger et al., 2006]: a) encoder and b) decoder...... 146 Figure 40. Decomposition of MDCT...... 148 Figure 41. Overlapping of blocks...... 148 Figure 42. Decomposition of MDCT by Windowing, TDAC and DCT-IV [Geiger et al., 2001]...... 150

xiv List of Figures

Figure 43. Givens rotation and its decomposition into three lifting steps [Geiger et al., 2001]...... 150 Figure 44. General perceptual coding algorithm [Kahrs and Brandenburg, 1998]: a) encoder and b) decoder...... 153 Figure 45. MPEG Layer 3 encoding algorithm [Kahrs and Brandenburg, 1998]...... 155 Figure 46. AAC encoding algorithm [Kahrs and Brandenburg, 1998]...... 155 Figure 47. Gain of ODG with scalability [Suchomski et al., 2006]...... 159 Figure 48. Decoding speed of SLS version of SQAM with truncated enhancement stream [Suchomski et al., 2006]...... 160 Figure 49. Converter model – a black-box representation of the converter (based on [Schmidt et al., 2003; Suchomski et al., 2004])...... 162 Figure 50. Converter graph: a) meta-model, b) model and c) instance examples...... 163 Figure 51. Simple transcoding used for measuring times and data amounts on best- effort OS with exclusive execution mode...... 170 Figure 52. Execution time of simple transcoding: a) per frame for each chain element and b) per chain element for total transcoding time...... 170 Figure 53. Cumulated time of source period and real processing time...... 175 Figure 54. Cumulated size of source and encoded data...... 176 Figure 55. DROPS Architecture [Reuther et al., 2006]...... 179 Figure 56. Reserved context and real events for one periodic thread...... 181 Figure 57. Communication in DROPS between kernel and application threads (incl. scheduling context)...... 182 Figure 58. DSI application model (based on [Reuther et al., 2006])...... 183 Figure 59. Generalized control flow of the converter [Schmidt et al., 2003]...... 185 Figure 60. Scheduling of the conversion graph [Märcz et al., 2003]...... 186 Figure 61. Simplified OO-model of the CSI [Schmidt et al., 2003]...... 191 Figure 62. Application model using CSI: a) chain of CSI converters and b) the details of control application and converter [Schmidt et al., 2003]...... 191 Figure 63. Encoding time of the simple benchmark on different platforms...... 194 Figure 64. Proposed machine index based on simple multimedia benchmark...... 195 Figure 65. Compiler effects on execution time for media encoding using MD-XVID...... 196 Figure 66. Preemptive task switching effect (based on [Mielimonka, 2006])...... 198 Figure 67. Timeslice allocation scheme in the proposed HRTA thread model of the converter...... 199 Figure 68. Normalized average LLV1 decoding time counted per frame type for each sequence...... 201

xv List of Figures

Figure 69. MD-XVID encoding time of different frame types for representative number of frames in Container QCIF...... 202 Figure 70. Difference between measured and predicted time...... 203 Figure 71. Average of measured time and predicted time per frame type...... 204 Figure 72. MB-specific encoding time using MD-XVID for Carphone QCIF...... 205 Figure 73. Distribution of different types of MBs per frame in the sequences: a) Carphone QCIF and b) Coastguard CIF (no B-frames)...... 206 Figure 74. Cumulated processing time along the execution progress for the MD-XVID encoding (based on [Mielimonka, 2006])...... 207 Figure 75. Average coding time partitioning in respect to the given frame type (based on [Mielimonka, 2006])...... 208 Figure 76. Measured and predicted values for MD-XVID encoding of Carphone QCIF...... 209 Figure 77. Prediction error of MB-based estimation function in comparison to measured values...... 210 Figure 78. Distribution of MV-types per frame in the Carphone QCIF sequence...... 211 Figure 79. Sum of motion vectors per frame and MV type in the static MD for Carphone QCIF sequence: a) with no B-frames and b) with B-frames...... 211 Figure 80. MD-XVID encoding time of MV-type-specific functional blocks per frame for Carphone QCIF (96)...... 212 Figure 81. Average encoding time measured per MB using the given MV-type...... 213 Figure 82. MV-based predicted and measured encoding time for Carphone QCIF (no B- frames)...... 214 Figure 83. Prediction error of MB-based estimation function in comparison to measured values...... 215 Figure 84. Mapping of MD-XVID to HRTA converter model...... 221 Figure 85. Process flow in the real-time converter...... 230 Figure 86. Setting up real-time mode (based on [Mielimonka, 2006; Wittmann, 2005])...... 232 Figure 87. Decoder’s preempter thread accompanying the processing main thread (based on [Wittmann, 2005])...... 233 Figure 88. Timeslice overrun handling during the processing of enhancement layer (based on [Wittmann, 2005])...... 234 Figure 89. Real-time periodic loop in the RT-MD-LLV1 decoder...... 235 Figure 90. Encoder’s preempter thread accompanying the processing main thread...... 237 Figure 91. Controlling the MB-loop in real-time mode during the processing of enhancement layer...... 238 Figure 92. Logarithmic time scale of computer events [Bryant and O'Hallaron, 2003]...... 244

xvi List of Figures

Figure 93. CPU frequency measurements in kHz for: a) AMD Athlon 1800+ and b) Pentium Mobile 2Ghz...... 245 Figure 94. Frame processing time per timeslice type depending on the quality level for Container CIF (based on [Wittmann, 2005])...... 249 Figure 95. Normalized average time per MB for each frame consumed in the base timeslice (based on [Wittmann, 2005])...... 251 Figure 96. Normalized average time per MB for each frame consumed in the enhancement timeslice (based on [Wittmann, 2005])...... 251 Figure 97. Normalized average time per MB for each frame consumed in the cleanup timeslice (based on [Wittmann, 2005])...... 252 Figure 98. Percentage of decoded MBs for enhancement layers for Mobile CIF with increasing framerate (based on [Wittmann, 2005])...... 253 Figure 99. Percentage of decoded MBs for enhancement layers for Container QCIF with increasing framerate (based on [Wittmann, 2005])...... 254 Figure 100. Percentage of decoded MBs for enhancement layers for Parkrun ITU601 with increasing framerate (based on [Wittmann, 2005])...... 254 Figure 101. Encoding time per frame of various videos for RT-MD-XVID: a) average and b) deviation...... 256 Figure 102. Worst-case encoding time per frame and achieved average quality vs. requested Lowest Quality Acceptable (LQA) for Carphone QCIF...... 257 Figure 103. Time-constrained RT-MD-XVID encoding for Mobile QCIF and Carphone QCIF...... 260 Figure 104. Time-constrained RT-MD-XVID encoding for Mobile CIFN and Coastgueard CIF...... 261 Figure 105. Time-constrained RT-MD-XVID encoding for Carphone QCIF with B- frames...... 262 Figure 106. Newly proposed temporal layering in the LLV1 format...... 272 Figure 107. LLV1 decoding algorithm...... 282 Figure 108. Graph of ranges – quality vs. bandwidth requirements ...... 291 Figure 109. Distribution of frame types within the used set of video sequences...... 304 Figure 110. Percentage of the total gained time between the original code version and the final version [Wendelska, 2007]...... 319

xvii List of Figures

xviii List of Tables

LIST OF TABLES

Number Description Page

Table 1. Throughput and storage requirement for few digital cameras from different classes...... 59 Table 2. Throughput and storage requirements for few video standards...... 60 Table 3. Throughput and storage requirements for audio data...... 61 Table 4. Hardware solutions for the media buffer...... 65 Table 5. Processing time consumed and amount of data produced by the example transcoding chain for Mobile (CIF) video sequence...... 172 Table 6. The JCPS calculated for the respective elements of the conversion graph from the Table 5...... 172 Table 7. JCPS time and size for the LLV1 encoder...... 175 Table 8. Command line arguments for setting up timing parameters of the real-time thread (based on [Mielimonka, 2006])...... 236 Table 9. Deviations in the frame encoding time of the MD-XVID in DROPS caused by microscopic factors (based on [Mielimonka, 2006])...... 247 Table 10. Time per MB for each sequence: average for all frames, maximum for all frames, and the difference (max-avg) in relation to the average (based on [Wittmann, 2005])...... 252 Table 11. Configuration of the MultiMonster cluster...... 298 Table 12. The hardware configuration for the queen-bee server...... 298 Table 13. The hardware configuration for the bee-machines...... 299 Table 14. The configuration of PC_RT...... 300 Table 15. The configuration of PC...... 300 Table 16. MPEG Audio Object Type Definition based on Tools/Modules [MPEG-4 Part III, 2005]...... 315 Table 17. Use of few selected Audio Object Types in MPEG Audio Profiles [MPEG- 4 Part III, 2005]...... 316

xix List of Tables

xx Index of Equations

INDEX OF EQUATIONS

Equation Page Equation Page Equation Page

(1) 94 (27) 165 (53) 207 (2) 94 (28) 165 (54) 208 (3) 95 (29) 165 (55) 209 (4) 95 (30) 165 (56) 209 (5) 95 (31) 166 (57) 209 (6) 95 (32) 166 (58) 212 (7) 96 (33) 166 (59) 213 (8) 96 (34) 166 (60) 213 (9) 96 (35) 168 (61) 213 (10) 96 (36) 169 (62) 215 (11) 96 (37) 169 (63) 217 (12) 103 (38) 169 (64) 217 (13) 103 (39) 169 (65) 217 (14) 103 (40) 171 (66) 217 (15) 107 (41) 177 (67) 219 (16) 140 (42) 188 (68) 219 (17) 140 (43) 188 (69) 219 (18) 140 (44) 188 (70) 219 (19) 140 (45) 188 (71) 220 (20) 148 (46) 202 (72) 220 (21) 149 (47) 202 (73) 220 (22) 149 (48) 202 (74) 220 (23) 149 (49) 203 (75) 264 (24) 150 (50) 206 (25) 150 (51) 206 (26) 151 (52) 206

xxi Index of Equations

ii Chapter 1 – Introduction I. Introduction

Chapter 1 – Introduction

The problems treated here are those of data independence –the independence of application programs and terminal activities from growth in data types and changes in data representation– and certain kinds of data inconsistency which are expected to become troublesome even in nondeductive systems. Edgar F. Codd (1970, A Relational Model of Data for Large Shared Data Banks, Comm. of ACM 13/6)

I. INTRODUCTION

The wisdom of humanity and the accumulated experience of generations derived from human- being ability of applying intelligently the knowledge gained through the human senses. However, before a human-being acquires the knowledge, he must learn to understand the meaning of information. A common sense (or meaning) of the natural languages has been molded through ages and educated implicitly among young generation by the old one, but in case of unnatural languages like programming, the sense of terms must be explained explicitly to the users. The sense of language terms used in communication between people allows them to understand the information, which is carried by the data written or spoken in the specific language. The data are located in the base level of the hierarchy of information science. The data in different forms (i.e. represented by various languages) may carry the same information, as well as the data in the same form may carry different information, but in this second case the interpretation of the used terms has different meaning usually depending on the context.

1 Chapter 1 – Introduction I. Introduction

Based on the above discussion, it may be deduced, that people rely on data and their meaning in context of information, knowledge and wisdom, and thus everything is about data and their understanding. However, the data themselves are useless until they are processed in order to obtain the information out of them. In other words, the data without the processing may be just a collection of garbage, and the processing is possible if the format of the data is known.

I.1. EDP Revolution The term of data processing covers all actions dealing with data including data understanding, data exchange and distribution, data modification, data translation (or transformation) and data storage. The people have been using the data as carriers of information since ages, and the same they have been processing these data in manifold, but sometimes in inefficient ways.

The revolution in data processing finds its roots in the late sixties of twentieth century [Fry and Sibley, 1976], when the digital world came into a play and in which two data models have been developed: the network model by Charles Bachman named Integrated Data Store (IDS) but officially known as Codasyl Data Model (which defined DDL and DML) [CODASYL Systmes Committee, 1969], and the hierarchical model implemented by IBM under supervision of Vern Watts called Information Management System (IMS) [IBM Corp., 1968]. In both models, an access to the data was provided through operations using pointers and paths linked to the data records. As so, a restructuring of the data required rewriting the access methods, and thus the physical structure had to be known for querying of the data.

Edgar F. Codd, working in that time by IBM, did not like the idea of physical-dependent navigational models of Codasyl and IMS, in which the data access was dependent on the storage structures. Therefore he proposed a relational model of data for large data banks [Codd, 1970]. The relational model separated the logical organization of database called schema from the physical storage methods. Based on Codd’s model, the System R—being a relational database system—has been proposed [Astrahan et al., 1976; Chamberlin et al., 1981]. Moreover, the System R was the first implementation of the SQL —structured query language— supporting transactions.

2 Chapter 1 – Introduction I. Introduction

After the official birth of System R an active movement of data and their processing from the analog into electronic world has begun. First commercial systems, such as IBM DB2 (production successor of System R) and Oracle, fostered and accelerated the electronic data processing. The development of these systems has been focused on implementation of the objectives of database management systems such as: data independence, centralization of data and system services, declarative query language (SQL) and application neutrality, multi-user operations with concurrent access, error treatment (recovery), and concepts of transactions. However, these systems supported only textual and numerical data, and other more complex types of data like images, audio or video have not even been mentioned.

I.2. Digital Multimedia – New Hype and New Problems Nowadays, a multimedia revolution analogical to the electronic data processing (EDP) revolution can be observed. It has been triggered by scientific achievements in information systems, followed by wide range of available multimedia software and continuous reduction of computer equipment prices, which was especially noticeable in the storage devices sector in 1980s and 1990s. On the other hand, there has been a raising demand for multimedia data consumption since late eighties.

However, the demand could not be fully satisfied due to missing standards i.e. due to deficiencies in the common definition of norms and terms. Thus the standardization bodies have been established: the WG-11 (known as MPEG) in Europe and the SG-16 in the USA. The MPEG is a working group of JTC1/SC29 within ISO/IEC organization. It was set down in 1988 to answer the demands at first by standardizing the coded representation of digital audio and video enabling many new technologies e.g. VideoCD and MP3 [MPEG-1 Part III, 1993]. The SG-16―a study group of ITU-T―finds its roots in ITU CCITT Study Group VIII (SG-7), which was founded in 1986. The SG-16’s primary goal was to develop a new more efficient compression scheme for continuous-tone still images (known as JPEG [ITU-T Rec. T.81, 1992]). Currently, the activities of MPEG and SG-16 cover standardization of all technologies that are required for interoperable multimedia including media-specific data coding, compositions, descriptions and meta-data definitions, transport systems and architectures.

3 Chapter 1 – Introduction I. Introduction

In parallel to standardization of coding technologies, the transmission and storage of digital video and audio have become very important for almost all kinds of applications that process audio-visual data. For example, an analogous archiving and broadcasting was the dominant solution of handling audio and video data even less than 10 years ago, but now the situation changed completely and services as DVB-T, DTV or ITV are available. Moreover, the standardization process has triggered research activities delivering new ideas, which proposed even more extensive usage of digital storage and transmission of multimedia data. The digital pay-TV (DTV) is transmitted through cable networks to households and the terrestrial broadcasting of digital video and audio (DVB-T and DAB-T in Europe, ISDB-T in Japan and Brazil, ATSC in USA and some other countries) is already available in some high-tech regions delivering the digital standard-definition television (SDTV) and allowing for high-definition TV (HDTV) in the future. The rising internet network bandwidths enable high-quality digital video- on-demand (VoD) applications, but as well the sharing of home-made videos is possible through Google Video, YouTube, Vimeo, Videoegg, Jumpcut, blip.tv, and many other providers. The music and video stores are capable of selling digital media content through Internet (e.g. i- Tunes). The recent advances in mobile networking (e.g. 3GPP, 3GPP2) permit audio and video streaming to handheld devices. The home entertainment devices like DVD players or digital camcorders with digital video (DV) hit the masses. Almost all modern PCs by default offer hardware and software support for digital video playback and editing. The national TV and Radio archives of analogous media goes on-line through government-sponsored digitizing companies (e.g. INA in France). The digital cinemas allow for new experiences in respect to high audio and video quality. The democratically-created and low-cost Internet TV (e.g. Dein- TV), which has its analogy to open-source community developing , seems to be approaching our doors.

Such increasing popularity of the mentioned applications causes a yet increasing amount of audio and video data being processed, stored and transmitted, and the same it closes the self- feeding circle of providing better know-how and then developing new applications, which after being applied indicate new requirements for existing technologies. As such, common and continuous production of audio and visual information triggered by new standards requires a huge amount of hard disk space and network bandwidth, which the same highly stimulates the

4 Chapter 1 – Introduction I. Introduction

development of more efficient and thus even more complex multimedia compression algorithms.

In the past, the development of MPEG-1 [LeGall, 1991] and MPEG-2 (H.262 [ITU-T Rec. H.262, 2000] is a common text with MPEG-2 Part 2 Video [MPEG-2 Part II, 2001]) has been driven by commercial storage/distribution/transmission of digital video at different resolutions accompanied by audio in format such as MPEG-1 Layer 3 (commonly known as MP3) [MPEG- 1 Part III, 1993] or (AAC) [MPEG-2 Part VII, 2006]. Next, the H.263 [ITU-T Rec. H.263+, 1998; ITU-T Rec. H.263++, 2000; ITU-T Rec. H.263+++, 2005] has been caused by demand of a low-bitrate encoding solution for videoconferencing applications. Newer MPEG-4 Video [MPEG-4 Part II, 2004] with High Efficiency AAC [MPEG-4 Part IV Amd 8, 2005] was required to support the Internet applications in manifold ways, and H.264 [ITU-T Rec. H.264, 2005] and MPEG-4 Part 10, which are developed by JVT, joint video team of MPEG and SG-16, and are technically aligned to each other [MPEG-4 Part X, 2007], were meant for next generation AV compression algorithms supporting high-quality digital audio and video distribution [Schäfer et al., 2003]. However, after the standardization process, the applicability of standards crosses usually the borders of imagination present in the times of standard definition e.g. the MPEG-2 found application in DTV and DVB/DAB through satellite, cable, and terrestrial as planned, but also as standard for SVCD and DVD production.

Considering the described multimedia revolution delivering more and more information, the people have moved from poor and illegible world of textual and numerical data to the rich and easy-understandable information carried by multimedia data. According to the human perception systems, the consumption of audio-visual information provided by applications rich in media data like images, music, animation or video is simpler, more pleasant and comprehensible, and as so, is the rising popularity of any multimedia-oriented solutions higher and higher. The trend towards rich-media applications supported by hardware development and combined with the advances in computing performance over the past few years caused the media data being the most important factor in digital storage and transmission today.

On the other hand, the multimedia revolution causes also new problems. Today’s large variety of multimedia formats finding an application in many different fields confuses the usual consumers, because the software and hardware used is not always capable to understand the

5 Chapter 1 – Introduction I. Introduction

formats and is not able to present the information in expected way (if at all). Moreover, among the different application fields the quality requirements exposed to the multimedia data vary to a high degree, e.g. a video played back on a small display of a handheld device can hardly have the same dimensions and the same framerate as a movie displayed on a large digital cinema screen, a picture presented on the computer screen will differ from those downloaded to cellular phone, a song downloaded from Internet audio shop may be decoded and played back by home set-top multimedia boxes, however it must not be consumable on a portable device.

I.3. Data Independence The amount and popularity of multimedia data with its variety of applications, and on the other hand the complexity and diversity of coding techniques have been consistently motivating the researchers and developers of database management systems. The complex data types as image, audio and video data need to be stored and accessed. Most of the research on multimedia databases considered only the timeless media objects i.e. the multimedia database management systems have been extended by services supporting storage and retrieval of time-independent media data like still images or vector graphics. However, managing the time-dependent objects like audio or video data requires sophisticated delivery techniques (including buffering, synchronization and considering time constraints) and efficient storage and access techniques (gradation of physical storage, caching with preloading, indexing, layering of data), which imposes completely new challenges on database management systems in comparison to those known from EDP revolution in 80s and 90s. Furthermore, new searching techniques for audio and video have to be proposed due to the enormously large amounts of data e.g. a context- based retrieval needs new index methodology because present indexing facilities are not able to process the huge amounts of media data stored.

While managing typical timeless data within one single database management system is possible, the handling of audio-video data is usually distributed over two distinct systems: a standard relational database and a specialized media server. The first one is responsible for managing meta-data and searching, and the second one provides data delivery, but both should together provide the objectives of database management system as mentioned in previous section. The centralization of data and system services and multi-user operations with concurrent access are provided by many multimedia management systems. The concepts of transactions is not really

6 Chapter 1 – Introduction I. Introduction

applicable to the multimedia data (until one consider upload of a media stream as a transaction), and thus, the error treatment is solved by simple reload or restart mechanisms. Work on declarative query language (SQL) have resulted in multimedia extension to SQL:1999 (known as SQL/MM [Eisenberg and Melton, 2001]), but it still falls short of the possibilities offered by abstract data types for multimedia and is not implemented in any well-known commercial system. Finally, the application neutrality and data independence are left somehow behind and there are lacks of solutions supporting them.

The data independence of multimedia data has been a research topic of Professor Meyer- Wegener since early years of his research in the beginning of 1990s. He has started with a design of media object storage system (MOSS) [Käckenhoff et al., 1994] as component of a multimedia database management system. Next he continued with research on kernel architecture for next generation archives and realtime-operable objects (KANGAROO) allowing for media data encapsulation and media specific operations [Marder and Robbert, 1997]. In the meantime, he worked on integration of media servers with DBMS and support of quality-of-service (IMOS) [Berthold and Meyer-Wegener, 2001], which was continued by Ulrich Marder in VirtualMedia project [Marder, 2002]. Next, he supervised the Memo.real project focused on media-object encoding by multiple operations in realtime [Lindner et al., 2000]. Finally, this work started in 2002, named the RETAVIC project, focuses on format independence provision by real-time audio-video conversion for multimedia databases management systems. This work should be a mutual complement with the memo.REAL project (continuation of Memo.real) [Märcz et al., 2003; Suchomski et al., 2004], which was started in 2001, but unfortunately had to be discontinued in the midway.

After these years of research, the data independence of multimedia data is still a very complex issue, because of the variety of media data types [Marder, 2001]. The media data types are classified in two general groups: non-temporal and temporal (known also as timed, time dependent, continuous). Text, image, 2D and 3D graphics belong to the non-temporal group, because the time has no influence on the data. Contrary, audio (e.g. wave), music (e.g. MIDI), video and animation are temporal media data, because they are time-constrained. For example, an audio stream consists of discrete values (samples) usually obtained during the process of sampling audio with a certain frequency and a video stream consists of images (frames) that

7 Chapter 1 – Introduction I. Introduction

have also been captured with a certain frequency (called framerate). The characteristic distinguishing these streams from non-temporal kind of media data is the time constraint (the continuous property of a data stream) i.e. the occurrence of the data events (samples or frames) is ordered and the periods between them are usually constant [Suchomski et al., 2004]. Thus, providing data independent access to various media types requires different solutions specific to each media type or at least to each group (non-temporal vs. temporal).

Secondly, the data independence in database management systems considers both: format independence (logical) and storage independence (physical) [Connolly and Begg, 2005; Elmasri and Navathe, 2000]. The format independence defines the format neutrality for the user application i.e. the internal storage format may differ from the output format delivered to the application, but the application is not bothered by any internal conversion steps and just consumes the understandable data as they were designed in the external schema. The storage independence defines the physical storage and access paths neutrality by hiding the internal access and caching methodology i.e. the application does not have to know how the data are stored physically (on disc, tape, CD/DVD), how they are accessed (index structures, hashing algorithms) and in which file system are they located (local access paths, network shares), but only knows the logical localization (usually represented by URL retrieved from multimedia database) from the external schema used by application. So, the provision of data independence and application neutrality for multimedia data relies on many fields of research in computer science:

• databases and information systems (e.g. lossless storage and hierarchical data access, physical storage structures, format transformations), • coding and compression techniques (e.g. domain transformations (DCT, MDCT) including lifting schemes (binDCT, IntMDCT), entropy repetitive sequence coding (RLE), entropy statistical variable-length coding (), CABAC, bit-plane coding (BPGC)), • transcoding techniques (cascade transcoder, transform-domain transcoder, bit-rate transcoder, quality adaptation transcoder, frequency-domain transcoder, spatial- resoultion transcoder, temporal transcoder, logo insertion and watermarking),

8 Chapter 1 – Introduction I. Introduction

• audio and video analysis (motion detection and estimation, scene detection, analysis of important macro blocks), • audio- and video-specific algorithms (zic-zac and progressive scanning, intra- and inter- frame modeling, quantization with Constant-Step, Matrix-Based, or Function- Dependent, perceptual modeling, noise shaping), • digital networks and communication with streaming technologies (time-constrained protocols MMS, RTP/RTSP, bandwidth limitations, buffering strategies, quality of service issues, AV gateways), • and operating systems with real-time aspects (memory allocation, IPC, scheduling algorithms, caching, timing models, OS boot-loading).

None of the available solutions created for research purposes so far can be considered complete in respect to data independence provision of continuous data in MMDMBS. The RETAVIC project [Suchomski et al., 2005] and co-related memo.REAL project [Lindner et al., 2000; Suchomski et al., 2004] have been founded with the aim to develop a solution for multimedia database management systems or media servers providing data independence in context of capturing, storage and access to potentially large media objects, and real-time quality-aware delivery of continuous media data comprising a modeling of converters, transcoding and processing with graph-based models. However, considering all the aspects from the mentioned fields of computer science, the complexity of the data independence and application neutrality provision based on real-time processing and QoS control is enormous e.g. it requires design of real-time capable file system, network adapters and infrastructure, and development of real-time transformation framework. As so, the problem considered in this work was limited to just a subset of mentioned aspects of the server-side, namely to a solution of format independence provision for audio and video data by transformations on the MMDBMS side. Many issues of operating systems (e.g. real-time file access, storage design) and of network and communication systems have been left out.

I.4. Format Independence Format independence can be compared to the Universal Multimedia Access (UMA) idea [Mohan et al., 1999]. In UMA, it is assumed that some amount of audiovisual (AV) signal is

9 Chapter 1 – Introduction I. Introduction

transmitted over different types of networks (optical, wireless, wired) to a variety of AV- terminals that support different formats. The core assumption of the UMA is to provide best QoS or QoE by either selecting the appropriate content formats, or adapting the content format directly to met the playback environment, or to adapt the content playback environment to accommodate the content [Mohan et al., 1999]. The key problem is to fix the mismatch between rich multimedia contents, networks and terminals [Mohan et al., 1999]; however, it has not been specified how to do this. On the other hand, the UMA is going beyond the borders of multimedia database management systems and proposes to do format transformations within the network infrastructure and with the dedicated transcoding hardware. This however makes the problem of format independence even more complex due to too many factors deriving from the variety of networks, hardware solutions, and operating systems. Moreover, introducing the real-time QoS control within the scope of global distribution area is hardly possible, because the networks have their constraints (bandwidth) and the terminals their own hardware (processing power and memory capacity) and software capabilities (supported formats, running OS). Including all these aspects supporting all applications in one global framework is hardly possible, as so this work focuses only on the part connected with the MMDBMS and format independence provision and application neutrality within MMDBMS, where the heterogeneity of the problem is kept on the reasonable low level.

There are three perspectives in the research on the format independence of continuous media data and on the application neutrality (detailed discussion is provided in section II.3). The first one is using multiple copies in many formats and various qualities of the same media object. Second covers adaptation with scalable formats, where the quality is adopted during transmission. Third one, presented in this work, considers conversion from internal format(s) to miscellaneous delivery formats on demand, analogical to UMA transparent transformation idea of media data, but only within the MMDBMS [Marder, 2000].

Storing videos in different formats and dimensions seems reasonable, as transmitting them in a unique format and with fixed dimensions (and then adapting the quality and format on the receiver’s side). However, the waste of storage and network resources is huge: in first case the replicas occupy extra space on the storage and in the second cases the waste of the bandwidth of the transmission channel, regardless if it is wireless or wired, takes place. Moreover, none of

10 Chapter 1 – Introduction I. Introduction

these two solutions provides fully format independence. Only the audio-video conversion allowing for quality adaptations during processing and transforming to required coding scheme would allow for full format independence.

I.5. AV Conversion Problems However, there are many problems when considering audio-video conversion. First, the time characteristic of the continuous data requires that the processing must be controlled not only according to functional correctness but also to time correctness. For example the video frame converted and delivered after its presentation time is useless as well as listening of the audio with samples ordered in different time-order than the original sequence is senseless. Secondly, the conversion algorithms (especially compression) for audio and video data are very complex and computation demanding (requires a lot of CPU processing power). As so, the optimization of the media-specific transformation algorithms is required. Thirdly, the processing demand of conversion varies in time and is heavily dependend on the media content i.e. one part of the media data may be easy convertible with small effort, while the other includes high energy of audio or visual signal and requires much more calculations during compression. Thus the resource allocation must be able to cope with varying load, or the adaptation of the processing and quality-of-service (QoS) support must be included in the conversion process i.e. the processing elements of the transcoding architecture such as decoder, filters, encoders should provide a mechanism to check requests and guarantee a certain QoS fulfilling such requests when the feasibility was tested positively.

I.6. Assumptions and Limitations The MMBDMS should support not only the format independence but also application neutrality. It means that different aspects and requirements of applications should be already considered in the process of multimedia database design. One of such key aspect is a long-time storage involving no loss of information i.e. the data should keep the origin information for the generations. As such, the MMDBMS must be capable of supporting lossless internal format and lossless transformations between internal and external (delivery) format. It is assumed, that lossy frame/window of samples is a subset of lossless frame/window of samples, so being able of providing lossless properties means also ability of delivering lossy characteristics.

11 Chapter 1 – Introduction I. Introduction

Moreover, it is assumed that the huge sets of data are stored in MMDBMS. In such collections, the access frequency to a media object is low and only few or less clients access the same data (contrary to VoD, where many clients access the same data very often). Examples of such media data sets are scientific collections of video and images, digital libraries with scanned documents, police archives for fingerprints and criminal photography, and videos from surgery and medical operations (e.g. brain surgery or coronary artery (heart) bypassing).

Furthermore, the clients accessing media data differ between each other and postulate different requirements in respect to format and the quality. The quality requirements may range from lossless quality with full information to very low quality with limited but still understandable information. All possible formats would have been delivered if the system was a superior system, however in reality only formats implemented in the transformation process may be supported by MMDBMS.

Embedded systems in dedicated hardware eventually would provide a fast environment for audio-video conversion supporting various formats, but they are neither flexible nor inexpensive, so this work aims at a software-only solution of the real-time transcoding architecture. Moreover, if a conversion on client side was assumed, then loss of bandwidth of transmission channels would have occurred, but anyway, it would have been hardly possible to do the conversion on power-sensitive and not strong enough mobile devices, which are usually optimized for decoding of one specific format (manufacturer’s implementation of the decoder).

I.7. Contribution of the RETAVIC Project The central contribution of this dissertation is a proposal of the conceptual model of the real- time audio-video conversion architecture, which includes: a real-time capturing phase with fast simple lossless encoding and media buffer; a non-real-time preparation phase with conversion to internal format and content analysis; a storage phase with lossless, scalable binary formats and meta-data set definition; and a real-time transcoding phase with real-time capable quality- adaptive decoding and encoding. The key assumption in real-time transcoding phase is an exploitation of the meta-data set describing given media object for feasibility analysis, scheduling and controlling the process. Moreover, the thesis proposes media-specific processing models based on the conceptual model and defines hard-real-time adaptive processing. The need for

12 Chapter 1 – Introduction I. Introduction

lossless, scalable binary format caused a Layered Lossless Video format (LLV1) [Militzer et al., 2005] being designed and implemented within this project. The work is backed by the analysis of requirements for media internal storage format for audio and video and by the review of format independence support in current multimedia management systems i.e. multimedia database management systems and media servers (streaming servers). The proof of concept is given by the prototypical real-time implementation of the critical parts of the transcoding chain for video, which was evaluated in respect to functional, quantitative and qualitative properties.

I.8. Thesis Outline In Chapter 2, the related work is discussed. It’s divided on two big sections: the fundamentals and frameworks being the core related work (Section II) and the data format and real-time operating system issues being a loosely-coupled related work (Section III).

In Chapter 3, the design is presented. At first, the conceptual model of format independence provision is proposed in Section IV. It includes real-time capturing, non-real-time preparation, storage, and real-time transcoding. In Section V, the video processing model is described. The LLV1 is introduced, the video-specific meta-data set is defined in details, and real-time decoding and encoding are presented. Analogically, in Section VI, the audio processing model with MPEG-4 SLS, audio-specific meta-data and audio transcoding are described. And finally in Section VII, the real-time issues in context of continuous media transcoding are explained. Prediction, scheduling, execution and adaptation are the considered subjects there.

In Chapter 4, the details about the best-effort prototypes and the real-time implementation are given. Section VIII points out the core elements of the RETAVIC architecture and states what the target of the implementation phase was. Sections IX describe real-time implementations of two representative converters, respectively RT-MD-LLV1 and RT-MD-XVID. The controlling of the real-time processes is also described in this section.

In Chapter 5, the proof of concept is given. The evaluation and measurements are presented in Section X. The evaluation process, discussion on measurement accuracy, and the evaluation of real-time converters are presented. The applicability of the RETAVIC architecture is discussed in Section XI. Moreover, few variations and extensions of the architecture are mentioned.

13 Chapter 1 – Introduction I. Introduction

Finally, the summary and outlook at further work is included in Chapter 6. The conclusions are listed in Section XII and the further work is covered by Section 0. Additionally, there are few appendixes detailing some of the related issues.

14 Chapter 2 – Related Work II. Fundamentals and Frameworks

Chapter 2 – Related Work

One thing I have learned in a long life: that all our science, measured against reality, is primitive and childlike—and yet it is the most precious thing we have. Albert Einstein (1955, Speech for Israel in Correspondence on a New Anti-War Project)

Even though many fields of research have been named as related areas in the introduction part, this chapter covers only the most relevant issues, which are referred later in this dissertation. The chapter is divided on two sections: essential being the most important related work, and surrounding describing adjacent but still important related work.

II. FUNDAMENTALS AND FRAMEWORKS

The essential related work covers terms and definitions, multimedia delivery, format independence methods and applications, and transformation frameworks. As so, there are definitions provided directly from other sources or keywords with refined meaning used within the RETAVIC context – both are given in the Terms and Definitions section. They are grouped into data-related, processing-related, and quality-related terms. Next, the issues of delivering multimedia data such as streaming, size and time constraints, and buffering are described. The three possible methods of providing format independence (already shortly mentioned in the Introduction), which are referred later shortly as approaches, are discussed subsequently. After that, the comments on some video and audio transformation frameworks are given. And last but not

15 Chapter 2 – Related Work II. Fundamentals and Frameworks

least, the related research on format independence in respect to multimedia management systems is considered.

II.1. Terms and Definitions The most of the following definitions used within this work are collected in [Suchomski et al., 2004] and some of them are adopted or refined for purposes of this work. There are also some terms added from other works or newly created to clarify the understanding. All terms are listed and explained in details in the Appendix A.

The terms and definitions are grouped in three groups: data-related, processing-related and quality-related. The terms are grouped as follows:

a) data-related: media data, media object (MO), multimedia data, multimedia object (MMO), meta data (MD), quant, timed data (time-constrained data, time-dependent data), continuous data (periodic or sporadic), data stream (shortly stream), continuous MO, audio stream, video stream, multimedia stream, continuous MMO, container format, compression/coding scheme;

b) processing-related: transformation (lossy, lossless), multimedia conversion (media-type, format and content changers), coding, decoding, converter, coder/decoder, codec, (data) compression, decompression, compression efficiency (coding efficiency), compression ratio, compression size, transcoding, heterogeneous transcoding, transcoding efficiency, transcoder, cascade transcoder, adaptation (homogeneous or unary-format transcoding), chain of converters, graph of converters, conversion model of (continuous) MMO, (error) drift;

c) quality-related: quality, objective quality, subjective quality, Quality-of-Service (QoS), Quality-of-Data (QoD), transformed QoD (T(QoD)), Quality-of-Experience (QoE).

This work uses the above terms extensively thus all the readers are referred to the Appendix A in case of doubts or problems with understanding.

16 Chapter 2 – Related Work II. Fundamentals and Frameworks

II.2. Multimedia data delivery Delivering of multimedia data is different from conventional data delivery, because multimedia stream is not transferred as a whole from the data storage server such as MMDBMS or media server to the client due to its size. Usually, the consuming application on the client side starts displaying the data almost immediately after their arrival. As so, the data do not have to be complete i.e. not all parts of multimedia data have to be present on the consumer side, but just these parts required for displaying at a given time. This delivery method is known as streaming [Gemmel et al., 1995].

Streaming multimedia data to clients differs in two ways from transferring conventional data [Gemmel et al., 1995]: 1) amount of data and 2) real-time constraints. For example, a two-hour MPEG-4 video, which is compressed with the average throughput of 3.96 Mbps1, needs more than 3.56 GB of disk space, and analogically, accompanying two-hour stereo audio signal in the MPEG-1 Layer III format of average throughput of 128kbps requires 115 MB. And even though, the modern compression schemes allow for compression ratio of 1:50 [MPEG-4 Part V, 2001] or even more, storing and delivering many media files is not a trivial task, especially if higher resolutions and frame rates are considered (e.g. HDTV 1080p). Please note that even when considering very compressed AV streams in high-quality the data rates vary between 2 and 10 Mbps.

Secondly, the delivery becomes even more difficult because of real-time constraints. In contrast, variations of the transfer rate are negligible and all bits has to be transferred correctly for conventional files, but the most important factor for streaming is that a certain minimum transfer is required [Gemmel et al., 1995]. If we take the above example of audio-video compressed stream, the transfer rate must at least equal the sum of average throughput of 4.09 Mbps in order to allow the client consuming the multimedia stream. However, this constraint is not even sufficient. The variations of the transfer rate, which may occur due to network congestions or the server inability of delivering the data on time, usually leads to stuttering

1 For example, a PAL viedo (4CIF) in YV12 color scheme (12 bits per pixel) with resolution 720 x 576 pixels and 25 fps gives requirements of about 118.5 Mbps. A compression ratio of 1:30, which is reasonable for the MPEG-2 compression, gives for this video the value of 3.955 Mbps.

17 Chapter 2 – Related Work II. Fundamentals and Frameworks

effect during play-out on the client side (i.e. when the required data for continuing playing did not arrive yet).

The buffering on the client side is used to reduce the network-throughput fluctuation problem by fetching multimedia data into the local cache and starting play-out first then when the buffer cache is filled up [Gemmel et al., 1995]. If the transfer rate suddenly decreases, the player can still use the data from the buffer, which is filled again when the transfer rate has recovered. On the other hand, the larger buffer on the client side the bigger latency in starting the play-out is, as so the buffer size shall be as low as required. Buffering overcomes short bottlenecks deriving from the network overloads, but if the server cannot deliver data fast enough for a longer period of time, the media playback is affected anyway [Gemmel et al., 1995]. Therefore, enough resources have to be allocated and guaranteed on the server, and the resource allocation mechanism has to detect when its transfer reaches the limits and in such case disallows further connections.

Another approach is buffering within the network infrastructure e.g. on caching and/or transcoding proxies in the tree-based distribution network [Li and Shen, 2005]. Various caching policies proposing placement and replacement of the media objects are present, and one promising2 is the caching in transcoding proxies for tree networks (TCTN) caching being an optimized strategy using dynamic programming-based solution and considering transcoding costs through weighted transcoding graphs [Li and Shen, 2005]. In general, the problems with server unavailability mentioned before could be solved by buffering on the proxies, however, the complexity of application coordinating the media transcoding rises to enormous size, because all network aspects and different proxy platforms must be considered.

II.3. Approaches to Format Independence As already mentioned in the introduction, there are three research approaches in different fields using multimedia data, which could be applied for the format independence provision. These three approaches are defined within this work as: redundancy, adaptation and transcoding. They have

2 At leas, it outperforms the least recently used (LRU), least normalized cost replacement (LNC-R), aggregate effect (AE) and web caching in transcoding proxies for linear topology (TCLT)

18 Chapter 2 – Related Work II. Fundamentals and Frameworks

been developed without application neutrality in mind and usually focused on the specific requirements. One exception is the Digital Item Adaptation (DIA) based on Universal Media Access (UMA) idea, which is discussed later.

II.3.1. Redundancy Approach

The redundancy approach defines the format independence through multiple copies of the multimedia data, which is kept in many formats and / or various qualities. In other words, the MO is kept in few preprocessed instances which may have the same coding scheme but different predefined quality characteristics (e.g. lower resolution), or may have different coding scheme but the same quality characteristics, or may have various coding scheme as well as different characteristics. The disadvantages of this method are as follows:

• waste of storage – due to multiple copies representing the same multimedia information • partial format independence solution – i.e. it covers only limited set of quality characteristics or coding schemes, because it is impossible to prepare copies of all potential qualities and coding schemes supporting yet undefined applications • very high start-up cost – in order to deliver different qualities and coding schemes, the multimedia data must be preprocessed with cost of O(m·n·o) complexity i.e. the number of MO instances, and thus their preparation, depends directly on:

ƒ m – number of multimedia objects ƒ n – number of provided qualities ƒ o – number of provided coding schemes

The biggest advantage is relatively small cost of delivery, and possibility of using distribution techniques such as multicasting or broadcasting. The redundancy approach could be used as imitation of format independence, but only in the limited set of applications where just few classes of devices exist (e.g. Video-on-Demand). As so, this approach is not enough for format independence provision by MMDBMS and may be applicable only as an additional extension for optimization purposes on costs of storage.

19 Chapter 2 – Related Work II. Fundamentals and Frameworks

II.3.2. Adaptation Approach

Another approach, which could be partly utilizable as format independence solution, is the adaptation approach. The goal is not to provide data independence (any format requested by application), but rather to adapt the existing (scalable) format to the network environment and end device capabilities. Here, the adaptation points on the borders of networks are defined (usually called media proxies or media gateways), which are responsible of adapting the transmitted multimedia stream to the end-point consumer devices or rather to the given class of consumer devices. This brings a disadvantage of wasting the network resources due to the bandwidth over-allocation of the delivery channel from the central point of distribution to the media proxies on the network borders. A second important disadvantage is that the data are always dropped and the more proxies adapt the data between the client and the server the lower data quality is provided. A third drawback is the dependency on the static solution, because the scalable format cannot be easily changed when once chosen.

There have been three architectures distinguished between adaptation approaches from low to high complexities [Vetro, 2001]3, mainly useful for non-scalable formats (such as MPEG-2 [MPEG-2 Part II, 2001]):

1) Simple open-loop system – simply cutting variable-length codes corresponding to the high frequency coefficient (i.e. to the higher-level data carrying less important information) used for bit-rate reduction; this involves only variable-length parsing (and no decoding); however, it produces the biggest error drift;

2) Open-loop system with partial decoding to un-quantized frequency domain – where the operations (e.g. re-quantization) are conducted on the decoded transformed coefficients in the frequency domain;

3) Closed-loop systems – where the decoding up to spatial/time domain (with pixels or sample values) is conducted in order to compensate the drift for the new re-quantized

3 The research work of Vetro focuses only on video data and these architectures are called “classical transcoding” architectures. However within this work, the Vetro’s “classical transcoding” reflects the adaptation term.

20 Chapter 2 – Related Work II. Fundamentals and Frameworks

data; however, only one coding schemes is involved due to the design (see definition of adaptation term); it is called simplified spatial domain transcoder (SSDT);

These three architectures have been refined by [Sun et al., 2005]. So, Vetro’s first architecture is called Architecture 1 – Truncation of the High-Frequency Coefficients and the second one is Architecture 2 – Requantizing the DCT Frequency Coefficients, and both are subjects to drift due to their simplicity and operation in the frequency domain. They are useful for applications such as trick modes and extended play recording of the digital video recorder. There are also some optimizations proposed through constrained dynamic rate shaping (CDRS) or general (unconstrained) dynamic rate shaping (GDRS) [Eleftheriadis and Anastassiou, 1995], which operates directly on the coefficients in the frequency domain. The Architecture 3 – Re-Encoding with Old MVs and Mode Decisions and the Architecture 4 – Re-Encoding with Old MVs (and New MD) are resistant to drift error, and they are reflecting the close-loop system from [Vetro, 2001]. They are useful for VoD and statistical multiplexing for multiple channels. Here an optimization is also proposed by simplifying the SSDT and creating a drift-free Frequency Domain Transcoder at reduced computational complexity, which does the motion compensation step in the frequency domain through the approximate matrices computing the MC-DCT residuals [Assunncao and Ghanbari, 1998].

Another adaptation approach is to use the existing scalable formats and to operate directly on the coded bit-stream through dropping out the unnecessary parts. This reduces significantly the complexity of the adaptation process i.e. the scalable format adaptation is much simpler than the adaptations mentioned before because it does not require any decoding of the bit-stream i.e. it operates in the coded domain. Few scalable formats have been proposed recently, for example MPEG-4 Scalable Video Coding [MPEG-4 Part X, 2007] or MPEG-4 Scalable to Lossless Coding [Geiger et al., 2006; MPEG-4 Part III FDAM5, 2006], as well as few scalable extensions to the existing formats have been defined e.g. Fine Granularity Scalability (FGS) profile of the MPEG-4 standard [MPEG-2 Part II, 2001]. However, there is unquestionable disadvantage of having additional costs of coding efficiency caused by additional scalability information included within the bit stream.

The biggest disadvantage of all adaptation approaches using scalable or non-scalable format is still the assumption, that the storage format has to be defined or standardized with respect to

21 Chapter 2 – Related Work II. Fundamentals and Frameworks

the end-user applications. Or even worse, only the consumers compliant with the standard storage format can be supported by the adaptation architecture, if the problem is treated from the distributor perspective. As so, there is still a drawback of having just a partial format independence solution being similar to the one present in the redundancy approach. And even though, one could compare transcoding to scalable coding with adaptivity [Vetro, 2003], the adaptation of one universal format with different qualities is not considered as fully-fledged data independence solution usable by MMDBMS.

II.3.3. Transcoding Approach

Last, but not least, approach is the transcoding approach. It is based on the multimedia conversion from internal (source) format to the external (requested) format. In general, there are two methodologies: on-demand (on-line) or off-line. The off-line approach is a best-effort solution where no delivery time is guaranteed and the MO is converted to the requested format completely before delivery starts. Obviously, this introduces huge latencies in response time. The on-demand is meant for delivering multimedia data just after obtaining the client request, i.e. the transcoding stars as soon as request for data appears. In this case, there are two types of on-line transformations: real-time and best-effort. While in the first one there may be sophisticated mechanism for QoS control implemented, in the second case the execution and delivery guaranties cannot be given.

On the other hand, [Dogan, 2001] talks about video transcoding in two aspects: homogeneous and heterogeneous. Homogeneous video transcoders only changes bit rate, frame rate, or resolution, while heterogeneous video transcoding allows for transformations between different formats, coding schemes and networks topologies e.g. different video standards like H.263 [ITU-T Rec. H.263+++, 2005] and MPEG-4 [MPEG-4 Part II, 2004]. In analogy to Dogan’s definitions, the homogeneous transcoding is treated as adaptation from the previous section, and heterogeneous transcoding is discussed within this section subsequently, and both aspects are considered with respect to audio as well as video.

The transcoding approach is exploited within the MPEG-21 Digital Item Adaptation (DIA) standard [MPEG-21 Part VII, 2004], which is based on the Universal Media Access (UMA) idea and tries to cope with the “mismatch” problem. As stated in the overview of DIA [Vetro, 2004],

22 Chapter 2 – Related Work II. Fundamentals and Frameworks

the DIA is “a work to help alleviate some of burdens confronting us in connecting a wide range of multimedia content with different terminals, networks, and users. Ultimately, this work will enable (…) UMA.”. The DIA defines an architecture for Digital Item4 [MPEG-21 Part I, 2004] transformation delivering format-independent mechanisms that provide support in terms of resource adaptation, description adaptation, and/or QoS management, which are collectively referred to as DIA Tools. The DIA architecture is depicted in Figure 1. However, the DIA describes only the tools assisting the adaptation process but not the adaptation engines themselves.

Figure 1. Digital Item Adaptation architecture [Vetro, 2004].

The DIA covers also other aspects related to the adaptation process. The usage environment tools describe four environmental system conditions: the terminal capabilities (incl. codec capabilites, I/O capabilities, and device properties), the network characteristics (covering network static capabilities and network conditions), the user characteristics (e.g. user’s information, preferences and usage history, presentation preferences, accessibility characteristics and location characteristics such as mobility and destination), and the natural environment characteristic (current location and time or audio-visual environment). Moreover, the DIA architecture proposes not only multimedia data adaptation, but also the bitstream syntax

4 Digital Item is understood as a fundamental unit of distribution and transaction within the MPEG-21 multimedia framework representing “what” part. As so, it is reflecting the definition of media object within this work i.e. DI is equal to MO. The detailed definition of what the DI exactly is, can be found in Part 2 of MPEG-21 containing Digital Item Declaration (DID) [MPEG-21 Part II, 2005], which is divided on three normative sections: model, representation, and schema.

23 Chapter 2 – Related Work II. Fundamentals and Frameworks

description adaptations, which could easily be called meta-data adaptations, being on the higher logical level and allowing the adaptation process in the coded bitstream domain in the codec independent manner. An overview of the adaptation on bitstream level is depicted in Figure 2.

Figure 2. Bitstream syntax description adaptation architecture [Vetro, 2004].

To summarize, if the DIA would support really format-independent mechanism for adaptation i.e. allowing for transformation from one coding scheme to different one, which seems to be the case at least from the standard description, it should not be called adaptation anymore, but Digital Item Transformation, and then it is to be treated without any doubts as transcoding approach and definitely not as adaptation approach.

II.3.3.1 Cascaded transcoding

The cascaded transcoding approach is understood in the context of this work as straightforward transformation using cascade transcoder with complete decoding from one coding scheme and complete encoding to another one and employs maximum complexity (of the described solutions) [Sun et al., 2005]. Due to this complexity, the conversion-based format independence based on cascaded approach has not been well-accepted as usable in real applications. For example, when transforming one video format into another, full recompression of the video

24 Chapter 2 – Related Work II. Fundamentals and Frameworks

data demanding expensive decoding and encoding steps5 is required. So, in order to achieve reasonable processing speed, modern video encoders (e.g. the popular XviD MPEG-4 ) employ sophisticated block-matching algorithms instead of a straight-forward full search in order to reduce the complexity of motion estimation (ME). Often, predictive algorithms like EPZS [Tourapis, 2002] are used, which offer a 100-5000 times speed-up over a full search while achieving similar picture quality. The performance of predictive search algorithms however highly depends on the characteristics of the input video (and is especially low for sequences with irregular motion). This content-dependent and unpredictable charactersistic of the ME step makes it very difficult to predict behaviour of a video encoder, and thus interfere with making the video encoders a part of real-time transcoding process within the cascaded transcoder.

II.3.3.2 MD-based transcoding

The unpredictability of the classical transcoding can be eliminated without compromising compression efficiency by using meta-data (MD) [Suchomski et al., 2005; Vetro, 2001]. There are plenty of proposals for simplifying the transcoding process by exploiting MD-based multimedia conversion. For example, the meta data guiding the transcoder specific for video content are divided into low-level and high-level futures [Vetro et al., 2001], where low-level are referred to color, motion, texture and shape, and high-level may include storyboard information or the semantics of the various objects in the scene. Other research in the video-processing and networking fields discussed specific transcoding problems in more detail [Sun et al., 2005; Vetro et al., 2003], like object-based transcoding, MPEG-4 FGS-to-SP, or MPEG-2-to-MPEG-4, and especially the MD-based approaches have been proposed by [Suzuki and Kuhn, 2000] and [Vetro et al., 2000].

[Suzuki and Kuhn, 2000] proposed difficulty hint, which assist in bit allocation during transcoding by providing the information about the complexity of one segment with respect to the other segments. It is represented as a weight in the range [0,1] of each segment (frame or sequence of frames) obtained through normalization of bits spent to this segment to the total bits spent to all segments. As so, the rate-control algorithm may use this hint for controlling the

5 In multimedia storage and distribution, there are usually asymmetric compression techniques used. This means that the efforts spent on encoding are much higher than on the decoding – the rate may achieve 10 times and more.

25 Chapter 2 – Related Work II. Fundamentals and Frameworks

transcoding process and optimizing the temporal allocation of bits (if the variable bit rate (VBR) is allowed). One constraint should be considered during calculation of the hint, namely it should be calculated at fine QP but still the results may vary [Sun et al., 2005]. The specific application of difficulty hint and some other issues as motion uncompensability and search range parameters are further discussed in [Kuhn and Suzuki, 2001].

[Vetro et al., 2000] proposes the shape change and motion intensity hints, which are usable especially for supporting data dependency in object-based transcoding. The well-know problem of variable temporal resolution of the objects within the visual sequence has been already investigated in the literature and two shape-distortion measures have been proposed: Hamming distance (number of different pixels between two shapes) and Hausdorff distance (maximum function between two sets of pixles based on Euclidean distance between these points). [Vetro, 2001] proposes to derive shape change hint from one of these measures but after normalization in the range [0,1] by dividing through the number of pixels within the object for Hamming distance or by maximum width or height of the rectangle bounding the object or the frame for Hausdorff distance. The motion intensity hint is defined as measure of significance of the object and is based on the intensity of motion activity6 [MPEG-7 Part III, 2002], number of bits, normalizing factor reflecting the object size, and the constant (being bigger than zero) usable for zero-motion objects [Vetro et al., 2000]. The larger values of motion intensity hint indicate higher importance of the object, and may be used for the decisions on quantization parameter or skipping with respect to each object.

A collective solution has been proposed by MPEG-7 where the MediaTranscodingHints descriptor in the MediaProfile descriptor of Media Description Tools [MPEG-7 Part V, 2003] has been defined. The transcoding properties proposed there fit only partly the existing implementations of encoders as required for the MMDBMS format independence provision. Among others, the MediaTranscodingHints define useful general motion, shape and difficulty coding hints7. On the other hand, a property like “intraFrameDistance” [MPEG-7 Part V, 2003] is not a good hint for

6 Intensity of motion activity is defined by MPEG-7 as standard deviation of motion vector magnitudes. 7 Suzuki, Kuhn, Vetro, and Sun have taken actively part in the MPEG-7 standard development, especially in the part responsible for defining meta-data for transcoding (MediaTranscodingHints) [MPEG-7 Part V, 2003]. As so, the transcoding hints derive from their works proposed earlier (discussed in the paragraphs above).

26 Chapter 2 – Related Work II. Fundamentals and Frameworks

encoding, since scene changes may appear unpredictably (and usually intra frames are used then). Thus, the intraFrameDistance should be treated rather like constraint than a hint. Regardless, in comparison to the MD set defined later within this work, few parameters of MPEG’s MDT are similar (e.g. importance is somehow related to MB priority) and some very important are still missing (e.g. frame type, or MB mode suggestions).

In general, the focus of the previous research is on the transcoding itself including the functional compatibility, compression efficiency and network issues, because the goal here was to simplify the execution in general, but not to predict the required resources or adapt the process to RTE. The concerns such as limiting motion search range, or improving bit allocation between different objects by identifying key objects or by regulating temporal resolution of objects, are in the center of interest. The subjects connected with real-time processing, scheduling of transcoders and QoS control in the context of the MMDBMS (i.e. format independence and application neutrality) are not investigated, and as so no meta-data set has been proposed to support coping with these topics.

II.4. Video and Audio Transformation Frameworks The transformation of continuous multimedia data is already well-researched topic for fairly long time. Many applications for converting or transforming audio and video can be found. Some built-in support for multimedia data is also already available in few operating systems (OS’es), which are called Multimedia OS’s due to that fact, but usually the support can neither meet all requirements nor handle all possible kinds of continuous media data. In order to better solve the problems associated with the large diversity of formats, frameworks have been proposed, but to my knowledge, there is a lack of collective work presenting a theory of multimedia transformations in a broad range and in various applications, i.e. considering the other solutions, various media types, real-time processing, QoS control at the same time and for different applications. One very good book pretending to cover almost all these topics, but only with respect to video data is [Sun et al., 2005], and many references within this work are done to this book.

27 Chapter 2 – Related Work II. Fundamentals and Frameworks

There are many audio and/or video transformation frameworks. They are discussed within this section and it was done everything what possible to keep the sound and complete description. However, the author cannot guarantee that there is no other related framework available.

II.4.1. Converters and Converter Graphs

The well-accepted and the most general research approach is comes from the signal processing field and is based on converters (filters) and converter graphs. It has been covered extensively in multimedia networking and mobile communication, so the idea is not new; it has been rooted in [Pasquale et al., 1993], supported by [Candan et al., 1996; Dingeldein, 1995] and extended by Yeadon in his doctoral dissertation [Yeadon, 1996]. A more recent approach is described by Marder [Marder, 2002]. They all, however, restrict the discussion to the function and do not consider execution time neither real-time processing nor QoS control. Typical implemented representatives are Microsoft DirectShow, Sun’s Java Media Framework (JMF), CORBA A/V Streams, and Multi Media Extension (MME) Toolkit [Dingeldein, 1995].

The pioneers have introduced some generalizations of video transformations in [Candan et al., 1996; Pasquale et al., 1993; Yeadon, 1996]. A filter as a transformer of one or more input streams of the multi-stream into an output (multi-)stream has been introduced [Pasquale et al., 1993]. In other words, the output (multi-)stream replaces after transformation the input (multi-)stream. The filters have been classified into three functional groups: selective, transforming, and mixing. These classes have been extended by [Yeadon, 1996] into five generic filter mechanisms: hierarchical, frame-dropping, codec, splitting/mixing, and parsing. Yeadon also proposed the QoS-Filtering Model which uses a few key objects to constitute the overall architecture: sources, sinks, filtering entities, streams, and agents. [Candan et al., 1996] proposed collaborators capable of displaying, editing and conversion within the collaborative multimedia system called c-COMS, which is defined as un-directed weighted graph consisting of set of collaborators (V), connections (E) and cost of information transmission over connection (ρ). Moreover, it defines collaboration formally, discusses quality constraints and few object synthesis algorithms (OSAs). [Dingeldein, 1995] proposes a GUI-based framework for interactive editing of continuous media supporting synchronization and mixing. It supports media objects divided on complex media (Synchronizer, TimeLineController) as composition and simple objects (audio, video data) as media control, and defines ports (source and sink) for processing. An adaptive framework for

28 Chapter 2 – Related Work II. Fundamentals and Frameworks

developing multimedia software components called the Presentation Processing Engine (PPE) framework is proposed in [Posnak et al., 1997]. PPE relies on a of reusable modules implementing primitive transformations [Posnak et al., 1996] and proposes a mechanism for composing processing pipelines from these modules. There are other published works, e.g. [Margaritidis and Polyzos, 2000] or [Wittmann and Zitterbart, 1997], but they follow or somehow adopt the above-mentioned classifications and approaches of the earlier research and do not introduce breakthrough ideas. However, all of the mentioned works consider only aspects of the communication layer (networking) or presentation layer, which is not sufficient when talking about multimedia transformations in context of MMDBMS.

VirtualMedia is another work of very high importance [Marder, 2000], which defines a theory of multimedia metacomputing as a new approach to the management and processing of multimedia data in web-based information systems. A solution for application independence of multimedia data (called transformation independence) through advanced abstraction concept has been introduced by [Marder, 2001]. The work discusses theoretically several ideas like device independence, location transparency, execution transparency, and data independence. Moreover, an approach to construct a set of connected filters, a description of the conversion process, and an algorithm to set up the conversion graph have been proposed in subsequent work [Marder, 2002], where the individual media and filter signatures are used for creating transformation graphs. However, no implementation as proof of concept exists.

II.4.1.1 Well-known implementations

The Microsoft’s DirectX platform is an example of media framework in an OS-specific environment. The DirectShow [Microsoft Corp., 2002a] is the most interesting part of DirectX platform in respect to audio-video transformations. It deals with multimedia files and uses a filter-graph manager and a set of filters working with different audio and/or video stream formats and coding schemes. The filters (called also media codecs) are specially designed converters supporting the DirectShow internal communication interface. Filter graphs are built manually, by creating by programmer the well-known execution path consisting of defined filters, or automatically, by comparing the provided output formats from previously selected filter to acceptable input formats with potential following media codec [Microsoft Corp., 2002b]. It provides mechanisms for stream synchronization according to OS time; however the

29 Chapter 2 – Related Work II. Fundamentals and Frameworks

transformation processes cannot get execution and time guaranties from the best-effort OS. Moreover, DirectX is only available under one OS family, so use at the client side is limited.

A competitive company, Sun Microsystems, has provided also a media framework analogical to MS DirectShow. The Java Media Framework (JMF) [Sun Microsystems Inc., 1999] uses the Manager object to cope with Players, Data Soruce, Data Sink and Processors. The processors are equivalents to converter graphs and are built from processing components called plug-ins (i.e. converters) ordered in transformation tracks within the processor. Processors can be configured using suitable controls by hand i.e. constructed by the programmer on the fly (TrackControl), or on the basis of predefined processor model, which specifies input and output requirements, or let decide the processor to auto-configure by specifying only output format [Sun Microsystems Inc., 1999]. In contrast to DirectShow filter graphs, processors can be combined with each other, which introduces one additional logical level of complexity in the hierarchy between simple converters and very complicated graphs. Moreover, JMF is not limited to just one OS due to the Java platform-independence properties through Java Virtual Machines (JVMs), but still it does not support QoS control and no execution guarantees in real-time can be given8. One disadvantage may derive from the taken-for-granted inefficiency of Java applications in comparison to low level languages and platform specific implementation—the detailed efficiency investigation and benchmarking would be suggested though.

The Transcode [Östreich, 2003] is the third related implementation that has to be mentioned. It is an open-source program for audio and video transcoding and is still under development. Even though reliable versions are available and can be very useful for experienced users. The Transcode’s goal was to be the most popular utility for audio and video data processing running under OSes and which could be controlled by text console allowing shell scripting and parametric execution for the purpose of automation. In contrast, the previous frameworks require development of an application, first then being able to use the transcoding capabilities. The approach is analogical to cascaded transcoding which uses raw (uncompressed) data between coded inputs and outputs, i.e. transcoding is done by loading modules that are either responsible for decoding and feeding transcode with raw video or audio streams (import modules),

8 Time line, timing and synchronizations in JMF are provided through internal time model defining objects such as Time, Clock, TimeBase and Duration (all with nanosecond precision).

30 Chapter 2 – Related Work II. Fundamentals and Frameworks

or for encoding the stream from raw to encoded representation (export modules). Up to now, the tool supports many popular formats (AVI, MOV, ES, PES, VOB, etc.) and compression methods (video: MPEG-1, MPEG-2, MPEG-4/DivX/XviD, DV, M-JPEG; sound: AC3, MP3, PCM, ADPCM), but it does not support QoS control and is developed as best effort application with unsupported real-time processing.

II.4.2. End-to-End Adaptation and Transcoding Systems

Other work in the field of audio and video transformation relates directly to the concept of audio and video adaptation and transcoding as the method allowing for interoperability in heterogeneous networks by changing container format, structure (resolution, frame rate), transmission rate, and/or the coding scheme, e.g. MPEG transcoder [Keesman et al., 1996], MPEG-2 transcoder [Kan and Fan, 1998], or low-delay transcoder [Morrison, 1997]. Here the converter is referred to as a transcoder.[Sun et al., 2005], [Dogan, 2001] and [Vetro, 2001] give overviews on the video transcoding and propose solutions. However, [Dogan, 2001] covers only H.263 and MPEG-4, and he does not address the problem of transformation between different standards, which is a crucial assumption for format independence.

Figure 3. Adaptive transcoding system using meta-data [Vetro, 2001].

31 Chapter 2 – Related Work II. Fundamentals and Frameworks

[Vetro, 2001] proposes object-based transcoding framework (Figure 3) that is the most similar solution among all referred related works to the transformation framework of the RETAVIC architecture, as so it’s described in more detail. The author defines the future extraction part, which takes place only for non-real-time scenarios, that generates descriptors and meta-data describing characteristics of the content. However, he does not specify the set of produced descriptors or meta-data elements – he just proposes to use the shape change and motion intensity transcoding hints (discussed earlier in section II.3.3.2 MD-based transcoding). Moreover, he proposes these hints with the only purpose of functional decisions made by the transcoding control, and more precisely by the analysis units responsible for the recognition of shape importance, the temporal decision (such as frame skip) and the quantization parameter selection. The author has also mentioned two additional units for resize analysis and texture shape analysis (for reduction of shape’s resolution) in his different work [Vetro et al., 2000]. Further, [Vetro, 2001] names two major differences to other research [Assunncao and Ghanbari, 1998; Sun et al., 1996] namely: the inclusion of the shape hint within the bit-stream and some new tools adopted for DC/AC prediction with regard to texture coding; no other descriptors or meta-data are mentioned. The transcoding control is used only for controlling the functional properties of the transcoders. The core of the object transcoders are analogical to multi-program transcoding framework [Sorial et al., 1999] and the only difference is the input stream – the object-based MPEG-4 video streams do not correspond to frame-based MPEG-2 video program streams9. The issues reflecting real-time processing and QoS control are not considered – nor in the meta-data set neither in the transcoder design. As so this work extends the Vetro’s research only partially by functional aspects and entirely by the quantitative aspects of transcoding.

Many examples of the end-to-end video streaming and transcoding system are discussed in [Sun et al., 2005]. For example, there is mentioned the MPEG-4 Fine Granular Scalability (FGS) to MPEG-4 Simple Profile (SP) transcoder in the 3rd Chapter, the spatial and temporal resolution reduction is discussed on functional level in the 4th Chapter and is accompanied by motion vector refinement and requantization discussion. The “syntactical adaptation” being one-to-one

9 The issue of impossibility of frame skipping in MPEG-2 is well know problem bypassed by spending one bit marking each MB as skipped for all macro blocks in the frame, thus it is not cited here.

32 Chapter 2 – Related Work II. Fundamentals and Frameworks

(binary-format) transcoding is discussed for JPEG-to-MPEG1, MPEG-2-to-MPEG-1, DV-to- MPEG-2, and MPEG-2-to-MPEG-4. Some more issues such as error-resilient transcoding, logo insertion, watermarking, picture switching and statistical multiplexing are discussed also in 4th Chapter of [Sun et al., 2005]. Finally, novel picture-in-picture (PIP) transcoding for H.264/AVC [ITU-T Rec. H.264, 2005] discusses two cases: PIP Cascade Transcoder (PIPCT) and optimized Partial Re-Encoding Transcoder Architecture (PRETA). However, all the transcoder examples mentioned in this paragraph are either adaptations (homogeneous or unary-format transcoding) or binary-format transcoding (one-to-one), and no transformation framework providing one-to- many or many-to-many coding-scheme transcoding is proposed i.e. “real” heterogeneous solution. Moreover, they do not consider real-time and QoS control.

Yet another example is discussed in Chapter 11 of [Sun et al., 2005] –the real-time server containing a transcoder with the precompressed content is given in Figure 11.2 on p.32910. There are few elements common with our architecture pointed out in the server-side, but also few critical are still missing (e.g. content analysis, MD-based encoding). Moreover, the discussed test-bed architecture is MPEG-4 FGS-based, which is also one of the adaptation solutions. The extensions to MPEG-4 transcoding of the test bed is proposed, but it assumes that there are only several requested resolutions all being delivered in MPEG-4 format (which is not the RETAVIC goal) and still the application of MD-based encoding algorithm is not considered.

Finally, there are also other less-related proposals enhancing media data transformation during delivery to the end-client. [Knutsson et al., 2003] proposes an extension to HTTP protocol to support server-directed transcoding on proxies, and even though, it states that any kind of data could be managed in such form, only the static image data and no other media types, especially continuous, are investigated. [Curran and Annesley, 2005] discusses the transcoding of audio and video for mobile devices constrained by bandwidth, and additionally discusses properties of media files and their applicability in streaming with respect device type. The framework is not presented in details, but it’s based on JMF and does not consider QoS nor real-time –at least both are nowhere mentioned and within the transcoding algorithm there is no step referring to any kind of time or quality control (it’s just best-effort solution). The perceived quality

10 The complete multimedia test bed architecture is also depicted in Figure 11.5 on p. 399 of [Sun et al., 2005].

33 Chapter 2 – Related Work II. Fundamentals and Frameworks

evaluation is done by mean of the scores (MOS) in off-line mode and total execution times are measured. An audio streaming framework is proposed in [Wang et al., 2004], but here only an algorithm for multi-stage interleaving of audio data and layered unequal-sized packetization useful in error prone networks are discussed, and no format transformations are mentioned.

II.4.3. Summary of the Related Transformation Frameworks

Summarizing, there are interesting solutions for media transformations that are ready to be applied in certain fields, but still there is no solution that supports QoS, real-time, and format independence in a single architecture or framework that may be directly applicable in the MMDBMS where the specific properties of the multimedia databases such as data presence and meta-data repository describing the data characteristics could be exploited.

Thus the RETAVIC media transformation framework [Suchomski et al., 2005] is proposed where each media-object is processed at least twice before being delivered to a client and media transformations are assisted by meta-data. This is analogical to two-pass encoding techniques in video compression [Westerink et al., 1999], so the optimization techniques deriving from the two-pass idea can also be applied to the RETAVIC approach. However, the RETAVIC framework goes beyond that – it heavily extends the idea of MD-assisted processing and employs meta-data to reduce the complexity and enhance the predictability of the transformations to meet real-time requirements and proposes an adaptive model of converter to provide QoS control.

II.5. Format Independence in Multimedia Management Systems Even though, there has been plenty of research done in the direction for audio and video transcoding as mentioned in previous sections, the currently available media servers and multimedia database management systems have huge deficiencies with respect to format independence for the media objects, especially when considering audio and video data. They offer either data independence or real time support, but not both. The media servers, which have been analyzed and compared, support only a small number of formats [Bente, 2004], not to mention the possibility to transform one format into the other, when the output format is defined by the end-user. Some attempts towards format independence are made with respect to the quality scalability, but only in one specific format i.e. the adaptation approach. The only

34 Chapter 2 – Related Work II. Fundamentals and Frameworks

server allowing for limited transformation (from RealMedia Format to Advanced Streaming Format) does neither support QoS control nor real-time processing. Besides, the user cannot freely specify the quality attributes such as different resolution or higher compression and at most can only choose between predefined qualities of given format (redundancy approach).

II.5.1. MMDBMS

The success of using database management systems for organizing large amounts of character- based data (e.g. structured or unstructured text) has lead to the idea of extending them to support multimedia data as well [Khoshafian and Baker, 1996]. In contrast to multimedia servers, which are designed to simply store the multimedia data and manage accessing them in analogy to file servers [Campbell and Chung, 1996], the multimedia database systems handle the multimedia data in a more flexible way without redundancy allowing to query for them through standardized structured query language (SQL) [Rakow et al., 1995], which may deliver data independence and application neutrality. Moreover, MMDBMS are responsible for coping with multi-user environment and control access to the same data in parallel.

Before going into details of each system, a short summary of the demands of state-of-the-art multimedia database management system (MMDBMS) based on [Meyer-Wegener, 2003] is given. The multimedia data are stored and retrieved as media objects, however storing means much more than just file storing (as in media servers). Other algorithms, e.g. manipulation of the content for video production or broadcasting, should not be implemented at the MMDBMS side. The majority of DBMSes has already built-in support for binary large objects (BLOBs) for integrating undefined binary data, which are not really suited to various media data, because there is no differentiation between media types and thus neither interpretation of nor specific characteristic of the data can be exploited. The MO should be kept as whole without divisions on separate parts (e.g. images and not separate pixels and positions; video and not each frame as image; audio and not each sample individually). The exceptions are scalable media formats and coding schemes, but here further media-specific MO refinement is required. The data independence (including storage device and format) of MOs shall be provided due to its crucial importance for any DBMS. The device independence is more critical in multimedia data than the format independence due to hierarchical access driven by the frequency of use of the data, where the access time of storage devices differentiates. However, the format independence may

35 Chapter 2 – Related Work II. Fundamentals and Frameworks

not be neglected to fully support data independence especially for allowing avoiding data redundancy, supporting long-time applications and neutrality, allowing for internal format updating without influencing the outside world, etc. Additionally, MMDBMS must prevent data inconsistencies, if multiple copies are required in any circumstances (e.g. due to optimization in delivery on costs of storage in analogy to materialized views). The more-sophisticated search capabilities, not only by name or creation date, have to be provided – for example using indexes or meta-data delivered by media-specific recognition or analysis algorithms. The indexes or meta-data can be produced by hand or automatically in advance11. Finally, the time-constrains of continuous data have to be considered, which means implementation of: a) a best-effort system with no quality control working fast enough to deliver in real-time, b) a soft-RTOS-based system with just statistical QoS, or c) a hard-RTOS-based system providing scheduling algorithms and admission control, thus allowing exact QoS control and precise timing. Summarizing, the MMDBMS combines the advantages of conventional DBMS with specific characteristics of the multimedia data.

The researchers have already delivered many solutions considering various perspectives on the evolution of MMDBMSes. In general, they can be classified into two functional extensions: 1) focus on “multi” part of multimedia, or 2) coping with device independency i.e. access transparency. The goal of the first group is it to provide integrated support for many different media types in one and complete architecture. Here METIS [King et al., 2004] and MediaLand [Wen et al., 2003] can be named as representatives. METIS is a Java-based unified multimedia management solution for arbitrary media types including a query processor, persistence abstraction, web-based visualization and semantic packs (containers for semantically related objects). Contrary, MediaLand coming from Microsoft is a prototypical framework for uniform modeling, managing and retrieving of various types of multimedia data by exploiting different querying methodologies such as standard database queries, information retrieval and content- based retrieval techniques, and hypermedia (graph) modeling.

The second group coping with the device independence proposes such solutions as SIRSALE [Mostefaoui et al., 2002] or Mirror DBMS [van Doorn and de Vries, 2000]. These are especially

11 On-demand analysis during the time of query is too costly (not to say unfeasible) and has to be avoided.

36 Chapter 2 – Related Work II. Fundamentals and Frameworks

useful for the distributed multimedia data. The first one, SIRSALE proposes a modular indexing model for searching in various domains of interest and independently of device, while the other, the Mirror DBMS solves the problems only with physical data independence in respect to the storage access and its querying mechanism. However, both do not discuss data independence in respect to the format of multimedia data.

From other perspective, there are commercial DBMSs available, such as Oracle, IBM’s DB2, or Informix. These are complex and reliable relational systems with object-oriented extensions; however, they are lacking of direct support for multimedia data and they can be maintained only through the mentioned object-relational extensions e.g. by additional implementation of special data types and related methods for indexing and query optimization. As so, the Informix offers DataBlades as extensions, the DB2 respectively Extenders, and the Oracle proposes interMedia Cartridgers. All of them provide some limited level of format independence through data conversion. The most sophisticated solution is delivered by Oracle interMedia [Oracle Corp., 2003]. There the ORDImage.Process() allows for converting the stored picture to different image formats. Moreover, the functions processAudioCommand() included within the ORDAudio interface and processVideoCommand() present in ORDVideo are used by calls for media processing (i.e. also transcoding) but these are just interfaces for passing the processing commands to the plug-ins that anyway have to be implemented for each user-defined audio and video format. And this format specific implementation would lead to the M-to-N conversion problem, because there has to be an implemented solution for each format handing over the date in user-requested format. Moreover, the Oracle interMedia allows for storing, processing and automatic extraction of meta-data covering the format (e.g. MIME-type, file container, coding scheme) and the content (e.g. author, title) [Jankiewicz and Wojciechowski, 2004]. The other system DB2 proposes Data Extenders for few types of media i.e. for Image, Audio and Video. However, it provides conversion only for DB2Image type and none of the continuous media data types at the same time [IBM Corp., 2003].

There has been also some work done on extensions of the declarative structured query language (SQL) to support multimedia data. In results, the multimedia extension to SQL:1999 under the name SQL Multimedia and Application Packages (known in short as SQL/MM) has been proposed [Eisenberg and Melton, 2001]. The standard [JTC1/SC32, 2007] defines a number of packages

37 Chapter 2 – Related Work II. Fundamentals and Frameworks

of generic data types common to various kinds of data used in multimedia and application areas in order to enable these data to be stored and manipulated in an SQL database. It covers currently five parts:

1) P.1) Framework – defines the concepts, notations and conventions being common to two or more other parts of the standard, and in particular explains the user-defined types and their behavior;

2) P.2) Full-Text – defines the full-text user-defined types and their associated routines;

3) P.3) Spatial – defines the spatial user-defined types, routines and schemas for generic spatial data handling on aspects such as geometry, location and topology usable by geographic infocmation system (GIS), decision support, data mining and data warehousing systems;

4) P.5) Still image – defines the still image user-defined types and their associated routines for generic image handling covering characteristics such as height, width, format, coding scheme, color scheme, and image futures (average color, histograms, texture, etc.) but also operations or methods (rotation, scaling, similarity assessment);

5) P.6) Data mining – defines data mining user-defined types and their associated routines covering data-mining models, settings and test results.

However, the SQL/MM still falls short of the possibilities offered by abstract data types for timed multimedia data and is not fully implemented in any well-known commercial system. The Oracle 10g implements only the Part 5 of the standard through the SQL/MM StillImage types (e.g. SI_StillImage), which is an alternative standard-compliant solution to the more powerful Oracle-specific ORDImage type [Jankiewicz and Wojciechowski, 2004]. The Oracle 10g supports partially also the full text and spatial parts of SQL/MM (i.e. Part 2 and Part 3).

To the best knowledge of the author of this thesis, neither the prototypes nor commercial MMDBMSs have considered format independence of the continuous multimedia data. The end- user perspective is the same neglected, because the systems do not address requirements of the formats variety on request. None of the systems provides format and quality conversion

38 Chapter 2 – Related Work II. Fundamentals and Frameworks

through transcoding, beside the adaptation possibility deriving only from using the scalable format. What’s more, the serving of different clients respecting the hardware properties and limitations (e.g. mobile devices vs. multimedia set-top boxes) is usually solved by limiting the search result only to the subset restricted by constraints of the user’s context, which means that only the data suiting given platform are considered for searching. Besides, none of them considered to be consistent with the MPEG-7 MDS model [MPEG-7 Part V, 2003].

II.5.2. Media (Streaming) Servers

As pointed out previously, the creation of fully featured MMDBMS supporting continuous data failed so far, probably due to the enormous complexity of the task. On the other hand, the demand for delivery of audio-visual data, imposing needs for effective storage and transmission, has been rising incessantly. In results, the development of simple and pragmatic solutions has been started, which delivered nowadays audio and video streaming server solutions being especially useful for video-on-demand applications.

The RealNetworks Media Server [RealNetworks, 2002], the Apple Darwin Streaming Server [Apple, 2003] and the Media Services [Microsoft, 2003] are definitely the most known commercial products currently available. All of them deliver the continuous data with low-latency and control the client access on the server side to the stored audio/video data. The drawback is that the storage of data on the server is conducted only in a proprietary format, as so it is also streamed to the client over a network. To provide at least some degree of scalability including various qualities of the data accompanied by a range of bandwidth characteristics, the redundancy approach has be applied and usually several instances of the same video are stored on the server that are pre-coded at various bit-rates and resolutions.

39 Chapter 2 – Related Work III. Formats and RT/OS Issues

III. FORMATS AND RT/OS ISSUES

III.1. Storage Formats Only the lossless and/or scalable coding schemes are discussed within this section. The non- scalable and lossy are out of interest as can be derived from the later part of the work. But before going into details, few general remarks have to be stated. It is a fact, that multimedia applications have to cope with huge amounts of data, and thus the compression techniques are exploited for storage and transmission. It is also clear, that the trade off between compression efficiency and data quality has to be considered when choosing the compression method. As so the first decision is to choose between lossless and lossy solutions. Secondly, the number of provided SNR quality levels should be selected between discrete with no scalability (one quality level) or just few levels, and continuous12 with smooth quality transition in the quality range (often represented by fine-granular scalability). It is taken for granted, that the higher compression leads to lower data quality for lossy algorithms, as well as the introduction of scalability lowers the coding efficiency for the same quality in comparison to non-scalable coding schemes. Moreover, the is much worse than the when considering only coding efficiency. All these general remarks apply to both audio and video data and their processing.

III.1.1. Video Data

When this work has been started, there was no scalable and lossless video compression available according to author’s knowledge. As so there are discussed only scalable and only lossless coding schemes for the video data at the beginning. Next, as an exception, the MJPEG-2000 being a possible solution for use as lossless scalable image coding format applied to video encoding is shortly described and the 3D-DWT-based solution is mentioned subsequently. With

12 The term continuous is overestimated in this case, because there is no real continuity in the digital world. As so, the border between discrete and continuous is very vague i.e. one could ask how many levels is required to have continuous and not discrete anymore. It is suggested that anything below 5 levels is discrete, but anyway the decision is left to the reader.

40 Chapter 2 – Related Work III. Formats and RT/OS Issues

respect to the newest ongoing and unfinished standardization process on MPEG-4 SVC [MPEG-4 Part X, 2007], there are some remarks given in the Further Work.

III.1.1.1 Scalable codecs

The MPEG-4 FGS profile [Li, 2001] defines two layers with identical spatial resolution, where one is base layer (BL) and the other is enhancement layer (EL). The base layer is compressed according to advanced simple profile (ASP) including the temporal prediction, while the EL stores the difference between originally calculated DCT coefficients and coarsely quantized DCT coefficients stored in BL using bit-plane method, such that the enhancement bit stream can be truncated at any position, and as so the fine granularity scalability is provided. There is no temporal prediction within the EL, so the decoder is resistant to error drift and robust to recovery from the errors in enhancement stream.

Two extensions to FGS have been proposed in order to lower FGS coding deficiencies due to missing temporal redundancy removal in the enhancement layer. The Motion-Compensated FGS (MC-FGS) [Schaar and Radha, 2000] is one of the extensions proposing higher-quality frames from the EL to be used as reference for motion compensation, which leads to smaller residuals in the EL and thus better compression, but at the same time suffers from higher error drift due to error propagation within the EL. The Progressive FGS (PGFS) [Wu et al., 2001] is the other proposal, which adopts special prediction method for EL in the separate loop using only partial temporal dependency on the higher quality frame. As so, it closes a gap between relatively inefficient FGS with no drift in EL and efficient MC-FGS with susceptibility to error drift in EL. Moreover, there has already been proposed an enhancement to PGFS called Improved PGFS [Ding and Guo, 2003], where coding efficiency is even higher (for the same compressed size the PSNR [Bovik, 2005] gain achieves 0.5 dB) by using higher-quality frame as reference for all frames, and additionally the error accumulation is prevented by the attenuation factors.

Even though the very high quality can be achieved by MPEG-4 FGS based solutions, it is impossible to get losslessly coded information due to the lossy properties of DCT and quantization steps of the implementation of the MPEG-4 standard.

41 Chapter 2 – Related Work III. Formats and RT/OS Issues

III.1.1.2 Lossless codecs

Many different lossless video codecs, which provide high compression performance and efficient storage, are available. They may be divided on two groups: 1) using general-purpose compression methods or 2) using the transform-based coding.

The first group exploits lossless general-purpose compression methods like Huffman coding or Lempel-Ziv algorithm and its derivatives, e.g. LZ77, LZ78, LZW or LZY [Effelsberg and Steinmetz, 1998; Gibson et al., 1998]. More efficient solution is to join these methods into one—it is known as DEFLATE (defined by Phil Katz in PKZIP or used in popular )— examples of such video codecs are: Lossless Codec Library (known as LCL AVIzlib/mszh), LZO Lossless Codec, and CSCD (RenderSoft CamStudio). These methods are still relatively inefficient due to their generality and the unexploited spatial and temporal redundancy, so more enhanced method employing spatial or temporal prediction for exploiting redundancies in video data would improve compression performance. The examples of such enhanced methods are [Roudiak-Gould, 2006] and Motion CorePNG, and it is also assumed that the proprietary AlparySoft Lossless Video Codec [WWW_AlparySoft, 2004] belongs to this group as well.

The second group of codecs uses compression techniques derived from transform-based algorithms, which was originally developed for still images. If such technique is applied to video data on the frame basis, it produces a simple sequence of pictures, which are independently coded without loss of information. Typical examples are Lossless Motion JPEG [Wallace, 1991] or LOCO-I [Weinberger et al., 2000]. There are also many implementations available both in software (e.g. PICVideo Lossless JPEG, Lead Lossless MJPEG) as well as in hardware (e.g. Pinnacle DC10, Matrox DigiSuite).

III.1.1.3 Lossless and scalable codecs

The combination of lossless and scalability properties in video coding was proposed by the recent research, however no ready and implemented solutions have been provided. One important solution focused on video is the MPEG-4 SVC [MPEG-4 Part X, 2007] discussed in the further work (due to work-in-progress status).

42 Chapter 2 – Related Work III. Formats and RT/OS Issues

Another possible lossless and scalable solution for video data could be a use of the technology based on a transform e.g. Motion JPEG 2000 (MJ2) using two- dimensional discrete (2D-DWT) [Imaizumi et al., 2002]. However, there are still many unsolved problems regarding efficiently exploiting temporal redundancies (JPEG 2000 only covers still pictures, not motion). Extensions for video exploiting temporal characteristics are still under development.

Last but not least could be an application of three dimensional discrete wavelet transform (3D- DWT) for video coding. 3D-DWT operates not in two dimensions as the transform used by MJ2, but in three dimensions i.e. the third dimension is time. This allows treating a video sequence or a part of the video sequence consisting of more than one frame called group of pictures (GOP) as a 3-D space of values of luminance or chrominance. There is however unacceptable drawback for real-time compression – the time buffer of at least the size of GOP has to be introduced. On the other hand, it allows to exploit better the correlation between pixels in the subsequent frames around one point or macro block of the frame, there is no need to divide the frame on non-overlapping 2-D blocks (allowing avoiding blocking artifacts) and the method allows inherently for scaling [Schelkens et al., 2003]. It is also assumed, that by selecting more important regions, one could achieve compression of 1:500 (vs. 1:60 for DCT- based algorithms [ITU-T Rec. T.81, 1992]). Still the computing cost of 3D-DWT-based algorithms may be much higher than previously published video compression algorithms like 2D-DWT-based. Moreover, only proprietary prototypical implementation has been developed.

III.1.2. Audio Data

Contrary to video section, the audio section discusses lossless and scalable coding schemes only, because there are few coding schemes applicable within this work. In general, there are few methods of storing audio data without loss: 1) in uncompressed form, e.g. in RIFF WAVE file using Pulse Code Modulation, Differential PCM, Adaptive DPCM, or any other modulation method, 2) in compressed form using standard lossless compression algorithms like ARJ, ZIP or RAR, or 3) in compressed form using one of the audio-specific algorithms for lossless compression. From the perspective of this work, only the third class is shortly described due to capability of layering, compression efficiency and relation to audio data.

43 Chapter 2 – Related Work III. Formats and RT/OS Issues

The most known and efficient coding schemes using audio-specific lossless compression are as follows:

• Free Lossless (FLAC) [WWW_FLAC, 2006] • Monkey's Audio [WWW_MA, 2006] • LPAC • MPEG-4 (ALS) [Liebchen et al., 2005] (LPAC has been used as a reference model for ALS) • RKAU from M Software, WavPack [WWW_WP, 2006] • OptimFrog • from SoftSound • • MPEG-4 Scalable Lossless Coding (SLS) [MPEG-4 Part III FDAM5, 2006]

Out of all the mentioned coding schemes only WavPack, OptimFrog and MPEG-4 SLS are (to some extent) scalable formats, while the other are un-scalable, which means that all data must be read from the storage medium, decoded completely, and there is no way of getting lower-quality through partial decoding. Both, WavPack and OptimFrog, have a feature providing rudimentary scalability, such that there are only two layers: a relatively small lossy file (base layer) and an error-compensation file which has to be combined with the decoded lossy data for providing lossless information (enhancement layer). This scalability future is called Hybrid Mode for WavPack, and respectively DualStream for OptimFrog.

Beside the mentioned two-layer scalability, the fine granular scalability would be possible in WavPack because the algorithm uses linear prediction coding where the most significant bits are encoded first. As so, if some rework in the implementation of WavPack were done, it would be possible to use every additional bit layer for improving quality and make the WavPack scalable to lossless audio codec with multiple layers of scalability. Though, such rework has not been implemented so far and no finer than two-layer scalability has been proposed.

Similarly, the two layers are proposed in the MPEG-4 SLS as default [Geiger et al., 2006] i.e. the base layer (called also core) and the enhancement layer. The core of MPEG-4 SLS uses the well-

44 Chapter 2 – Related Work III. Formats and RT/OS Issues

known MPEG-4 AAC coding scheme [MPEG-2 Part VII, 2006], while the EL stores the transform-based encoded error values (i.e. the difference between the decoded core and the lossless source) coded according to the SLS algorithm. The SLS’s EL, analogically to WavPack, also stores the most significant bits of the difference at first, but contrary the SLS algorithm allows cutting off the bits from EL, so the fine-granular scalability is possible. Moreover, the transform-based coding of MPEG-4 SLS outperforms the linear prediction coding used in WavPack in the coding efficiency at low bit rates. Additionally, the MPEG-4 SLS can also be switched into the SLS-only mode where no audio data are stored in the base layer (no core)13, and just the enhancement layer encodes the error. Such solution is equal to encoding of error difference between zero and lossless signal value in the scalable from-the-first-bit manner. Of course, at low bitrates the SLS-only method cannot compete with the AAC-base core.

III.2. Real-Time Issues in Operating Systems The idea of transcoding-driven format independence of audio and video data imposes on the operating system enormous performance requirements including both computing power and data transfers. Secondly, the time-constraints of the continuous data limits the applicable solution to only a subset of the existing algorithms. For example, the preemptive scheduling with few priority levels [Tannenbaum, 1995], which was proved to be relatively simple and efficient in managing the workload in the best effort systems such as Linux or Windows, is insufficient in more complex scenario where many threads with time constraints appear and the execution deadlines should be considered.

Additionally, in order to understand the later part of this work, some backgrounds in the operating systems field are also required. Not all aspects related to OS issues are discussed within this section, but just the most critical points and definitions are referred – more detailed aspects using introduced here definitions appear whenever required in the respective chapter (as inline explanation or footnotes). At first the execution modes, kernel architectures and inter-

13 This method is called SLS-based core, but it may be misunderstood, because there is no bit spent for storing audio quanta in base layer. Only the descriptive and structure information such as bit-rate, number of channels, sampling rate, etc. are stored in the BL.

45 Chapter 2 – Related Work III. Formats and RT/OS Issues

process communication are discussed, then the real-time processing models are shortly introduced, and next related scheduling algorithms are referenced.

III.2.1. OS Kernel – Execution Modes, Architectures and IPC

In general, there are two kinds of memory space (or execution modes) distinguishable: user space (user mode) and kernel space (kernel mode). As it may be derived from the name, the OS kernels (usually with the device drivers and kernel extensions or modules) run in the non-swappable14 kernel memory space with the kernel mode, while the applications use the swappable user memory and are executed with the user mode [Tannenbaum, 1995]. Of course, the programs in user mode cannot access the kernel space, which is very important to system stability and allow handling the buggy or malicious applications.

On the other hand, there are two types of OS kernel architecture commonly used15: a micro- kernel and a monolithic kernel. The monolithic kernels embed all the OS services such as memory management (address spaces with mapping and TLB16, virtual memory, caching mechanisms), threads and threads switching, scheduling mechanisms, interrupts handling, inter- process communication (IPC17) mechanism [Spier and Organick, 1969], file systems support (block devices, local FS, network FS) and network stacks (protocols, NIC drivers), PnP hardware support, and many more. At the same, all these services run in the kernel space with the kernel mode. The novel monolithic kernels are usually re-configurable by having a possibility of dynamical load of additional kernel modules e.g. when a new hardware driver or a new file system support is required. In this only respect, they are similar to micro-kernels. What differs micro-kernels from the monolithic kernel is the embedded set of OS services, and the

14 Swappable means that the parts of virtual memory (VM) may be swapped out to the secondary storage (usually to swap file or swap partition on a hard disk drive) temporality whenever the process is inactive or the given part of VM has not been used for some time. Analogically, non-swappable memory can not be swapped out. 15 A nano-kernel is intentionally left out due to its limitations i.e. only the minimalist hardware abstraction layer (HAL) including the CPU interface. The interrupts management and memory management unit (MMU) interface are very often included in the nano-kernels (due to the coupled CPU architectures) even though they do not really belong to it. The nano-kernels are suitable for the real-time single-task applications, and thus are applicable in embedded systems for hardware independence. Alternatively, it might be said that nano-kernels are not OS kernels in the traditional sense, because they do not provide minimal set of OS services. 16 TLB stands for Translation Lookaside Buffer, which is a cache of mappings from the operating system’s page table. 17 IPC finds it roots in the meta-instructions defined for the multiprogrammed computations by [Dennis and Van Horn, 1966] within the MIT’s Project MAC.

46 Chapter 2 – Related Work III. Formats and RT/OS Issues

address space and execution mode used for these services. There is only a minimal set of abstraction (address spaces, threads, IPC) in the micro-kernel included [Härtig et al., 1997], which is used as a base for the user-space implementation of the remaining functionality (i.e. for other OS services called usually servers). In other words, only the micro-kernel with its minimal concepts (primitives) runs with the kernel mode using kernel-assigned memory and all the servers are loaded to the protected user memory space and run at the user level [Liedtke, 1996]. As so the micro-kernels have some advantages over the monolithic kernels such as: a smaller size of the binary image, a reduced memory and cache footprint, a higher resistance to malicious drivers, or are easier portable to different platforms [Härtig et al., 1997]. An example of the micro-kernel is the L4 [Liedtke, 1996] implementing the primitives such as: 1) address spaces with grant, map and flush, 2) threads and IPC, 3) clans and chiefs for kernel-based restriction of IPCs (clans and chiefs themselves are located in the user-space), and 4) unique identifiers (UIDs). One remark: the device-specific interrupts are abstracted to IPCs and are not handled internally by L4 micro-kernel, but rather by the device driver in the user-mode.

The efficient implementation of IPC is required by the microkernels [Härtig et al., 1997] and is also necessary for parallel and distributed processing [Marovac, 1983]. The IPC allows for data exchange (usually by sending messages) between different processes or processing threads [Spier and Organick, 1969]. On the other hand, by limiting the IPC interface of a given server or by implementing a different IPC redirection mechanism an additional security may be enforced by efficient and more flexible access control [Jaeger et al., 1999]. However, a wrongly designed IPC introduces additional overhead to the implemented OS kernel [Liedtke, 1996].

III.2.2. Real-time Processing Models

III.2.2.1 Best-effort, soft- and hard-real-time

A few processing models applicable in on-line multimedia processing can be distinguished with respect to real-time. The first one is the best-effort processing without any time consideration during scheduling and without any QoS control, for example the mentioned preemptive scheduling with few priority levels applicable in best-effort OS [Tannenbaum, 1995].

The second processing model is recognized as soft real-time, where the deadline misses are acceptable (with processing being continued) and time buffers for keeping the constant quality

47 Chapter 2 – Related Work III. Formats and RT/OS Issues

required. However, if the execution time varies too much such that the limited buffer will overrun, a frame skip may occur and the quality will drop noticeable. Moreover, the delays depending on the size of time buffer are introduced, so the trade-off between buffer size and the quality guarantees is crucial here (and the guarantees of delivering all frames cannot be given). Additionally, a sophisticated scheduling algorithm must allow for variable periods per frame in such case, as so the jitter-constrained periodic streams model (JCPS) [Hamann, 1997] is applicable here.

The third model is hard real-time where the deadline misses are not allowed and the processing must be executed completely. So, the guarantees for delivering all the frames may be given here. However, a waste of resources is the problem here, because in order to guarantee complete execution on time the worst case scheduling must be applied [El-Rewini et al., 1994].

III.2.2.2 Imprecise computations

The imprecise computations [Lin et al., 1987] is a scheduling model which obeys the deadlines as the hard-real-time model, but is applicable by the flexible computations i.e. if there is still time left within the strict period the result should be made better to achieve the better quality. Here the deadline misses are not allowed, but the processing is adopted to meet all the deadlines on costs of graceful degradation in result’s quality. The idea is that the minimum level of quality is provided by the mandatory part, and the improved quality is delivered by optional computations if the resources are available.

III.2.3. Scheduling Algorithms and QAS

The scheduling algorithm based on the idea of imprecise computation, which is applicable for real-time audio-video transformation with minimal quality guarantees and no deadline misses, is the quality assuring scheduling (QAS) proposed by [Steinberg, 2004]. QAS is based on periodic use of resources, where the application may split the resource request into few sub-tasks with different fixed-priorities. The reservation time is the sub-task’s exclusive time allocated on the resource. This allocation is done a priori i.e. during the admission test before the application is started. Moreover, the allocation is possible only if there is still available free (unused) reservation time of the CPU.

48 Chapter 2 – Related Work III. Formats and RT/OS Issues

The QAS scheduling [Steinberg, 2004] is the most suitable for the format independence provision using real-time transformations for the continuous multimedia data within the RETAVIC project, because the guarantees of delivering all the frames may be given and the minimum level of quality will be always provided. It is also assumed that the complete task does not have to be completed by the deadline but just its mandatory part delivering minimum requested quality, and thus the optional jobs (sub-tasks) within the periodic task may not be executed at all or may be interrupted and stopped. So, the QAS will be further referred in this paper and clarified whenever needed.

A reader can find the extensive discussion including advantages and disadvantages of scheduling algorithms usable for multimedia data processing in the related work of [Hamann et al., 2001a]. This discussion refers to: 1) extended theoretical research on imprecise computations (Liu, Chong, or Baldwin) resulting in just one attempt to build a system for CPU allocation and no other resources (Hull), 2) time value functions (Jensen with Locke, or Rajkumar with Lee), where the varying execution time of jobs but no changes in system load are considered, 3) Abdelzaher’s system for QoS negotiation which considers only long-lived real-time service and assumes task’s requirements for resources being constant, 4) network-oriented view on packet scheduling in media streams (West with Poellabauer) also with guaranties of specific level of quality, but not considering the semantic dependencies between packets, 5) statistical rate monotonic scheduling (SRMS from Atlas and Bestavros) which relies on the actual execution time and scheduling at the beginning of each period, 6) resource kernel (Oikawa and Rajkumar) which is managing and providing resources directly by kernel (in contrast QAS is server-based on top of micro-kernel), and moreover, overloads, quality and resources for optional parts are not addressed.

Since the opinion on applying QAS in the RETAVIC project is similar to the one presented by [Hamann et al., 2001a] being applicable in the real-time multimedia processing, the scheduling algorithms are not further discussed in this work.

49 Chapter 2 – Related Work III. Formats and RT/OS Issues

50 Chapter 3 – Design IV. System Design

Chapter 3 – Design

Entia non sunt multiplicanda praeter necessitatem.18 William of Occam (14th Century, “Occam’s Razor”)

IV. SYSTEM DESIGN

This chapter gives an overview on the RETAVIC conceptual model. At first, shortly the architecture requirements are stated. Next, the general idea based on generic 3-phase video server framework proposed in [Militzer, 2004] is depicted – however, the 4-phase model (as already published in [Suchomski et al., 2005]) is described in even more details considering both audio and video, and then each phase in separate subsection is further clarified. Some drawbacks of the previously described models of architecture are mentioned in the respective subsections, and thus the extensions are proposed. Subsequently, the evaluation of the general idea is given by short summary. Next, the hard real-time adaptive model is proposed. Finally, the real-time issues in context of the continuous media according to proposed conceptual model are described.

The subsections of the conceptual model referring to storage including internal format and meta-data (IV.2.3) and real-time transcoding (IV.2.4) are further explained for each media type

18 In English: Entities should not be multiplied beyond necessity. Based on this sentence the KISS principle has been developed: Keep It Simple, Stupid — but never oversimplify.

51 Chapter 3 – Design IV. System Design

(media-specific details) separately in the following sections of this chapter: video in Section V, audio in Section VI, and both combined as multi-media in Section VII.

IV.1. Architecture Requirements The main difficulty of media transformation is caused by the huge complexity of the process [Suchomski et al., 2004], especially for continuous media where it must be done in real-time to allow uninterrupted playout. The processing algorithms for audio and video require enormous computational resources, and their needs vary heavily since they depend on the input data. Thus accurate resource allocation is often impossible due to this unpredictable behavior [Liu, 2003] (huge difference between worst-case and average execution times), and the missing scalability and adaptability of the source data formats inhibit QoS control.

The essential objective in this project is to develop a functionality extending nowadays’ multimedia database services that brings out efficient and format-transparent access to multimedia data by utilizing real-time format conversion respecting user’s specific demands. In other words, the RETAVIC project has been initiated with the goal to enable format independence by real-time audio-video conversion in multimedia database servers. The conversion services will run in a real-time environment (RTE) in order to provide a specific QoS. Few main directions have been defined in which the Multimedia DBMS should be extended:

Data independence – various clients' requests should be served by transparent on-line format transformations (format independence) without regard to physical storage and access methodology (data independence).

Data completeness – data should be stored and provided without any loss of information, whenever it is required.

Data access scalability – just a portion of data should be accessed when lower quality is requested, and a complete reading should take only place if lossless data are requested.

52 Chapter 3 – Design IV. System Design

Real-time conversion – a user should be allowed to transparently access data on-demand. So, the conversion should be executed just-in-time, which causes specific real-time requirements and thus requires QoS control.

Redundancy/transformation trade-off – a single copy (an instance) of each media object should be kept to save space and to ease update. However, the system must not be limited to exactly one copy, especially when many clients with identical quality and format requirements are expected to being served e.g. using caching proxies.

Real-time capturing – lossless insertion (recording) should be supported, which is especially important in scientific and medical fields.

These directions are not excluding Codd’s well-known Twelve Rules [Codd, 1995], but contrary, they are extending them by considering the continuous multimedia data. For example, the first direction specifies a method for the 8th and 9th Codd’s rules19. Moreover, Codd’s rules have been defined for an ideal relational DBMS, which is not fully applicable in the other DBMSes such as ODBMS, ORDBMS or XML-specific DBMS. For example 1st Codd’s rule (Information Rule – all data presented in tables) is somehow useless in respect to multimedia data, where the presentation layer is usually devoted to the end-user application.

According to the mentioned six directions and the limitations of existing media-server solutions (previously discussed in Chapter 2), a proposal of the generic real-time media transformation framework for multimedia database managements systems and multimedia servers is developed and presented in the successive part.

IV.2. General Idea The generic real-time media transformation framework defined within the RETAVIC project (shortly the RETAVIC framework) finally consists of four phases: real-time capturing, non-real-

19 Codd’s Rules: 1. The Information Rule, 2. Guaranteed Access Rule, 3. Systematic Treatment of Null Values, 4. Dynamic On- Line Catalog Based on the Relational Mode, 5. Comprehensive Data Sublanguage Rule, 6. View Updating Rule, 7. High-level Insert, Update, and Delete, 8. Physical Data Independence, 9. Logical Data Independence, 10. Integrity Independence, 11. Distribution Independence, 12. Nonsubversion Rule.

53 Chapter 3 – Design IV. System Design

time preparation, storage, and real-time delivery. The RETAVIC framework is depicted in Figure 4.

Figure 4. Generic real-time media transformation framework supporting format independence in multimedia servers and database management systems. Remark: dotted lines refer to optional parts that may be skipped within a phase.

The real-time capturing (Phase 1) includes a fast and simple lossless encoding of captured multimedia stream and an input buffer for encoded multimedia binary stream (details are given in Section IV.2.1). Phase 2 –non-real-time preparation phase– prepares the multimedia objects and creates meta-data related to these objects. Finally it forwards all the produced data to a multimedia storage system. An optional archiving of the origin source, a format conversion from many different input formats into an internal format (which has specific characteristics), and a content analysis are executed in Phase 2 (described in Section IV.2.2). Phase 3 is a multimedia storage system where the multimedia bitstreams are collected and the accompanying meta-data are kept. Phase 3 has to be able to provide real-time access to the next phase (more about that in Section IV.2.3). Finally, Phase 4 –real-time delivery–, has two parallel processing

54 Chapter 3 – Design IV. System Design

channels. The first one is real-time transcoding sub-phase where the obligatory processes, namely real-time decoding and real-time encoding take place, and which may have optional real- time media processing (e.g. resizing, filtering). The second channel –marked as bypass delivery process in Figure 4– is used for direct delivery which does not apply any transformations beside the protocol encapsulation. Both processing channels are treated as isolated i.e. they work separately and do not influence each other (details about Phase 4 are given in Section IV.2.4).

There are few general remarks according to the proposed framework as a whole one:

• phases may be distributed over a network – then one must consider additional complexity and communication overhead, but can gain a processing efficiency increase (and as so higher number of served requests) • Phase 1 is optional and may be completely skipped, however, only if the application scenario does not require live video and audio capturing • Phase 1 is not meant for non-stop life capturing (there must be some breaks between to empty media buffer), because of unidirectional communication problem between real- time and non-real-time phases (solvable only by guaranteed resources with worst-case assumptions in non-real-time phase)

There are also two additional boundary elements visible in Figure 1 besides those within described 4 phases. They are not considered to be a part of the framework, but allow understanding such construction of the framework. The multimedia data sources and the multimedia clients with an optional feedback channel (useful in failure-prone environments) are the two constituents. The MM data sources represent input data for the proposed architecture. The sources are generally restricted by the previously mentioned remark about non-stop life capturing and by the available decoder implementations. However, if the assumption is made that a decoder is available for each encoded input format, the input data may be in any format and quality including also lossy compression schemas, because the input data are simply treated as origin information i.e. a source data for storage in the MMDBMS or MM server. The multimedia clients are analogically restricted only by the available real-time encoder implementations. And again if the existence of such encoders is assumed for each existing format, the client restriction to the given format is dispelled due to the format independence

55 Chapter 3 – Design IV. System Design

provision by on-demand real-time transcoding. Moreover, the same classes of multimedia clients consuming identical multimedia data, which have already been requested and cached earlier, are supported by direct delivery without any conversion, as same as clients accessing the origin format of the multimedia stream.

IV.2.1. Real-time Capturing

First phase is designed to deliver a real-time capturing, which is required in some application scenarios where no frame drops are allowed. Even though grabbing of audio and video on the fly (live media recording) is a rather rare case considering the variety of applications in the reality, where most of the time input audio and video data have already been stored on some other digital storage or device (usually pre-compressed to some media-specific format), it must not be neglected. A typical use case for heavily-capturing scenario are the scientific and industrial research systems, especially in the medical field, which regularly collect (multi)media data where loss of information is not allowed and real-time recording of the processes under investigation is a necessity20. Thus, the capturing phase of the RETAVIC architecture must provide a solution for information preservation and has to rely on a real-time system. Of course, if the multimedia data is already available as a media file, then Phase 1 can be completely skipped as it is shown in Figure 4 by connecting the “Media Files” in oval shape directly to Phase 2.

IV.2.1.1 Grabbing techniques

The process of audio-video capturing includes analog-digital conversion (aka ADC) of the signal, which is huge topic for its own and will not be discussed in details. Generally, the ADC is carried out only for the analog signal and produces the digital representation of the given analog output. Of course, the ADC is a lossy process as every discrete mapping of continuous function. In many cases the ADC is already conducted by the recording equipment directly (e.g. within the digital camera’s hardware). In general, there are few possibilities of getting audio and video signal from the reality into the digital world:

20 In case of live streaming of already compressed data like MPEG-2 coded digital TV channels distributed through DVB- T/M/C/S, these streams could simply be dumped to the media buffer in real-time. The process of dumping is not investigated due to its simplicity and the media buffer is discussed in the next subsection.

56 Chapter 3 – Design IV. System Design

• by directly connecting (from microphone or camera) to standardized digital interfaces like IEEE 139421 (commonly known as FireWire limited to 400Mbps), Universal Serial Bus (USB) in version 2.0 (limited to 480Mbps), wireless BlueTooth technology in version 2.0 (limited to 21Mbps), Camera Link using coaxial MDR-26 pin connector (or optical RCX C-Link Adapter over MDR-26), or S/P-DIF22 (Sony/Philips Digital Interface Format) being a consumer version of IEC 6095823 Type II Unbalanced using coaxial RCA Jack or Type II Optical using optical F05 connector (known as TOSLINK or EIAJ Optical – optical fiber connection developed by Toshiba). • by using specialized audio-video grabbing hardware producing digital audio and video like PC audio and video cards24 (e.g. Studio 500, Studio Plus, Avid Liquid Pro from Pinnacle, VideoOh!’s from Adaptec, All-In-Wonder’s from ATI, REALmagic Xcard from Sigma Designs, AceDVio from Canopus, PIXIC Frame Grabbers from EPIX) –a subset of these is also known as Personal Video Recorder cards (PVRs e.g., WinTV-PVR from Hauppauge, EyeTV from Elgato Systems)–, or stand-alone digital audio and video converters25 (e.g. ADVC A/D Converters from Canopus, DVD Express or Pyro A/V Link from ADS Technologies, PX-AV100U Digital Video Converter from Plextor, DAC DV Converter from Datavideo, Mojo DNA from Avid) –a subset of stand-alone PVRs is also available (e.g. PX-TV402U from Plextor).

21 The first version of IEEE 1394 appeared in 1995, but there are also IEEE 1394a (2000) and 1394b (2002) amendments available. The new amendment IEEE 1394c is coming up but it’s known already as FireWire 800, which is limited to 800Mbps, thus the previous standards may be referred as FireWire 400. However, to our knowledge, the faster FireWire has not yet been applied in the audio-video grabbing solutions. The i.Link is the Sony’s implementation of IEEE1394 (in which 2 power pins are removed). 22 There is a small difference in the Channel Status Bit information between specification of AES/EBU (or AES3) and its implementation by S/P-DIF. Moreover, the optical version of SPDIF is usually referred as TOSLINK (contrary to “coaxial SPDIF” or just “SPDIF”). 23 It is known as AES/EBU or AES3 Type II (Unbalanced or Optical) 24 Analog audio is typically connected through RCA jacks on chinch pairs, and analog video is usually connected through: coaxial RF (known as F-type used for antennas; audio and video signal together), composite video (known as RCA – one chinch for video, pair for audio), S-video (luminance and chrominance separately; known as Y/C video; audio separately) or component video (each R,G,B channel separately; audio also independently). 25 Digital audio is carried by mentioned S/P-DIF or TOSLINK. The digital video is transmitted by DVI (Digital Video Interface) or HDMI (High-Definition Multimedia Interface). The HDMI is also capable of transmitting digital audio signal in addition.

57 Chapter 3 – Design IV. System Design

• by using network communication e.g. WLAN based mobile devices (like stand-alone cameras with built-in WLAN cards) or Ethernet-based cameras (connected directly to the network), • by using generic data acquisition (DAQ) hardware doing AD conversion being stand-alone with USB support (e.g. HSE-HA USB from Harvard Apparatus, Multifunction DAQs from National Instruments, Personal DAQ from IOTech), PCI- based (e.g. HSE-HA PCI from Harvard Apparatus, DaqBoard from IOTech), PXI- based (PCI eXtensions for Instrumentation e.g. Dual-Core PXI Express Embedded Controller from National Instruments) or Ethernet-based (e.g. E-PDISO 16 from Measurement Computing).

Regardless the variety of grabbing technologies named above, it is assumed in the RETAVIC project, that the audio and visual signals are already delivered to the system as a digital signals for further capturing and storage (no ADC is analyzed anymore). Moreover, it is assumed that huge amounts of continuous data (discussed in next section) must be processed without loss of information26, for example, when monitoring a scientific process with high-end 3CCD industrial video cameras connected through 3 digital I/O channels by DAQs or when recording a digital audio of the high-quality symphonic concert using multi-channel AD converters transmitting over coaxial S/P-DIF.

IV.2.1.2 Throughput and storage requirements of uncompressed media data

Few digital cameras grouped in different classes may serve as an example of throughput and storage requirements27:

1) very high-resolution 1CCD monochrome/color camera e.g. Imperx IPX-11M5-LM/LC, Basler A404k/A404kc or Pixelink PL-A686C,

26 Further loss of information is meant when assuming the loss introduced by ADC. 27 The digital cameras used as examples are representative from the market on September 3rd, 2006. The proposed division on five classes should not change in the future i.e. the characteristics of each class allow to assign correctly existing cameras in a given point of time. For example, there might be available an HDTV camera in the group of consumer cameras (class 5) instead of professional cameras (class 4) in the future.

58 Chapter 3 – Design IV. System Design

2) high-resolution 3CCD (3-chips) color camera e.g. Sony DXC-C33P or JVC KY-F75U,

3) high speed camera e.g. Mikrotron MotionBLITZ Cube ECO1/Cube ECO2/Cube3, Mikrotron MC1310/1311, Basler A504k/A504kc,

4) professional digital camera e.g. JVC GY-HD111E, Sony HDC-900 or Thomson LDK- 6000 MKII (all three are HDTV cameras),

5) consumer digital (DV) camera e.g. Sony DCR-TRV950E, JVC MG505 or Canon DM- XM2E or DC20.

First three classes comprise the industrial and scientific cameras. Class 4 refers to professional market, while class 5 covers only the needs of standard consumers.

File Pixel Through- Size of Class Name Model Width Height FPS Bit put 60sec. Depth [Mbps] [GB] 1 Imperx IPX-11M5-LM 4000 2672 5 12 612 4,48 1 Imperx IPX-11M5-LC 4000 2672 5 12 612 4,48 1 Basler A404k 2352 1726 96 10 3717 27,22 1 Basler A404kc 2352 1726 96 10 3717 27,22 1 Pixelink PL-A686C 2208 3000 5 10 316 2,31 2 JVC KY-F75U 1360 1024 7,5 12 120 0,88 2 Sony DXC-C33P 752 582 50 10 209 1,53 3 Mikrotron MotionBLITZ Cube ECO1 640 512 1000 8 2500 18,31 3 Mikrotron MotionBLITZ Cube ECO2 1280 1024 500 8 5000 36,62 3 Mikrotron MotionBLITZ Cube3 512 512 2500 8 5000 36,62 3 Basler A504k 1280 1024 500 8 5000 36,62 3 Basler A504kc 1280 1024 500 8 5000 36,62 3 Mikrotron MC1310 1280 1024 500 10 6250 45,78 3 Mikrotron MC1311 1280 1024 500 10 6250 45,78 4 JVC GY-HD111E 1280 720 60 8 422 3,09 4 Sony HDC-900 1920 1080 25 12 593 4,35 4 Thomson LDK-6000 MKII 1920 1080 25 12 593 4,35 5 Sony DCR-TRV950E 720 576 25 8 79 0,58 5 JVC MG505 1173 660 25 8 148 1,08 5 Canon DM-XM2E 720 576 25 8 79 0,58 5 Canon DC20 720 576 25 8 79 0,58

Table 1. Throughput and storage requirement for few digital cameras from different classes.

59 Chapter 3 – Design IV. System Design

The Table 1 shows how much data should be considered during recording from the mentioned five classes of digital cameras. The bandwidth of video data ranges from 5 Mbps up to 6250 Mbps. Interestingly, the highest bit rate is achieved not by the highest resolution nor by the highest frame rate but by a mixture of high resolution and high frame rate (for high-speed cameras in class 3). Moreover, a file keeping just 60 seconds of the visual signal from the most demanding camera requires almost 46 GB of space on the storage system.

Pixel Throughput File Size of Name Width Height FPS Bit [Mbps] 60sec. [GB] Depth HDTV 1080p 1920 1080 25 8 396 2,90 HDTV 720p 1280 720 25 8 176 1,29 SDTV 720 576 50 8 158 1,16 PAL (ITU601) 720 576 25 8 79 0,58 CIF 352 288 25 8 19 0,14 CIFN 352 240 30 8 19 0,14 QCIF 176 144 25 8 5 0,04

Table 2. Throughput and storage requirements for few video standards.

Additionally, the standard resolutions are listed in the Table 2 in order to compare the bandwidth and storage requirements for them as well. As so, the highest quality of the high- definition television standard (HDTV 1080p) requires as much as a low-end camera of high- resolution industrial cameras (class 1). As so, the requirements of the standards from consumer market are not as critical as scientific or industrial demands.

Analogically, the audio requirements are presented in Table 3. The throughput of the uncompressed audio ranges from 84 kbps for low quality speech, through 1,35 Mbps for CD quality, up to 750 Mbps for 128-channels studio recordings. Respectively, an audio file of 60 seconds needs from 600kB, through 10MB, up to 5,6 GB (please note, that the values in file size column are given in MegaBytes).

60 Chapter 3 – Design IV. System Design

File Sampling Sample Number Throughput Size of Name frequency Bit of [Mbps] 60sec. [kHz] Depth channels [MB] Studio 32/192/128 192 32 128 750,000 5625 Studio 24/96/128 96 24 128 281,250 2109 Studio 32/192/48 192 32 48 281,250 2109 Studio 24/96/48 96 24 48 105,469 791 Studio 32/192/4 192 32 4 23,438 176 Studio 24/96/4 96 24 4 8,789 66 Studio 20/96/4 96 20 4 7,324 55 Studio 16/48/4 48 16 4 2,930 22 DVD 7+1 96 24 8 17,578 132 DVD 5+1 96 24 6 13,184 99 DVD Stereo 96 24 2 4,395 33 DAT 48 16 2 1,465 11 CD 44,1 16 2 1,346 10 HQ Speech 44,1 16 1 0,673 5,0 PC-Quality 22,05 16 2 0,673 5,0 Low-End-PC 22,05 8 2 0,336 2,5 LQ-Stereo 11 8 2 0,168 1,3 LQ Speech 11 8 1 0,084 0,6

Table 3. Throughput and storage requirements for audio data.

IV.2.1.3 Fast and simple lossless encoding

Due to the fact of such huge storage requirements, a compression should be exploited in order to store efficiently captured volumes of data. However, the throughput requirement must not be neglected. As so, the trade-off between compression efficiency and algorithm speed must be considered.

In general, a slower algorithm generates smaller output than a faster one, due to a more sophisticated compression schemes in slower algorithms –the algorithm is referred here and not the issues of a good or bad implementation– including steps as the motion estimation and prediction, the scene-change detection or the decision about a frame-type or a macro-block type. The slow and complex algorithms unfortunately cannot be used in the capture process, because their motion estimation and prediction algorithms consume too many resources to permit recording of visual signals as provided by high-quality, very high-quality or high-speed digital cameras. Besides, some slow algorithms are not even able to process videos from HDTV cameras in real-time. What’s more, the implementations of the compression algorithms usually

61 Chapter 3 – Design IV. System Design

are still best-effort implementations dedicated to non-real-time systems, and as so, they just work in the real time (or rather “on time”), i.e. they produce the compressed data fast enough by processing a sufficient amount of frames or samples per second on time using expensive hardware. Secondly, there is very little or no control over the process of on-line encoding. On the other hand and in spite of the expectations, some of the implementations are not even able to handle huge amounts of data at high resolution and high frame rate or at high sample-rate with multiple streams in real time even on modern hardware.

So, the only solution is to capture input media signals with a fast and simple compression scheme, which does not contain complicated functions, is easy to control and at least delivers the data on time (willingly it runs on real-time system). A very good candidate for such a compression of video data is the HuffYUV algorithm proposed by [Roudiak-Gould, 2006]. It is a lossless video codec employing selectable prediction scheme and Huffman entropy coding. It predicts each sample separately, and the entropy coding encodes the resulting error signal by selecting most appropriate, predefined Huffman table for a given channel (it’s possible to specify own outside table). There are three possible prediction schemes: left, gradient and median. The left prediction scheme predicts the previous sample from the same channel and is the fastest one but in general delivers worst results. The gradient method predicts from calculation of three values: adds Left, adds Above and minuses AboveLeft, and it is a good trade off between speed and compression efficiency. The median scheme predicts the median of three values: Left, Above and the median predictor; and delivers maximal compression, however it is the slowest one (but still much faster than complex algorithms with motion prediction e.g. DCT-based codecs [ITU-T Rec. T.81, 1992]). The HuffYUV codec supports YUV 4:2:2 (16 bpp), RGB (24 bpp28), and RGBA (RGB with alpha channel; 32 bpp) color schemes. With respect to YUV color scheme, there are allowed UYVY and YUY2 ordering methods. As author claims, a computer “with a modern processor and a modern IDE hard drive should be able to capture CCIR 601 video at maximal compression (…) without problems” [Roudiak- Gould, 2006]29. The known limitation of HuffYUV is the resolution being a multiple of four.

28 bpp – stands for bits per pixel 29 There are available some comparison tests done within the RETAVIC project. For details, a reader is referred to [Militzer et al., 2005].

62 Chapter 3 – Design IV. System Design

According to [Suchomski et al., 2006] from among the lossless audio formats available on the market, the WavPack [WWW_WP, 2006] and the Free Audio Lossless Coding (FLAC) [WWW_FLAC, 2006] are good candidates for use in the real-time capturing phase. On the Figure 5 can be seen a comparison of the decoding speed and compression size of some lossless audio codecs, where the EBU Sound Quality Assessment Material (SQAM) [WWW_MPEG SQAM, 2006] and on a private set of music samples (Music) [WWW_Retavic - Audio Set, 2006] have been measured.

1000 SQAM SLS SQAM WavPack SQAM Monkey's SQAM Flac SQAM OptimFrog Music WavPack 100 Music Monkey's Music Flac Music OptimFrog Music SLS nocore

10 Decoding Speed (xreal-time) Speed Decoding No Core SL S 64 SLS 256 SLS no Core

1 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% Compression Rate

Figure 5. Comparison of compression size and decoding speed of lossless audio codecs [Suchomski et al., 2006].

It has been observed in [Suchomski et al., 2006] that the codecs achieve different compression rates according to the content of the input samples and different coding speeds. The best results in meaning of the speed are achieved by FLAC i.e. it about 150 to 160 times faster (on the used test bed) than the required by real-time compression. Contrary, the WavPack is only about 100 times faster. In terms of compression rate, however, the FLAC achieves not the best results. FLAC is very rich of features, supports bit depths from 4 to 32 bits per sample, sampling rates up to 192 kHz and right now up to 8 audio channels. There is no kind of scalability feature in FLAC. As the source code of the FLAC libraries are licensed under BSD license, it would be possible to do modifications without licensing problems. Analogical behavior in respect to the encoding speeds may be also observed but the further investigation should be conducted.

63 Chapter 3 – Design IV. System Design

IV.2.1.4 Media buffer as temporal storage

According the extreme throughput and storage requirements presented in previous subsection, there must be provided a solution for storing the audio and video data in real-time. It would be possible to store them directly in the MMDBMS, however this direct integration is not further investigated in respect to the RETAVIC architecture. One of the arguments for keeping the separate media buffer is the non-real-time preparation phase (Phase 2) disallowing any real-time processing and coming next after the Phase 1 and before storing the media data in MMDBMS (Phase 3).

Thus, the solution proposed to handle the storing process defines a media buffer as the intermediate temporal storage of the captured media data associated with the real-time capturing phase. This prevents from losing the data and keeps it for further complex analysis and preparation in the next Phase 2. Of course, its content is just temporal and is to be removed after the execution of required steps in Phase 2.

Different hardware storage solutions and offered throughput Media buffer should have few characteristics deriving from the capturing requirements, namely: real-time processing, appropriate size in respect to the application, enough bandwidth according to properties of grabbed media data (allowing required throughput). There are few hardware solutions compared in the Table 4. There is a cache on different levels (L1, L2, L3) omitted in the MEM class due to its small sizes making it inapplicable as the media buffer.

The MEM class is a non-permanent memory – the most temporal storage of all four classes given in the Table 4 due to its erasing properties after powering it off. It was divided into: simple RAM, dual-channel RAM and NUMA. Simple RAM (random access memory) is delivered in hardware as single in-line memory modules (SIMM) or dual in-line memory modules (DIMM). Most known representatives of SIMMs are DRAM SIMM (30)30 and FPM31 SIMM (72). There are many versions of DIMMs available: DDR2 SDRAM (240), DDR32

30 There is given a number of pins present on the module in the brackets. 31 FPM – Fast Page Mode (DRAM) optimized for successive reads and writes (bursts of data). 32 DDR - Double Data Rate (of the frontside bus) by transferring data on both the rising and falling edges of the clock.

64 Chapter 3 – Design IV. System Design

SDRAM33 (184), (D)RDRAM34 (184), FPM (168), EDO35 (168), SDRAM (168). There are also versions for notebooks called SODIMMs, which include: DDR SDRAM (200 pins), FPM (144), EDO (144), SDRAM (144), FPM (72), EDO (72). There are referred only DDR2 SDRAM in the Table 4 because these modules are the fastest out of all listed above.

Peak Class Type Example Bandwidth [MB/s] MEM Simple RAM DIMM DDR2 64bit/800MHz 6104 MEM DUAL-CHANNEL RAM 2x DIMM DDR2 64bit/800MHz 12207 MEM NUMA HP Integrity SuperDome 256000 NAS ETHERNET 10GbE 1250 NAS CHANNEL BONDED ETHERNET 2x 10GbE 2500 NAS MYRINET 4th Generation Myrinet 1250 NAS INFINIBAND 12x Link 3750 SAN SCSI over Fiber Channel FC SCSI 500 SAN SATA over Fiber Channel FC SATA 500 SAN iSCSI 10GbE iSCSI 1250 SAN AoE 10GbE AoE 1250 DAS RAID SAS 16x SAS Drives Array 300 DAS RAID SATA 16x SATA Drives Array 150 DAS RAID SCSI 16x Ultra-320 SCSI Drives Array 320 DAS RAID ATA/ATAPI 16x ATAPI UDMA 6 Drives Array 133

Table 4. Hardware solutions for the media buffer.36

The Dual Channel RAM is a technology for bonding two memory modules into one logical unit. A nowadays motherboard’s chipset may support two independent memory controllers allowing access to the RAM modules simultaneously, as so, one 64-bit channel may be defined as upstream data, while the other as downstream data (IN and OUT), or both channels are bonded to one 128-bit channel (either IN or OUT).

33 SDRAM – Synchronous Dynamic RAM i.e. it’s synchronized with the CPU speed (no wait states are present) 34 (D)RDRAM – (Direct) Rambus DRAM is a type of SDRAM proposed by Rambus Corp., but the latency is higher in comparison to DDR/DDR2 SDRAM. Moreover, the heat output is higher requiring special metal heat spreaders. 35 EDO – Extended Data Out (DRAM) faster about 5% than FPM, because new cycle starts while data output is still from the previous cycle (overlapping in operations i.e. pipelining). 36 This table includes only raw values of the specified hardware solution i.e. no protocols, communication or other overheads are concerned. In order to consider them, the measurement of the user-level or application-level performance must be conducted.

65 Chapter 3 – Design IV. System Design

NUMA is a Non-Uniform Memory Access (sometimes Architecture) defined as a computer memory design with hierarchical access (i.e. the access time depends on the memory location relative to a processor) used in multiprocessor systems. As so, a CPU can access its own local memory faster than non-local (remote) memory (which is memory local to another processor or shared memory). ccNUMA is a cache coherent NUMA which is critical for keeping the cache of the same memory regions consistent. Usually it’s solved by inter-process communication (IPC) between cache controllers. In result, the ccNUMA performs badly on applications accessing the memory in rapid succession. Huge advantage of NUMA and ccNUMA is much bigger memory space available to the application providing still very high bandwidth known as the scalability advantage over symmetric multiprocessors (SMPs – where it is extremely hard to scale over 8-12 CPUs). Other words, it’s a good trade of between Shared-Memory MIMD37 and Distributed- Memory MIMD.

The difference between next classes of storage, namely NAS, SAN and DAS are depicted in Figure 4, which comes directly from the famous Auspes’s Storage Architecture Guide38 [Auspex, 2000].

Figure 6. Location of the network determines the storage model [Auspex, 2000].

37 MIMD – Multiple Instructions Multiple Data (contrary to SISD, MISD, or SIMD). 38 Auspex is commonly recognized as a first true NAS company, before the term NAS had been defined.

66 Chapter 3 – Design IV. System Design

The Network Attached Storage (NAS) class is a network-based storage solution and in general, it is provided by a network file system. The network infrastructure is used as a base layer for storage solution provided by clusters (e.g. Parallel Virtual File System [Carns et al., 2000], Oracle Clustered File System V2 [Fasheh, 2006], Cluster NFS [Warnes, 2000]) or by dedicated file servers (e.g. software-based SMB/CIFS or NFS servers, Network Appliance Filer [Suchomski, 2001], EMC Celerra, Auspex Systems). NAS may be based on different networks types regardless the type of the physical connection (glass fiber or copper), like: Ethernet, Channel Bonded Ethernet, Myrinet or Infiniband. Thus the bandwidth presented in Table 4 is given for each type of mentioned networks without consideration of the overhead given by the file sharing protocol of specific network file systems39. Ethernet is a hardware base for the Internet and its standardized bandwidth ranges from 10Mbps, through 100Mbps, 1Gbps (1GbE), up to the newest 10Gbps (10GbE). Channel Bonded Ethernet is analogical to Dual Channel RAM but refers to two NICs coupled logically into one channel. Myrinet was the first network allowing bandwidth up to 1Gbps. The current Fourth-Generation Myrinet supports 10Gbps. In general, it has two fiber optics connectors (upstream/downstream). Myricom Corp. offers the NICs which supports both 10GbE and 10Gbps Myrinet. Infiniband uses two types of connectors: Host Channel Adapter (HCA) used mainly in IPC, and Target Channel Adapter (TCA) use mainly in I/O subsystems. Myrinet and Infiniband are low latency/delay and low protocol overhead networks in comparison to Ethernet.

SAN stands for Storage Area Network and is a hardware solution delivering high IO bandwidth. It is more often referred to as block I/O services (block-based storage in analogy to SCSI/EIDE) rather than file access services (higher abstraction level). It usually uses four types of network for block-level I/O: SCSI over fiber channel, SATA over fiber channel, iSCSI or recently proposed ATA over Ethernet (AoE) [Cashin, 2005]. The SCSI40/SATA41 over fiber

39 A network file system is considered to be on the higher logical level of the ISO/OSI network model, so in order to measure its efficiency, an evaluation on the user- or application-level must be carried out considering the bandwidth of the underlying network layer. 40 SCSI is Small Computer System Interface and is the most commonly used 8-or 16-bit parallel interface for connecting storage devices in the servers. It ranges from SCSI (5MB/s) up to Ultra-320 SCSI (320MB/s). There is also available newer SAS (Serial Attached SCSI) being much more flexible (hot-swapping, improved fault tolerance, higher number of devices) and still achieving 300MB/s. 41 SATA stands for Serial Advanced Technology Attachment (contrary to origin ATA which was parallel and now referred as PATA) and is used to connect devices with the bandwidth up to 1,5Gbps (but due to 8/10 bits encoding on physical layer

67 Chapter 3 – Design IV. System Design

channel provides usually bandwidth of 1Gbps, 2Gbps or 4Gbps, which requires a use of the special Host Bus Adapter (HBA). iSCSI is an Internet SCSI (or SCSI over IP) which as a medium may use standard Ethernet NIC (e.g. 10GbE). AoE is a network protocol designed for accessing ATA42 storage devices over Ethernet network, thus enabling cheap SANs over low- cost, standard technologies. AoE simply puts ATA commands into low-lever network packets (simplifying the ATA ribbon – the wide 40- or 80-line cable is exchanged by Ethernet cable). The only drawback (but also a design goal) was to make it not routable over LANs (and thus very simple).

Direct Attached Storage (DAS), or simply, local storage43, are devices (disk drives, disk arrays, RAID arrays) attached directly to the computer by one of the standardized interfaces like ATA/ATAPI, SCSI, SATA or SAS. There are only the fastest representatives from each of the mentioned standard given in the Table 4. Each hardware implementation of the given standard has its limits contrary to the SATA/SCSI over fiber channel, iSCSI or AoE, where just the protocols are implemented and only the different physical layer is used as a carrier providing higher transfer speed.

Evaluation of storage solutions in context of RETAVIC From the REATVIC perspective, each solution has advantages and disadvantages. Memory is very limited in size until very expensive NUMA technology is used allowing for scalability. Besides, the non-permanent characteristic demands the grabbing hardware being always on-line until all captured data is processed by Phase 2. An advantage is definitely easiness of the implementation of the real-time support (might be directly supported in the RTOS kernel without special drivers) and it has extremely high bandwidth – some applications like capturing with high-speed cameras may be done only by using memory directly (e.g. Mikrotron MC1311 requires at least 6,25Gbps bandwidth without overhead). DAS is limited in scalability

achieves only 1,2Gbps). New version (SATA-2) doubles the bandwidth (respectively 3Gbps/2,4Gbps) and third version being ongoing research should allow for four-times bigger bandwidth (6Gbps/4,8Gbps). 42 ATA is also commonly referred to as the Integrated Drive Electronics (IDE), Enhanced IDE (EIDE), ATA Packet Interface (ATAPI), Programmed Input/Outpu (PIO), Direct Memory Access (DMA), or Ultra DMA (UDMA), which is wrong, because ATA is a standard interface for connecting devices and IDE/EIDE only uses ATA, and PIO, DMA and UDMA are different data access methods. ATAPI is an extension of ATA and it exchanged the ATA. ATA/ATAPI range from 2,1MB/s up to 133MB/s (with UDMA 6). 43 DAS is also called captive storage or server attached storage.

68 Chapter 3 – Design IV. System Design

(comparing to NAS or SAN), however the real-time support could be provided with only some efforts (a block-access driver supporting the real-time kernel must be implemented usually for each specific solution). Moreover, it has a good price-to-benefits rate. SAN is scalable and reliable solution, which was very expensive up to now, however with introduction of AoE, it seems to be achieving prices of DAS. But anyway, the real-time support requires sophisticated implementation within the RTOS considering the network driver coupled with block-access method (additionally network communication between server and the NAS device must be real- time capable). NAS is relatively cheap and very scalable solution; however demands even more sophisticated implementation to provide real-time capabilities because none of the existing network file sharing systems provides real-time and QoS control (it’s due to the missing real- time support in the network layer being used as a base layer for the network file system).

Due to such variety of costs versus efficiency characteristics, the final decision of using one of given solution is left to the system designer exploiting the RETAVIC architecture. As so for the application withing this project, the NAS has been chosen as being most suitable for testing purposes (limited size / high speed / easiness of use).

IV.2.2. Non-real-time Preparation

Phase 2 is responsible for insertion and update of the data in the RETAVIC architecture i.e. the actual step of putting the media objects into the media server is performed in this non-real-time- preparation part. Its main goal is conversion to a particular internal format i.e. to convert the input media from source storage format into the internal storage format being most suitable for real- time conversion. Secondly, it does a content analysis and stores the meta data. Additionally, it is able to archive origin source i.e. to keep the input media data in the origin format.

Phase 2 does not require any real-time processing and thus may apply the best-efforts implementation of the format conversion and complex analysis algorithms. The drawback of missing real-time requirement is that the potential implementation of the architecture cannot guarantee a time needed for conducting the insert or update operations i.e. it can be only based on the best-effort manner respective to the controlling mechanism (e.g. as fast as possible according to selected thread/process/operation priority). Moreover, the transactional behavior

69 Chapter 3 – Design IV. System Design

of the insert and update operations must be provided regardless of the real-time properties being considered as critical or not.

IV.2.2.1 Archiving origin source

The archiving origin source module is optional and does not have to exist in every scenario, however is required in order to cover all possible applications. This module was introduced in the last version of the proposed RETAVIC architecture (Figure 4) due to two requirements: exact bit-by-bit preservation of the origin stream and even more flexible application in the real world considering simpler delivery with smaller costs in some certain use cases (this also obliged doing changes in Phase 4). Moreover, in cases of meta-data being incorporated in the source encoded bitstream, the meta-information could be dropped by the decoding process. However, it is preserved now by keeping all the origin bits and bytes whenever required.

The first goal is achieved since the source media bit stream is simply stored completely as the origin binary stream regardless of the used format in analogy to the well-known binary large objects (BLOBs). The problem noticed here is a varying lossiness of decoding process i.e. the amount of noise always varies in the decoded raw representation after the decoding process due to the different decoder implementations (e.g. having dissimilar rounding function) even if the considered decoders operate on the same encoded input bitstream. The mentioned problem was neglected in earlier versions of the RETAVIC architecture due to the assumption of media data were considered as the source after being decoded from lossy format by lossy decoder. However, now the problematic aspect is handled additionally and thus preserves bit-by-bit copy of the source (encoded) data. Of course, use only of BLOBs without any additional (meta) information is impossible, but in the context of the proposed architecture it makes sense, since the existing relationship between the BLOB and its well-understandable representation in the scalable internal format.

Second goal of achieving higher application flexibility is targeted by giving an opportunity to the user of accessing the source bitstream in the origin format. In case that the proposed archiving origin source module would not be present, the origin format had to be produced in the real- time transcoding phase as for every other requested format. However, by introducing the module, the process may be simplified and transcoding phase may be skipped. On the other

70 Chapter 3 – Design IV. System Design

hand, the probability of having a user requesting a format exactly the same to the source format is extremely small. Thus it may be useful only in the applications where the requirement of preserving bitwise origin source is explicitly stated.

IV.2.2.2 Conversion to internal format

Here the integration of media objects is achieved by importing different formats and decoding them to raw format (uncompressed representation) and then encoding it to the internal format. This integration is depicted as a conversion module in the middle of Phase 2 in Figure 4. The media source can either be a lossless binary stream from the media buffer of Phase 1 described in the previous section IV.2.1 or an origin media file (called also the origin media source) delivered to the MMDBMS in the digital (usually compressed) form from the outside environment.

If media data come from the media buffer, the decoding is fairly simple due to the fact of exactly knowing the format of lossless binary stream grabbed by Phase 1. As so, the process of decoding is just an inverse process to the previously defined fast and simple lossless encoding (3rd subsection of IV.2.1 Real-time Capturing) and is similarly easy to handle.

If media data come from the outside environment as a media file, the decoding is more complex than in the previous case, because it requires additional step of detection of storage format with all necessary properties, whenever the format with the required parameters was not specified explicitly, and then it has to choose the correct decoder for the detected/given digital representation of the media data. Finally, the decoding is executed. Some problems may appear if the format could be decoded by more than one available decoder. In this case, the selection scheme must be applied. Some methods have been proposed already e.g. by Microsoft DirectShow or Java Media Framework. Both schemes allow for manual or automatic selection of the used decoders44.

44 Semi-automatic method would also be possible, if the media application decides on its own in case of just one possible decoder and asks user for a decision in case of more than one possible decoder. However, the application must handle this additional use case.

71 Chapter 3 – Design IV. System Design

Secondly, the part employs an encoding algorithm to produce a lossless scalable binary stream (according to next section IV.2.3) and decoder-related control meta-data for each continuous media separately. The detailed explanation of each media-specific format and its encoder is given in the following chapters: for video in V and for audio in VI. One important characteristic of employing the media-specific encoder is a possibility to simply exchange (or update) the internal storage format for each media at will due to the fact of a simple switching of the current encoding algorithm to a new one just within the Conversion to internal format module. Moreover, such exchange should not influence the decoding part of this module but must be done in coordination with Phase 3 and Phase 4 of the RETAVIC architecture (some additional information how to do this are given in section IV.3.2).

IV.2.2.3 Content analysis

The last but not least important aspect of feeding the media data into the media server is a content analysis of these data. This step looks into a media object (MO) and its content, and based on that produces the meta-data (described in next section) which are required later in the real-time transcoding part (section IV.2.4). The content analysis produces only a certain set of meta-data. The set is specific for each media type and proposed MD sets are presented later, however these MD sets are not finished nor closed –so, let’s call them initial MD sets. The MD set may be extended, but due to the close relation between content analysis and produced meta- data set, the content analysis has to be extended in parallel.

Since the format conversion into the internal storage format and the content analysis of the input data need to be performed just once upon import of a new media file into the system, the resource consumption of the non-real-time preparation phase is not critical for the behavior of the media server. In results the mentioned operations are designed as best-effort non-real-time processes that can be run at lowest priority in order to not influence negatively the resource- critical real-time transcoding phase.

IV.2.3. Storage

Phase 3 differs from all others phases in one point, namely it is not a processing phase i.e. no processing on media data or meta-data is performed here. It would be possible to employ some querying or information retrieval here, but it was clearly defined as out of scope of the

72 Chapter 3 – Design IV. System Design

RETAVIC architecture. Moreover, the access methods and I/O operations are also not the research points to cope with. It is simply taken for granted that there is provided established storage solution, which is analogical to one of the usable storage methods like SAN, NAS or DAS – these methods have already been described in details in the subsection Media buffer as temporal storage (of section IV.2.1).

Moreover, the set of storage methods may be extended by considering the higher-level representations of data e.g. instead of talking about file- or block-oriented methods it may be possible to store the media data in other multimedia database management system or on a media server with unique resource identifier (URI), or anything similar.

However, the chosen solution has also to offer well-controllable real-time media data access. Thus, the RETAVIC architecture does not limit the storage method, besides the real-time support for media access (similar as in the real-time capturing) i.e. the real-time I/O operations must be provided e.g. write/store/put/insert&update and read/load/get/select. This real-time requirement may be hard to implement from the hardware or operating system perspective [Reuther et al., 2006], but again this is the task of some other projects to solve the problem of the real-time access to the storage facilities. Few examples of research on the real-time access with QoS support can be found in [Löser et al., 2001a; Reuther and Pohlack, 2003].

IV.2.3.1 Lossless scalable binary stream

All media objects are to be stored within RETAVIC architecture as lossless scalable streams to provide application flexibility, efficiency and scalability of access and processing. There are few reasons of storing the media data in such a way.

At first, there are applications that require the lossless storage of audio and video recordings. They would not be supported if architecture assumed a lossy internal format already from the beginning in the design postulation. Contrary, the undemanding applications, which do not require lossless preservation of information, can still benefit from the losslessly stored data by the extraction of just a subset of the data or by lossy transcoding. As so, the lossless requirement for the internal storage format allows covering all application fields, and thus brings the application flexibility to the architecture. From the other perspective, every DBMS is obliged

73 Chapter 3 – Design IV. System Design

to preserve the information as it was stored from the client side and deliver it to him without any changes. If the lossy internal format was used, such as FGS-based video codecs described in section III.1.1.1, the information would be lost during the encoding process. Moreover, the introduced noise would be greater by every update due to the well-known tandem coding problem [Wylie, 1994] of lossy encoders even if the decoding was not introducing any additional losses45.

Considering the application flexibility and information preservation on one hand and the efficiency and scalability of access and processing on the other hand, the only possible solution is to make a lossless storage format scalable. The lossless formats have been designed for their lossless properties and as so are not as efficient in the compression in comparison to the lossy formats which usually exploit the perceptual models of human being46. Moreover, the lossless codecs usually are unscalable, e.g. those given in section III.1.1.2, and process all or nothing i.e. the origin data is stored in one coded file which requires that the systems reads the stored data completely and then decodes it also completely, and there is no way to access and decode just a subset of data. Because the compression size delivered by lossless codecs usually ranges from 30% to 70% due to the requirement of information preservation, the relatively big amount of data has to be read always, which is not required by in all cases (not always lossless information is needed).

By introducing scalability (e.g. by data layering) into the storage format, the scalability of access i.e. the ability to access and read just a subset of compressed data (e.g. just one layer) is provided. Moreover, it also allows the scalability of processing by handling just this smaller set of input data, which usually is faster than dealing with complete set of data. As side effect, the scalability of the binary storage format provides also lossy information (by delivering a lower quality of the information) with compression having lower size; for example, just one tenth of the compressed data may be delivered to the user. The examples of scalable and lossless coding

45 The tandem coding problem will appear if at first the data is selected from the MMDBMS and decoded, and next it is encoded and updated on the MMDBMS. Of course, if the update without selecting and decoding the media data from the MMDBMS but with getting them from different source occurs, the tandem coding problem no longer applies. 46 Anyway, a comparison of lossless to lossy encoders does not really make sense, because the lossless algorithms will always lose in respect to the compression efficiency.

74 Chapter 3 – Design IV. System Design

schemes usable for video and audio data are discussed in the Related Work in sections III.1.1.3 and III.1.2.

IV.2.3.2 Meta data

The multimedia transcoding, and especially audio and video processing, is a very complex and unpredictable task. The algorithms cannot be controlled during the execution in respect to the amount of processed data and time constraints, because the behavior of the algorithms depends on the content of the audio and video i.e. the functional properties are defined in respect to coding/compression efficiency as well as to quality according to human perception system –it is human hearing system (HHS) for audio information [Kahrs and Brandenburg, 1998] and human visual system (HVS) for video perception respectively [Bovik, 2005]. Due to the complexity and unpredictability of the algorithms, the idea of using meta-data (MD) to ease transcoding process and to make the execution controllable is a core solution used in the RETAVIC project.

As mentioned already, the MD are generated during the non-real-time preparation (Phase 2) by the process called content analysis. The content analysis looks into a media object (MO) and its content, and based on that produces two types of MD: static and continuous. The two MD- types have different purposes in the generic media transformation framework.

The static MD describe the MO as it is stored and hold information about the structure of the video and audio binary streams. So, static MD keep statistical and aggregation data allowing accurate prediction of resource allocation for the media transformation in real-time. Thus, they must be available before the real transcoding starts. However, static MD are not required anymore during the process of transcoding, so they may be stored separately from the MO.

The continuous MD are time-dependent (like the video and audio data itself) and are to be stored together with the media bit stream in order to guarantee real-time delivery of data. The continuous MD are meant for helping the real-time encoding process by feeding it with the pre- calculated values prepared by the content analysis step. In other words, they are required in order to reduce the complexity and unpredictability of the real-time encoding process (as explained in subsection Real-time transcoding of Section IV.2.4).

75 Chapter 3 – Design IV. System Design

A noticeable fact is that the size of static MD is very small in comparison to the continuous MD. Moreover, the static MD is sometimes referred to as coarse-granularity data due to the aggregative properties (e.g. the sum of the existing I-frames, the total number of audio samples), and respectively, the continuous MD are called fine-granularity data because of their close relation to each quant (or even to a part of quant) of the MO.

IV.2.4. Real-time Delivery

The last but not least phase is Phase 4. The real-time delivery is divided on two processing channels. The first one is real-time transcoding, which is the key task in achieving the format independence in the RETAVIC architecture. The second one is an extension of the earlier proposed version of the architecture and brings the real-time direct delivery of the stored media objects to the client application. Both processing channels are further described in the following subsections.

IV.2.4.1 Real-time transcoding

This part finally provides format independence of stored media data to the client application. It employs a media transcoding that meets real-time requirements. The processing idea is derived from the converter graph which extends the conversion chains, and is an abstraction of the analogical technologies like the DirectX/DirectShow, processors with controls of Java Media Framework (JMF), media filters, and transformers (for details see II.4.1 Converters and Converter Graphs). However, the processing in the RETAVIC architecture is fed with additional processing information, or in other words it is based on additional meta-data (discussed in Meta data in section IV.2.3). The MD are used for controlling the real-time transcoding process and for reducing the complexity and unpredictability of this process.

The real-time transcoding is divided on three subsequent tasks –using different classes of converters– namely: real-time decoding, real-time processing, and real-time encoding. The tasks representing the converters use pipelining to process media objects i.e. they passes so-called media quanta [Meyer-Wegener, 2003; Suchomski et al., 2004] between the consecutive converters.

76 Chapter 3 – Design IV. System Design

Real-time decoding First, the stored bitstreams –this applies to both audio and video data– must be read and decoded resulting in the raw intermediate format i.e. uncompressed media-specific data e.g. a sequence of frames47 with pixel values of luminance and chrominance48 for the video or a sequence of audio samples49 for the audio. The layered internal storage formats allow for reading the binary media data selectively, so the system is used efficiently, because only the data required for the decoding in the requested quality are read. The reading operation must of course support the real-time capabilities in this case (as mentioned in the Storage section IV.2.3). Then the meta-data necessary for the decoding processes are read accordingly, and depending on the MD type the real-time capabilities are required or not. Next, the media- specific decoding algorithms are executed. The algorithms are designed in such a way, that they are scalable not only in data consumption due to the layered internal formats but also in the processing i.e. the bigger amount of data needs more processing power. Obviously, the trade-off between the amount of data, the amount of required computation, and the data quality must be considered in the implementation.

Real-time processing Next the media-specific data may optionally be processed i.e. some media conversion [Suchomski et al., 2004] may be applied. This step covers converting operations on media streams with respect to real-time. These conversions may be grouped in few classes: quanta modification, quanta-set modification, quanta rearrangement and multi-stream operation. This grouping applies to both discussed media types: audio as well as video. First group –quanta modification– refers to the direct operations conducted on content of every quant of media stream. Examples for video quanta modification are: color conversion, resize (scale up / scale down), blurring, sharpening, and other video effects. Examples for audio quanta modification are: volume operation, hi- and low-pass filters, re-quantization (changing bit resolution per sample e.g. from 24 to 16 gives smaller possible set of values). Quanta-set modification considers actions on set of quanta i.e. the number of quanta in the stream is changed. Examples are conversion of rate

47 See term video stream in Appendix A. 48 It may be other uncompressed representation with separate values of each red, green and blue (RGB) color. As default, the luminance and two chrominance values (red, blue) are assumed. Other modes are not forbidden by the architecture. 49 See term audio stream in Appendix A.

77 Chapter 3 – Design IV. System Design

of video frames (known as frames per second – fps), in which the 25fps are transformed to 30fps (frame-rate upscale) or 50fps may be halved to 25fps (frame downscale of frame dropping), or analogically in audio – conversion of sample rate may be changed (e.g. from studio sample frequency of 96kHz to CD standard of 44,1kHz) by downscaling or simple dropping samples. The third category does not change quanta themselves neither the set of involved quanta, but the frame sequence with respect to time. Here time operations, like double- or triple-speed play, slow (stop) play, but as well frame reordering are involved. Fourth proposed group covers operations on many streams in the same time i.e. there are always few streams on input or output of the converter involved. The most-suitable representative are mixing (only of the same type of media) and multiplexing (providing exact copy of the stream e.g. for two outputs). Other examples cover mux operation (merging different types of media), stream splitting (contrary to mux), and re-synchronization.

In general, the operations mentioned above are linear operations and do not depend on the content of the media i.e. it does not matter how the pixel values are distributed in the frame or if the variation of sample values is high. However, the operations depend on the structure of the raw intermediate format. For example, the number of input pixels calculated by width and height and the output resolution influences the time required for the resize operation. Similarly, the number of samples influences the amount of time spent on making sound louder, but the current level of volume does not affect the linear processing.

Real-time encoding Finally, the media data is encoded into the user-requested output format, which involves respective compression algorithm. There are many various compression algorithms for audio and video available. Most known representative codecs are:

• For video

ƒ MPEG-2 P.2: MPEG-4 ( for MPEG-2), bitcontrol MPEG-2, Elecard MPEG-2, InterVideo, Ligos, MainConcept, Pinnacle MPEG-2 ƒ MPEG-4 P.2: XviD (0.9, 1.0, 1.1), DivX (3.22, 4.12, 5.2.1, 6.0), Microsoft MPEG-4 (v1, v2, v3), (4.5), OpenDivX (0.3)

78 Chapter 3 – Design IV. System Design

ƒ H.264 / MPEG-4 P.10 (AVC) [ITU-T Rec. H.264, 2005]: , Mpegable AVC, H.264, MainConcept H.264, Fraunhofer IIS H.264, Ateme MPEG-4 AVC/H.264, Videosoft H.264, ArcSoft H.264, ATI H.264, Elecard H.264, VSS H.264 ƒ 9 ƒ QuickTime (2, 3, 4) ƒ Real Media Video (8, 9, 10) ƒ Motion JPEG ƒ Motion JPEG 2000

• For audio

ƒ MPEG-1 P.3 (MP3): Fraunhofer IIS ƒ MPEG-2 P.3 (AAC): Fraunhofer IIS, Coding Technologies ƒ MPEG-4 SLS: Fraunhofer IIS ƒ AAC+: Coding Technologies ƒ Lame ƒ ƒ FLAC ƒ Monkey’s Audio

All the mentioned codecs are provided for non-real-time systems such as MS Windows, Linux or Mac OS, in which the rate of worst-case to average execution time of audio and video compression may reach even thousands, so the accurate resource allocation for real-time processing is impossible with these standard algorithms. The variations in the processing time are caused by the content analysis step (due to the complexity of the algorithms) i.e. for video these are motion-related calculations like prediction, detection and compensation and for audio these are filter bank processing (inkl. Modified DCT or FFT) and perceptual model masking [Kahrs and Brandenburg, 1998] (some results are given later for specific media separately in the following analysis-related sections: V.1 and VI.1). Within the analysis part of the codec the most suitable representation of the intermediate data for the further standard compression algorithms

79 Chapter 3 – Design IV. System Design

(like RLE or Huffman coding) is found, in which the similarity of the data is further exploited (thus making compression more efficient leading to lower compression size).

Thus, it is proposed in the RETAVIC architecture to separate the analysis step and the unpredictability accompanying it from the real-time implementation, and put it in the non-real- time preparation phase. Secondly, the encoder should be extended with the support of MD, which allows exploiting the data produced by the analysis. This idea is analogical to the two-pass method in compression algorithms [Bovik, 2005; Westerink et al., 1999], where the non-real- time codec first analyzes a video sequence entirely and the optimizes a second encoding pass of the same data by using the analysis results. The two-pass codec uses the internal structures for keeping results and has no possibility to store them outside [Westerink et al., 1999]. The analysis is done by each run even if the same video is used.

Finally, the transcoded media data is delivered to the outside of the RETAVIC architecture – to the client application. The delivery may involve the real-time encapsulation into the network protocol, which should be capable of delivering data under the real-time constraints. The network problems and solutions, such as Real-Time Protocol (RTP) [Schulzrinne et al., 1996], Microsoft Media Server (MMS) Protocol [Microsoft Corp., 2007a] or Real-Time Streaming Protocol (RTSP) [Schulzrinne et al., 1998], are however not the scope of the RETAVIC project and are not further investigated.

IV.2.4.2 Direct delivery

The second delivery channel is called bypass delivery, which is a direct delivery method. The idea here is very simple – the media data which are stored by Phase 3 of the RETAVIC are read from the storage and delivered to the client application. Of course, the real-time processing is necessary here, so real-time requirements for reading analogical to these from real-time decoding must be considered.

There are three possible scenarios in direct delivery:

1) Delivering internal storage format

2) Delivering origin source

80 Chapter 3 – Design IV. System Design

3) Reusing processed media data

In order to deliver media data in the internal storage format, not much have to be done within the architecture. All required facilities are already provided by the real-time transcoding part. It is obvious, that real-time reading and data provision outside the system must be supported already in the real-time transcoding. So, bypass delivery should simply make use of them.

If one considers storing the origin source within his application scenario, the archiving origin source proposed as optional (in Section IV.2.1) has to be included. Moreover, the capability of managing the origin source within the storage phase has to be developed. This capability covers adaptation of media data and meta-data models. Moreover, if searching facilities are present they must also support the origin source. Besides these issues, the bypass delivery of origin source is conducted similar way to the internal format.

Third possible scenario considers reusing the already processed media data and may be refered as caching. This is depicted by the arrow back from the real-time encoding to multimedia bitstream collection. The idea behind the reuse is to give the possibility of storing the encoded media data after each request for the not yet present formats in the multimedia bitstream collection in order to speed up further processing for the same request by costs of storage. As one can notice, the compromise between the costs of the storage and costs of the processing is a crucial key, which has to be concerned, so the application scenario should define if such situation (of reusing the processed media data) is really relevant. If it is, the extensions analogical to delivering origin source have to be implemented with consideration of linking more than one media bitstream to the media object. Also searching of already existing instances of various formats has to be provided.

IV.3. Evaluation of the General Idea Contrary to the framework proposed by [Militzer, 2004] and [Suchomski et al., 2005], the extended architecture allows for both: storing origin source of data and converting it to the internal storage format. The initial RETAVIC approach of keeping origin source was completely rejected in [Militzer, 2004] and only internal format has been allowed. This rejection may however limit the potential application field for the RETAVIC framework, and somehow

81 Chapter 3 – Design IV. System Design

contradicts the MPEG-7 and MPEG-21 models, in which the master copy (origin instance) of the media object (media item) is allowed and supported. On the other hand, keeping origin instance introduces higher redundancy50 but that is a trade-off between application flexibility and redundancy in the proposed architecture, which is considered being targeted well in my opinion.

The newest proposal of the RETAVIC framework (Figure 4) is a hybrid of the previous assumptions regarding many origin formats and the one internal storage format, thus keeping all the advantages of previous framework versions (as in [Militzer, 2004] and [Suchomski et al., 2005]) and at the same time delivering higher flexibility to the applications. Moreover, all the drawbacks present in the initial proposal of the RETAVIC architecture (as discussed in section 2.1 and 2.2 of [Militzer, 2004]) ––like complexity of many-to-many conversion (of arbitrary input and output formats), accurate determination of the behavior of a black-box converter (global average, worst- and best-case behavior), unpredictable resource allocation (due to lack of influence on the black-box decoding), no scalability in data access and thus no adaptivity in decoding process–– are not present in the last version of the framework anymore.

IV.3.1. Insert and Update Delay

The only remaining drawback is the delay during storing new/updating old media data in the MMDBMS. As previously outlined, the Phase 1 and Phase 2, even though separated, serve both as a data insert/update facility, where the real-time compression to an intermediate format is only required for capturing uncompressed live video (part of Phase 1) and the media buffer is required for intermediate data captured in real-time regardless of the real-time source (next part of Phase 1), the archiving of the origin source is only required in some application cases, and the conversion into the internal format and content analysis are performed in all application cases in subsequent steps. Obviously, the last steps of the input/update facility, namely the conversion and analysis steps, are computationally complex and run as a best-effort processes, thus they

50 The redundancy was not the objective of the RETAVIC architecture; however it is one of the key factors influencing system complexity and storage efficiency. So, the higher redundancy is, the more complex system to manage consistency and the less efficient storage are.

82 Chapter 3 – Design IV. System Design

may require quite some time to finish51. Actually, it may take quite a few hours for a longer audio or video data to be completely converted and analyzed, especially when assuming (1) a high load of served request for media data in general and (2) the conversion/analysis processes running only at low priority (such a case promotes serving simultaneously many data-requesting clients (1) contrary to uploading/updating clients (2)). To summarize, the delay between the actual insertion/update time and the moment when the new data become usable in the system and visible by the client is an unavoidable fact within proposed RETAVIC framework.

However, it is believed that this only limitation can be well accepted in reality considering still available support for few applications demanding real-time capturing. Moreover, considering points (1) and (2) of the previous paragraph, the most of the system’s resources would be spent on real-time transcoding delivering format-converted media data i.e. the inserting a new media data to the MMDBMS is a rather rare case compared to transcoding and transmitting of some already stored media content to a client for playback. Consequently, the proposed framework delivers higher responsiveness to the media-requesting clients due to the assumed trade-off between delay in insertion and speed up during delivery.

Additionally, a feature of making newly inserted or updated media data available for the clients as soon as possible is not considered to be an essential highlight of the MMDBMS. It actually does not matter for the user whether he or she is able to access a newly inserted or updated data either just after the real-time import or maybe a couple of hours later. And it is more important that the MMDBMS can guarantee a reliable behavior and delivery on time. Nevertheless, the newly inserted or updated data should still become accessible within a relatively short period.

IV.3.2. Architecture Independence of the Internal Storage Format

As mentioned already, when employing the media-specific encoder as an atomic and isolated part of the Conversion to internal format module, the architecture gains the possibility of exchanging (or updating) the internal storage format without changing the outside interfaces and

51 Please note, that during the input or update of the data, the real-time insertion in most cases is not required according to Architecture Requirements from Section IV.1. The few cases of supporting real-time capturing are provided by Phase 1. As so, the unpredictable behavior of the conversion to internal format and of the analysis process is not considered being a drawback.

83 Chapter 3 – Design IV. System Design

functionality i.e. the data formats understood and the data formats provided by the MMDBMS designed according to the RETAVIC architecture will stay the same as before the format update/exchange. Of course, the exchange/update can be done for each media separately, thus allowing fastest possible adaptation of the novice results of the autonomous research on each media type52.

IV.3.2.1 Correlation between modules in different phases related to internal format

Figure 7. Correlation between real-time decoding and conversion to internal format.

Before conducting the replacement of the internal format, the correlation between influenced modules has to be explained. Figure 7 depicts such correlation by grey arrows with black edges. The real-time decoding uses as an input two data sources: media data and meta-data. The first type is simply the binary stream in the known format which decoder understands. The second one is used by the self-controlling process being a part of the decoder and by the resource allocation for prediction and allocation of required resources. As so, if someone decides to replace the media internal storage format, he must also exchange the real-time decoding module, and accordingly the related meta-data. From the other hand, a new format and related meta-data have to be prepared by new encoder and stored on the storage. This encoder has to be placed in the encoding part of the conversion module from Phase 2. Due to the changes of meta-data, the DB schema has to be adopted accordingly for keeping new set of data. This few-steps exchange,

52 This is exactly the case in the research – the video processing and audio processing are separate scientific fields and in general do not depend on or related to each other. Sometimes the achievements from both sides are put together for assembling a new audio-video standard.

84 Chapter 3 – Design IV. System Design

however, should not influence the decoding part of the non-real-time conversion module as well as the remaining modules of the real-time delivery phase.

IV.3.2.2 Procedure of internal format replacement

A step-by-step guide for replacement of the internal storage format is proposed as follows:

1. prepare real-time decoding – implement it in the RTE of Phase 4 according to the used real-time decoder interface

2. prepare non-real-time encoding – according to the encoder interface of the Conversion to internal format module

3. design changes in the meta-data schema – those reflecting the data required by real-time decoding

4. encode all stored media bitstreams in the new format (on the temporal storage)

5. prepare meta-data for the new format

6. begin transactional update

a. replace the real-time decoder

b. replace the encoder in the Conversion to internal format module

c. update the schema for meta-data

d. update the meta-data

e. update all required media bitstreams

7. commit

This algorithm is a bit similar to the distributed 2-phase commit protocol [Singhal and Shivaratri, 1994]: it has a preparation phase (commit-request: points 1-5) and an execution phase (commit: points 6-7). It should work without any problems53 for non-critical systems in which the access to data may be blocked exclusively for longer time. However in 24x7 systems such transactional update, especially point 6.e), may be hard to conduct. Then one of the possibilities

53 The blocking-protocol disadvantage of the 2-phase commit protocol is not considered here, because the human beings updating the system (system administrators) are coordinators and the same cohorts in the meaning of this transaction.

85 Chapter 3 – Design IV. System Design

would be to prepare the bitstreams on the exact copy of the storage system but with updated bitstreams and exchange these systems instead of updating all the required media data. Other solution would be skipping transactional update and doing update based on locking mechanism for separate media bitsreams and allowing operability of two versions of real-time decoding. This however, is more complex solution and has an impact on the control application of the RT transcoding (because a support for more than one (per media) real-time decoder was not investigated).

IV.3.2.3 Evaluation of storage format independence

Summarizing, having a possibility of replacing internal format at will, the RETAVIC architecture is independent from the internal storage format. In results, any lossless scalable format may be used. As so, also the level of scalability may be chosen at will and depends only on the selected format for given application scenario. Thus, the application flexibility of the RETAVIC architecture is even higher. Moreover, by assuming lossless properties of the internal format in the architecture requirements the number of replacement is not limited contrary to the case of lossy formats where the tandem coding problem occurs.

86 Chapter 3 – Design V. Video Processing Model

V. VIDEO PROCESSING MODEL

The chapter V introduces the video-specific processing model based on the logical model proposed in previous chapter. Additionally, the analysis of few representatives of the existing video codecs and the small discussion on the possible solution is presented at the first. Next, there is one particular solution proposed for each phase of the logical model. The video-related static meta-data are described. Next, the LLV1 is proposed as the internal storage format and the coding algorithm based on [Militzer, 2004; Militzer et al., 2005] is explained and further detailed. After that, the processing idea is explained: the decoding using its own MD subset and then, the encoding employing also own subset of MD. Only the continuous MD are referred in the processing part (Section V.5). Finally, the evaluation of video processing model by exploiting best-effort prototypes is presented.

The MD set covering the encoding part has been proposed at first by [Militzer, 2004] and named as coarse- and fine-granular MD (as mentioned in subsection Meta data of Section IV.2.3 coarse-granularity refers to static MD, and analogically fine-granularity to continuous MD). Next the MD set was extended and refined in [Suchomski and Meyer-Wegener, 2006]. The extension of continuous part of MD supporting adaptivity in LLV1 decoding was proposed by [Wittmann, 2005].

V.1. Analysis of the Video Codec Representatives The analysis of the execution time with respect to the representatives of the DCT-based video encoding such as FFMPEG54, XVID or DIVX clearly showed that the processing is irregular and unpredictable, the time spent per frame varies to big extent [Liu, 2003]. For example, the encoding time per frame of the three mentioned codecs for exactly same video data is depicted

54 Whenever the FFMPEG abbreviation is used, the MPEG-4 algorithm within the FFMPEG codec is referred. The FFMPEG is sometimes called multi-codec implementation because it can also provide different coded outputs e.g. MPEG-1, MPEG-2 PS (known as VOB), MJPEG, FLV (Macromedia Flash), RM (Real Media A+V), QT (Quick Time), DV (Digital Video) and others. Moreover, the FFMPEG supports much more decoding formats (as input) e.g. all output formats and MPEG-2 TS (used in DVB), FLIC, and other proprietary like GXF (), MXF (Media eXchange Format), NSV (Nullsoft Video). The complete list may be found in the Section 6. Supported File Formats and Codecs of [WWW_FFMPEG, 2003].

87 Chapter 3 – Design V. Video Processing Model

in Figure 855. This clearly shows that for various codecs and even for their diverse configurations the execution time per frame differs and it is also not constant within one codec i.e. frame-by-frame. Moreover, even with the same input data the behavior of the codecs cannot be directly generalized into simple behavior functions depending directly on data i.e. even for the same scene (frames between 170 and 300) one codec speeds up, the other slows down, and the third one does neither of these actions. It is also clearly noticeable, that raising the quality parameter56 to five (Q=5) for XVID and thus making motion estimation and compensation steps more complex, the execution time varies more from frame to frame and the overall execution is slower (blue line over the green one).

Encoding Time [s]

70

DIVX FFMPEG XVID Q1 XVID Q5

60

50

40

30

20

10

0 0 50 100 150 200 250 300 350 400 Frame

Figure 8. Encoding time per frame for various codecs.57

55 These are curves representing the average encoding time per frame of five benchmark repetitions of the given encoding in the exactly same system configuration. 56 The referred quality parameter (Q) defines the set of parameters used for controlling the motion estimation and compensation process. There have been five classes defined from 1 to 5 in such a way that the smaller parameter is the simpler algorithms are involved. 57 The figure is based on the same data as the Figure 4-4 (p.51) in [Liu, 2003] i.e. coming from measurements of the video clip “Clip no 2” with fast motion and fade out. The content of the video clip is depicted in Figure 2-3 (p.23) in [Liu, 2003]. The

88 Chapter 3 – Design V. Video Processing Model

Encoding Time [s] - Average (Min/Max), Deviation and Variance

80

70

60

50

40

30

20

10

0 DIVX FFMPEG XVID Q0 XVID Q1 XVID Q2 XVID Q3 XVID Q4 XVID Q5

Deviation 8,43,23,62,42,74,04,95,3 Va rian ce 70,9 10,0 13,1 6,0 7,2 15,9 24,3 28,0 Avera ge 34,5 48,2 22,1 19,4 27,4 29,1 31,0 36,1

Figure 9. Average encoding time per frame for various codecs58.

The encoding time was further analyzed and the results are depicted in Figure 9. Obviously, the FFMPEG was the slowest on the average (black bullets) and the XVID with quality parameter set to one59 (Q=1) the fastest one. However minimal and maximal values of the time spend per frame are more interesting. Here, the peak-to-average execution plays a key rule. It allows us to predict the potential loss of resources (i.e. inefficient allocation/use), if the worst-case resource allocation for real-time transcoding is used. The biggest peak-to-average ratio for this specific video data is achieved by XVID Q0 and is of factor 3.32. Other factors worth of noticing are the MIN/AVG and MIN/MAX ratios, the variance and standard deviation. While MIN/MAX or MIN/AVG may be exploited by the resource allocation algorithm (e.g. defining the importance of the need of freeing resources for other processes), the variance and standard deviation60 inform us about the required time buffer for the real-time processing. So, the more

only difference is that the color conversion from RGB to YV12 was eliminated from the encoding process i.e. the video data was prepared earlier in the YV12 color scheme (which is the usual form used in the video compression). 58 There are presented all possible configurations of quality parameter Q in XVID encoder – from 0 to 5, where 5 means the most sophisticated motion estimation and compensation. 59 The quality parameter set to zero (Q=0) allowed the XVID encoder take the default values and/or detect them, which required additional processing. That’s why the execution is a bit slower than the one with Q=1. 60 The variance magnifies the measured differences by its exponential (quadratic) nature. Thus it is very useful when the importance of constancy is very high and detection of even small changes is required. One however must be careful when variance is used with fractional values between 0 and 1 (e.g. coming from the difference of compared values), because then the measured error is decreased. Contrary, the standard deviation exposes the linear characteristics.

89 Chapter 3 – Design V. Video Processing Model

frame-to-frame execution time varies, the bigger variance/deviation is. For example, the DIVX has the biggest variance and deviation, while the average encoding time lands somewhere in the middle of all codecs results. Contrary, the FFMPEG is the slowest on average but the variance is much smaller in comparison to DIVX.

Encoding time distribution in XVID encoder

100%

90%

80% Interlacing 70% Coding Prediction 60% Transfer Interpolation 50% Edges 40% IQuan IDCT 30% Quant DCT 20% Motion Compensation 10% Motion Estimation

0% 0 21 42 63 84 105 126 147 168 189 210 231 252 273 294 315 336 357 378 399 Frame

Figure 10. Example distribution of time spent on different parts in the XVID encoding process for Clip no 261.

Further investigations in [Liu, 2003] delivered detailed results on certain parts of the mentioned codecs. Due to the obtained values, the most problematic and time consuming issues in the processing are the motion estimation (ME) and compensation (MC). The encoding time distribution calculated per each frame separately62 along the video stream are depicted here for just two codecs: XVID in Figure 10 and FFMPEG in Figure 11. In the first figure it is clearly seen that time required for MC and ME of XVID achieves over 50% of the total time spend on frame. Interesting is that the MC/ME time rises but the distribution does not vary much from frame to frame. However, other behavior can be noticed for FFMPEG where the processing

61 The figure is based on the same data as the Figure 4-19 (p.61) in [Liu, 2003] i.e. the same video clip “Clip no 2”. The encoder quality parameter Q was set to 5 (option “–Q 5”). 62 Please note, this are the percentage values of time used for ME/MC per frame related to the total time per frame, and not the ME/MC time spend per frame.

90 Chapter 3 – Design V. Video Processing Model

time used per frame varies frame-by-frame – expressed by angular line (peaks and drops interchangeably). One more conclusion may be derived when comparing the two graphs, namely there is a jump in the processing time starting from 29th frame of the given sequence, but reaction of each encoder is different. The XVID adapts its control settings of internal processing slowly and distributes them on many frames, while FFMPEG reacts at once. This is also the reason of having the angular (FFMPEG) versus smooth (XVID) curves. It has been also verified that after encoding there are: one intra-frame and 420 inter-frames in XVID, vs. two intra-frames and 419 inter-frames in FFMPEG. This extra intra-frame in FFMPEG was at the 251st position, and thus there is also a peak where the MC/ME time was neglected in respect to the remaining parts.

Encoding time distribution in FFMPEG encoding

100%

90% Rest Postprocessing 80% Preprocessing 70% Huffman Picture Header 60% IP+CS+PI Simple Frame Det. 50% Interlacing 40% Transfer IQ+IDCT 30% Quantization DCT 20% Edge MC 10% Motion Estimation

0% 0 18 36 54 72 90 108 126 144 162 180 198 216 234 252 270 288 306 324 342 360 378 396 414 Frame

Figure 11. Example distribution of time spent on different parts in the FFMPEG MPEG-4 encoding process for Clip no 263.

Some other measurements from [Liu, 2003] (e.g. Figure 4-19 or 4-21) showed that the execution times of ME/MC vary much more than of all other parts of the encoding. This is due to the

63 The figure is based on the same data as Figure 4-21 (p.62) in [Liu, 2003]. The video is the same as in example used for XVID (Figure 10) with the same limitation. The noticeable peak is due to the encoder decision about different frame type (I) for frame 250.

91 Chapter 3 – Design V. Video Processing Model

dependency of the ME and MC on the video content, which is very hard to define as a mathematical function. As so, the ME/MC is the most unpredictable part of encoding where variations of consumed time even between subsequent frames exist. This effect of ME/MC hesitations could be partly argued by the changing shape of the ME/MC curves in both figures. However, for that purposes better is to see the curves depicting spend time exactly in the measured values (mentioned in the first sentence of this paragraph).

Summarizing, the ME/MC steps are problematic in the video encoding. The problem lies in both the complexity and the amount of time required for the processing as well as in the unpredictable behavior and the variations in time spend per frame.

V.2. Assumptions for the Processing Model The first idea was to skip completely the ME/MC step in order to gain control over the encoding process, which should work in real-time. Thus the encoding would be roughly twice as fast and much more stable according to time spend on each frame. However, a video processing model using the straightforward removal of the complex ME/MC step would cause a noticeable drop in the coding efficiency and in the same worse quality of the video information – it’s obvious that such additional complexity in the video encoder yields to higher compression ratio and reduced data rates for the decoders [Minoli and Keinath, 1993]. Therefore the idea of pure exclusion of the ME/MC step from the processing chain was dropped.

However, still the idea of removing the ME/MC step from the real-time encoding seemed to be only reasonable in gaining the control on the video processing. Thus another and more reasonable concept was to move the ME/MC step out of the real-time processing and not dropping it out but putting it into the non-real-time part, and use only the data produced by the removed step as additional input data to the real-time encoding. Based on that assumption it was required to measure how much data, namely the motion vectors, are produced by the ME/MC step. As it was stated in the published paper [Suchomski et al., 2005], the size overhead related to the additional information is equal to about 2% of the losslessly encoded video data. Hence it was clearly acceptable to gain two times faster and more stable –yielding lower worst- case to average ratio of time consumed per frame, and consequently allowing for more accurate resource allocation– real-time encoding by only such small costs of storage.

92 Chapter 3 – Design V. Video Processing Model

Finally, the non-real-time preparation phase included the ME/MC steps in the content analysis of the video data i.e. the ME/MC steps cover all the activities in the compression algorithm like scene-change detection, frame-type recognition, global motion estimation, deblocking into macro blocks, detection of motion characteristics and complexity, and decision about the type of each macro block. Moreover, if there exist some ME-related parts being applicable to only a given compression algorithm, but not named in the previous sentence, they should also be done in the non-real-time content analysis step. All these ME activities are means for producing the meta-data used in the real-time phase, not only for the compression process itself, but also for scheduling and resource allocation.

The other remaining parts responsible for motion compensation like motion vectors encoding and calculation of texture residuals, and the parts used for compression in DCT-based algorithms [ITU-T Rec. T.81, 1992], such as DCT, quantization, quantized coefficients scanning (e.g. ZIC-ZAC scanning), run-length encoding (RLE), Huffman encoding, bit-stream building, are done in the real-time transcoding phase.

This splitting up of the video encoding algorithm on two parts in non-real-time and real-time phases leads to the RETAVIC architecture, which may be treated as already improved in respect to processing costs in the design phase, because the model already considers the reduction of the resource consumption. The overall optimization is gained due to just one-time analysis step execution contrary to the complete-algorithm execution (in which the analysis is done each time for the same video during the un-optimized encoding on demand). Secondly, the encoding- related optimizations are provided due to a simplified encoder construction without analysis part. This encoding algorithm simplification leads to faster execution in real-time and as so gives possibility of serving bigger number of potential clients. Additionally, the compression algorithm should behave more smoothly, which would decrease the buffer sizes.

The earlier attempts proposing a processing model for video encoding has not considered the relationship between the behavior and functionality of the encoder in the relation to video data [Rietzschel, 2002]. The VERNER [Rietzschel, 2003] –a real-time player and encoder implemented in DROPS– used real-time implementation of the XVID codec where the origin source code has been directly embedded in the real-time environment and no processing adaptation has been performed beside treating the encoding operation as mandatory part and

93 Chapter 3 – Design V. Video Processing Model

the post-processing functionality as optional part [Rietzschel, 2002]. In result, the proposed solution failed to integrate the real-time ability and predictability usable in QoS control due to still to high variations of the required processing time per frame within the mandatory part64. As so the novel MD-based approach is investigated in next sections.

V.3. Video-Related Static MD As it was mentioned in the Section IV.2.3.2, the static MD are used for resource allocation. The initial MD-set (introduced in IV.2.2.3) related only to video is discussed in this section. Its coverage is a superset of two sets reflecting MD required for the two algorithms used within the prototypical implementation of the RETAVIC architecture, namely the MD useful for LLV1 and for XVID algorithms respectively.

It is assumed that the media object (MO) 65 defined by Equation (1) belongs to the media objects set O. The MO is uniquely identifiable by the media identifier (MOID) as defined by Equation (2), where i and j are referring to different MOIDs.

∀i : moi = (typei ,contenti , formati ) ∧ typei ∈T ∧ contenti ∈C ∧ formati ∈ F

∀i : moi ∈O (1) 1 ≤ i ≤ X where T denotes a set of possible media types66, C denotes a set of content, F respectively a set of formats, and X is the number of all media objects in O, and none of the sets can be empty.

∀ ∀ : i ≠ j ⇒ mo ≠ mo i j i j (2) 1 ≤ i ≤ X ∧1 ≤ j ≤ X

In other words, the MO refers to the data of one media type either video or audio and represents exactly one stream of the given media type.

64 The mandatory part is assumed to be exactly and deterministically predictable, which is impossible without considering data in case of video encoding. Otherwise, it must be modeled with the worst case model. Some more details are explained later. 65 The definitions of MMO and MO used here are analogical to those described in [Suchomski et al., 2004]. MO is defined to have type, content and format, and MMO consist of more than one MO, as mentioned in the fundamentals of the related work. 66 The media type set is limited within this work to {V, A} i.e. to video and audio types.

94 Chapter 3 – Design V. Video Processing Model

The multimedia object (MMO) is an abstract representation of the multimedia data and consists of more than one MOs. The MMO belongs to multimedia objects set M and is defined by Equation (3):

∀i : mmoi = {mo j mo j ∈O} (3) ∀i : mmoi ∈ M 1 ≤ i ≤ Y where X the number of all media objects in O. Analogically to MO, the MMO is uniquely identified by MMOID as formally given by Equation (4):

∀i∀ j : i ≠ j ⇒ mmoi ≠ mmo j (4)

Therefore the MO may be related to MMO by MMOID67.

There are initial static meta-data defined for the multimedia object (MMO) as depicted in Figure 12. As so, the static MD describing the specific MO are related to this MO by its identifier (MOID). The MD, however, are different for various media types, so the static MD for video are related to the MO having type of video, and the set (StaticMD_Video) is defined as:

∀i : MDV (moi ) ⊂ StaticMD _Video ⇔ typei =V (5)

where typei denotes type of the media object moi, V is the video type, and MDV is a complex function extracting the set of meta-data related to the video media object moi. The index V by

MDV denotes video specific meta data contrary to index A which is used for audio specific MD in the later part.

Moreover, the video stream identifier (VideoID) is a one-to-one mapping to the given MOID:

∀ ∃ ¬∃ :VideoID = MOID ∧VideoID = MOID i j k i j i k (6) k ≠ j

67 The MO does not have to be related to MMO i.e. the reference attribute in MO pointing to the MMOID is nullable.

95 Chapter 3 – Design V. Video Processing Model

The static MD of the video stream include sums of each type of frames i.e. there is a sum of frames calculated separately for each frame type within the video. The frame type (f.type) is defined as I, P, or B:

∀ : f .type ∈{I, P, B} i i (7) 1 ≤ i ≤ N where I denotes the type of an intra-coded frame (I-frame), P denotes the type of a predicted inter-coded frame (P-frame), B denotes the type of a bidirectional predicted inter-coded frame (B-frame), and N denotes the amount of all frames in the video media object.

The sum for I-frames is defined as:

IFrameSum = {}f f ∈ mo ∧1 ≤ j ≤ N ∧ f .type = I (8) moi j j i j

where, fj is a frame at j-th position, fj.type denotes type of the j-th frame. And analogically, for P- and B-frames (N, fj, fj.type is the same as in Equation (8)):

PFrameSum = {}f f ∈ mo ∧1 ≤ j ≤ N ∧ f .type = P (9) moi j j i j

BFrameSum = {}f f ∈ mo ∧1 ≤ j ≤ N ∧ f .type = B (10) moi j j i j

Next, the video static MD are defined per each frame in the video – namely they include a frame number and frame type, and calculated sums of each macro block (MB) types. The frame number represents the position of the frame in the sequence of frames for the given video, and the frame type specifies the type of this given frame fj.type. The sum of macro blocks with respect to the given type of MB is calculated analogically to the frames as in Equations (8), (9) and (10), but the N refers to total number of MBs per frame and conditions are replaced respectively by:

mbs j = 1⇔ mb j .type = I mb j .type = P mb j .type = B (11) mbs j = 0 ⇔ mb j .type ≠ I mb j .type ≠ P mb j .type ≠ B

96 Chapter 3 – Design V. Video Processing Model

The information about sum of different block types is stored in respect to the layer type (in LLV1 there are four layers defined – one base and three enhancement layers [Militzer et al., 2005]) in the StaticMD_Layer.

And finally, the sum of different motion vector types is kept for the frame in StaticMD_MotionVector. Nine types of vectors, which could be used in the video, are recognized up to now [Militzer, 2004]. For each video frame, a motion vector class histogram is created by counting the number of motion vectors corresponding to each class and the result is stored in relation to VectorID , FrameNo and VideoID.

A motion vector (x, y) activates one of nine different interpolations of pixel samples depending on the values of the vector’s components [Militzer, 2004]. Therefore there are nine types of motion vectors distinguished [Militzer, 2004]: mv1 – if both x and y are in half-pixel precision, luminance and chrominance samples need to be horizontally and vertically interpolated to achieve prediction values; mv2 – if just the x component is given in half-pel precision, the luminance and chrominance match in the reference frame needs to be horizontally interpolated, and no vertical interpolation is applied to chrominance samples as long as the y vector component is a multiple of four; mv3 – as mv2 but y is no multiple of four, then the chrominance samples are additionally vertically filtered; mv4 – if only the y component of the motion vector is in half-pixel precision, the luminance and chrominance pixels in the reference frame need to be vertically interpolated and no horizontal filtering is employed as long as the x vector component is a multiple of four; mv5 – as mv4 but x vector is no multiple of four, then the chrominance samples are horizontally filtered additionally; mv6 ÷ mv9 – finally, if both the x and y vector components have full-pel values, the interpolation complexity again depends on whether chrominance samples need to be interpolated or not: if none of the vector components is a multiple of four, then chrominance samples need to be horizontally and vertically filtered – mv6 –, if only the y component is a multiple of four, chrominance samples are horizontally filtered – mv7 –, if only the x component is a multiple of four, then vertical filtering is applied – mv8 –, and if both x and y are multiples of four, no interpolation is required – mv9.

In order to simplify understanding of the initial set of MD, the current definition is mapped to the relational schema and depicted as relational schema diagram (with primary keys, foreign keys and integrity constraints for relationships) in Figure 12.

97 Chapter 3 – Design V. Video Processing Model

Figure 12. Initial static MD set focusing on the video data68.

The video static MD as mapped according to Figure 12 allows for exploiting the potency of the SQL by calculating the respective sum for given type with simple SQL query (Listing 1):

SELECT VideoID, count(FrameType) FROM StaticMD_Frame WHERE FrameType = 0 -- for P-frames: FrameType = 1 -- for B-frames: FrameType = 2 GROUP BY VideoID ORDER BY VideoID;

Listing 1. Calculation of the sum for the I-frame type from existing MD (query condition for P- and B- are given as comments) for all described videos.

68 The figure is based on the Figure 2 of [Suchomski and Meyer-Wegener, 2006].

98 Chapter 3 – Design V. Video Processing Model

Also the sum of all types respectively to the type may be calculated in one SQL query (Listing 2):

SELECT VideoID, FrameType, count(FrameType) FROM StaticMD_Frame GROUP BY VideoID, FrameType ORDER BY VideoID, FrameType;

Listing 2. Calculation of all sums according to the frame type of frames for all described videos.

If the sum of all motion vector types along the video for all videos is required, the following SQL query extracts such information (Listing 3):

SELECT VideoID, VectorID, sum(MVsSum) FROM StaticMD_MotionVector GROUP BY VideoID, VectorID ORDER BY VideoID, VectorID;

Listing 3. Calculation of all MV types per video for all described videos.

Of course, the entity set of StaticMD_Frame must be complete in such a way that all frames existing in the video are described and included by this set. Based on this assumption, the sums included in the StaticMD_Video are treated as derived attributes counted by above SQL statements. However, due to the optimization issues and rare updates of the MD set, these values are materialized and included as usual attributes in the StativMD_Video. On the other hand, if the StaticMD_Frame were not complete i.e. did not include all the frames, the sum attributes could not be treated as derived.

V.4. LLV1 as Video Internal Format The LLV1 stands for Lossless Layered Video One format and was first time proposed by [Militzer, 2004]. The detailed description can be found in [Militzer, 2004] and it covers: 1) higher logical design, 2) advantages of using LLV1, 3) scalable video decoding useful for real-time transcoding, 4) implementation, and 5) proposed method of decoding time estimation. The compact description including all the mentioned issues with the related work is published in [Militzer et al., 2005]. The work of [Wittmann, 2005] focuses on the decoding part of LLV1 only and describes it in more details, where each step is represented separately by an algorithm. The analysis, implementation and evaluation in the real-time aspect are also described in [Wittmann,

99 Chapter 3 – Design V. Video Processing Model

2005] (they are also extended and described in section IX.3 RT-MD-LLV1 of this work). Due to available literature, this section will only 1) summarize the most important facts about the LLV1 by explaining the algorithm in simplified way, and 2) refine the mathematical definitions whenever necessary, thus make the previous doubtful explanation more precise. Some further extensions of the LLV1 algorithm are given in the Further Work section in the last chapter of this work.

V.4.1. LLV1 Algorithm – Encoding and Decoding

The LLV1 format fulfills the requirement of losslessness while at the same time gives the media server flexibility to guarantee the best picture quality possible facing user requirements and QoS characteristic changes. These two aspects are achieved by layered storage where each layer can be read selectively. It is possible to limit the access to just a portion of the stored media data (e.g. just the base layer, which is about 10% of the whole video [Militzer et al., 2005]). Thus, if the lower resolutions of highly compressed videos are requested, only parts of the data are really accessed. The other layers, i.e. the spatial and temporal enhancement layers, can be used to increase the quality of the output, both in terms of image distortion and frame rate. Though being a layered format designed for real-time decoding, LLV1 is still competitive regarding compression performance compared to other well-known lossless format69 [Militzer et al., 2005]. The reasons for this competitiveness may derive from the origins – LLV1 is based on the XVID codec and was adopted by exchange of the lossy transform to lossless one [Tran, 2000] and by refinement of variable length encoding (incl. Huffman tables and other escape codes).

The simplified encoding and decoding algorithms are depicted in Figure 13. The input video sequence (VS IN) is read frame-by-frame from the input and the frame type detection (FTD) is conducted to find out if the current frame should be intra- or inter-coded frame. Usually, the intra-coded type is used when new scene appears or a difference between subsequent frames crosses certain threshold. If the frame is assigned to be inter then the next steps of motion detection (MD) producing the motion vectors (MV) and motion error prediction (MEP)

69 In the time of development beginning, there were no lossless and layered (or scalable) video codecs available. Thus the conducted benchmarks relate the LLV1 to the lossless codecs only without scalability characteristics. Nowadays, there is ongoing work on standardized MPEG-4 Scalable Video Coding (SVC) [MPEG-4 Part X, 2007], which is much more sophisticated and promising solution, but the official standard has been finished on July 2007 and thus could not be used within this work.

100 Chapter 3 – Design V. Video Processing Model

calculating the motion compensation errors (MCE) are applied. Otherwise, the pixel values (PV) are delivered to the transform step (binDCT). The binDCT is an integer transform using a lifting scheme and is similar in characteristics to the discrete cosine transformation (DCT) but is invertible (i.e. lossless) and produces bigger numbers in the output [Liang and Tran, 2001]. Here depending on the frame input either the PV or the MEP values are transformed from the time domain (pixel-values domain) to the frequency domain (coefficients domain).

a)

b)

Figure 13. Simplified LLV1 algorithm: a) encoding and b) decoding.

101 Chapter 3 – Design V. Video Processing Model

As next step the quantization similar to the well-known H.263 quantization from MPEG-4 reference software [MPEG-4 Part V, 2001] is applied on the coefficient values. For each layer different quantization parameters (QP) are used: QP=3 for quantization base layer (QBL), QP=2 for quantization enhancement layer one (QEL1), and respectively QP=1 for the second QEL and QP=0 for the third QEL – this is denoted correspondingly by Q3, Q2, Q1 and Q0. In the subsequent, the variable length encoding (VLE) is applied on the base layer coefficients and accompanying motion vectors outputting the BL bitstream. In parallel, the quantization bit plane calculation (QPC) is applied sequentially on the coefficients of QEL1, next of QEL2 and finally of QEL3. The QPC computes the prediction values and the sign values (q-plane) by using formulas defined later. Each q-plane is then coded separately by the bit plane variable length encoding (BP VLE), what produces the encoded bitstreams of all QELs.

The decoding is the inverse process of the encoding. The input encoded bitstreams are specified for the decoding – of course there must be BL and optionally ELs so the decoder accepts 1, 2, 3, or 4 streams. The BL bitstream is decoded by variable length decoding (VLD) and the quantization coefficients (QC) or respectively quantization motion compensation error (QMCE) are inverse quantized (IQ). Next the inverse transform step (IbinDCT) is executed. The motion compensation (MC) using motion vectors (MV) is additionally performed for predicted frames. The bit-plane variable length decoding (BP VLD) is the first step for the enhancement layers. Secondly the quantization plane reconstruction (QPR) base on the QC from the layer below is conducted. Finally, the IQ and IbinDCT are conducted only for the highest-quality quantization plane. The complete decoding algorithm with the detailed explanation is given in Appendix B.

V.4.2. Mathematical Refinement of Formulas

There have been few details missing to present all the problematic aspects in the previous papers, e.g. how to exactly calculate the bit plane values and when to store the sign information in the quantization enhancement layer (QEL). The formula given in the previous papers (nr 3.2 on p. 22 in [Militzer, 2004] and nr 1 on p. 440 in [Militzer et al., 2005]) for calculating the coefficient for the enhancement layer in the decoding process looks after refinement like the one in Equation (12):

102 Chapter 3 – Design V. Video Processing Model

⎧2⋅Ci−1 + Pi ⇔Ci−1 > 0 ⎫ ⎪ ⎪ Ci = ⎨2⋅Ci−1 − Pi ⇔Ci−1 < 0 ⎬ : Pi ∈{0,1}∧ Si ∈{−1,1}∧ i ∈{1,2,3} (12) ⎪ ⎪ ⎩ Pi ⋅ Si ⇔ Ci−1 = 0 ⎭

where i denotes current enhancement layer, Ci is the calculated coefficient of the current i-layer, Ci-1 is the coefficient of the previous layer (one below), Pi is the predicted value and Si is the sign information.

Moreover, the formula for calculating the prediction value and sign information in the given bit plane during the encoding process has been neither officially stated nor published before. It is now given respectively in Equation (13) and in Equation (14):

Pi = Ci − 2⋅Ci−1 ∧ i ∈{1,2,3} (13)

⎧−1 ⇔ Ci < 0 ∧ Ci−1 = 0 ⎫ Si = ⎨ ⎬ : i ∈{1,2,3} (14) ⎩ 1 ⇔ Ci ≥ 1∨ Ci−1 ≠ 0 ⎭

Please note, that the predicted value and the sign information are stored only for the enhancement layers, so there is no error where current layer is the base layer. Moreover, the sign information is stored only when zeroing operation of the negative coefficient of the next lower layer occurs and only if the current coefficient is smaller than 0 (i.e. negative).

The calculation of the coefficient according to the algorithm description above is given by example in Figure 14, where the QBL stands for quantization base layer and QELx for the respective quantization enhancement layer (x = {1,2,3}). QBL stores signed coefficients, while QELs store the predicted values as unsigned bitplanes. Only the blue colored data are really stored by the LLV1 bitstream.

103 Chapter 3 – Design V. Video Processing Model

Figure 14. Quantization layering in the frequency domain in the LLV1 format.

V.4.3. Compression efficiency

The compression efficiency is not the biggest advantage of the LLV1 but still it is comparable to other lossless codecs. The results of comparison to a set of popular lossless codecs are depicted for few well-known sequences in Figure 15. The YV12 has been used as internal for all tested compression algorithms to guarantee a fair comparison.

The results –the output file sizes– are normalized to the losslessly-encoded LLV1 output file sizes (100%) i.e. all the layers are included. As can be seen, LLV1 performs not worse than most of the other codecs. In fact, LLV1 provides better compression than all other tested formats besides LOCO-I, which outperforms LLV1 by approximately 9% on average. Of course, the other codecs have been designed especially for lossless compression and do not provide scalability features.

104 Chapter 3 – Design V. Video Processing Model

HuffYUV Lossless JPEG Knightshields (720p) Alparysoft LLV1 LOCO-I VQEG src20 (NTSC)

Paris (CIF)

Silent (QCIF)

Ove rall

0% 20% 40% 60% 80% 100% 120% 140%

Figure 15. Compressed file-size comparison normalized to LLV1 ([Militzer et al., 2005]).

V.5. Video Processing Supporting Real-time Execution This section focuses on MD-supported processing. At first the continuous MD set including just one attribute applicable during video decoding of the LLV1 is explained.

V.5.1. MD-based Decoding

The LLV1 decoding was designed from the assumptions to be scalable in the processing. Moreover, processing of each layer should be stable i.e. the time spend per frame along the video was expected to be constant for the given type of frame or macro block. Thus, it should be enough to include in the process of decoding just the static MD for prediction. This assumption was practically tested and then refined, thus in order to support MD-based decoding the existing LLV1 had to be extended by one important element placed in the continuous MD set allowing for even more adaptive decoding i.e. the granularity of adaptation was enhanced by allowing stopping the decoding in the middle of the frame, which reflects processing on the fine-grained macro block level.

105 Chapter 3 – Design V. Video Processing Model

Based on [Wittmann, 2005], the functionality of the existing best-effort LLV1 decoder has been extended70 by the possibility to store the size occupied by the frame in the compressed stream for all enhancement layers. The best-effort LLV1 decoder therefore accepts an additional argument that defines whether the compressed frame sizes should be written to an additional file as continuous MD71. The original best-effort decoder had no need for this meta- data since the whole frame was read for each enhancement layer that had to be decoded and it did not matter how long it took.

However, if the execution time is limited (e.g. the deadline is approaching), it happens that not the complete frame is decoded from the compressed stream. In such case, the real-time decoder should leave out some macro blocks at the end of allocated time (period or timeslice). The problem is that it cannot start decoding next frame due to the Huffman coding. As so, the end of the frame has to be passed as meta-data (in other words, the position of the beginning of next frame).

V.5.2. MD-based Encoding

The idea of MD-supported encoding is explained along this work by using the MPEG-4 video coding standard [MPEG-4 Part II, 2004]. This however, does not limit the idea. Different encoders may reuse or partly adopt the existing MD. It may be also required to extend the MD set by adding some other parameters than proposed here.

V.5.2.1 MPEG-4 standard as representative

The MPEG-4 video coding standard was chosen as representative due to few properties which are common in most video encoding techniques. At first it is transform-based algorithm. There are few possible domain transforms, which could be applied video processing, such as Fourier Transform (FT) [Davies, 1984] in discrete form (DFT) or known as fast transform (FFT), Discrete Sine Transform (DST), Discrete Cosine Transform (DCT) [Ahmed et al., 1974],

70 The decoder accepting additional argument that defines whether this information should be written to an extra MD file is available through the –m switch. 71 This functionality should be included on the encoder side (and used during the analysis phase). However it required less programming efforts to include it on the decoder side for decoding existing streams in order to extract the size of frames in the ELs of the considered media objects.

106 Chapter 3 – Design V. Video Processing Model

Laplace Transform (LT) [Davies, 1984] or Discrete Wavelet Transform (DWT) [Bovik, 2005]. All these transforms if applied to video compression should consider two dimensions (2D) in high and width of each frame, but it’s also possible in to represent 2D transform as 1D transform. Out of the available transforms only the DCT was well accepted in video processing becaues: 1) DCT is very equivalent to DFT of roughly twice the length (would be more complex in calculation), 2) FFT could be a competitor but produce larger coefficients from the same input data (as so it’s harder to handle it by entropy coding thus less efficient compression can be obtained), 3) DCT operates on real data with even symmetry (in contrast to odd DST), 4) there are extremely fast 1-D DCT implementations (in comparison to other transforms) operating on 8x8 matrixes of values (representing the luminance or chrominance)72, and 5) 2D-DWT do not allow for applying ME/MC techniques for exploiting temporal redundancy73. There exist floating-point or integer implementations of the fast and well-accepted DCT, but to be more precise, the 1-D DCT Type II given by Equation (15) [Rao and Yip, 1990] is the most popular due to a proposed wide range of low complexity approximations and fast integer-based implementations for many computing platforms [Feig and Winograd, 1992]. The DCT Type II is applied in MPEG standards as well as in ITU-T standards and in many other not-standardized codecs. As so, the resulting coefficients are expected to be similar in the codecs using this type of DCT.

⎧ 1 2 N −1 k(2n +1)π ⎪ ⇔ k = 0 Z k = ck ∑ xn cos where ck = ⎨ 2 (15) N n=1 2N ⎪ ⎩1 ⇔ k ≠ 0

Secondly, the generalization of video coding algorithms includes the two types of frames almost in all types of transform-based algorithms. The intra-coded and inter-coded frames can be distinguished. The intra-coded processing does not include any motion algorithms (estimation/prediction/compensation), while the inter-coded technique uses motion algorithms intensively and encodes not the coefficients of pixel values (as intra-coded does) but the error coming from the difference between two compared sets of pixels (usually 8x8 matrixes).

72 In the newest video encoding algorithms some other derivates of DCT are employed. For example, in MPEG-4 AVC it is possible to use integer-based DCT, which operates on 4x4 matrixes being faster in calculation of 256 values (4 matrixes vs. 1 matrix in old DCT). 73 There are also fast 3D-DWTs available mentioned already in the related work, but due to their inapplicability in real-time there are omitted.

107 Chapter 3 – Design V. Video Processing Model

Figure 16. DCT-based video coding of: a) intra-frames, and b) inter-frames.

Thirdly, as it can be noticed there exist common parts in the DCT transform-based algorithms [ITU-T Rec. T.81, 1992], which may still differ a bit in the implementation. These are namely: 1) quantization (most known types commonly used are MPEG-based and H.263-base, which btw. may be also used in MPEG compliant codecs), 2) coefficient scanning according to few types (horizontal, vertical, and most popular zig-zac), 3) run length encoding, 4) Huffman encoding (often the Huffman tables differ from codec to codec) and in case of inter-frames 5) motion estimation and prediction generating motion vectors and motion compensation error (called also prediction error), which is using 6) the frame buffer of the previous and/or following frames related to the currently processed frame. The more detailed comparison of MPEG-4 vs. H.263 can be found in Appendix C.

V.5.2.2 Continuous MD set for video encoding.

Based on the mentioned commonalities of the DCT-based compression algorithms, few kinds of information are stored as continuous MD for video, namely: frame coding type, bipred value, MB width and height, MB mode and priority, three most significant coefficient values and motion vector(s). Especially fine-granularity information on coefficient values and motion vectors makes the size relatively large, and thus continuous MD should be compressed along with the media bit stream (as tests showed yielding about 2% of the LLV1 stream size [Suchomski et al., 2005]). Of course, the continuous MD are not interleaved with the video data, but are stored as separate bitstream (e.g. to allow decoding of media data passed in the direct delivery by the standard decoder). What is important is the time-dependency of the continuous MD due to their close

108 Chapter 3 – Design V. Video Processing Model

relationship to the given frame, and thus they should be delivered by real-time capable system along the video stream.

The parts of the proposed continuous MD set have already been proposed in the literature as it was mentioned at the beginning of section V within this chapter. However, the parts have not been collected in one place making the continuous MD set comprehensive. As so, the elements of the continuous MD set are described in the following.

Frame coding type. It holds information on the type of a given frame in the video sequence (I-, P- or B-frame). The type is detected in the analysis process to eliminate the resource- demanding scene detection and frame-type decision. Furthermore, it allows better prediction of compression behavior and resource allocation, because the three types of frames are compressed in very different ways due to the use of dissimilar prediction schemes.

Bipred value The bipred value is used in addition to frame coding type and it says if two vectors have been used to predict the block – if any block in the frame has two vectors, the value is set. This is useful optimization for bidirectionally-predicted framed (B-frames) i.e. if just one reference frame should be used during encoding, the algorithm may skip the interpolation of the two referred pixel matrixes (with all the accompanying processing).

Macro block width and height. This is just simple information about the number of MB in two directions, horizontal and vertical respectively. It allows us to calculate the number of MBs in the frame (we did not assume that it is always the same).

Macro block mode. Similar to frame coding type, the information about macro block (MB) mode allows distinguishing, if the MB should be coded as intra, inter or bi-directionally predicted. It also stores more detailed information. For example, if we take into account only bi- directionally predicted MB's, there are the following five modes possible: forward predicted, backward predicted, direct mode with delta vector74, direct mode without MV and not coded. Special cases of a bi-directionally predicted with two vectors are considered separately (by bipred

74 For each luminance block there may be none, one or two vectors, which means that there may be none, 4 x 1 or 4 x 2 vectors for the luminance in MB. There is however one more „delta vector” mode i.e. the backward or forward vector is the same for all four luminance blocks.

109 Chapter 3 – Design V. Video Processing Model

value). If we go further and take into consideration H.264/AVC [ITU-T Rec. H.264, 2005] for similar bidirectional MB's, there can be 23 different types. If we consider intra type of MB's, there are 25 possible types in H.264/AVC, but only 2 in MPEG-4 (inter MB's have 5 types in both standards).

Such a diversity of MB coding types makes the compression algorithm very complex in order to find the optimal type. So, using meta-data from the analysis process will significantly speed up the compression and allows for recognition of the execution path, which is useful in resource allocation.

Macro block priority. Additionally, if the priority of MB is considered, the processing could be influenced by calculating at first these MBs with the highest importance, which is done in our implementation for the intra blocks (they have the highest priority). Moreover, depending on the complexity of the blocks within the MB, the encoder can assign more or less memory to the respective blocks in the quantization step [Symes, 2001]. And now, not only complexity of block but also the importance of current MB could influence the bit allocation algorithm and optimize the quantization parameter selection, thus influencing positively the overall perceived quality.

Most significant coefficients. The first three coefficients, the DC and two AC coefficients, are stored in addition, but only for the intra coded blocks. This allows for better processing control by possible skipping coefficient calculations (DCT) in case of lack of resources – in such situation it would be possible to just calculate not the real but estimated pixel values and still deliver acceptable picture quality. Since the DCT in different codecs are expected to work similar way, storing first three coefficients will influence the size of the MD just a little, but on the other hand allows skipping DCT calculations without the cost of dropping the macro block. In other words, the quality provided in case of skipping the MB with just three coefficients, will be still higher than zero.

Motion vectors. They could be stored for the whole MB, or for parts of a MB called blocks, however our current implementation considers each block separately. Of course, only temporally predicted MB's require associating MV's (MVs do not exist for intra MBs at all). However, in different compression algorithms there are different numbers of MV's used by one MB, e.g. in H.264/AVC it is allowed to keep up to 16 MV's per MB. So, if all possible

110 Chapter 3 – Design V. Video Processing Model

combinations were searched (using e.g. quarter-pel accuracy), the execution time would explode. Thus pre-calculated MVs help a lot in the real-time encoding, even though they will not be applicable directly in some cases (in such situations they should be used adopted).

The load of continuous MD is to be implemented as a part of the real-time encoder. The pseudo code showing how to do the implementation is given in section XVII of Appendix D.

V.5.2.3 MD-XVID as proof of concept

The XVID is a state-of-the-art open-source implementation of the MPEG-4 standard for visual data [MPEG-4 Part II, 2004], and to be more precisely for rectangular video (natural and artificial) supporting Advance Simple Profile. It was chosen to be a base for the meta-data based representative encoder due to its good compression [Vatolin et al., 2005], stable execution and source code availability. It was adapted to support the proposed continuous MD set i.e. the algorithm given in the previous section has been implemented. The design details about the MD-based XVID encoder can be found in [Militzer, 2004] and here only the overview is given.

There is the XVID encoder depicted in Figure 17 a), which combines the previously discussed DCT-algorithm for inter- and intra-frames (in Figure 16) and optimizes the quality through additional decoding part within the encoder. The first step done by XVID is making decision about the frame type for applying selectively further steps, which is very crucial for the coding efficiency. As it was given in the literature [Henning, 2001], the compression ratio for MPEG-4 standard depends on the frame type and typically amounts 7:1 for I-frames, 20:1 for P-frames and 50:1 for B-frames. Of course, these ratios can change depending on the temporal and spatial similarity i.e. the number of different MD types encoded within the frame and accompanying respective MVs. Then depending if the P- or B-frame should be processed, the motion estimation, which produces motion vector data, together with the motion prediction delivering motion compensation error are applied. Additionally coding reordering must take place in case of B-frames and two referencing frames are used instead of one, so in that case the ME/MP complexity is higher than for P-frame. Next the standard steps of DCT-based encoding are applied like DCT, quantization, zig-zac scanning, RLE and Huffman encoding. Due to the mentioned quality optimization by additional decoding loop, the lossy transformed and quantized AC/DC coefficients are decoded through dequantization, inverse DCT and

111 Chapter 3 – Design V. Video Processing Model

eventually if the referred frame is the intra-frame (precisely P-frame) the motion compensation takes place. Such encoder extension imitate the client-side behavior in the decoder and gives common base for the reference frame reconstruction, thus avoiding additional reference error deriving from difference between decoded reference frame and origin reference frame.

A representative of the meta-data based video encoder being based on XVID (called shortly MD-XVID) is depicted in Figure 17 b). There is an absent lossless meta-data decompression step scarified for simplicity in the picture, but in reality it takes place as suggested in the section V.1. The depicted MD represents not both types of the MD but only the continuous set as given in section V.5.2.2. i.e. all the seven MD elements are included, and for each the arrowed connector showing the flow of meta data to the given module of the encoding process is depicted. The Frame Type Decision from Figure 17 a) is exchanged by the much faster MD-Based Frame Type Selection. The Motion Estimation step present in Figure 17 a) is completely removed in MD-XVID thanks to MV data flowing to the Motion Prediction step, and the MP is simplified due to the inflow of the three additional elements of continuous MD set such as bipred value, MB mode and MB priority.

Finally, the first three and most significant coefficients are delivered to the Quantization step if and only if they are not calculated beforehand (the X-circle connector symbolizes the logical XOR operation). This last option allows for delivering the lowest possible quality of the processed MBs in case of high load situations. For example, if the previous steps like motion prediction and DCT could not be finished on time or if the most important MBs took to much time and the remaining MBs are not going to be processed, then just the lowest possible quality including the first three coefficients and the zeroed rest are further processed and finally delivered.

Optionally, there could be also an arrowed connector symbolizing the flow of the MB priority to the Quantization step in the Figure 17 b), thus allowing for even better bit allocation by extended bit rate control algorithm. However, this option has been neither implemented nor investigated yet, and as so, it is left for further investigations.

112 Chapter 3 – Design V. Video Processing Model

Figure 17. XVID Encoder: a) standard implementation and b) meta-data based implementation.

113 Chapter 3 – Design V. Video Processing Model

V.6. Evaluation of the Video Processing Model through Best-Effort Prototypes The video processing model has been implemented at first as a best-effort prototype in order to evaluate the assumed ideas. Here few critical aspects reflecting the core components of the architecture (from Phase 2: conversion to internal format and content analysis; from Phase 3: decoding and encoding) have been considered such as: generation of MD set by analysis step, encoding to internal storage format demonstrating the scalability in the data amount and quality, decoding from internal format exhibiting the scalability in the processing in relation to the data quality, and encoding using the prepared MD where the quality of delivered data or the processing complexity are considered. Finally, the evaluation of complete processing of video transcoding chain is done in respect to execution time.

The evaluation of static MD is not included in this section; because it can be done only with help of the environment with the well-controlled timing (i.e. it is included in the evaluation of implementation in the real-time system).

V.6.1. MD Overheads

In the best-effort prototype, the content analysis of Phase 2 generating MD has been integrated with the conversion step transforming video data from the source format to the internal format. Here, the implemented LLV1 encoder has been extended by content analysis functionality, such that the LLV1 encoder delivers the required statistical data describing the structure of the lossless bit stream used for scheduling as well as the data required for the encoding process in real-time transcoding, mainly used by the real-time encoder. Depending on the number of LLV1 layers to be decoded and the requested output format properties of the transcoding process, not all generated MD may be required. In these cases, only the really necessary MD could be accessed selectively so that a waste of resources is avoided.

The overhead of introducing the continuous meta-data is depicted in Figure 18 a). The average cost for the 25 well-known sequences, which was calculated as relation of the continuous MD size to the LLV1 base layer size and multiplied by 100%, was equal to 11.27 %. The measure related to base layer was used instead of the relation to the complete LLV1 (i.e. including all layers), because the differences can be easier noticed—the cost of continuous MD in relation to

114 Chapter 3 – Design V. Video Processing Model

the complete LLV1 amounts on the average 1.63% with standard deviation of the cost equal to 0.34%.

Cost of continuous MD Distribution

sizecMD / size BL [%] 30,00% 9

8 25,00% 7

20,00% 6

5

15,00% 4

3 10,00% 2

5,00% 1

0 0,00% 0%<5% < 10% < 15% < 20% < 25% < 30% f f f 6 if if if if n i 1 i 4 1 4 4 c c if cif 50 0 50 9 0 50 9 9 01 ifn Number of results _9 qc _c c r_ qc p _cif qc p p 6 f r_ n _ e _cifn 59 59 59 tu ci rd _ r_ ll_cifn a ito r_ 20 il e le_ 20 p p p i q a ne en ito 7 b il 7 0 is_c below given range ai ine b bi n_itu6 n t tba rd on l_ cal_itu6 o n_ ta o rem m on a mo ru 72 olm_ ten n fo fo ga m b mo m ru _ h astgu con foreman_qcif ll_ bc o n_720 rk s hields_itu601 k carphone_qcihone_ o co ll_ o m u a hields_720ld s olm_720c p c ha m p s e h ar ha park hi k sto c parkr s c to s a) b)

Figure 18. Continuous Meta-Data: a) overhead cost related to LLV1 Base Layer for tested sequences and b) distribution of given costs.

The difference between the average cost for all sequences and the sequence-specific cost is also depicted by the thin line with the mark. It’s obvious that the size of continuous MD is not constant because it heavily depends on the content i.e. on motion characteristics including MVs and coefficients, and additionally it is influenced by the lossless entropy compression. However, the precise relationship between continuous MD and the content of the sequence was neither mathematically defined nor further investigated.

Additionally, the distribution of the MD costs in respect to all tested sequences has been calculated using frequency of occurrence in the equally distributed ranges of 5% with the maximum value of 30%75. This distribution is depicted in Figure 18 b). It is clearly visible that in the most of cases the MD overhead is in the range between 10% and 15%, and the MD

75 The rages are as follows: (0%;5%), <5%;10%), <10%;15%), <15%;20%), <20%;25%), and <25%;30%). There was no continuous MD cost above 30% measured.

115 Chapter 3 – Design V. Video Processing Model

overhead lower than 15% occurs in 80% cases (20 out of 25 sequences), and below 25% in 96% cases.

Finally, the static MD overhead was calculated. The average cost in relation to the LLV1 BL amounts 0.72% with standard deviation below 0.93%, as so such small overhead can easily be neglected in respect to the total storage space for complete video data. If one considers that LLV1 BL occupies between 10% and 20% of the losslessly stored video in LLV1 format, then the overhead can be treated as unnoticeable – below 1‰ of the complete LLV1.

To summarize, the static MD is just a small part of the MD set proposed and the continuous MD plays a key role here. Still, the continuous MD introduces only very small (1.6%)—when considered lossless data—to small overhead (11.3%)—when measured only against the LLV1 base layer. Please note, that the LLV1 BL has also limited level of data quality as it is shown in next section, but the size of continuous MD is constant for the given sequence regardless the number of LLV1 layers used in the later processing.

V.6.2. Data Layering and Scalability of the Quality

The scalability of data in respect to quality is an important factor in the RETAVIC architecture. The LLV1 was designed with a purpose of having the data layered in additive way without unnecessary redundancy in the bit streams. So the proposed layering method provides many advantages over a traditional non-scalable lossless video codec such as HuffYUV [Roudiak- Gould, 2006], Motion CorePNG, AlparySoft Lossless Video Codec [WWW_AlparySoft, 2004], Lossless Motion JPEG [Wallace, 1991] or LOCO-I [Weinberger et al., 2000]. Especially in the context of the RETAVIC transformation framework targeting the format independence where the changeable level of quality is a must, the data are not allowed to be accessed on all-or- nothing basis without scalability, so just a reduced file size to the manageable amount of the compressed video file [Dashti et al., 2003] are not enough anymore.

The organization of data blocks allocated on storage device is expected to follow the LLV1 layering, such that the separate layers can be read efficiently and independently of each other, and the best when sequentially—due to the highest throughput efficiency of the nowadays hard drives achieving then the peak performance [Sitaram and Dan, 2000]. On the other hand, the data prefetching mechanism must consider the time constraints of the processed quanta of each

116 Chapter 3 – Design V. Video Processing Model

layer which is hardly possible with sequential reading for varying number of data layers being stored separately. The rotational-position-aware RT disk scheduling using a dynamic active subset, which exploits QAS (discussed in the related work), proposed recently can be helpful here [Reuther and Pohlack, 2003]. This algorithm provides a framework optimizing the disc utilization under hard and statistical service guaranties by calculating the set of active (out of all outstanding) requests on each scheduling point such that no deadline are missed. The optimization is provided within the active set by employing shortest time access first (SATF) method considering the rotational position of the request.

The time constraints for write operations are left out intentionally due to their unimportance to the RETAVIC architecture in which only the decoding process of the video data stored on the server must meet real-time access requirements. Since the conversion from input videos into the internal storage format is assumed to be a non-RT process, it does not require time-constrained storing mechanism so far.

The major advantage of the layered approach through bit stream separations in LLV1 is the possibility of defining picture quality levels and temporal resolutions by the user where the amount of requested as well as really accessed data can vary (Figure 19). Contrary, other scalable formats designed with network scalability in mind cannot be simply accessed in the scalable way—in such case, the video bit stream has to be read completely (usually sequentially) from the storage medium and unnecessary parts can be dropped only during decoding or transmission process where are two possibilities: 1) the entropy decoding takes place to find out the layers or 2) the bit stream is packetized such that the layers are distinguishable without decoding [MPEG- 4 Part I, 2004]. Regardless the method, all the bits have to be read from the storage medium.

Obviously, the lossless content requires significantly higher throughput than in the lossy case. Even though, the modern hard-disks are able to deliver un-scaled and losslessly coded content in real-time, it is waste of resources to read the complete bitstream always even when not required. The scalable optimization in data prefetching for further processing in the RETAVIC is simply delivered by the scheme where just certain subset of data is read, in which the number of layers and as so amount of data can vary for the Paris (CIF) sequence as shown in Figure 19 a). There is the base layer taking only 6.0% of the completely-coded LLV1 including all the layers (Figure 15 b)). The addition of the temporal enhancement layer to the base layer

117 Chapter 3 – Design V. Video Processing Model

(BL+TEL) brings the storage space requirement for LLV1-coded Paris sequence to the level of about 9.6%. Further increase of data amount by successive quantization enhancement layers (QELs) i.e. QEL1 and QEL2 raises the required size respectively to 32.1% and 65.6% for this specific sequence. The last enhancement layer (QEL3) needs about 34.4% to make the Paris sequence lossless.

Paris (CIF) sequence compressed in LLV1 format

BL 6,0%

TEL 3,6% QEL3 34,4% QEL1 22,5%

0 102030405060708090100MB QEL2 33,5% BL BL+TEL BL+TEL+QEL1 BL+TEL+QEL1+QEL2 BL+TEL+QEL1+QEL2+QEL3

a) b)

Figure 19. Size of LLV1 compressed output: a) cumulated by layers and b) percentage of each layer.

The LLV1 layering scheme, in which the size of the base layer is noticeably smaller than the volume of all the layers combined (Figure 19), entails reduced number of disk requests made by the decoding process by directly proportional function. Thus, LLV1 offers new freedom for real-time admission control and disk scheduling, since most of the time users actually request videos at a significantly lower quality than the internally stored lossless representation due to limited network or playback capabilities (e.g. handheld devices or cellular phones). In result, by using separated bit streams in the layered representation—like the LLV1 format—for internal storage, the waste of resources used for data access can be reduced. Consequently, the separation of bit streams in the LLV1 delivers more efficient use of the limited hard-disk throughput and allows more concurrent client requests being served.

To prove if layering scheme is such efficient for other video sequences as in case of Paris sequence (Figure 19), the subset of well-known videos [VQEG (ITU), 2005; WWW_XIPH, 2007] has been encoded and the size required for data on each level of LLV1 layering scheme has been compared to the origin size of the uncompressed video sequence. The results are depicted in Figure 20. There the base layer was investigated always together with temporal enhancement layer in order to measure the frame rate of the origin frequency resolution. It can

118 Chapter 3 – Design V. Video Processing Model

be noticed that the size of the base layer (including TEL) influences proportionally the next layer—the QEL1 built directly on top of BL+TEL. For example, the BL+TEL of the Mobile sequence crosses the 20% line in all three resolutions (CIF, QCIF, CIFN) as same as the Mobile’s QEL1s are above the average. Contrary, the Mother and Daughter, Foreman or News have the BL+TELs as well as the QEL1s smaller than respective average values. Thus it may be derived, that the bigger the base layer, the bigger participation of the QEL1 in the LLV1-coded sequence is. Contrary, the QEL2 and QEL3 are just slightly influenced by the BL+TEL compression size and they both are almost constant—the average compression size in respect to origin uncompressed sequence amounts respectively to 19% and 19.2%, and the standard deviation is equal to 0.47% and 0.41%.

Relation of LLV1 layers to uncompressed video 100%

90%

80%

70% BL+TEL 60% QEL1 50% QEL2 40% QEL3 30% ALL

20%

10%

0% paris_cif news_qcif mobile_cif tennis_cifn mobile_cifn mobile_qcif foreman_cif garden_cifn football_cifn container_cif foreman_qcif carphone_qcif container_qcif shields_itu601 coastguard_cif mobcal_itu601 parkrun_itu601 shields_720p50 hall_monitor_cif mobcal_720p50 parkrun_720p50 hall_monitor_qcif carphone_qcif_96 stockholm_itu601 shields_720p5994 parkrun_720p5994 stockholm_720p5994 mother_and_daughter_cif mother_and_daughter_qcif

Figure 20. Relation of LLV1 layers to origin uncompressed video in YUV format.

Summarizing, it may be deduced that the QEL2 and QEL3 are resistant to any content changes and have almost constant compression ratio/size, while the QEL1 depends somehow on the content and the BL+TEL are very content-dependent. For the QEL1 the minimum compression size amounts to 8.2% and the maximum to 17.8% thus MAX/MIN ratio achieves 2.16, and the BL+TEL demonstrating even less stability achieves respectively 3.4% and 31.6%,

119 Chapter 3 – Design V. Video Processing Model

and the MAX/MIN ration equal to 9.41. Contrary, the MIN/MAX ratios of QEL2 and QEL3 are equal to 1.11 and 1.10. On the other hand, the content independent compression may indicate a wrong compression scheme that is not capable of exploiting the signal characteristics and the entropy of the source data on the higher layers, but some more investigation is required to prove such statement.

Figure 21. Distribution of layers in LLV1-coded sequences showing average with deviation.

Additionally, the distribution of layers in the LLV1-coded sequences has been calculated and is depicted in Figure 21. The average is calculated for the same set of videos as Figure 20, but this time the percentage of each quantization layer76 within the LLV1-coded sequence is presented. The maximum and minimum deviation for each layer is determined – it is given as superscript and subscript assigned to the given average and depicted as red half-rings filled with the layer’s color. The BL+TEL occupies less than two-eleventh of the whole, QEL requires a bit more than one-fifth, and the QEL2 and QEL3 need a bit less than one-third each. It’s obvious that the percentage of QEL2 and QEL3 is not almost-constant as in previous case, because the size of LLV1-coded sequences changes according to the higher or lower compression of BL-TEL and QEL1, in which the compression sizes depend on the source data. The small deviation—

76 The base layer together with temporal enhancement layer is used as base for QELs in the benchmarking for the sake of simplicity.

120 Chapter 3 – Design V. Video Processing Model

both negative and positive—by QEL1 confirms its relationship to BL+TEL, because if the BL+TEL size changes the QEL1 size follows these variations such that the percentage of QEL1’s size in respect to the changed total size of LLV1 is kept on the same level hesitating only in a small range (from -4.2 to +1.7). The higher or lower percentage of BL-TEL is respected of costs (loses/gains) of percentage of two other enhancement layers (QEL2 & QEL3).

The scalability in the amount of data would be useless if no differentiation of the signal/noise ratio (SNR) had taken place. The LLV1 is designed such that the SNR scalability is directly proportional to the amount of layers, and as so to the amount of data. The peak-signal-to-noise- ration (PSNR) 77 [Bovik, 2005] values of each frame between decoded streams including different combination of layers are given in Figure 22 and Figure 23. Four combinations of quantization layering (temporal layering is again turned off) are investigated:

1) just base quantization layer,

2) base layer and quantization enhancement layer (BL+QEL1),

3) base layer, first and second quantization layer (BL+QEL1+QEL2)

4) all layers

The detailed results for Paris (CIF) are presented in Figure 22 and for Mobile (CIF) respectively in Figure 23. The quality values of fourth option i.e. the decoding of all layers from BL up to QEL3 are not depicted in the graphs since the PSNR values of lossless data are infinite. The PSNR is based on MSE and is inversely proportional to MSE, so having MSE being zero is a proof of lossless compression where no difference error between images exists. So, the lossless property was proved by checking if the mean square error (MSE) is equal to zero. The results show that the difference of the PSNR values between two consecutive layers averages about 5-6

77 In most cases the PSNR value is in accordance with the compression quality. But sometimes this metric does not reflect presence of some important visual artefacts. For example, the quality of the blocking artefacts cannot be estimated i.e. the compensation performed by some codec as well as the presence of the "snow" artefacts (strong flicking of the stand-alone pixels) cannot be detected in the compressed video using only PSNR metric. Moreover, it is difficult to say whether 2 dB difference is significant or not in some cases.

121 Chapter 3 – Design V. Video Processing Model

[dB] and this difference in the quality between layers seems to be constant for the various sequences. Moreover, there is no significant variation along the frames, so the perceived quality is also experienced to be constant.

Picture Quality for Different Layers (Paris - CIF) QEL2 QEL1 BL

44,00 42,00 40,00 38,00 36,00

PSNR [dB] 34,00 32,00 30,00 1 51 101 151 201 251 301 351 401 451 501 551 601 651 701 751 801 851 901 951 1001 1051 Frames

Figure 22. Picture quality for different quality layers for Paris (CIF) [Militzer et al., 2005]

Picture Quality for Different Layers (Mobile - CIF) QEL2 QEL1 BL

44,00 42,00 40,00 38,00 36,00

PSNR [dB] PSNR 34,00 32,00 30,00 1 51 101 151 201 251 Frames

Figure 23. Picture quality for different quality layers for Mobile (CIF) [Militzer et al., 2005]

Finally, the scalability in the quality is achieved such that lower layer has proportionally lower quality. The PSNR value ranges for the BL from about 32 dB to roughly 34 dB, for the QEL1 it is between 37 dB and 39 dB, and for the QEL2 respectively 43 dB and 44 dB. The lossless property of all layers together (from BL up to QEL3) has been confirmed.

V.6.3. Processing Scalability in the Decoder

Due to data scalability in LLV1, the quantification of resource requirements in relation to amount of data seems to be possible. This allows QoS-like definitions of required resources e.g. processing time, required memory, required throughput, and thus allows better resource management in a real-time environment (RTE). So, the data scalability enables some control

122 Chapter 3 – Design V. Video Processing Model

over the decoding complexity in respect to requested data amount and thus to the data quality (Figure 24). The achieved processing scalability being proportional to number of layers and as to amount of data fits the needs of the RETAVIC framework.

Figure 24. LLV1 decoding time per frame of the Mobile (CIF) considering various data layers [Militzer et al., 2005].

The decoding process of e.g. the base layer and the temporal enhancement layer (BL+TEL) takes approximately only about 25% of the time for decoding the complete lossless information (Figure 24). The LLV1 base layer is expected to provide the minimal quality of the video sequence, so only BL decoding is mandatory. To achieve higher levels of data quality, additional layers have to be decoded on cost of processing time. As so, the LLV1 decoding process can be well partitioned into a mandatory part—requiring just very few computational resources—and additional optional parts—requiring the remaining 75% of resources being spent in total. Contrary to the decoding of traditional single-layered video, this scalable processing characteristic can be exploited according to the imprecise computation concept [Lin et al., 1987] where at first low-quality result is delivered and then according to available resources and time additional calculations are done to elevate the accuracy of the information. Moreover, the

123 Chapter 3 – Design V. Video Processing Model

processing with mandatory and optional computation parts can be scheduled according to QAS [Hamann et al., 2001a].

Moreover, the required computational resources for the decoding process can be smoothly controlled if decoding is done not only on the number of enhancement layers but on a per- frame or even on macro block level, for example, the complete BL+ETL layers are decoded and partly QEL1 where partly means some of all frames, or even some important MBs out of a given frame. In such case only some frames or parts of frames will have higher quality.

When considering LLV1 decoding versus unscalable decoding, the computational complexity of decoding a LLV1 stream at a specific quality level (e.g. targeting given PSNR value) is not much higher than that of decoding a single-layered representation at about the same quality. Such behavior is achieved due to dequantization and inverse transform being executed only once, no matter if just one or three quantization enhancement layers are used for decoding. The main overhead cost for higher quality derives then neither from the dequantization nor from the inverse transformation but from the amount of data being decoded by entropy decoder, which also in case of single-layered decoders raises in accordance with the higher quality.

The depicted adaptivity in terms of computational complexity allows a media server making better use of resources. By employing LLV1 as unified source format for real-time transformations more concurrent client requests with lower quality or less number of clients with higher quality can be served. As so, the QoS control gets the controlling mechanism based on the construction of the internal storage format and allows making a decision upon established QoS strategy. Such QoS control would not be possible if no scalability is present in the data and in the processing.

Additionally and in analogy to the sophisticated QoS levels defined for the ATM networks, a minimum guaranteed quality can be defined since there is a possibility of dividing the LLV1 decoding process into mandatory and optional parts. Moreover, if the system had some idle resources, the quality could be improved according to the QAS such that the conversion process calculates data from enhancement layers but does not allocate the resources exclusively and still some other new processes can be admitted. For example, let’s assume that there is some number of simultaneous clients requesting minimum guaranteed quality and this number

124 Chapter 3 – Design V. Video Processing Model

is smaller than the maximum number of possible clients requesting minimum quality. Rationally thinking, such assumption is a standard case, because otherwise the allocation would be refused and client’s request would be rejected. So, there are still some resources idle which could be adaptively assigned to running transformations in order to maximize the overall output quality, and finally deliver the quality higher than the guaranteed minimum.

The comparison to other coding algorithms makes sense only if both the lossless and scalable properties are assumed. There have been proposed few wavelet-based video decoders which could have the mentioned characteristic [Devos et al., 2003; Mehrseresht and Taubman, 2005]. However, in [Mehrseresht and Taubman, 2005] only the quality evaluation but not the processing complexity is provided. In [Devos et al., 2003] the implementation is just a prototype and not the complete decoding is considered, but just three main parts of the algorithm i.e. wavelet entropy decoding (WED), inverse discrete wavelet transform (IDWT) and motion compensation (MC). Anyway, the results are far behind these achieved even by best-effort LLV1 e.g. just the three mentioned parts need from 48 sec. in low quality up to 146 sec. in high quality for Mobile CIF on Intel Pentium IV 2.0GHz machine. Contrary, the RT-LLV1 decoder requires about 3 sec. for just base layer and 10 sec. for lossless (most complex) decoding on the same machine as PC_RT, which is specified in Appendix E (section XVIII.2)78.

Additionally, the LLV1 was compared to Kakadu – a well known JPEG 2000 implementation and the results of both best-effort implementations are shown in Figure 25. On average the LLV1 takes 11 sec, while Kakadu needs almost 14 sec on the same PC_RT machine, so the processing of LLV1 is less CPU demanding than the Kakadu even though the last few steps of the LLV1 algorithm such as inverse quantization (IQ) and inverse binDCT (IbinDCT) have to be executed twice.

78 Please note, that PC_RT is AMD Athlon XP 1800+ running with 1.533MHz, so the results for LLV1 on the machine using Intel Pentium IV 2GHz should be even better.

125 Chapter 3 – Design V. Video Processing Model

16 LLV1 JPEG2K - KAKADU 14

12

10

8

6 Total Execution [s] Time 4

2

0 12345678910AVG

Figure 25. LLV1 vs. Kakadu – the decoding time measured multiple times and the average.

V.6.4. Influence of Continuous MD on the Data Quality in Encoding

In order to evaluate the influence of continuous MD on the data quality in the encoding process, some simulations have been conducted on a set constructed from the well-known standard video clips being recognized for research evaluation [WWW_XIPH, 2007] and from the sequences of Video Quality Expert Group [WWW_VQEG, 2007]. In order to check how the motion vectors coming from the continuous MD set are influencing the quality of the output, the comparison between a best-effort MPEG-4 encoder using standard full motion estimation step and a MD-based encoder exploiting directly motion vectors has been done. Results from these two cases are depicted in Figure 26.

The PSNR value given (in dB) on the Y-axis is counted averagely per compressed sequence, and for a number of various sizes of compressed bit stream specified on the non-linear X-axis. The non-linear scale is logarithmic, because the PSNR measure is a logarithmic function based on the peak signal value divided by noise represented by mean square error (MSE), and as so the difference of the PSNR values especially on lower bit rates can be better observed.

There can be observed in Figure 26, that the direct use of the MVs from continuous MD performs very well when targeting low qualities (from 34 to 38 dB), however if very low quality i.e. very high compression is expected then the direct MV reuse delivers worse results in comparison to the full motion estimation. It is caused by the difference between currently

126 Chapter 3 – Design V. Video Processing Model

processed frame and the reconstructed reference frame within the encoder having higher quantization parameter (for higher compression ratio), which introduces higher MSE for the same constant set of MVs. Since the MVs are defined in analogy to the BL of LLV1 which targets the quality of 32-34 dB, the errors are introduced also in case of higher quality (above 40dB); however, here the errors are much smaller than errors in case of very low quality (25 to around 32 dB).

Figure 26. Quality value (PSNR in dB) vs. output size of compressed Container (QCIF) sequence [Militzer, 2004].

Contrary to undoubtful interest in the higher qualities, it is questionable, if the very low qualities (below 32dB) are in the interest of users. Thus another type of graph commonly known as the rate-distortion (R-D) curves has been used for evaluation of the applicability of MD-based method. The R-D curves depicts the quality in comparison to the more readable bit rates expressed by bits per pixel (bpp), which may be directly transformed to the compression ratio i.e. assuming the YV12 color space where there are 12 bpp in the uncompressed video source it can be derived that achieving 0.65 bpp in the compressed domain delivers the compression ratio

127 Chapter 3 – Design V. Video Processing Model

of 18.46 (the compression size of 5.5%), and in analogy 0.05 bpp brings 240 (0.5%) and 0.01bpp causes 1200 (0.83‰).

Figure 27. R-D curves for Tempete (CIF) and Salesman (QCFI) sequences [Suchomski et al., 2005].

The R-D graph showing the comparison of the standard XVID encoder and the MD-based XVID have been used for comparisons targeting different bit rates and low, but not very low, quality (option Q=2). Considering various bits per pixels of the compressed output, the quality for both processing cases of two sequences (Tempete and Salesman) is depicted in Figure 27. It can be noticed that the curves overlaps in the range of 30 to 40 dB, thus the influence of direct MV reuse is neglected. In case of higher quality, e.g. around 49dB the cost of introducing MD- based encoding according to compression efficiency is still very small i.e. 0.63 bpp is achieved for Tempete instead of getting 0.61 bpp, which means the value of 19.6 vs. 19.1 in compression ratio. Even better results are achieved for Saleseman.

128 Chapter 3 – Design V. Video Processing Model

V.6.5. Influence of Continuous MD on the Processing Complexity

The processing complexity is an important factor in video transcoding, especially if applied in real-time for delivering format transformation. The continuous meta-data have been designed such that they influence the complexity in the positive manner i.e. the processing is affected by the simplification (speed-up) in the algorithm and by the stabilization in the processing time counted per frame.

Figure 28. Speed-up effect of applying continuous MD for various bit rates [Suchomski et al., 2005].

The speed-up effect is noticeable for different bit rates targeted by the encoder (Figure 28), so the positive effects of using continuous MD covers not only the wide spectrum of qualities being proportional to the achieved bit rate (Figure 27) but also the processing complexity which is inverse proportional to the bit rate i.e. lowering the bit rate yields the higher fps thus the smaller complexity. Please note, that the Y-axis of Figure 28 represents the frames per second so the higher value the better. It is clearly visible that the MD-based encoder outperforms the unmodified XVID by allowing for processing much more frames per second (black line above

129 Chapter 3 – Design V. Video Processing Model

the gray one). For example, if the bit rate of 0.05 bpp is requested the speed-up of 1.44 (MD- XVID ~230fps vs. XVID ~160fps) is achieved, and respectively the speed-up of 1.47 for the bit rate of 0.2 bpp (MD-XVID ~191fps vs. XVID ~132fps); the MD-XVID (145fps) is 1.32 times faster than XVID for the bit rate of 0.6 bpp (~110fps).

Good Will Hunting Trailer (DVD), PAL, 1000 kbit/s 35

XviD 1.0, quality 2, direct MV reuse 30 XviD 1.0, quality 2, regular ME

25

20

15 Processing time [ms]

10

5

0 1 64 127 190 253 316 379 442 505 568 631 694 757 820 883 946 1009 1072 1135 1198 1261 1324 1387 1450 1513 1576 1639 Frames

Figure 29. Smoothing effect on the processing time by exploiting continuous MD [Suchomski et al., 2005].

The encoding time per frame for one bit rate (of 1000 kbps) is depicted for the real-world sequence, namely Good Will Hunting Trailer, in Figure 29. Beside the noticeable speed-up (smaller processing time), there may be recognized a smoothing effect in the processing time for the MD-based XVID i.e. the differences between MIN and MAX frame-processing time and the variations of the time from frame to frame are smaller (Figure 29). Such smoothing effect is very helpful in the real-time processing, because it makes the behavior of the encoder more predictable, and in results the scheduling of the compression process is easier manageable. Consequently, the resource requirements can be more accurately determined and stricter buffer techniques for continuous data processing can be adopted.

Moreover, the smoothing effect will be even more valuable if more complex video sequences are processed. For example, assuming close to worst-case scenario for usual (non-MD-based)

130 Chapter 3 – Design V. Video Processing Model

encoder where very irregular and unpredictable motion exists the processing time spent per frame hesitates much more than the one depicted in Figure 29 or in Figure 8. So, the reuse of MVs, frame type or MB-type in such cases eliminates the numerous unnecessary steps leading to misjudgment decisions in the motion prediction by delivering the almost-perfect results.

V.6.6. Evaluation of Complete MD-Based Video Transcoding Chain

Finally, the evaluation of simple but complete chain using the MD-based transcoding has been conducted. As the RETAVIC framework proposes, the video encoder has been split into two parts: content analysis, which is encapsulated in the LLV1 encoder (in the non-real-time preparation phase), and MD-based encoding (MD-XVID) using the continuous MD delivered from the outside (read from the hard disk). The video data is stored internally in the proposed LVV1 format and the continuous MD are losslessly compressed using entropy coding. The continuous MD decoder is embedded in the MD-XVID encoder, thus the cost of MD decoding is included in the evaluation.

Figure 30. Video transcoding scenario from internal LLV1 format to MPEG-4 SP compatible (using MD-XVID): a) usual real-world case and b) simplified investigated case.

The video transcoding scenario is depicted in Figure 30. There are two cases presented: a) the usual real-world case where the delivery through network to end-client occurs and b) the simplified case for investigating only the most interesting parts of the MD-based transcoding (marked in color). So, the four simple steps in the example of conversion are performed as depicted in Figure 5. b), namely: the LLV1-coded video together with accompanying MD is read from the storage, next the compressed video data are adaptively decoded to raw data, then the decoding of continuous MD and encoding to MPEG-4 Simple Profile using MD-based XVID is performed, and finally the MPEG-4 stream is written to the storage. For the real-world

131 Chapter 3 – Design V. Video Processing Model

situation (a), the MPEG-4 stream is converted into the packetized elementary stream (PES) according to [MPEG-4 Part I, 2004] and sent to the end-client through the network instead of writing to the storage (b).

Figure 31. Execution time of the various data quality requirements according to the simplified scenario.

The results for the simplified scenario (b) for the Paris (CIF) sequence are depicted in Figure 31. Here, the execution time covers the area marked by color (Figure 30) i.e. only the decoding and encoding is summed up and read/write operations are not included (but anyway they influence the results insignificantly). Four sets of quality requirements specified by the user have been investigated:

1. low quality – where BL and TEL of LLV1 are decoded and the bit rate of the MPEG-4 bitstream targets 200 kbps;

2. average quality – where BL+TEL and QEL1 of LLV1 are decoded and the targeted bit rate is equal to 400 kbps;

132 Chapter 3 – Design V. Video Processing Model

3. high quality – where the layers up to QEL2 of LLV1 are worked out and the higher bit rate of 800 kbps is achieved;

4. excellent quality – where all LLV1 layers are processed and the MPEG-4 with bit rate of 1200kbps is delivered (lossless decoding).

The execution time for each quality configuration is measured per frame in order to see the behavior of the transcoding process. There are still few peaks present, which may derived from the thread preemption in the best-effort system; however in general the execution time per frame is very stable—and one could even risk a statement that it is almost constant. Obviously, the use of continuous MD allows for reduced encoder complexity making the participation of LLV1-decoder’s time bigger in the total execution time in contrast to a chain with the standard XVID encoder. On the other hand, the proved LLV1 processing scalability now influences significantly the total transcoding time thus allows gaining better control over the amount and quality of the processed data on the encoders input.

All in all the whole transcoding where the continuous MD are used shows much more stable behavior for all considered quality ranges up to full-content lossless decoding of LLV1 video. Summarizing, the MD-based framework allows gaining some more control over the whole transcoding process and speeds up the execution of the video processing chain in comparison to the transcoding without any MD.

133 Chapter 3 – Design VI. Audio Processing Model

VI. AUDIO PROCESSING MODEL

In analogy to chapter V, the chapter VI proposes the audio-specific processing model based on the logical model proposed in chapter IV. The analysis of few most-known representatives of the audio encoders is given at the beginning. Next, the general assumptions for the audio processing are made. Subsequently, the audio-related static MD is defined and MPEG-4 SLS is proposed as the internal format and described in details. The MD-based processing is described separately for decoding and encoding. Finally, the evaluation of the decoding part of the processing model is given. This part covers only the MPEG-4 SLS evaluation and its enhancement.

VI.1. Analysis of the Audio Encoders Representatives There have been three codecs selected for the analysis as representatives of different perceptual transform-based algorithms. These are following:

• OggEnc [WWW_OGG, 2006] – standard encoding application of the Ogg Vorbis coder being part of the vorbis-tools package, • FAAC [WWW_FAAC, 2006] – open source MPEG-2 AAC compliant encoder, • LAME [WWW_LAME, 2006] – open source MPEG-1 Layer III (MP3) compliant encoder.

All of them are open source programs and they are recognized as state-of-the-art for their specific audio format. The decoding part of these lossy coders is not considered as important for the RETAVIC, since only the encoding part is executed in the real-time environment. Moreover, the most important factors under analysis cover the constancy and predictability of the encoding time as well as the encoding speed (deriving from coding complexity). In all cases the default settings of the encoders have been used.

There have been used different types of audio data in the evaluation process. The data covered the instrumental sounds from the MPEG Sound Quality Assessment Material [WWW_MPEG

134 Chapter 3 – Design VI. Audio Processing Model

SQAM, 2006] and an own set of music from commercial CDs and self-created audio [WWW_Retavic - Audio Set, 2006], however in the later part only the three representatives are used (namely silence.wav, male_speech.wav, and go4_30.wav). These three samples are enough to show the behavior of the encoding programs under different circumstances like silence, speech, music, having high and low volume or different dynamic ranges.

The graphs depicting the behaviour of all encoders for each of the selected samples are given such that the encoding time is measured per block of PCM samples defined as 2048 samples for FAAC and OggEnc, and 1152 samples for Lame. Such division of samples per block derives from the audio coding algorithm itself and cannot be simply changed (i.e. the block size in Lame is fixed), thus the results cannot be simply compared on one graph and so the Lame is depicted separately.

As expected, the silence sample results show a very constant encoding time for all the encoders. The results are depicted respectively in Figure 32 and Figure 33. The FAAC (pink color) needs about 2ms per block of PCM samples, while the OggEnc (blue colour) requires roughly 2.4ms. The Lame is achieving about 2.1ms per block, however due to the smaller block size the artificial time of 3.73ms could be achieved after normalization to the bigger block of 2048 samples (it is however not depicted to avoid flattening of the curve).

Figure 32. OggEnc and FAAC encoding behavior of the silence.wav (based on [Penzkofer, 2006]).

135 Chapter 3 – Design VI. Audio Processing Model

Figure 33. Behavior of the Lame encoder for all three test audio sequences (based on [Penzkofer, 2006]).

Figure 34. OggEnc and FAAC encoding behavior of the male_speech.wav (based on [Penzkofer, 2006]).

The results for encoding of male_speech are depicted in Figure 33 for Lame and in Figure 34 for FAAC and OggEnc. Contrary to silence, the encoding time hesitates much more for the male_speach. The FAAC encodes the blocks in time between 1.5ms and 2ms and is still relatively constant in respect to this range since there are no important peaks that exceed these values. The encoding time of Lame fluctuates much more, but still there are just few peaks above 3ms.

136 Chapter 3 – Design VI. Audio Processing Model

The OggEnc behaves unstable i.e. there are plenty of peaks in both directions going even beyond 7ms per block.

Yet another behavior of audio encoders can be observed for music example go4_30 (Figure 35). A narrow tunnel with min and max values still can be defined for the FAAC encoder (1.6 to 2.2ms), but the OggEnc is now very unstable having even more peaks than in case of male_speach. Also the Lame (yellow color in Figure 33) manifests the unstable behavior and requires more processing power than in case of male_speach i.e. the encoding time ranges from 2 to 4ms per block of 1152 samples and the average encoding time is longer (yellow curve above two other curves).

Figure 35. OggEnc and FAAC encoding behavior of the go4_30.wav (based on [Penzkofer, 2006]).

The overall FAAC encoding time of the different files does not differ very much even if different types of audio samples are used. This may ease the prediction of the encoding time. The OggEnc results demonstrate huge peaks in both directions of the time axis for male_speach and for go4_30. These peaks are caused by the variable window length, which is possible in the Ogg Vorbis codec, i.e. the number of required encoding steps is not fixed for the block of 2048 samples and it is determined by block analysis. Here, the pre-calculated meta-data keeping

137 Chapter 3 – Design VI. Audio Processing Model

window length of every block would be applicable to help in prediction of the execution time. On the other hand, the peaks hesitate around relatively constant average value and usually a positive peak is followed by negative one. Such behavior could be supported by small time buffer leading to the constant average time.

Finally, the overall encoding time of the analyzed encoders differ significantly as it is shown in Figure 36. The male_speech is processed faster because the sample sequence is shorter than two others. Still it has to be pointed out that these codecs are lossy encoders, in which the compression ratio, sound quality, and processing time are related to each other, and different relationship exists for each encoder. As so in order to compare speed, the quality and the produced size shall also be considered. The measurement of perceived quality is subjective to each person, so it needs a special environment, hardware settings and experienced listeners to evaluate the quality. Anyway, the default configuration settings should be acceptable by a home user. On the other hand, the quality and compression ratio was not the point within the behavior evaluation under the default configuration settings of the selected encoders. So the idea of the graph is to show the different behavior depending on audio source i.e. for different sequences one encoder sometimes is faster but sometimes slower than the competitors.

Figure 36. Comparison of the complete encoding time of the analyzed codecs.

VI.2. Assumptions for the Processing Model Considering the results from the previous chapter, the better understanding of the coding algorithm is required before proposing the enhancements in the processing. All the audio codecs under investigation employed the perceptual audio coding idea, which uses

138 Chapter 3 – Design VI. Audio Processing Model

psychoacoustic model79 to specify which parts of the data are really inaudible and can be removed during compression [Sayood, 2006; Skarbek, 1998]. The goal of psychoacoustic model is to increase the efficiency of compression by maximizing the compression ratio and to make the reconstructed file as most exact to the source as possible in the meaning of the perceived quality. Of course, an algorithm using psychoacoustic model is a lossy coding technique, so there is no bit-exactness between both source and reconstructed data files.

To make the audio compression effective, it is reasonable to know what kind of sounds a human can actually hear. The distinction between important-for-perception elements and the less important ones lets the developers concentrate only on the audible parts and ignore the segments which would not be heard anyway. According to this, the aggressive compression is possible or even cutting out some parts of signal without significant influence on the quality of the heard sound. Most of all, the frequencies which exceed the perception threshold (e.g. very high frequencies) can be removed, and the important-for-human-being frequencies between 200Hz to 8kHz must be reflected precisely during the compression, because human can easily hear the decrease of the sound quality in that frequency range.

When two sounds occur simultaneously the masking phenomenon takes place [Kahrs and Brandenburg, 1998]. For example, when one sound is very loud and the other one is rather quiet, only the loud sound will be perceived. Also if at a short interval of time after or before a loud, sudden sound a quiet sound occurs, it won’t be heard. The inaudible parts of a signal can be encoded with a low precision or can not be encoded at all. Psychoacoustics helped in development of effective, almost lossless compression in meaning of perceived quality. The removed parts of the signal don’t affect the comfort of the sound perception to a high degree although they may decrease its signal quality.

79 The ear of an average human can detect sounds between 16 Hz and 22 kHz of frequency range and about 120dB of dynamic range in the intensity domain [Kahrs and Brandenburg, 1998]. The dynamic range is said to be from the lowest level of hearing (0dB) to the threshold of pain (120-130dB). Human ear is most sensitive to frequencies from the range of 1000 Hz to 3500 Hz and the human’s speech communication takes place between 200 and 8000 Hz [Kahrs and Brandenburg, 1998]. The psychoacoustics as the research field aims at the investigation of dependencies between sound wave coming into human’s ear and subjective sensible perception of this wave. The discipline deals with describing and explaining some specific aspects of auditory perception based on anatomy, physiology of human ear and cognitive psychology. The results of the investigation are utilized while inventing modern, new compression algorithms [Skarbek, 1998].

139 Chapter 3 – Design VI. Audio Processing Model

VI.3. Audio-Related Static MD As stated in the Video-Related Static MD (section V.3), the MD are different for various media types, thus also the static MD for audio are related to the MO having type of audio, for which the set (StaticMD_Audio) is defined as:

∀i : MDA (moi ) ⊂ StaticMD _ Audio ⇔ typei = A

∀i : moi ∈O (16) 1 ≤ i ≤ X where typei denotes type of the i-th media object moi, A is the audio type, MDA is a function extracting the set of meta-data related to the audio media object moi, and X is the number of all media objects in O.

Analogically to video, the audio stream identifier (AudioID) is also one-to-one mapping to the given MOID:

∀ ∃ ¬∃ : AudioID = MOID ∧ AudioID = MOID i j k i j i k (17) k ≠ j

The static MD set of the audio stream includes sums of each type of window analogically to frame type of video:

∀i : wi .type∈WT ∧WT = {ST, SP, L, M , S} (18) where ST denotes the start window type, and analogically SP denotes the stop , L - long, M - medium and S - short, and WT denotes the set of these windows types.

The sum for each window type is defined as:

WindowSumwi .type = {w w ∈ mo ∧1 ≤ j ≤ N ∧ w .type = w .type ∧ w .type∈WT} (19) moi j j i j i j

Where wi.type is one of the window types defined by Equation (18), N denotes the amount of all windows, wj is a window at j-th position in the audio stream, wj.type denotes type of the j-th frame.

The sum of all types of windows is kept in the respective attributes of the StaticMD_Audio. Analogically to the window type, the information about the sum of different window switching

140 Chapter 3 – Design VI. Audio Processing Model

modes along the complete audio stream is kept in StaticMD_SwitchingModes. Eleven types of window switching modes are differentiated up to now thus there are eleven derived aggregation attributes.

The current definition of initial static MD set is mapped to the relational schema (Figure 37).

Figure 37. Initial static MD set focusing on the audio data.

Analogically to the video static MD, the sum of all window types and window switching modes may be calculated by the respective SQL queries:

SELECT AudioID, WindowType, count(WindowType) FROM StaticMD_Window GROUP BY AudioID, WindowType ORDER BY AudioID, WindowType;

Listing 4. Calculation of all sums according to the window type.

141 Chapter 3 – Design VI. Audio Processing Model

SELECT AudioID, WindowSwitchingMode, count(WindowSwitchingMode) FROM StaticMD_Window GROUP BY AudioID, WindowSwitchingMode ORDER BY AudioID, WindowSwitchingMode;

Listing 5. Calculation of all sums according to the window switching mode.

Of course, the entity set of StaticMD_Window must be complete in such a way that all windows existing in the audio are included by this set. Based on this assumption, the sums included in the StaticMD_Audio and StaticMD_WindowSwitchingSum are treated as derived attributes counted by above SQL statements. However, due to the optimization issues and rare updates of the MD set, these values are materialized and included as usual attributes in the StativMD_Audio and StaticMD_WindowSwitchingSum. If the set completeness assumption of StaticMD_Window were not fulfilled, the sum attributes could not be treated as derived.

VI.4. MPEG-4 SLS as Audio Internal Format Following the system design in section IV, the internal storage format has to be chosen. The MPEG-4 SLS is proposed to be the suitable format for storing the audio data internally in the RETAVIC architecture. The reasons have been discussed in details in [Suchomski et al., 2006] where the format requirements have been stated and the evaluation in respect to the qualitative aspects considering data scalability as well as the quantitative aspects referring to the processing behavior and resource consumption. Other research considering the MPEG-4 SLS algorithm and its evaluation is covered by the recent MPEG verification report [MPEG Audio Subgroup, 2005], but both works are complementary to each others due to their different assumptions i.e. they discuss different configuration settings of the MPEG-4 SLS and evaluate the format from distinct perspectives. For example, [MPEG Audio Subgroup, 2005] compares the coding efficiency for only two settings: AAC-Core and No-Core, contrary to [Suchomski et al., 2006] where additionally various AAC-cores have been used. Secondly, [Suchomski et al., 2006] provides explicit comparison of coding efficiency and processing complexity (e.g. execution time) to other lossless formats, while the [MPEG Audio Subgroup, 2005] reports details on SLS coding efficiency without comparison and very detailed analysis of algorithmic complexity. Finally, the worst-case processing in respect to complexity being usable during a hard-RT DSP implementation is given in [MPEG Audio Subgroup, 2005], whereas [Suchomski et al., 2006] discusses some scalability issues of processing applicable in different RT models e.g. cutting-off

142 Chapter 3 – Design VI. Audio Processing Model

the enhancement part of the bitstream during the on-line processing with consideration of the data quality in the output80.

VI.4.1. MPEG-4 SLS Algorithm – Encoding and Decoding

The MPEG-4 Scalable Lossless Audio Coding (SLS) was designed as an extension of MPEG Advanced Audio Coding (AAC). These two technologies combined together have composed an innovative technology joining scalability and losslessness, which is referred commercially as High Definition Advanced Audio Coding (HD-AAC) [Geiger et al., 2006], and based on standard AAC perceptual audio coder with the additional SLS enhancement layer. Both technologies cause, that even at lower bit rates a relatively good quality can be delivered. The SLS can be also used as a stand-alone application, with a non-core mode due to its internal compression engine. The scalability of quality varies from AAC-coded representation through wide range of in-between near-lossless levels up to fully lossless representation. The overview of the SLS simplified encoding algorithm is depicted in Figure 38 for two possible modes: a) with AAC-core and b) as stand-alone SLS mode (without core).

Figure 38. Overview of simplified SLS encoding algorithm: a) with AAC-based core [Geiger et al., 2006] and b) without core.

80 The bitstream truncation itself is discussed in [MPEG Audio Subgroup, 2005], however from the format and not from the processing perspective.

143 Chapter 3 – Design VI. Audio Processing Model

VI.4.2. AAC Core

Originally the Advanced Audio Coding (AAC) was invented as an improved successor of MPEG-1 Layer III (MP3) standard. It is specified in MPEG-2 standard, Part 7 [MPEG-2 Part VII, 2006] and later in MPEG-4, Part 3 [MPEG-4 Part III, 2005] and can be described as a high-quality multi-channel coder. The used psychoacoustic model is the same as the one in MPEG-1 Layer III model, but there are some significant differences in the algorithm. AAC didn’t have to be backward compatible with Layer I and Layer II of MPEG-1 as MP3 had to, so there can be offered higher quality at lower bitrates. AAC shows great encoding performance at very low bitrates additionally to its efficiency at standard bitrates. As an algorithm, it offers new functionalities like low-delay, error robustness and scalability, which are introduced as AAC tools/modules and are used by specific MPEG audio profiles (detailed description is given in section XXII of Appendix H).

The main improvements were made by adding the Long Term Predictor (LTP) [Sayood, 2006] to reduce the bit rate in the spectral coding block. It also supports more sample frequencies (from 8 kHz to 96 kHz) and introduces additional sampling half-step frequencies (16, 22.05, 24 kHz). Moreover, LTP is computationally cheaper than its predecessor and is a forward adaptive predictor. The other improvement over MP3 is the increased number of channels – provision of multi-channel audio up to 48 channels but also support 5.1 surround sound mode. Generally, the coding and decoding efficiency has been improved what effects in smaller data sizes and better quality sound than MP3 when compared at the same bitrate. The filter bank converts the input audio signal from time domain into frequency domain by using Modified Discrete Cosine Transform (MDCT) [Kahrs and Brandenburg, 1998]. The algorithm supports dynamical switching between two window lengths of 2048 samples and 256 samples. All windows are overlapped by 50% with neighboring blocks. This results in generating 1024 or respectively 128 spectral coefficients from the samples. For better frequency selectivity the encoder can switch between two different shapes of windows, a sine shaped window and a Kaiser-Bessel Derived (KBD) Window with improved far-off rejection of its filter response [Kahrs and Brandenburg, 1998].

144 Chapter 3 – Design VI. Audio Processing Model

The standard contains two types of predictors. The first one, intra-block predictor—also called Temporal Noise Shaping (TNS)81—is used to input containing transients. The second one, interlock predictor, is useful only in stationary conditions. In stationary signals, the further reduction of bit rates can be achieved by using prediction to reduce the dynamic range of the coefficients. During stationary periods, generally the coefficients at a certain frequency don’t change their values in a significant degree between blocks. This fact allows transmitting only the prediction error between subsequent blocks (backward adaptive predictor).

The noiseless coding block uses twelve Huffman codebooks available for two- and four-tuple blocks of quantized coefficients. They are used to maximize redundancy reduction within the spectral data coding. A codebook is applied to sections of scale-factor bands, to minimize the resulting bitrate. The parts of a signal, which are referred to as noise, are in most cases undistinguishable from other noise-like signals. This fact is used by not transmitting the scale- factor bands but instead of this transmitting the power of the coefficients in the “noise” band. The decoder will generate a random noise according to the power of the coefficients and put it in the region where should be the original noise.

VI.4.3. HD-AAC / SLS Extension

The process of encoding consists of two main parts (see Figure 38 a). At first the bitstream is encoded by the integer-based invertible AAC coder and the core layer is produced. AAC is a lossy algorithm, so not all the information is encoded. The preservation of all the information can be attained thank to the second step delivering enhancement layer. This layer encodes the residual signal missing in the AAC-encoded signal i.e. the error between lossy and source signal. Moreover, the enhancement layer allows scalability in meaning of quality due to the bit- significance-based bit-plane coding preserving spectral shape of the quantization noise used by the AAC-core [Geiger et al., 2006].

81 TNS is a technique that operates on the spectral coefficients only during pre-echoes. It allows using the prediction in the frequency domain to shape temporally the introduced quantization noise. The spectrum is filtered, and then the resulting prediction residual is quantized and coded. Prediction coefficients are transmitted in the bitstream to allow later recovery of the original signal in the decoder.

145 Chapter 3 – Design VI. Audio Processing Model

The detailed architecture of the HD-AAC is depicted as block diagram for a) encoding and b) decoding in Figure 39. The HD-AAC exploits the Integer Modified Discrete Cosine Transform (IntMDCT) [Geiger et al., 2004] to avoid additional errors while performing the transformations and is completely invertible function. During the encoding process the MDCT produces coefficients, which are later mapped into both layers. The residual signal introduced in the AAC process is calculated by the error mapping process and then mapped into bit-planes in the enhancement layer. The entropy coding of the bit-planes is done by three different methods: Bit-Plane (BPGC), Context-Based (CBAC) and Low Energy Mode (LEM).

a)

b)

Figure 39. Structure of HD-AAC coder [Geiger et al., 2006]: a) encoder and b) decoder.

The error correction is placed at the beginning of each sample window. The most significant bits (MSB) correspond to the bigger error, while the least significant bits (LSB) to the finer

146 Chapter 3 – Design VI. Audio Processing Model

error, and obviously the more bits of the residual are used, the less loss in the output will be. The scalability is achieved by truncation of the correction value.

Core and enhancement layers are subsequently multiplexed into the HD-AAC bitstream. The decoding process can be done in two ways, by decoding only the perceptually-coded AAC part, or by using both AAC and the additional residual information from the enhancement layer.

VI.4.3.1 Integer Modified Discrete Cosine Transform

In all the audio coding schemes one of the most important matter is the transform of the input audio signal from the time domain into the frequency domain. To achieve a block-wise frequency representation of an audio signal, the Fourier-related transforms (DCT, MDCT) or filter-banks are used. The problem is that these transforms produce floating point output even for the integer input. Reduction of the data rate must be done by quantizing the floating data. This means, that some additional error will be introduced. For the lossless coding any additional rounding errors must be avoided by the usage of a very precise quantization, so the error could be neglected, or by applying a different transform. The Integer Modified Discrete Cosine Transform is an invertible integer approximation of the MDCT [Geiger et al., 2004]. It retains the advantageous properties of the MDCT and also produces integer output values. It provides a good spectral representation of the audio signal, critical sampling and overlapping of blocks.

The following part describes the IntMDCT in some details and is based on [Geiger et al., 2001; Geiger et al., 2004]. The earlier mentioned properties have been gained by applying the lifting scheme onto the MDCT. Lifting steps divide the signal into a set of steps. The most advantageous quality of the steps is the fact that the inverse transform is a mirror of the forward transform. The divided signal can be easier operated on by some convolution operations and transformed back into one signal without any rounding errors.

The MDCT can be decomposed into Givens rotations. According to this fact, the decomposed transform can be approximated by a reversible, lossless integer transform. Figure 40 illustrates this decomposition. The MDCT is firstly decomposed into Givens rotations. To achieve the decomposition it must be divided into three parts: 1) Windowing, 2) Time Domain Aliasing Cancellation (TDAC) and 3) Discrete Cosine Transform of type IV. Of course, TDA is not

147 Chapter 3 – Design VI. Audio Processing Model

used directly, but as an integer approximation deriving from the decomposition into three lifting steps.

Figure 40. Decomposition of MDCT.

The MDCT is defined by [Geiger et al., 2001]:

2 2N −1 (2k +1+ N)(2m +1)π X (m) = ∑ w(k)x(k)cos (20) N k =0 4N where x(k) is a time domain input, w(k) is the windowing function, N defines the number of calculated spectral lines and m is sequence of integer numbers between 0 and N-1.

Figure 41. Overlapping of blocks.

To achieve both the critical sampling and the overlapping of blocks the frequency domain subsampling is performed. This procedure introduces aliasing in time domain, thus TDAC is used to cancel the aliasing by “overlap and add” formula employed on two subsequent blocks in the synthesis filter-bank. Two succeeding blocks overlap by 50%, so to get one full set of samples, 2 blocks are needed. The better preservation of all the original data can be achieved by the redundancy of information (see Figure 41). Each set of samples (marked with different

148 Chapter 3 – Design VI. Audio Processing Model

colors) is firstly put into the right part of the corresponding block and to the left part of the succeeding block. For each block 2N time domain samples are used to calculate N spectral lines. Each block corresponds to one window. The MDCT while proceeding introduces an aliasing error, which can be cancelled by adding the outputs of the inverse MDCT of two overlapping blocks (as depicted). The windows must fulfill certain conditions in their overlapping parts to ensure this cancellation [Geiger et al., 2001]:

⎛ π ⎞ w(k) = sin⎜ ()2k +1 ⎟, where k = 0,...,2N −1 (21) ⎝ 4N ⎠

To decompose the MDCT (with a window length of 2N) into Givens rotations it must be first decomposed into windowing, time domain aliasing and a Discrete Cosine Transform of Type IV (DCT-IV) with a length of N. The combination of windowing and TDAC for the overlapping part of the two subsequent blocks results in application of [Geiger et al., 2001]:

⎛ N N ⎞ ⎜ w( + k) − w( −1− k)⎟ N ⎜ 2 2 ⎟, where k = 0,..., −1 (22) N N 2 ⎜ w( −1− k) w( + k) ⎟ ⎝ 2 2 ⎠ which is itself a Givens rotation.

The DCT-IV is defined as [Geiger et al., 2001]:

2 N −1 (2k +1)(2m +1)π X (m) = ∑ x(k)cos , where m = 0,..., N −1 (23) N k =0 4N

The DCT-IV coefficients build an orthogonal NxN matrix, what means that it can be decomposed into Givens rotations. The rotations applied for windowing and time domain aliasing can be also used in the inverse MDCT in reversed order, with different angles. The inverse to the DCT-IV is the DCT-IV itself. Figure 42 depicts the decomposition of MDCT and of the inverse MDCT, where 2N samples are first rotated and then transformed by the DCT-IV to achieve N spectral lines.

149 Chapter 3 – Design VI. Audio Processing Model

Figure 42. Decomposition of MDCT by Windowing, TDAC and DCT-IV [Geiger et al., 2001].

According to the conditions for the TDAC, it can be observed that for certain angles the MDCT can be completely decomposed into Givens rotations, by an angle α [Geiger et al., 2001]:

⎛cosα − sinα ⎞⎛ x1 ⎞ ⎛ x1 cosα − x2 sinα ⎞ ⎜ ⎟⎜ ⎟ = ⎜ ⎟ (24) ⎝ sinα cosα ⎠⎝ x2 ⎠ ⎝ x1 sinα + x2 cosα ⎠

The input values x1 and x2 are rotated and thus transformed into values x1cosα – x2sinα and respectively x1sinα + x2cosα. Moreover, every Givens rotation can be divided into three lifting steps [Geiger et al., 2001]:

⎛ cosα −1⎞ ⎛ cosα −1⎞ ⎛cosα − sinα ⎞ ⎜1 ⎟⎛ 1 0⎞⎜1 ⎟ ⎜ ⎟ = sinα ⎜ ⎟ sinα (25) ⎜ sinα cosα ⎟ ⎜ ⎟⎜sinα 1⎟⎜ ⎟ ⎝ ⎠ ⎝0 1 ⎠⎝ ⎠⎝0 1 ⎠

Figure 43. Givens rotation and its decomposition into three lifting steps [Geiger et al., 2001].

150 Chapter 3 – Design VI. Audio Processing Model

The Givens rotations are mostly used to introduce zeros in matrices. The operation is performed to simplify the calculations and thereby reduce the number of required operations and the overall complexity. It is often used to decompose matrices or to zero components which are to be annihilated.

VI.4.3.2 Error mapping

The error mapping process provides the link between the perceptual AAC core and the scalable lossless enhancement layer of the coder. Instead of encoding all IntMDCT coefficients in the lossless enhancement layer, the information already coded in AAC is used. Only the resulting residuals between the IntMDCT spectral values and their equivalents in the AAC layer are coded.

The residual signal e[k] is given by [Geiger et al., 2006]:

⎧c[k]− thr(i[k]), i[k] ≠ 0 e[]k = ⎨ (26) ⎩ c[]k , i []k = 0 where c[k] is an IntMDCT coefficient, thr(i[k]) is the next quantized value closer to zero with respect to i[k], and i[k] is the AAC quantized value.

If the coefficients belong to a scale-factor band that was quantized and coded in the AAC, the residual coefficients are received by subtracting the quantization thresholds from the IntMDCT coefficients. If the coefficients don’t belong to a coded band, or are quantized to zero, the residual spectrum is composed form the original IntMDCT values. This process allows better coding efficiency without losing information.

VI.5. Audio Processing Supporting Real-time Execution VI.5.1. MD-based Decoding

In analogy to LLV1, the SLS decoding was also designed with the goal of scalability, however there contrary to LLV1 there is a possibility to adapt the processing according the real-time constraints. Thus in order to support MD-based decoding, the existing SLS do not have to be extended to support the real-time adaptive decoding through possibility of skipping some parts

151 Chapter 3 – Design VI. Audio Processing Model

of the extension stream including residuals for given window of samples or even a set of windows of samples82. Of course, the extension stream stores not the samples themselves but the compressed residual signal used for the error correction.

Such meta data allow stopping the decoding in the middle of the currently processed window, and continuing enhancement processing at the subsequent window or at one of next windows, whenever the unpredictable peak in processing occurs. In other words, this process may be understood as truncation of the bitstream on the fly during higher-load period, and not before start of the decoder execution (as it is in standard SLS right now83).

The MD-based functionality of the existing best-effort SLS is already included considering the analogical extension proposal for LLV1 (as given in section V.5.1) due to the possibility of storing the size occupied by the data window of enhancement layer in the compressed domain at the beginning of each window. So, the real-time SLS decoder should only be able to consume this additional information as continuous MD.

Of course, the original best-effort SLS decoder has no need for such MD, because the stream is read and decoded as specified by the input parameters (i.e. if truncated than beforehand). The discontinuation of processing of one or few data windows in the compressed domain is only required under the strict real-time constraints and not under best-effort system, i.e. if the execution time is limited (e.g. the deadline is approaching) and the peak in the processing occurs (higher coding complexity than predicted), the decoding process of the enhancement layer should be terminated and started again with the next or later window. The finding of the position of the next or later data window in the compressed stream is not problematic anymore, since the beginning of next data window is delivered by the mentioned MD. On the other hand, the window positions could be stored in an index and then faster localization of the next window would be possible.

82 In order to skip few data windows, the information about size of each window has to be read sequentially and fseek operation has to be executed in steps to read the size of the window at the beginning of the given window. 83 There is an additional application BstTruncation used for truncating the enhancement layer in the coded domain.

152 Chapter 3 – Design VI. Audio Processing Model

VI.5.2. MD-based Encoding

To understand the proposed MD-based extensions for audio coding the general perceptual coding algorithm together with the reference MPEG-4 audio coding standard are explained at first. Then the concrete continuous MD are discussed referring to the audio coding.

VI.5.2.1 MPEG-4 standard as representative

Generalization of Perceptual Audio Coding Algorithms The simplified algorithm of perceptual audio coding is depicted in Figure 44. The input sound in the digital form, which is usually Pulse Code Modulation (PCM) audio in 16-bit words, is first divided in the time domain into windows usually of a constant duration (not depicted). Each window corresponds to a specific amount of samples e.g. one window of a voice/speech signal lasts approx. 20 ms what would be 160 samples for the sampling rate of 8 kHz (8000 samples/s x 0.02 s) [Skarbek, 1998].

a)

b)

Figure 44. General perceptual coding algorithm [Kahrs and Brandenburg, 1998]: a) encoder and b) decoder.

Then the windows of samples in the time domain are transformed to the frequency domain by analysis filter bank (analogically to transform step in video), which decomposes the input signal into its sub-sampled spectral components – frequency sub-bands – being a time-indexed series of coefficients. The analysis filter bank apply the Modified Discrete Cosine Transform (MDCT) or discrete Fast Fourier Transform (FFT) [Kahrs and Brandenburg, 1998]. The windows can be overlapping or non-overlapping. During this process also a set of parameters is extracted, which give information about the distribution of signal, masking power over time-frequency plane and

153 Chapter 3 – Design VI. Audio Processing Model

signal mapping used to shape the coding distortion. All these assist in perceptual analysis, perceptual noise shaping and in reduction of redundancies, and are later needed for quantization and encoding.

The perceptual model, called also psychoacoustic model, is used to simulate the ability of human auditory system to perceive different frequencies. Additionally, it allows modeling the masking effect of loud tones in order to mask quieter tones (frequency masking and temporal masking) and quantization noise around its frequency. The masking effect depends on the frequency and the amplitude of the tones. The psychoacoustic model analyzes the parameters from filter banks to calculate the masking thresholds needed in quantization. After the masking thresholds are calculated bit allocation is assigned to signal representation in each of frequency sub-bands for a specified bitrate of the data stream.

Analogically to quantization and coding steps in video, the frequency coefficients are next quantized and coded. The purpose of the quantization is implementing the psychoacoustic threshold while maintaining the required bit rate. The quantization can be uniform (equally distributed) or non-uniform (with varying quantization step). The coding uses scale factors to determine noise shaping factors for each frequency sub-band. It is done by scaling the spectral coefficients before quantization to influence the scale of the quantization noise and grouping them into bands corresponding to different frequency bands. The scale factors’ values are found from the perceptual model by using two nested iteration loops. The values of the scale factors are coded by Huffman coding (only the difference between the values of subsequent bands is coded). Finally, the windows of samples are packed into the bitstream according to required bitstream syntax.

MPEG-1 Layer 3 and MPEG-2/4 AAC The MPEG-1 Layer 3 (shortly MP3) [MPEG-1 Part III, 1993] and MPEG-2 AAC extended by MPEG-4 AAC (shortly AAC) [MPEG-4 Part III, 2005] are the widest spread and most often used coding algorithms for natural audio coding, thus this work discusses these two coding algorithms from the encoder perspective. MP3 encoding block diagram is depicted in Figure 45 while the AAC in Figure 46.

154 Chapter 3 – Design VI. Audio Processing Model

Figure 45. MPEG Layer 3 encoding algorithm [Kahrs and Brandenburg, 1998].

Figure 46. AAC encoding algorithm [Kahrs and Brandenburg, 1998].

There are few noticeable differences between MP3 and AAC encoding algorithm [Kahrs and Brandenburg, 1998]. The MP3 uses the hybrid analysis filter bank (Figure 44) consisting of two

155 Chapter 3 – Design VI. Audio Processing Model

cascading modules: Polyphase Quadrature Filter (PQF)84 and MDCT (Figure 45), while the AAC uses only switched MDCT. Secondly, the MP3 uses window size of 1152 (evtl. 576) values and the length of AAC window is equal to 102485. Thirdly, MPEG-4 AAC uses additional functionality like Temporal Noise Shaping (intra-block predictor), Long-Term Predictor (not depicted), Intensity (Stereo) / Coupling, (interlock) Prediction and Mid/Side Processing.

The window type detection together with window/block switching, and determination of M/S stereo IntMDCT are the most time consuming blocks of the encoding algorithm.

VI.5.2.2 Continuous MD set for audio coding

Having the audio coding explained, the definition of continuous meta-data set can be formulated. There are some differences between coding algorithms; however there are also some common aspects. Based on these similarities, the following elements of continuous MD set for audio are defined: window type, window switching mode, and M/S flag.

Window type. It is analogical to the frame type in video, however here the type is derived not from the processing method, but primarily from the audio content. The window type is decided upon the computed MDCT coefficients and their energy. An example algorithm of determining the window type is given in [Youn, 2008]. Depending on the current signal there are three window types defined as far: start, stop, long, medium and short. These can be mapped to standard AAC window types. Contrary, the MPEG-4 SLS allows for more window types, but still they could be classified as sub-groups of the mentioned groups and refined on the fly during execution (which is not as expensive as complete calculation).

Window switching mode. In case of different window types coming subsequently, the window switching is required. It can be conducted first then, when the window types of the

84 PQF splits a signal into N = 2x equidistant sub-bands allowing increase of the redundancy removal. There are 32 sub-bands (x=5) for MP3 with the frequency range between 0 and 16 kHz). However, the PQF introduces the aliasing for certain frequencies, which has to be removed by the filter overlapping the neighboring sub-bands (e.g. by MDCT having such characteristic). 85 There are few window lengths possible in the MPEG-4 AAC [MPEG-4 Part III, 2005]. Generally only two are used: short window with 128 samples and long window with 1024. The short windows are grouped by 8 in a group, so the total length is then 1024, and that’s why usually the window size of 1024 is mentioned. The other possible windows are: medium – 512, long-short – 960, medium-short – 480, short-short – 120, long-ssr – 256, short-ssr – 32.

156 Chapter 3 – Design VI. Audio Processing Model

current and next frame are known. If the window type is already delivered by the continuous MD, then the window switching mode can be calculated, and even though there have been proposed enhancements (e.g. [Lee et al., 2005]) this still requires additional processing power. Thus the window switching mode can also be included in the continuous MD. It is proposed to include the same eleven modes as defined by MPEG-486 in the continuous MD.

Mid/Side flag. It holds information on the using possibility of M/S processing. There are three modes of using M/S processing:

• 0 – the M/S processing is always turned off regardless the incoming signal characteristics • 1 – the M/S processing is always turned on, so the left and right channels are always mapped to M/S signal, which may be inefficient in case of big differences between channels • 2 – the M/S processing chooses dynamically if each channel signal is processed separately or in (M/S) combination.

Obviously, the last option is the default by the encoders supporting the M/S functionality. However, checking each window separately is CPU consuming operation.

There is also some additional chance of using meta-data in a continuous matter such that the real-time encoding could possibly be conducted faster. The Dynamic Range Control (DRC) idea is meant here [Cutmore, 1998; MPEG-4 Part III, 2005]. DRC is a process manipulating the dynamic range of an audio signal thus altering (automatically) the volume of audio, such that the gain (level) of audio is reduced when the wave’s amplitude crosses certain threshold (deriving directly from the limitation of hardware or of coding algorithm). If the dynamics of the audio is analyzed before it is coded, there is possibility to exploit an additional “helper” signal along with

86 Types supported by MPEG-4: STATIC_LONG, STATIC_MEDIUM, STATIC_SHORT, LS_STARTSTOP_SEQUENCE, LM_STARTSTOP_SEQUENCE, MS_STARTSTOP_SEQUENCE, LONG_SHORT_SEQUENCE, LONG_MEDIUM_SEQUENCE, MEDIUM_SHORT_SEQUENCE, LONG_MEDIUM_SHORT_SEQUENCE, FFT_PE_WINDOW_SWITCHING.

157 Chapter 3 – Design VI. Audio Processing Model

the digital audio which “predicts” the gain that will shortly be required [Cutmore, 1998]. This allows modification of the dynamic range of the reproduced audio in a smooth manner. The support for DRC has already been included in MPEG-4 AAC [MPEG-4 Part III, 2005], however the DRC meta-data is produced on the side of encoder and consumed on the decoder side. There is no possibility to exploit DRC MD on the encoder side. Eliminating such drawback would allow to use the DRC information as hint for the scale factor calculation87.

VI.6. Evaluation of the Audio Processing Model through Best-Effort Prototypes VI.6.1. MPEG-4 SLS Scalability in Data Quality

The MPEG-4 scalability in meaning of the quality of the input data has been measured (shown in Figure 47). The SLS enhancement bitstream for different bitrates of the core layer was truncated with the BstTruncation tool to different sizes and then decoded. The resulting audio data have been then compared to the reference files using a PEAQ implementation that gives the Objective Difference Grade (ODG) [PEAQ, 2006]. The ODG value of the difference of perceived sound quality is closer to zero, the better quality is. Beside the standard SLS core option, the SLS was tested also for the non-core mode referred to as SLS no Core and for the pure core (i.e. AAC) denoted by SLS Core only (Figure 47). The set included both the MPEG- based SQAM subset [WWW_MPEG SQAM, 2006] and the private music set [WWW_Retavic - Audio Set, 2006].

Overall, the scalability of SLS is very efficient in respect to quality gain compared to the added enhancement bits. The biggest gain of ODG is achieved when scaling towards bitrates around 128 kbps. The high-bitrate AAC cores affect the sound quality positively until about 256 kbps with the disadvantage of no scalability below the bitrate of the AAC core. In the area between near-lossless to lossless bitrates the ODG converges towards zero. The SLS will achieve the lossless state at rates about 600-700 kbps, which is not possible for lossy AAC. The pure basic

87 The MPEG-4 AAC uses Spectral Line Vector which is calculated based on Core Scaling factor and Integer Spectral Line Vector (IntSLV). Core Scaling factor is defined as a constant for four possible cases: MDCT_SCALING = 45.254834, MDCT_SCALING_MS = 32, MDCT_SCALING_SHORT = 16, and MDCT_SCALING_SHORT_MS = 11.3137085. The IntSLV is calculated by the MDCT function.

158 Chapter 3 – Design VI. Audio Processing Model

AAC stream (SLS Core only) starts to fall behind scalable to lossless SLS from about 160 kbps upwards.

0,00 -0,25 -0,50 -0,75 -1,00 -1,25 -1,50 -1,75

ODG -2,00 SLS 64 kbps -2,25 Core SLS 128 kbps -2,50 Core -2,75 SLS Core only -3,00 SLS no Core -3,25 -3,50 64 96 128 160 192 224 256 288 320 352 384 Bitrate / kbps (Core + Enhancement Layer)

Figure 47. Gain of ODG with scalability [Suchomski et al., 2006].

Concluding, the SLS provides very good scalability in respect to the quality and occupied size and the same can still be competitive to other non-scalable lossless codecs.

VI.6.2. MPEG-4 SLS Processing Scalability

In order to investigate the processing scalability, the decoding speed was compared to the bitrate of input SLS bitstream. Figure 48 shows the SLS decoding speed in respect to the enhancement streams truncated at different bit rates analogical to the data-quality scalability test.

Obviously, the truncated enhancement streams are decoded faster as the amount of data and number of bit planes decrease due to the truncation. Furthermore, the Non-Core SLS streams are decoded faster than normal SLS, but in both cases the increase in speed is very small—about a factor of 2 from minimum to maximum enhancement bitrate (but the expected behavior should show the decoding being much faster for small amount of data i.e. the curve should be more declivous). However, the situation changes dramatically when the FAAD2 AAC decoder [WWW_FAAD, 2006] is used for the decoding of the AAC core and the enhancement layer is dropped completely. In that case, the decoding is over 100 times faster that the real-time. But

159 Chapter 3 – Design VI. Audio Processing Model

even then, if the decoding of enhancement layers takes place, the decoding speed will drop below 10 times faster than real-time.

1000,0 SLS no core SLS with 64kbps core SLS 64kbps 100,0 core only

10,0 Decoding speed (x times real-time) times (x speed Decoding

1,0 64 96 128 160 192 224 256 288 320 352 384 ma x Bitrate

Figure 48. Decoding speed of SLS version of SQAM with truncated enhancement stream [Suchomski et al., 2006].

Overall, the real processing scalability, as it would also be expected, is not given with MPEG-4 SLS. From this point of view it can be reduced to only two steps of scalability, using the enhancement layer or not, so the scalability of SLS in terms of processing speed cannot compete with its scalability in terms of sound quality. This is caused by the inverse integer MDCT (InvIntMDCT) taking the largest part of the overall processing time. Even though it is almost constant and does not depend on the source audio signal, it consumes between 50% and 70% of the decoding depending on the source audio. A better and faster implementation has been proposed [Wendelska, 2007] and is described in section XXIII. MPEG-4 SLS Enhancements of Appendix H. It can be clearly seen (Figure 110) that the enhancements in the implementation have brought the benefits such as smaller decoding time.

160 Chapter 3 – Design VII. Real-Time Processing Model

VII. REAL-TIME PROCESSING MODEL

The time aspect of the data in continuous media such as audio and video influences the prediction, scheduling and execution methods. If the real-time processing is considered as a base for format independence provision in MMDBMS, the converters participating in the media transformation process should be treated not as separate but as a part of the whole execution. Of course, a converter at first must be able to be run in the real-time environment (RTE) and as such, must be able to execute in the RTOS and to react on the IPC used for controlling real- time processes specific to the given RTOS. Secondly, there exist data dependencies in the processing, which have not been considered earlier by discussed models [Hamann et al., 2001a; Hamann et al., 2001b]. So, at first the modeling of continuous multimedia transformations with the aspects of data dependency is discussed. Then, the real-time issues are discussed in context of multimedia processing. Finally, the design of real-time media converters is proposed.

VII.1. Modeling of Continuous Multimedia Transformation VII.1.1. Converter, Conversion Chains and Conversion Graphs

A simple black-box view of the converter [Schmidt et al., 2003] considers as visible: data source, data destination, resource utilization and processing function, which consumes the resources. The conversion chain described in [Schmidt et al., 2003] assumed that in order to handle all the kinds of conversion a directed, acyclic graph is required (conversion graph) i.e. the split and re- join operations are required in order to support multiplexed multimedia stream; however, the split and re-join operation are treated in [Schmidt et al., 2003] as separate operations being outside the defined conversion chain. On the other hand, it was stated that due to the interaction between only two consecutive converters in most cases, only the conversion chains shall be investigated. Those two assumptions are somehow contradicting to each other. If a consistent, sound, and complete conversion model was provided, then the split and re-join operations would be treated as converter as well, and thus the whole conversion graph should be considered.

161 Chapter 3 – Design VII. Real-Time Processing Model

m data inputs Converter n data outputs (processing function)

ressource utilisation

Figure 49. Converter model – a black-box representation of the converter (based on [Schmidt et al., 2003; Suchomski et al., 2004]).

As so, the conversion model was extended by [Suchomski et al., 2004] such that the incoming data could be represented not by only one input but by many inputs and the outgoing data could be represented not by one output but by many outputs, i.e. the input/the output of the converter may consume/provide one or more media data streams as depicted in Figure 49. Moreover, it is assumed that the converter may convert from m media streams to n media streams, where m∈N, n∈N, and m does not have to be equal n (but still it may happen that m=n). Due to this extension, the simple media converter could be extended to multimedia converter, and the view on the conversion model could be broadened from the conversion chain to the conversion graphs [Suchomski et al., 2004]. Moreover, the model presented in [Suchomski et al., 2004] is a generalization of the multimedia transformation based on the previous research (mentioned in the Related Work in section II.4).

The special case of having source and sink represented as converters on the borders of the conversion graph should still be mentioned. These two types of converters within the conversion graph have respectively only the output or the input within the model. The source (e.g. file reading) delivers the data sources for the consecutive converter, while the sink (e.g. screen) only consumes the data from the previous converter. The file itself or the file system (its structure, access method, etc.) as well as the network or the graphical adapter should not be presented in the conversion graph as fully-fledged converters, as so they are considered as representatives of the outside environment having just one type of data interconnection and are mapped to converter sink or source respectively. Of course, the conversion graph should not be limited to just one source and one sink (there may be many of them), however there always must be at least one instance of source and one instance of sink present in the conversion graph. The converter graph is depicted in Figure 50 where (a) presents the meta-model of converter graph, (b) shows a model of the converter graph derived from the meta-model

162 Chapter 3 – Design VII. Real-Time Processing Model

definition, and (c) depicts three examples being instances of a converter graph model i.e. the particular converters are selected such that the converter graph is functionally correct and executable (explained later in section VII.2.5.2 Scheduling of the conversion graph).

RUNTIME Producer CONNECTION M

N CONVERTER Consumer a)

[CONSTRAINTS] D SOURCE: may not have any CONSUMER role SINK: may not have any PRODUCER role

SOURCE SINK

converter graph Producer Consumer CONN 11

converter chain b) converter converter converter converter (source) (sink)

converter converter converter (sink)

buffer buffer

mpeg-video V to DivX V buffer demux: mux: mpeg-sys asf mpeg-audio disk buffer A A to wma network buffer buffer

buffer buffer buffer LLV1 LLV1 XVID c) selective decoding encoding reading

network

buffer buffer

DVB-T stream demux: A Live multiplex mpeg-sys internet radio buffer

archive

Figure 50. Converter graph: a) meta-model, b) model and c) instance examples.

163 Chapter 3 – Design VII. Real-Time Processing Model

VII.1.2. Buffers in the Multimedia Conversion Process

Within the discussed conversion graph model the data inputs and data outputs are logically joined into connections, which are mapped on buffers between converters. This mapping is to be designed carefully due to the key problem of the data transmission costs between multimedia converters. Especially if the converters operate on the uncompressed multimedia data, the costs of copying data between buffers may influence heavily the efficiency of the system.

As so, the assumption of having zero-copy operations provided by the operating system should be made. The zero-copy operation could be provided for example by shared buffers between processes called global buffer cache [Miller et al., 1998], or in other words by using inter-process shared or mapped memory for input/output operations through pass-by-reference transfers88. Yet another example of zero-copy operation, which could deliver possibly efficient data transfers, is based on the page remapping by “moving” data residing in the main memory across the protected boundaries and thus allows avoiding copy operations (e.g. IO-Lite [Pai et al., 1997]). Contrary, the message passing where the message is formed out of the function invocation, signal and data packets is not a zero-copy operation due to the fact of copying data from/to local memory into/from packets transmitted through the communication channel [McQuillan and Walden, 1975]89.

Jitter-constrained periodic stream It’s obvious, that the size of the buffer depends on the size of the media quanta and on the client’s QoS requirements [Schmidt et al., 2003], but still, the buffer size must somehow be calculated. It could be done e.g. by the statistical approach using jitter-constrained periodic streams (JCPS) proposed by [Hamann et al., 2001b]. The JCPS is defined as two streams: time and size [Hamann et al., 2001b]:

88 Another example of zero-copy operation is well-known direct memory access (DMA) for hard disks, graphical adapters or network cards used by the OS drivers. 89 There may be a message passing implemented with zero-copy method, however, it is commonly acknowledged that message passing introduces some overhead and is less efficient than shared memory on local processor(s) (e.g. on SMPs) [LeBlanc and Markatos, 1992]. This however, may not be the case on distributed shared memory (DSM) multiprocessor architectures [Luo, 1997]. Nevertheless, there is some research done in the direction of providing the software-based distributed shared memory e.g. over virtual interface architecture (VIA) for Linux-based clusters, which seems to be simpler in handling and still competitive solution [Rangarajan and Iftode, 2000], or using openMosix architecture allowing for multi-threaded process migration [Maya et al., 2003].

164 Chapter 3 – Design VII. Real-Time Processing Model

Time stream: JCPSt = (T, D,τ ,t0 ) (27) where T is the average distance between events i.e. length of period (T>0), D is a minimum distance between events (0≤D≤T), τ is the maximum lateness i.e. maximum deviation from the beginning of period out of all deviations for each period, t0 is the starting time point (t0 ∈ );

Size stream: JCPSs = (S,M ,σ ,s0 ) (28) where S is the average quant size (S>0), M is a minimum quant size (0≤M≤S), σ is the maximum deviation from accumulated quantum size, s0 is the initial value ( s0 ∈ ) [Hamann et al., 2001b].

Leading time and buffer size calculations According to JCPS specification it is possible to calculate the leading time of the producer P in respect to the consumer C and the minimum size of the buffer by following [Hamann et al., 2001b]:

σ C TP ⋅σ C tlead = +τ P = +τ P (29) R S P

⎡ S P ⎤ Bmin = ⎡⎤()τ C +τ P ⋅ R +σ P +σ C − s0 = ⎢()τ C +τ P ⋅ +σ P +σ C − s0 ⎥ (30) ⎢ TP ⎥ where R is the rate of size to time (i.e. bit rate) of both communicating JCPS streams. Of course, there is only one P and one C for each buffer.

M:N data stream conversion However, the JCPS considers only the conversion chains based on simple I/O converter i.e. the converter accepting one input and one output (such as given in [Schmidt et al., 2003]), and no multiple streams and stream operations are supported. The remedy, which allows to support additional stream operations such as join, split or multiplex, could be delivered in two ways: 1) as an extension of the model such that the condition for buffer allocation supports multiple producers and multiple consumers, or 2) by changing the core assumption i.e. substituting the

165 Chapter 3 – Design VII. Real-Time Processing Model

used converter model by the one given in [Suchomski et al., 2004] (depicted in Figure 49). In the first case, the task requires extending the assumption for the producer and for the consumer as:

Pt = Pt1 + Pt2 + … + Ptn where Ptn = (TPn, DPn, τPn, t0Pn) (31)

Ps = Ps1 + Ps2 + … + Psn where Psn = (SPn, MPn, σPn, s0Pn) (32)

Ct = Ct1 + Ct2 + … + Ctn where Ctn = (TCn, DCn, τCn, t0Cn) (33)

Cs = Cs1 + Cs2 + … + Csn where Csn = (SCn, MCn, σCn, s0Cn) (34) where “+” operator symbolizes the function combining the time event streams into one, such that all events are included and the length of period (T), minimum event distance (D), maximum lateness (τ) and starting point (t0) are correctly calculated, and “+” operator represents operation of addition of average sizes (S) together with the calculation of minimum quantum size (M), of maximum accumulated deviation of quantum size (σ), and of initial value (s0). However, the definition of “+” and “+” is expected to be very complex and has not been defined yet. Contrary, the other method, where the new converter model is assumed, does not require changes in the JCPS buffer model i.e. the outputs of converter are treated logically as separate producers delivering to separate consumers from the inputs of subsequent converter(s). The drawback is obvious – the converter itself must cope with the synchronization issues of each input buffer (being consumer) and each output buffer (being producer).

VII.1.3. Data Dependency in the Converter

As it was proved in the previous work [Liu, 2003] and shortly pointed out within this work, the behavior of the converter highly depends on the data processed. Even application of the MD- based transcoding approach does not eliminate but only reduces the variations of the processing time requirements, making the precise definition of the relationship between data and processing impossible i.e. the exact prediction of the process execution time is still hardly feasible. On the other hand, it reduces the required computations appreciably e.g. the speed-up for video ranges between 1.32 and 1.47. Anyway, the data influence on the processing has been investigated to some extent and is presented in the later part of this work in the Real-time Issues in Context of Multimedia Processing section (subsection VII.2.1).

166 Chapter 3 – Design VII. Real-Time Processing Model

VII.1.4. Data Processing Dependency in the Conversion Graph

There are also data dependencies within the conversion graph. Let’s assume that there are only two converters: producer (P) and consumer (C). P delivers synchronously quanta such that 25 decoded frames per second without delay and with constant size, which is derived from color scheme and resolution or from the number of MBs (each MB consist of 6 blocks having 64 8- bit values), are written to the buffer for sequence Mobile (CIF), and C consumes exactly the same way as P produces them; in other words, both have no delay in delivery and intake. According to JCPS model they could be presented as:

Pt = (TP, DP, τP, t0P) where TP=40[ms], DP= 40[ms], τP=0, t0P=0,

i.e. Pt = (40, 40, 0, 0)

Ps = (SP, MP, σP, s0P) where SPn=1201.5[kb], MPn=1201.5[kb], σPn=0, s0Pn=0,

i.e. Ps = (1201.5, 1201.5, 0, 0) and analogically

Ct = (40, 40, 0, 0) and Cs = (1201.5, 1201.5, 0, 0).

The lead time and the minimal buffer can then be calculated according to Formulas (29) and (30) for this simple theoretical case:

tlead = 0.0[s] and Bmin= 0[b]

Even though, P and C fulfill the requirement of constant rate i.e.: R = Ss/Ts = St/Tt, the above result would lead to an error in the reality, because there is data dependency between successive converters90 hidden, which is not presented within the model but is very important for any type of scheduling of the conversion process. Namely, C can start consumption first then when P has completed the first quant i.e. the buffer is filled with the amount of bits equal to the needs of C. Moreover, when considering time of subsequent quanta being consumed by C, all of them must be produced beforehand by P. Thus, if the scheduling is considered the P should start at

90 According the JCPS model, the contents of the quanta are irrelevant here i.e. the converter is not dependent on data.

167 Chapter 3 – Design VII. Real-Time Processing Model

least one full quant (frame/window of samples) before C can work i.e. tlead must be equal to 40ms. Moreover, if the buffer is used as the transport medium it should at least allow for storing full-frame, which means that it should be equal at least 1201.5 [kb]. Finally, if the P and C are modeled as stated above, both must occupy the processor exclusively due to the full processing time use (i.e. in both cases 40 [ms] x 25 [fps] = 1 [s]), which means that they run either on two processors or on two systems.

The data dependency in the conversion graph exists regardless the scheduling model and the system i.e. it does not matter if there is one processor or more. In case of one processor, the scheduling must create an execution sequence of converters such that the producer will occupy the processor, produce the quant required by the consumer always in advance and share the time with the consumer. In case of parallel processing, the consumer on the next processor can start only when the producer on the previous one delivered the data to the buffer. This is analogical to the instructions pipelining, but in contrast it would be applied on the thread level not on the instruction level.

VII.1.5. Problem with JCPS in Graph Scheduling

If the transcoding graph is considered from the user perspective, the output from the last converter in the chain must consider execution time of all the previous converters and must allow for synchronized data delivery. Moreover, if processing with one CPU is assumed, the time consumed by each converter in respect to given quant should in total not cross the period size of the output stream, because otherwise the rule of constant quant rate will be broken. For example, if the Mobile (CIF) sequence is considered with 25fps in the delivered stream, it means that the period size is equal to 40ms—so, the sum of execution time of all the previous converters on one processor should not cross this value. Otherwise, the real-time execution will not be possible. So, the following equation must be true for real-time delivery on one CPU:

n TOUT ≥ ∑Ti (35) i=1

where TOUT is the period size of the output JCPS stream requested by the user, Ti is the JCPS period size for the specific element in the conversion graph and n is the number of converters.

168 Chapter 3 – Design VII. Real-Time Processing Model

Moreover, the leading time (tlead) for all converters should be summed up and analogically the size buffer should be considered, as follows:

n t ≥ t leadOUT ∑ leadi (36) i=1

n B ≥ B minOUT ∑ mini (37) i=1

These requirements are especially true for infinite streams. And even though, introducing the buffer would allow for some deviations in execution time, still the extra time consumed by some converters, which took longer, must be balanced by those which took less time, thus making the average time still fulfilling the requirement. On the other hand, the respective buffer size would allow for some variations in the processing time, but the same introduces start-up latency in the delivery process such that the bigger buffer is, the bigger latency is introduced. The latency in startup can simply be calculated for each conversion graph element by:

B L = ⋅T (38) S where L means latency derived from given buffer, and following the definitions given previously B is the buffer size, S is the average quantum size and T is the length of the period.

The latency for the complete converter graph is calculated as sum of the latencies of each converter in the conversion graph as follows:

n LOUT ≥ ∑ Li (39) i=1

An example of simple transcoding chain is depicted in Figure 51. Here the LLV1 BL+TEL bit stream is read from the disk, put into the buffer, decoded by LLV1 decoder, put into the buffer, encoded by XVID, put into the buffer, and finally encoded MPEG-4 bit stream is stored on the disk. The transcoding chain has been executed sequentially in the best-effort OS with exclusive execution mode (occupied 100% of the CPU) in order to allow modeling with JCPS by measuring the time spent by each element of the chain for all frames i.e. at first the selected data have been read from the storage and buffered completely for all frames, then the first converter

169 Chapter 3 – Design VII. Real-Time Processing Model

A (denotes LLV1 decoding) processed and put the decoded data to next buffer, next the second converter B (denotes XVID encoding) compressed the data and put it in the third buffer, and finally the write-to-disk operation has been executed.

Figure 51. Simple transcoding used for measuring times and data amounts on best-effort OS with exclusive execution mode.

Transcoding Time Converter's participation in total transcoding time

40,00 Sink Store Source Read 3,5% 4,3% 35,00 Buffer Buffer 0,7% 30,00 0,0% Sink Store Converter A Buffer 25,00 Decode Converter B Encode 27,9% 20,00 Buffer Converter A Decode Time [ms] 15,00 Buffer

10,00 Source Read

5,00 Converter B Buffer 0,00 Encode 2,2% 123456789101112131415 61,4% Frame No.

a) b)

Figure 52. Execution time of simple transcoding: a) per frame for each chain element and b) per chain element for total transcoding time.

The measured execution times together with the amount of produced data are listed for the first 15 quanta for each participating chain element in Table 5 and depicted in Figure 52. The time required for reading (from buffer) by the given converter can be neglected in context of the converter’s time due to the fast memory access employing caching mechanism, thus it is measured together with the converter’s time—only in case of source the measured time represents the reading from the disk. Secondly, the quant size read by the next converter is equal to the one stored in the buffer. Thus the consumer part of each converter is hidden (in order to avoid repetitions in the table) and the data in the buffers is called Data In, which shall reflect their consumer characteristic, in opposite to Data Out in the source, converters and sink (being producers). As it can be noticed, there is the complete chain evaluated and the execution and

170 Chapter 3 – Design VII. Real-Time Processing Model

buffering time are calculated. The execution time is the simple addition of time values for source, converters and sink, while the buffering time is the sum of writing to buffer. Finally, the waiting time required for synchronization is calculated – the synchronization is assumed to be done in respect to the user specification i.e. to the given 25fps in the output stream, which is given by the JCPS as time stream JCPSt=(40,40,0,0).

Finally, the JCPS is calculated for each element of the hypothetical conversion chain. The time and size streams are both given in the Table 6 (analogically to Table 5). The time JCPSs are also calculated for the execution, buffering and wait values. It can simply be noticed, that the addition of the respective elements of these tree time JCPSs will not give the time JCPS specified by the user. Only the period size (T) can be added. The other attributes (D, τ, t0) have to be calculated different yet unknown way (as it was mentioned in section VII.1.2 and specified by symbol “+”).

Summarizing, JCPS alone is not enough for scheduling the converter graphs, because of the data influence on the converter graph behavior, which is not considered in the JCPS model as mentioned above. Thus additional MD are required, which might be analogical to trace information. The trace data are defined as statistical data coming from the analysis of execution recorded for each quant, which is the most suitable in prediction of the repetitive execution but rather expensive. Moreover, JCPS due to its unawareness of media data are not as good as model based on trace information.

As so, the goal could be to minimize the trace information and encapsulate it in the MD set. Other possible solution is to define MD-set separately in such a way that the schedule of the processing analogical to the trace could be calculated.

Moreover, the JCPS should be extended by the target frequency of the processed data, because specifying the period sizes as it was originally mentioned will not provide the real expected synchronous output. As so, the following definition for the time JCPS is given:

Time stream: JCPSt = (T, D,τ ,t0 , F) (40) where the additional parameter F represents the target frequency of the events delivering quanta.

171 Chapter 3 – Design VII. Real-Time Processing Model

Sequence: MOBILE_CIF (356x288x25fps) -3 Period (1/fps·10 ): [ms] 40 Number of frames: 15 Source Buffer Converter A Buffer Converter B Buffer Sink Complete Chain Read Decode Encode Store Summary Sync Quant Data Data Data Data Data Data Exec Buffer Time Time Time Time Data In Time Time Time Time No Out In Out Out In Out Time Time (Wait) [ms] [kb] [ms] [kb] [ms] [kb] [ms] [kb] [ms] [kb] [ms] [kb] [ms] [kb] [ms] [ms] [ms] 1 2,23 44,6 0,36 44,6 8,03 1201,5 0,67 1201,5 17,67 35,6 0,02 35,6 1,8 35,6 29,7 1,0 9,2 2 0,74 14,9 0,12 14,9 8,32 1201,5 0,67 1201,5 18,31 11,9 0,01 11,9 0,6 11,9 28,0 0,8 11,2 3 1,49 29,7 0,24 29,7 9,07 1201,5 0,67 1201,5 19,96 23,8 0,01 23,8 1,2 23,8 31,7 0,9 7,4 4 0,76 15,1 0,12 15,1 8,34 1201,5 0,67 1201,5 18,35 12,1 0,01 12,1 0,6 12,1 28,1 0,8 11,1 5 1,34 26,7 0,21 26,7 9,01 1201,5 0,67 1201,5 19,83 21,4 0,01 21,4 1,1 21,4 31,2 0,9 7,9 6 0,83 16,5 0,13 16,5 8,21 1201,5 0,67 1201,5 18,07 13,2 0,01 13,2 0,7 13,2 27,8 0,8 11,4 7 2,08 41,6 0,33 41,6 9,03 1201,5 0,67 1201,5 19,86 33,3 0,02 33,3 1,7 33,3 32,6 1,0 6,4 8 0,89 17,8 0,14 17,8 8,18 1201,5 0,67 1201,5 17,99 14,3 0,01 14,3 0,7 14,3 27,8 0,8 11,4 9 1,63 32,7 0,26 32,7 9,01 1201,5 0,67 1201,5 19,81 26,1 0,01 26,1 1,3 26,1 31,8 0,9 7,3 10 0,74 14,9 0,12 14,9 8,23 1201,5 0,67 1201,5 18,10 11,9 0,01 11,9 0,6 11,9 27,7 0,8 11,6 11 1,19 23,8 0,19 23,8 9,05 1201,5 0,67 1201,5 19,91 19,0 0,01 19,0 1,0 19,0 31,1 0,9 8,0 12 0,89 17,8 0,14 17,8 8,09 1201,5 0,67 1201,5 17,79 14,3 0,01 14,3 0,7 14,3 27,5 0,8 11,7 13 2,52 50,5 0,40 50,5 9,01 1201,5 0,67 1201,5 19,81 40,4 0,02 40,4 2,0 40,4 33,4 1,1 5,5 14 0,80 15,9 0,13 15,9 8,21 1201,5 0,67 1201,5 18,07 12,8 0,01 12,8 0,6 12,8 27,7 0,8 11,5 15 1,95 39,0 0,31 39,0 9,03 1201,5 0,67 1201,5 19,86 31,2 0,02 31,2 1,6 31,2 32,4 1,0 6,6 Total 20,1 401,3 3,2 401,3 128,8 18022,5 10,0 18022,5 283,4 321,1 0,2 321,1 16,1 321,1 448,3 13,4 138,3

Table 5. Processing time consumed and amount of data produced by the example transcoding chain for Mobile (CIF) video sequence.

Source Source Buffer Buffer A A Buffer Buffer B B Buffer Buffer Sink Sink Exec Buffer Wait JCPSt JCPSs JCPSt JCPSs JCPSt JCPSs JCPSt JCPSs JCPSt JCPSs JCPSt JCPSs JCPSt JCPSs JCPSt JCPSt JCPSt T / S 1,3 26,8 0,2 26,8 8,6 1201,5 0,7 1201,5 18,9 21,4 0,0 21,4 1,1 21,4 29,9 0,9 9,2 D / M 0,7 14,9 0,1 14,9 8,0 1201,5 0,7 1201,5 17,7 11,9 0,0 11,9 0,6 11,9 27,5 0,8 5,5 τ / σ 0,9 17,8 0,1 17,8 0,0 0,0 0,0 0,0 0,0 14,2 0,0 14,2 0,7 14,2 0,0 0,2 4,0 t0 / s0 -1,3 -25,1 -0,2 -25,1 -0,8 0,0 0,0 0,0 -1,8 -20,1 0,0 -20,1 -1,0 -20,1 -3,8 -0,2 0,0

Table 6. The JCPS calculated for the respective elements of the conversion graph from the Table 5.

172 Chapter 3 – Design VII. Real-Time Processing Model

VII.1.6. Operations on Media Streams

Media integration (multiplexing) The multiplexing (called also merging or muxing) of media occurs when two or more conversion chains are joined by one converter i.e. when the converter accepts two or more inputs. In such case, the problem with synchronization occurs (see later). The typical examples can be multiplexing audio and video in one synchronous transport stream.

Media demuxing Demuxing (or demultiplexing) is an inverse operation to the muxing operation. It allows for separating each of the media specific stream form the interleaved multimedia stream. Important element of the demuxing is the assignment of time stamps to the media quanta in order to allow synchronization of the media (see later). The typical example is decoding from the multiplexed stream in order to display the video and audio together.

Media replication The replication is a simple copy of the (multi)media stream such that the input is mapped to the multiple (at least two) outputs by copying the exact content of the stream. Here, no other special functionalities are required.

Both demuxing and replication are considered as one input and many outputs converters according to the converter model defined previously (depicted in Figure 49), and respectively muxing is considered as many inputs and one output converter.

VII.1.7. Media data synchronization

The synchronization problem can be solved by using the digital phase-lock loop (DPLL)91 in two ways: 1) employing the buffer fullness flag and 2) using time stamp [Sun et al., 2005] (Chapter VI). The second technique has an advantage of allowing the asynchronous execution of the producer and consumer. These two techniques are usually applied to the network area

91 DPLL is an apparatus/method for generating a digital clock signal which is frequency and phase referenced to an external digital data signal. The external digital data signal is typically subject to variations in data frequency and high frequency jitter unrelated to changes in the data frequency. DPLL may consist of a serial shift register receiving digital input samples, a stable local clock signal supplying clock pulses thus driving the shift register, and a phase corrector circuit adjusting the phase of the regenerated clock to match the received signal.

173 Chapter 3 – Design VII. Real-Time Processing Model

between the transmitter and the receiving terminals, however are not limited just to them. Thus, the global timer, allowing to represent time clock of the media stream, and time stamps assigned to the processed quanta should be applied in the conversion graph (in analogy to the solution described for MPEG in [Sun et al., 2005]92). These elements are required, however are not sufficient within the conversion graph. Additionally, the target frequency of all media must be considered to define the synchronization points, which could be integer-based calculated by least common multiple (LCM) and by greatest common divisor (GCD)93—otherwise, the synchronization may include some minor rounding error. If multiple streams are considered where just minor difference of quant frequency exist (e.g. 44.1kHz vs. 48kHz, or 25 vs. 24.99fps) it may be desirable to synchronize more often than when based on integers using GCD, thus avoidance of rounding errors is not possible.

In order to explain it more clearly, let us assume that there are two streams, video and audio. The video should be delivered with frame rate of 10 fps, which means the target frequency of 10 Hz. The audio should be delivered with the sampling rate of 11.025 kHz (standard phone quality). However, the stored source data have QoD higher than requested QoD, which is achieved due to the higher frame rate (25fps) and higher sampling rate (44kHz). The origin data are synchronized by every frame and by every 1764th sample i.e. GCD(25,11025)=25, and 25/25=1 and 11025/25=1764. Contrary, the produced data can by integer synchronized by every 2nd frame and every 2205th sample i.e. GCD(10,11025)=5, and 10/5=2 and 11025/5=2205. If the fractional synchronization were used such that video is synchronized every frame, then the audio should be synchronized exactly in the middle between sample 1102 and 1103 (2205/2=1102.5).

For the other previous examples, the GCD(44100,48000) is equal to 300 meaning synchronization at every 23520 sample for two audio streams, and respectively GCD(2499, 25) amounts 1 meaning synchronization at every 62475 frame for two video streams.

92 The MPEG-2 allows for two types of time stamps (TS): decoding (DTS) and presentation (PTS). Both are based on system time clock (STC) of the encoder, which is analogical to the global timer. The STC of MPEG-2 has a constant value of 27 MHz and is also represented in the stream by program clock references (PCR) or system clock reference (SCR). 93 In mathematics: LCM is known also as smallest common multiple (SCM) and GCD is also called greatest common factor (GCF) or highest common factor (HCF).

174 Chapter 3 – Design VII. Real-Time Processing Model

VII.2. Real-time Issues in Context of Multimedia Processing VII.2.1. Remarks on JCPS – Data Influence on the Converter Description

JCPS is proposed to be used in modeling of the converters applicable in the timed-media processing. However, it was found empirically that for the different media data (two different video or audio streams), the same converter requires specification of different values according to JCPS definition. Moreover, the difference exists in both the time and the size specifications of JCPS (Table 7).

JCPS time T D τ t0

JCPS size S M σ s0 source JCPSt 133 133 0 0

source JCPSs 67584 67584 0 0

mobile_cif JCPSt 136 90 1686 50

mobile_cif JCPSs 54147 39344 799983 48273

tempete_cif JCPSt 134 87 1637 47

tempete_cif JCPSs 35852 24418 501602 34227

Table 7. JCPS time and size for the LLV1 encoder.

To depict the difference between the calculation backgrounds for the above JCPSs, two graphs have been depicted: cumulated time in Figure 53 and cumulated size in Figure 54. The values are cumulated along the number of the frame being processed for the first 20 frames.

3 Cumulated Time

2,5 source mobile_cif tempete_cif 2

1,5 Time [s] Time

1

0,5

0 1234567891011121314151617181920

Figure 53. Cumulated time of source period and real processing time.

175 Chapter 3 – Design VII. Real-Time Processing Model

1400 Cumulated Size 1200 source mobile_cif 1000 tempete_cif

800

Size [kB] 600

400

200

0 1 2 3 4 5 6 7 8 9 1011121314151617181920

Figure 54. Cumulated size of source and encoded data.

The Tempete and Mobile sequences used for this comparison have the same resolution and frame rate thus the same constant data rate (depicted as source). However, the contents of these sequences are different. Please note, that JCPS have been designed with the assumption of no data dependency.

As it can be noticed, the JCPSs given for the LLV1 encoder processing the video sequences with the constant data rate and time requirements but with different contents (Tempete vs. Mobile) have different JCPS values in both the size and the time. So, if no data dependency was assumed, the JCPS for both time and size should be the same for one converter. However, this is not the case and especially the difference in size is noticeable (Figure 54).

VII.2.2. Hard Real-time Adaptive Model of Media Converters

Due to these problems with pure application of JCPS in the conversion graph, some other solution has been investigated. It is based on imprecise computations and a hard-real-time assumption, thus it is called within this work the hard real-time adaptive (HRTA) converter model. As it was mentioned in the Related Work (subsection III.2.2.2), the imprecise computation allows for scalability in the processing time by delivering less or more accurate results. If they are applied such that the minimum time for calculating the lowest quality acceptable (LQA) [ITU-T Rec. X.642, 1998] is guaranteed, some interesting applications to the multimedia processing can be recognized.

176 Chapter 3 – Design VII. Real-Time Processing Model

For example, the frame skip must not occur at all in the imprecise computation model i.e. when the decoding/encoding is designed according to the model, the minimal quality of each frame could be computed on lower costs and then enhanced according to the available resources, but this minimal quality could be guaranteed for 100%. Moreover, such model does not require converter-specific time buffers (besides those used for the reference quanta, which is the case in all processing models) and does not introduce additional initial processing delay beside the one required by the data dependency in the conversion graph (beside the graph-specific time buffer, which is present anyway in all processing models). In results, the model would allow for average case allocation with LQA guaranties and no frame drops, thus achieving higher perceived quality of the decoded video, which is an advantage over the all-or-nothing model where in case of a deadline miss the frame is skipped.

The hard real-time adaptive model of media converters is using quality assuring scheduling (QAS) but it is not the same. The difference is that the model proposes how to cope with the media converter if the imprecise computations have to be included, and the QAS is a tool for mapping from the converter model to the system implementation.

The HRTA converter model is defined as:

CHRTA = (CM ,CO ,CD ) (41)

where CHRTA denotes a converter supporting adaptivity and hard real-time constraints, CM defines the mandatory part of the processing, CO specifies optional part of the algorithm, and CD is a delivery part providing the results to the next converter (called also clean-up step).

In order to provide LQA, the CM and CD are always executed, contrary to CO being executed only when idle resources are available. The CM is responsible for processing the media data and coding then according to LQA definition. The CO enhances the results either by repeating calculation with higher precision on the same data or by operating on additional data and then improving the results or calculating additional results. The CD decides upon which results to deliver i.e. if no CO has been executed then the output of CM is selected, otherwise the output of

CO is integrated with output of CM, and finally, the results are provided as the output of the converter.

177 Chapter 3 – Design VII. Real-Time Processing Model

The hard real-time adaptive model of media converter is flexible, and still supports the quant- based all-or-nothing method i.e. it can be applied on different levels of processing quality such as dropping of information with different granularity. The quant, regardless if it is an audio sample or a video frame, can be dropped completely in a way that CM is empty—no processing is defined—, CO does all the processing, and the CD simply delivers output of CO (if it completed the execution) or informs about the quant drop (and no LQA is provided). However, the advantage of model would be wasted in such case, because the biggest benefit is the possibility to influence the conversion algorithm during the quant processing and gain the partial results, thus raising the final quality.

For example, the video frame may be coded only partially by including only a subset of macro blocks, while the remaining set of MBs could be dropped. In other words, the data dropping is conducted on the macro block level. This obviously rises the quality because instead of having 0% of the frame (frame skipped) and loss of some resources, the results includes something between 0% and 100% and no resources are lost.

VII.2.3. Dresden Real-time Operating System as RTE for RETAVIC

The requirement of real-time transformation and QoS control enforces embedding the conversion graph in the real-time environment (RTE). The reliable RTE can be provided by the real-time operating system, because then reservations of processing time and of resources can be guaranteed. A suitable system is the Dresden Real-Time Operating System (DROPS), which aims at supporting the real-time applications with different timing modes. Moreover, it supports both timesharing (i.e. best-effort) and real-time threads running on the same system where the timesharing threads do not influence the real-time threads [Härtig et al., 1998], and are allowed to be executed only on unreserved (idle) resources.

Architecture DROPS is a modular system (Figure 55) and is based on the Fiasco microkernel offering event- triggered or time-triggered executions. The Fiasco is an L4-based microkernel providing fine- grained as well as coarse-grained timer modes for its scheduling algorithm. The fine-grained timer mode (called one-shot mode) has a precision of 1µs and thus it is able to produce a timer interrupt for every event at the exact time it occurs. In contrast, the coarse-grained timer mode

178 Chapter 3 – Design VII. Real-Time Processing Model

generates interrupts periodically and it is default mode. The interrupt period has the granularity of roughly 1ms (976µs), which might yield to unacceptable overhead for the applications demanding small periods. On the other hand, the fine-grained timer may introduce the additional switching overhead. There are three types of clock possible:

• RTC – real-time clock generates timer interrupts on IRQ8 and is the default mode • PIT – it generates timer interrupts analogical to RTC but works on IRQ0, and is adviced to be used with VMWare machines and does not work with profiling • APIC – most advanced mode using APIC timer, where the next timer interrupt is computed each time for the scheduling context

The RTC and PIT allow only for coarse-grained timers, and only the APIC mode allows for both timer modes. Thus if the application requires precise scheduling, the APIC mode with one-shot option has to be chosen.

Figure 55. DROPS Architecture [Reuther et al., 2006].

Fiasco uses non-blocking synchronization ensuring that higher-priority threads do not block waiting on lower-level threads or kernel (avoiding priority inversion) and supports static priorities allowing fully preemptible executions [Hohmuth and Härtig, 2001]. In comparison to RTLinux, other real-time operating system, Fiasco can deliver smaller response time on interrupts to the event-triggered executions such that the maximum response time may be

179 Chapter 3 – Design VII. Real-Time Processing Model

guaranteed [Mehnert et al., 2003]. Additionally, a real-time application may be executed on given scheduled point of time (time-triggered), where the DROPS grants the resources to given thread, controls the state of the execution and interacts with the thread according to the scheduled time point, thus allows for ensuring the QoS control and accomplishment of the task.

Scheduling The quality assuring scheduling (QAS) [Hamann et al., 2001a] is adopted as the scheduling algorithm for DROPS. The implementation of QAS works such that the preemptive tasks are ordered according to priorities where from all threads being ready to run in the given scheduling period always the thread with the highest priority is executed. If a thread with a higher priority than the current one becomes ready i.e. next period for the given real-time thread occurs, the current thread is stopped (preempted) and the new thread is assigned to the processor. Among threads with the same priority in the same period a simple round-robin scheduling algorithm is applied. So, the time quantum and the priority to each thread must be assigned in order to control the scheduling algorithm on the application level, and if one or both of them is/are not assigned the default values are used. Moreover, the thread with default values is treated as time- sharing (i.e. non-real-time) thread with lowest priority and no time constraints.

Real-time thread model A periodic real-time thread in the system is characterized by its period and arbitrary many timeslices. The timeslices can be classified into mandatory, which have to be executed always, and optional timeslices, which improve the quality but can be skipped if necessary. Any combination of mandatory and optional timeslices is imaginable, as the type of the timeslice is only subject to the programmer of the thread on application level, and not the kernel level. Each timeslice has the two required properties: length and priority. The intended summed length of the timeslices together with the length of the period make up the threads reserved context as shown in Figure 56. In other words, the reserved context has to have the deadline (i.e. end of period) and the intended end of each timeslices defined (if there are three timeslices then three timeslices’ ends are defined for each period).

180 Chapter 3 – Design VII. Real-Time Processing Model

Figure 56. Reserved context and real events for one periodic thread.

Nevertheless, the work required by one timeslice might consume more time than estimated, and the scheduled thread does not finish the work till given end of timeslice i.e. the timeslice exceeds the reserved time. The kernel is able to recognize such event and in reaction it sends a timeslice overrun (TSO) IPC message connected with the threads’ timeslice to the preempter thread (Figure 57), which runs in parallel to the working thread of the application at the highest priority. Similarly, the kernel is able to recognize a deadline miss (DLM) and to send the respective IPC call. The deadline miss happens to the thread in the case where the end of the period is reached and the thread has not signalized the waiting state for the end of the period to the kernel. Normally the kernel does not communicate directly with the working thread.

The scheduling context 0 is meant as time-sharing part and is always included in any application. The real-time application requires additionally at least one scheduling context (numbered from 1 to n). Then the SC0 is used as empty (idle) time-sharing part allowing for waiting for the next period. The kernel does not distinguish between mandatory and optional timeslices, but only between scheduling context, so it is the responsibility of the application developer to handle the scheduling context respectively to the timeslice’s type by defining its reserved context through

181 Chapter 3 – Design VII. Real-Time Processing Model

mapping from the timeslice type to the scheduling context. As so, the SC1 is understood as mandatory part with the highest priority, SC2…SCn are meant for optional parts each with lower priority than the previous one, and SC0 is the time-sharing part with the lowest priority.

Figure 57. Communication in DROPS between kernel and application threads (incl. scheduling context).

VII.2.4. DROPS Streaming Interface

The DROPS streaming interface has been designed in order to support transferring of big amounts of timed data being typical for multimedia application. Especially the controlling facilities enabling real-time applications and efficient data transfer have been provided. The DSI implements the packet-oriented zero-copy transfer where the traffic model is based on JCPS. It allows for sophisticated QoS control by supporting non-blocking transport, synchronization, time-limited validity, data (packet) dropping, and resynchronization. Other RTOSes line QNX [QNX, 2001] and RTLinux [Barabanov and Yodaiken, 1996] do not provide such a concept for data streaming.

182 Chapter 3 – Design VII. Real-Time Processing Model

Control Application

1. Initiate Stream

2. Assign Stream Shared 2. Assign Stream Data Area

Shared Control Area 3. Signaling Producer Consumer & Data Transfer

Figure 58. DSI application model (based on [Reuther et al., 2006]).

The DSI application model is depicted in Figure 58 in order to explain how to use the DSI in the real-time interprocess communication [Löser et al., 2001b]. Three types of DROPS’s servers having socket references are involved: control application, producer, and consumer. The zero copy data transfer is provided through shared memory areas (called stream), which are mapped to the servers participating in the data exchange – one as producer and the other as consumer. A control application sets up the connection by creating the stream in its own context and then by assigning it to the producer and the consumer through socket references. The control application initiates but is not involved in the data transfer—the producer and the consumer are responsible for arranging the data exchange independently from the control application through signaling employing the ordered control data area in a form of the ring buffer with limited amount of packet’s descriptor [Löser et al., 2001b]. The control data kept in packet descriptor such as packet position (packet sequence number), timestamp, pointer to start and end position of timed data allow arranging the packets arbitrary in the shared data area [Löser et al., 2001a].

The communication can be done in the blocking or non-blocking mode [Löser et al., 2001b]. The non-blocking mode must rely on correct scheduling, the virtual time mechanism, and polling or signaling in order to avoid data loss, but the communication does not need the overhead of IPC for un-/blocking mechanism thus is few times faster than the method using blocking. On the other hand, the blocking mode handles situations in which the required data is not yet available (empty buffer) or the processed data cannot be sent (full buffer). Additionally,

183 Chapter 3 – Design VII. Real-Time Processing Model

the DSI supports handling data omission [Löser et al., 2001a] e.g. when frame dropping in video processing occurs.

Summarizing, the DSI provides the required streaming interface for the real-time processing of the multimedia data. It delivers efficient continuous data transfer avoiding the costly copy operations between processes, which is critical in the RETAVIC architecture. It is delivered through the fastest static mapping in the non-blocking mode contrary to slower blocking mode. The possibility of using dynamic mapping is put aside due to its higher costs being linear to the packet size of the transferred data [Löser et al., 2001b].

VII.2.5. Controlling the Multimedia Conversion

The most of the existing software implementations of the media converters (e.g. XVID, FFMPEG) have not been designed for any RTOS but contrary for the wide-spread best-effort systems such as Microsoft Windows or Linux. Thus there is no facility for controlling the converter during the processing in respect to the processing time. The only exception is to stop the work of the converter; however this is not the controlling facility in our understanding. By the controlling facility it is meant the minimal interface implementation required for interaction between the RTOS, the processing thread and the control application. A proposal of such system was given by [Schmidt et al., 2003], thus it is only shortly explained here and the problem is not further investigated within the RETAVIC project.

VII.2.5.1 Generalized control flow in the converter

The converters are different to each other in meaning of processing function, but their control flow can be generalized. The generalization of the control flow has been proposed in [Schmidt et al., 2003] and is depicted in Figure 59. There are three parts distinguished: pre-processing, control loop and post-processing. The pre-processing allocates memory for the data structures on the media stream level and arranges the I/O configuration, while post-processing frees allocated memory and cleans up the created references. The control loop iterates over all incoming media quanta and produces other quanta. There is a processing function executed on the quant taken from the buffer in all iterations of the loop. This processing function does the conversion of the media data, and then the output of the function is written to the buffer of the next converter.

184 Chapter 3 – Design VII. Real-Time Processing Model

pre-processing

control loop

read a quant

processing function

write a quant

post-processing

Figure 59. Generalized control flow of the converter [Schmidt et al., 2003].

The time constraints connected with the processing has to be considered here. Obviously, the processing function occupies the most of the time, however the other parts cannot be neglected (as it was already shown in the Table 5). The stream-related pre- and post-processing is assumed to be executed only once during the conversion chain setup. If there was some quant-related pre- and post-processing then it would become a part of the processing function. The time constraints are considered in the scheduling of the whole conversion chain in cooperation with the OS.

VII.2.5.2 Scheduling of the conversion graph

Each converter uses resources of the underlying hardware and to provide guarantees for timely execution, the converters must be scheduled in RTE. In general, the scheduling process can be divided into multiple steps, starting with a user request and ending with a functionally correct and successfully scheduled conversion chain. The algorithm of the whole scheduling process proposed in the cooperative work on memo.REAL [Märcz et al., 2003] is depicted in Figure 60.

Construct the conversion graph The first step of the scheduling process creates the conversion graph for the requested transformation. In this graph, the first converter accepts the format of the source media objects from the source, and the last converter produces the destination format requested by the user through the sink. Moreover, each successive pair of converters must be compatible so that the output produced by the producer is accepted as input by the consumer i.e. there is a functionally correct coupling.

185 Chapter 3 – Design VII. Real-Time Processing Model

media server media Resource Managers (RTOS)

Figure 60. Scheduling of the conversion graph [Märcz et al., 2003].

There can be several of the functionally correct conversion graphs, since some of the conversion can be performed in arbitrary order or there exist converters with different behavior but the same processing function. The advantage of this is that the graph with the lowest resource consumption can be chosen. In fact, some of the graphs may request more resources than available, and thus cannot be considered for scheduling even if they are functionally correct.

A method for constructing a transcoding chain from the pool of available converters has been proposed by [Ihde et al., 2000]. The algorithm uses the capability advertisement (CA) of the converter represented in simple Backus/Naur Form (BNF) grammar and allows for media type transformations. However, the approach does not consider the performance i.e. the quantitative properties of the converters, thus no distinction is made between to functionally-equal chains.

186 Chapter 3 – Design VII. Real-Time Processing Model

Predict quant data volume Secondly, a static resource planning is done individually for various resource usages based on a description of the resource requirements of each converter. This yields the data volume that a converter turns over on each resource when processing a single media quant. Examples are the data volumes that are written to disk, sent through the network, or calculated by a processing function.

Calculate bandwidth In the third step, resource requirements in terms of bandwidth are calculated with the help of the quant rate (e.g. frame, sample, or GOP rate). The quant rate is specified by the client request, by the media object format description, and by the function of each converter. The output of this step is the set of bandwidths required on each resource, which may be calculated from:

Check and allocate the resources Fourth, the available bandwidth of each resource is inquired. With this information, a feasibility analysis of all conversion graphs is performed:

a) Each conversion graph is tested as to whether the available bandwidth on all resources suffices to fulfill the requirements calculated in 3rd step. If this is not the case, the graph must be discarded.

b) Based on the calculated bandwidth (from 3rd step), the runtime for each converter is computed using the data volume processed for a single quant. If the runtime goes beyond the limits defined by the quant rate, the graph is put aside.

c) The final part of the feasibility analysis is to calculate the buffer sizes e.g. according to [Hamann et al., 2001b] where the execution time follows the JCPS model. If some buffer sizes emerge as too large for current memory availability, the whole feasibility analysis must be repeated with another candidate graph.

The details on feasibility analysis in bandwidth based scheduling can be found in [Märcz and

Meyer-Wegener, 2002]. There is the available bandwidth of the resource (BR) in respect to

187 Chapter 3 – Design VII. Real-Time Processing Model

resource capacity in terms of maximum data rate (CR) and the resource utilization (ηR) defined such that:

BR = (1 −η R )⋅ CR (42)

The data volume sR on the resource for given quant is measured as [Märcz and Meyer-Wegener, 2002]:

s R ,i = C R ⋅ t R ,i = DR ,i ⋅ t′R ,i (43)

where the tR,i is a time required by the processing function for the i-th quant when the converter uses the resource R exclusively thus the used resource bandwidth DR,i occupies 100% of resource. In case of parallel execution with used resource bandwidth (i.e. the percentage is lower than 100%) the longer time t’R,i on resource is required by i-th quant.

The required bandwidth QR of each resource can be calculated using the desired output quant rate per second of the converter (frate) as [Märcz and Meyer-Wegener, 2002]:

QR = S(sR ) ⋅ f rate (44) where the S is an average size being a function of the given data volume processed on the specific resource (in analogy to the first parameter of JCPSs).

Finally, the feasibility check of the bandwidth allocation is done according to [Märcz and Meyer- Wegener, 2002]:

∑QR,x ≤ BR (45) x where the sum of the required bandwidth on the given type of resource (R) is summed for all x converters and is smaller than the available bandwidth of that resource BR. (i.e. the bandwidth overlapping is not allowed).

In case no graph passes all the tests, the client has to be involved to decide on measures to be taken. For example, it might consider lowering the QoS requirements. In the better case of more than one graphs being left, any of them can be selected using different criteria. Minimum

188 Chapter 3 – Design VII. Real-Time Processing Model

resource usage would be an obvious criterion. Finally, bandwidth-based resource reservations should be conveyed to the resource manager of the operating system.

Summarizing, the result of the scheduling process is a set of bandwidths (dynamic resources) and buffer sizes (static resources) required to run the converters with a guaranteed Quality of Service. To avoid resource scheduling by worst-case execution time one of the already discussed models could be exploited.

VII.2.5.3 Converter’s time prediction through division on function blocks

The time, which was mentioned in the Equation (43), required by the processing function can be measured physically on the running system. However, this time depends on the content of the video sequences (section VII.2.1), so the measured time may vary from sequence to sequence even, when it is done on the per-quant basis. Here, the proposed static meta-data may be very helpful. The idea behind is to split the converter into basic execution blocks (called also function blocks [Meyerhöfer, 2007]) which process subsets of the given quant (usually iteratively in a loop) and have certain input and output data i.e. for video it can be split on the macro block’s and then on block’s level e.g. by detailing the LLV1 algorithm in Figure 13 (on p. 101) or the MD-XVID algorithm in Figure 17 b) (on p. 113). The edges of the graphs representing the control flow rather than data flow should be used, since the focus is on the processing times; however the data flow in the multimedia data should not be neglected due to the accompanying transfer costs and buffering, which also influence the final execution time. Moreover, the data dependency between subsequent converters must be considered.

It’s natural in media processing that the defined function blocks can have internally multiple alternatives and completely separated execution paths. Of course it may happen that only one execution path exist or there is nothing-to-do execution path included. Anyway, the execution paths are contextual i.e. they depends on the processed data and the decision about which path to take is made up based on the current context values of the data, and to prove such assumption, the different execution time for different types of frame for the given function block should be measured on the same platform (the statistical errors have to be eliminated).

189 Chapter 3 – Design VII. Real-Time Processing Model

Additionally, the measurement could deliver the platform-specific factors usable by the processing time calculation and prediction if executed for the same data on different platforms. Some details on calculating and measuring the processing time are given in the upcoming section VII.3 Design of Real-Time Converters.

VII.2.5.4 Adaptation in processing

If any mismatch exists between the predicted time and the really occupied time i.e. when the reserved time for the processing function is too small or too big, the adaptation in the converter processing can be employed. In other words, the adaptation in the processing during execution is understood as a remedy for the imprecise prediction of the processing time.

Analogical to the hard real-time adaptive model of media converter, the converter including its processing function has to be reworked, such that the defined parts of the HRTA converter model are included. Moreover, a mechanism for coping with the overrun situations has to be included in the HRTA-compliant media converter.

VII.2.6. The Component Streaming Interface

The separation of the processing part of the control loop from the OS environment and any data streaming functionality was the first requirement of conversion graph design. The other one was to make the converter’s implementation OS independent. Both goals are realized by the component streaming interface (CSI) proposed by [Schmidt et al., 2003]. The CSI is an abstraction implementing multimedia converter interface providing the converter’s processing function with the media quant and pushing the produced quant back to the conversion chain for further data flow to the subsequent converter. This abstraction exempts the converter’s processing function from dealing directly with the OS-specific streaming interface like DSI as depicted in Figure 62 a). Obvious benefit of the CSI is the option to adapt the CSI-based converter to other real-time environment having other streaming interface without changing the converter code itself.

On the converter side the CSI is implemented in a strictly object-oriented manner. Figure 61 shows the underlying class concept. The converter class itself is an abstract class representing only basic CSI functionality. For each specific converter implementation the programmer is

190 Chapter 3 – Design VII. Real-Time Processing Model

supposed to override at least the processing function of the converter. Some examples of implemented converters (filter, multiplexer) are included in the UML diagram as subclasses of the Converter class.

ConverterConnection ConverterChain

Converter JCPSendBuffer

JCPReceiveBuffer

Sender Receiver

Filter Multiplexer

Figure 61. Simplified OO-model of the CSI [Schmidt et al., 2003].

Converter DSI Converter DSI Converter a) CSI CSI CSI CSI CSI CSI

ConverterChain

Converter Converter Converter Connection Connection Connection

b) IPC

Converter

DSIJCPReceive JCPSend DSI Buffer Buffer

buffer buffer

Figure 62. Application model using CSI: a) chain of CSI converters and b) the details of control application and converter [Schmidt et al., 2003].

The chain of converters using the CSI interface is depicted in Figure 62 a) and the details of the application scenario showing the control application managing ConversionChain with ConverterConnections and the CSI Converters are shown in Figure 62 b) [Schmidt et al., 2003]. The prototypical Converters run as stand-alone servers under DROPS. In analogy to the model of DSI application (Figure 58), a control application manages the setup of the conversion chain and forwards user interaction commands to the affected converter applications using interprocess

191 Chapter 3 – Design VII. Real-Time Processing Model

communication (IPC). The ConverterConnection object is affiliated with the Converter running in the run-time RTE and employing the JCPReceiveBuffer and JCPSendBuffer, which help to integrate the streaming operations of multimedia quanta. The prototypical implementation allowed to set up a conversion chain and to play an AVI video, which was converted online from RGB to YUV color space [Schmidt et al., 2003].

The CSI prototype has been selected as the environment for embedding the media converters at the beginning. However, there have been some problems with the versioning of DROPS (see Implementation chapter) and the CSI implementation has not been supported anymore due to the early closing of the memo.REAL project. Thus, the CSI has not been used by the prototypes explained in the later part of this work and the DSI has been used directly. Still, the CSI concepts developed within the memo.REAL project are considered to be useful and applicable in the further implementation of the control application of the RETAVIC architecture.

VII.3. Design of Real-Time Converters The previously introduced best-effort implementations evaluated shortly in Evaluation of the Video Processing Model through Best-Effort Prototypes (section V.6) are the groundwork used for the attempt of the design of the real-time converters. The real-time converters must be able to run in the real-time environment with time restrictions and as so have to meet certain QoS requirements to allow for the interaction during processing. When porting the best-effort to the real-time converter, the run-time RTE and the selected processing model have to be considered already in the design phase. As so the decision on selecting the DROPS with DSI has been made thus the converters have to be adapted to this RTOS and to its streaming interface. However, before going into the implementation details, the design decisions connected with additional functionality supporting the time constraints are investigated and proposal to rework the converter’s algorithm are given.

The systematic and scientific method in the algorithm refinement is exploited during the design phase in analogy to the OS performance tuning including understanding, measuring and improving steps [Ciliendo, 2006]. These three steps are further separated on low-level design delivering the knowledge about the platform-specific factors influencing processing efficiency and on high-level design where the mapping of the algorithms on the logical level occurs.

192 Chapter 3 – Design VII. Real-Time Processing Model

Obviously, the low-level design does not influence the functionality of the processing function but demonstrates the influence of the elements of the run-time environment, which includes the hardware and the OS specifics. Contrary, the high-level design does not cover any run-time specific issue and may lead to the converter’s algorithm changes providing different functionality and thus to the altered output results. Furthermore, the quantitative properties shall be investigated in respect to the processing time after each modification implementing new concept in order to prove if the expected gains are achieved. This investigation is called quantitative time analyses in the performance evaluation [Ciliendo, 2006] and it distinguishes between processing time variations deriving from the algorithmic or implementation-based modifications vs. other time deviations related to the measurement errors (or side effects)—the later can be classified into statistical deriving from the measurement environment and structural caused by the specifics of the hardware system [Meyerhöfer, 2007].

VII.3.1. Platform-Specific Factors

There are few classes of the low-level design in which the investigation of the behavior and the capabilities of the run-time environment are conducted. These classes are called platform- specific factors and include: hardware architecture dependent processing time, the compiler optimizations, priorities in thread models, multitasking effects, and caching effects.

VII.3.1.1 Hardware architecture influence

The hardware architecture of the computer should allow at first for running the software-based RTOS, and secondly, it has to be capable of coping with the requested workload. As so, the performance evaluation of the computer hardware system is conducted usually by the well- defined benchmarks delivering the deterministic machine index (e.g. BogoMips evaluating the CPU in the Linux kernel). Since such indexes are somehow hard to interpret in the relation to the multimedia processing and scheduling and the CPU clock rate cannot say everything about the final performance, the simple video-encoding benchmark has been defined in order to deliver some kind of a multimedia platform-specific index usable for the scheduling and processing prediction. In such case, the obtained measurements should reflect not only the measure of one specific subsystem, but instead they demonstrate the efficiency of the integrated hardware system in respect to the CPU with L1, L2 and L3 cache size, the pipeline length and

193 Chapter 3 – Design VII. Real-Time Processing Model

branch prediction, implementation of instruction set architecture (ISA) with SIMD94 processor extensions (e.g. 3DNow!, 3DNow!+, MMX, SSE, SSE2, SSE3) and other factors influencing the overall computational power. Thus the index could easily reflect the machine and allow for the simple hardware representation in the prediction algorithm.

The video encoding benchmark has been compiled once with the same configuration i.e. the video data (first four frames of the Carphone QCIF), the binaries of system modules, the binaries of encoder, and the parameters have not been changed. The benchmark has been executed on two different test-bed machines called PC (using Intel Mobile processor) and PC_RT (with AMD Athlon processor) which are listed in the Appendix E. The run-time environment configuration (incl. DROPS modules) is also stated in this appendix in the section Precise measurements in DROPS. The average time per frame of the four-times executed processing is depicted in Figure 63, where the minimum and maximum values are also marked.

Hardware Architecture Comparison - Encoding Time

10

9

Pentium4 AMD

8 Processing Time [ms] Time Processing

7 12 34

Figure 63. Encoding time of the simple benchmark on different platforms.

It can be noticed, that the AMD-based machine is faster; however, that was not the point, but based on these time measurements an index was proposed such that the execution time of the first frame in first run is considered as normalization factor i.e. the other values are normalized

94 SIMD stands for single instruction multiple data (contrary to SISD, MISD, or MIMD). It is usually referred to as processor architecture enhancement including floating-point and integer-based multimedia-specific instructions: Multimedia Extensions (MMX; integer-only), Streaming SIMD Extensions (SSE 1 & 2), AltiVec (for Motorolla), 3DNow!-family (for AMD), SSE3 (known as Prescott New Instructions – PNI) and SSE4 (known as Nehalem New Instructions – NNI). A short comparison of SIMD processor extensions can be found in [Stewart, 2005].

194 Chapter 3 – Design VII. Real-Time Processing Model

to this one and depicted in Figure 64. The average standard deviation counted for such index in respect to the specific platform and specific frame is smaller than 0.7% which is interpreted as the measurement error of multiple runs (and may be neglected). Though, the average standard deviation counted in respect to the given frame but covering different platforms is equal to 6.1% for all frames and 8.0% for only the predicted frames—such significant deviation can be regarded neither as measurement error nor as side effect.

Hardware Architecture Comparison - Machine Index

1,300

1,200

1,100

Pentium4 1,000 AMD Index 0,900 Processing Time [ms] Time Processing

0,800

0,700 1234

Figure 64. Proposed machine index based on simple multimedia benchmark.

In results, the proposed index cannot be used in the prediction of processing time, because it does not behave similar way on different platforms. The various types of frames are executed in different ways on diverse architectures, thus some more sophisticated measure should be developed that will allow reflecting the architecture specifics in the scheduling process.

VII.3.1.2 Compiler effects on the processing time

The compiler optimization is another aspect related to the platform-specific factors. It is always applicable whenever a source code in the higher-level language (e.g. C/C++) has to be translated to machine-understandable language i.e. binary or machinery code (e.g. assembler). Since the whole development of the real-time converters (including the source code of the RTOS as well) is done in C/C++, the language-specific have been shortly investigated. The source code of MD-XVID as well as of MD-LLV1 is already assembler

195 Chapter 3 – Design VII. Real-Time Processing Model

optimized for IA3295-compatible systems; however there is still an option to turn the optimization off (to investigate speed-up or support other architectures).

A simple test has been executed to investigate the efficiency of the executable code delivered by different compilers [Mielimonka, 2006]. The MD-XVID has been compiled with assembler optimizations for four different versions of the well-known open-source GNU Compiler Collection (gcc) compiler. Then the executable has been run on the test machine PC_RT and the execution time has been measured. Additionally, the assembler optimizations have been investigated by turning them off for the most recent version of the compiler (gcc 3.3.6).

108,00% gcc 4.0.2

106,00% gcc 3.4.5

104,00% gcc 3.3.6 no assembler

102,00% gcc 3.3.6

100,00% gcc 2.95.4 gcc 3.3.6 gcc 3.4.5 gcc 4.0.2

98,00% gcc 2.95.4

1,46 1,48 1,50 1,52 1,54 1,56 1,58 1,60 1,62 1,64 gcc 3.3.6 no assembler Encoding Time [s] Tim e normalized to gcc 3.3.6

a) b)

Figure 65. Compiler effects on execution time for media encoding using MD-XVID.

The results for the encoding process of the Carphone QCIF sequence are depicted in Figure 65. The part a) shows the execution time measured for the whole-sequence and the part b) presents the normalized to the gcc 3.3.6 values to depict better the differences between the measured times. It can be noticed that the oldest compiler (2.95.4) is about 6.8% slower than the fastest one. Moreover, the assembler optimizations done per hand are also conducted by the compiler during the compilation process. Even though they seem to be even better (99.93% of the normalization factor), the difference is still in the range of the measurement error being below 0.7% (as discussed in the previous section).

95 IA32 is an abbreviation of 32-bit Intel™ Architecture.

196 Chapter 3 – Design VII. Real-Time Processing Model

Finally, it can be concluded that the compiler significantly influences the execution time and thus only the fastest compiler should be used in the real-time evaluations. Additionally, the decision of no further code optimization per hand is taken, since the gains are small or unnoticeable. Even more, the use of the fastest compiler not only delivers better and more efficient code, but keeps the opportunity to use the higher-level source code on different platforms than the IA32 (which was the assembler optimization target).

VII.3.1.3 Thread models – priorities, multitasking and caching

Obviously the thread models and execution modes existing in any OS influences the execution method of every application running under given OS. There are already some advantages being usable for multimedia processing present in the real-time system such as controlled use of resources, time-based scheduling or reservation guaranties used by QoS control. On the other hand, the application must be able to utilize such advantages.

One of the important DROPS’s benefits is possibility to assign the priorities to the device drivers deriving from the microscopic kernel construction i.e. the device drivers are treated as user processes. Thus the assignment of lower priority to the device driver allows for lowering or sometimes even avoiding96 the influence of the device interrupts, which is especially critical for the real-time applications using memory actively due to potential overload situations deriving from the PCI-bus sharing between memory and device drivers and causing deadline misses [Schönberg, 2003].

The multitasking (or preemptive task switching) is another important factor influencing the execution of the application’s thread because the processor timeline is not equal the application processing timeline i.e. the thread may be active or inactive for some time. The problem in the real-time application is to provide enough active time to allow the RT application finishing its work before the defined deadlines. However, the DROPS uses a fixed-priority-based preemptive scheduling (with the round-robin algorithm for the same priority value) for non- real-time threads analogically to the best-effort system i.e. the highest-priority thread is

96 It is possible only when the device being needless in multimedia server is not selected at all e.g. drivers for USB or IEEE hubs and connectors are not loaded.

197 Chapter 3 – Design VII. Real-Time Processing Model

activated first and the equal-priority threads share the processor time equally. Thus some investigation has been conducted where the MD-XVID video encoder (1st thread) and the mathematical program (2nd thread) have been executed with equal priorities in parallel. The execution time according to the processor timeline (not the application timeline) is measured for the encoder in the parallel environment (Concurrent) and compared to the stand-alone execution (Single) and results are depicted in Figure 66. It is clearly noticeable that the concurrent execution takes much more time for some frames than the single one – namely, for these frames where the thread was preempted to the inactive state and was waiting for its turn.

Multitasking Effect 30 Concurrent Single

25

20

15

10 Execution Time [ms] Execution

5

0 1 4 7 1013161922252831343740434649525558616467707376798285889194 Frame Number

Figure 66. Preemptive task switching effect (based on [Mielimonka, 2006]).

It’s obvious that the multitasking system is required by the multimedia server where many threads are usual case, but the fact of the preemptive task switching as shown in Figure 66 is too dangerous to the real-time applications. Thus a mechanism for QoS control and a time-based scheduling is required; luckily the DROPS provides already controlling mechanism, namely the QAS, which could be applicable here. The QAS may be used with the admission server controlling the current use of resources (especially CPU time) and reserve the required active time of the given real-time thread by allocating the period timeslice within the period provided by the resource.

198 Chapter 3 – Design VII. Real-Time Processing Model

VII.3.2. Timeslices in HRTA Converter Model

Having the platform specific factors discussed, the timeslice for each part defined in the HRTA converter model has to be introduced. Contrary to the thread model in DROPS (Figure 57 on p. 182), where only one mandatory timeslice and any number of the optional timeslices exist and one empty (idle) time-sharing timeslice, the time for each part of the HRTA converter model is depicted in Figure 67. According to defininition of HRTA converter model, two mandatory timeslices and one optional timeslices are defined, where the time tbase_ts of the mandatory base timeslice is defined for CM, and analogically tenhance_ts of the optional enhancement timeslice for CO, and tcleanup_ts of the mandatory cleanup timeslice for CD.

tbase_ ts tenhance _ ts tcleanup _ ts tidle_ ts

Figure 67. Timeslice allocation scheme in the proposed HRTA thread model of the converter.

The last time value called tidle_ts is introduced in analogy to the time-sharing part of the DROPS model. It is a nothing-to-do part in the HRTA-compatible converter and is used for inactive state in the multitasking system—other threads may be executed in this time—or if the timeslice overrun happened the other converter’s parts could still exploit this idle time. The period depicted in Figure 67 is assumed to be constant and derives directly from the target frequency F (or inverse of the target framerate given by 1/fps) of the converter output, and definitely does not have to be equal to the length of period as defined by T in JCPSt.

199 Chapter 3 – Design VII. Real-Time Processing Model

VII.3.3. Precise time prediction

The processing time may be predicted based on statistics and real measurements but still one has to remember that it has already been demonstrated that the multimedia data influence the processing time. As so, there have been three methods investigated during the design of the processing time estimation:

1) Frame-based prediction

2) MB-based prediction

3) MV-based (or block-based) prediction

They are ordered by complexity and accuracy i.e. the frame-based prediction is the simplest one but the MV-based seems to have the highest accuracy. Moreover, the more complex the estimation algorithm is, the more additional meta-data it requires. So, these methods are directly related to the different levels of the static MD set as given in Figure 12 (on p. 98).

All these methods depend on both: the platform characteristics (as discussed in section VII.3.1) and the converter behavior. The data influence is respected by each method with small to high attention by using a certain subset of the proposed meta-data. Anyway, all of the methods require two steps in the estimation process:

a) Learning phase – where the platform characteristics in context of the used converter and a given set of video sequences are measured and a machine index is defined (please note that the simple one-value index proposed in VII.3.1.1 is insufficient);

b) Working mode – where the defined machine index together with the prepared static meta-data of the video sequences is combined by the estimation algorithm in order to define the time required for the execution (let’s call it default estimation).

Additionally, the working mode could be used for the refinement of the estimation extending the meta-data set by delivering the trace and storing it back in the MMDBMS. Then the prediction could be based on the trace or on combination of trace and the default estimation, and thus would be more accurate. If the exactly same request appears in the future, there could

200 Chapter 3 – Design VII. Real-Time Processing Model

not only the trace be stored but also the already produced video data; this, however, produces additional amount of media data and should be only considered as trade-off between the processing and storage costs. Neither the estimation refinement by the trace nor the already- processed data reuse has been further investigated.

VII.3.3.1 Frame-based prediction

This method is relatively simple. It delivers the average execution time per frame considering the distinction between frame types and the video size. The idea of distinction between frame types comes directly from the evaluation. The decoding in respect to frame type is depicted for few sequences in different resolutions in Figure 68. In order to make the results comparable, the higher resolutions are normalized as follows: CIF by factor 4 and PAL (ITU601) by factor 16. Moreover, the B-frames are used only in the video sequences with “_temp” extension.

Normalized Average Decoding Time per Frame Type

mother_and_daugher_qcif_temp / 1 mother_and_daugher_qcif / 1 container_qcif_temp / 1 container_qcif / 1 mobile_qcif / 1 mobile_qcif_temp / 1 mother_and_daugher_cif / 4 mother_and_daugher_cif_temp / 4 container_cif_temp / 4 container_cif / 4 mobile_cif_temp / 4 mobile_cif / 4 I-frame mobcal_itu601 / 16 P-frame mobcal_itu602_temp / 16 B-frame shields_itu601 / 16 parkrun_itu601 / 16

024681012 Decoding Time [ms]

Figure 68. Normalized average LLV1 decoding time counted per frame type for each sequence.

Analogically, the MD-XVID encoding is depicted only for the representative first forty frames of Container QCIF in Figure 69, where the average of B-frames encoding is above the average of P-frames encoding (i.e. they respectively amount about 9.17ms and 8.36ms). Summarizing, it is clearly visible, that I-frames are processed fastest and B-frames slowest for both ―decoding and encoding― algorithms.

201 Chapter 3 – Design VII. Real-Time Processing Model

11 10,5 Frame Encoding Time (carphone_qcif, first 40 frames) 10 9,5 9 8,5 8 7,5

Processing Time [ms] 7 Encoding Avg. of P-frames Avg. of B-frames 6,5 6 I BPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPB Frame Type

Figure 69. MD-XVID encoding time of different frame types for representative number of frames in Container QCIF.

The predicted time for the given resolution is defined as:

avg Texec = n ⋅T (46) where T avg is the machine-specific time vector including the average execution time measured respectively for I-, P- or B-frames during the learning phase, and the n is the static MD vector keeping the sum of frames of the given type for the specific video sequence:

n = [nI ,nP ,nB ] n = IFrameSum I moi (47) n = PFrameSum P moi n = BFrameSum B moi

The distribution of frame types for the investigated video sequences is depicted in Figure 109 (section XIX.1 Frame-based static MD of Appendix F).

As so, the predicted time for the converter execution may be simply calculated as:

avg avg avg Texec = nI ⋅TI + nP ⋅TP + nB ⋅TB (48)

It is not distinguished between the types of converters yet, but it is obvious that the T avg should be measured for each converter separately (as demonstrated in Figure 68 and Figure 69). The prediction error calculated for the example data is depicted in Figure 70. There is a difference between the total execution time and the total time predicted according to Equation (48) on the

202 Chapter 3 – Design VII. Real-Time Processing Model

left-side Y-axis, and the error given as the ratio between the difference and the real value. The average absolute error is equal to 7.3% but the maximal deviation has achieved almost 11.6% for overestimation and 12.2% for underestimation.

Prediction Error

4 40,00%

3 Difference (l.s.) 30,00% Error (r.s.) 2 20,00%

1 10,00%

0 0,00%

-1 -10,00%

Total Time [s] -2 -20,00% mobile_cif mobile_qcif

-3 container_cif -30,00% container_qcif shields_itu601 mobcal_itu601 parkrun_itu601 mobile_cif_temp mobile_qcif_temp

-4 container_cif_temp -40,00% container_qcif_temp mobcal_itu602_temp mother_and_daugher_cif

-5 mother_and_daugher_qcif -50,00% mother_and_daugher_cif_temp -6 mother_and_daugher_qcif_temp -60,00%

Figure 70. Difference between measured and predicted time.

Moreover, a different video resolution causes definitely the different average processing time per frame. Thus, it is advised to conduct the learning step on at least two well-known resolutions and then estimate the scaled video according to the following formula:

⎧ ⎪⎛ 1 ⎞ p ⎪⎜ ⎟ new ⎜1− ⎟ ⋅ ⇔ pnew > pold ⎪⎝ log( pnew ) ⎠ pold avg ⎪ Tnew = θexec ⋅Told , where θexec = ⎨ (49) ⎪ 1 ⇔ p < p ⎪⎛ ⎞ new old ⎪ 1 pnew ⎜1− ⎟ ⋅ ⎩⎪⎝ log( pnew ) ⎠ pold

203 Chapter 3 – Design VII. Real-Time Processing Model

and Tnew and pnew is the time and number of the pixels in the new resolution, and analogically the

Told and pold is for the origin test video (i.e. of the measurement). The linear prediction, where the slope θ is equal only the ratio between new and old number of pixels i.e. to pnew/pold for upscaling and pnew/pold for downscaling, may be also applied but it yields higher estimation error as depicted by thin black lines for the LLV1 decoding in Figure 71. The theta-based prediction performs the best for I-frames and in general better when downscaling; however when upscaling the predicted time is in most of the cases underestimated.

Average and predicted time per frame type

measured QCIF teta-downscale CIF->QCIF I-frame linear-downscale CIF->QCIF P-frame teta-downscale PAL->QCIF B-frame linear-downscale PAL->QCIF measured CIF teta-upscale QCIF->CIF linear-upscale QCIF->CIF teta-downscale PAL->CIF linear-downscale PAL->CIF measured PAL teta-upscale CIF->PAL linear-upscale CIF->PAL teta-upscale QCIF->PAL linear-upscale QCIF->PAL

0 20 40 60 80 100 120 140 160 Execution Time [ms]

Figure 71. Average of measured time and predicted time per frame type.

VII.3.3.2 MB-based prediction

The MB-based method considers the number of different MBs in the frame. There are three types possible: I-MBs, P-MBs and B-MBs. They are functionally analogical to frame-types however they are not directly related to the frame-types i.e. I-MBs may appear in all three types of frames, P-MBs may be included only in P- and B-frames, and B-MBs occur solely in the B- frames. As so, the differentiation not on the frame-level but on MB-level is more reliable for the average time measurement of the learning phase. The examples depicting the time consumed per MB are given for different frame-types in Figure 72. Interesting is that, the B-MBs are coded faster than the I- or P-MBs, while in general the B-frames are coded longer than P-frames—for

204 Chapter 3 – Design VII. Real-Time Processing Model

a comparison see Figure 68, Figure 69 and Figure 70 in the previous section. The high standard deviations for P- and B-frames stem from the different MB-types using probably different MV- types. Anyway, it may be deduced, that if there are more B-MBs in the frame, the faster MB- coding of MD-based XVID and of MD-LVV1 will be; obviously, this may not be true for standard coders using the motion estimation without meta-data (neither in MD-LLV1 nor in MD-XVID it is done).

70 I-frame Avg.=43,4 Std.Dev.=4,7 Δ|max-min|=20,4 P-frame Avg.=42,2 Std.Dev.=9,1 Δ|max-min|=43,8 B-frame Avg.=30,2 Std.Dev.=7,8 Δ|max-min|=33,8 60

50 s] μ

40

30 Encoding Time [

20

10

0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 MB Number

Figure 72. MB-specific encoding time using MD-XVID for Carphone QCIF.

The example distribution of the different MB types for two sequences is depicted in Figure 73 (more investigated examples are depicted in section XIX.2 MB-based static MD of Appendix F). There is a noticeable difference between the pictures because the B-frames do (a) or do not (b) appear in the video sequence. Even if the B-frames appear in the video sequence, it does not have to mean that all MBs within B-frame are B-MBs—this rule is nicely depicted for even frames in Figure 73 a), where B-MBs in yellow color cover only part of all MBs in the frame.

205 Chapter 3 – Design VII. Real-Time Processing Model

Coded MBs L0 Coded MBs L0

120 450

400 100 350

80 300

B-MBs 250 B-MBs 60 P-MBs P-MBs I-MBs 200 I-MBs

40 150

100 20 50

0 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 a) b)

Figure 73. Distribution of different types of MBs per frame in the sequences: a) Carphone QCIF and b) Coastguard CIF (no B-frames).

Now, based on the different amount of each MB-type in the frame the predicted time for each frame may be calculated as:

MBavg Davg TFexec = m ⋅T + f (T ) (50) where T MBavg is the time vector including the converter-specific average execution time measured respectively for I-, P- or B-MBs during the learning phase (as shown in Figure 72), which includes operations covering code for preparation and support of MB’s structure (e.g. zeroing MB matrixes), error prediction and interpolation, , quantization, and entropy coding of MBs incl. quantized coefficients or quantized error values with MVs (MB- related bitstream coding). It is defined as:

MBavg MBavg MBavg MBavg T = [TI ,TP ,TB ] (51)

The m is analogical to n (in previous section), but the static MD vector is keeping the sum of MBs of the given type for the specific j-th frame of the video sequence (i.e. of the i-th media object) i.e.:

m = [mI ,mP ,mB ] m = IMBsSum I moi, j (52) m = PMBsSum P moi , j m = BMBsSum B moi , j

206 Chapter 3 – Design VII. Real-Time Processing Model

The function f (T Davg ) returns the average time per frame required for frame-type-dependent default operations other than MB processing related to the specific converter before and after MB-coding i.e.:

Davg Davg Davg Davg T = [TI ,TP ,TB ] (53) and the function returns one of these values depending on the type of the processed frame per

Davg each converter. As so, TI covers operations required for assigning internal parameters, zeroing frame-related structures, continuous MD decoding and not-MB-related bitstream coding (e.g. bit stream structure information like width, height, frame-type, frame number in the

Davg Davg sequence), TP includes operations of TI and preparation of reference frame (inverse

Davg Davg quantization, inverse transform and edging), and TB includes operations of TI and preparation of two reference frames. Obviously, not all of the mentioned operations may be included in the converter e.g. the LLV1 decoder does not need continuous MD decoding. Moreover, the T Davg is related to the frame size analogically to the frame-based prediction and thus the scaling should be applied respectively.

10 Encoding Process (carphone_qcif)

9

8

7

6

5

4 Cummulated Time [ms]Cummulated

3

2 I-frame

1 P-frame B-frame

0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 MB-Encoding Progress

Figure 74. Cumulated processing time along the execution progress for the MD-XVID encoding (based on [Mielimonka, 2006]).

207 Chapter 3 – Design VII. Real-Time Processing Model

To explain the meaning of f (T Davg ) the measurement of time distribution within the encoding algorithm has been investigated from the perspective of MB-coding. The results for three representative frames (different types) of Carphone QCIF are depicted in Figure 74. Here, the progress between 0 and 1 means the first part of f (T Davg ) and the progress between 99 and

100 denotes the second part of f (T Davg ) . The progress between 1 and 99 is the processing time spent on MB-coding. As it is shown, the P-frames need more time for preparation than I- frames, and B-frames need even more.

However, the contrary situation is in the after-MB-coding phase, where the I-frames need the most processing time and B-frames the least. This situation is better depicted in Figure 75 in which the time division specific for each frame-time has been divided into preparation, MB- coding and completion. Obviously, the time for the preparation and completion phases is given by f (T Davg ) .

Coding Time Partitioning

0,00% 10,00% 20,00% 30,00% 40,00% 50,00% 60,00% 70,00% 80,00% 90,00% 100,00%

I-Frame

P-Frame

B-Frame

Preparation MB-Coding Completion

Figure 75. Average coding time partitioning in respect to the given frame type (based on [Mielimonka, 2006]).

Now, the relation between the f (T Davg ) and the m ⋅T MBavg could be derived for the given frame, such that:

f (T Davg ) = Δ ⋅ m ⋅T MBavg ⎛ a + b ⎞ (54) Δ = []Δ I ,Δ P ,Δ B ∧ Δ k = ⎜ ⎟ ⎝1⋅(a + b) ⎠ and

208 Chapter 3 – Design VII. Real-Time Processing Model

⎧a = 24.7% ∧ b =17.5% ,if k = I ⇔ f .type = I ⎪ ⎨a = 43.8% ∧ b = 8.3% ,if k = P ⇔ f .type = P (55) ⎪ ⎩a = 64.4% ∧ b = 4.1% ,if k = B ⇔ f .type = B which may be further detailed as:

(24.7% +17.5%) f (T Davg ) = ⋅m ⋅T MBavg I 100% − (24.7% +17.5%)

Davg (43.8% + 8.3%) MBavg f (TP ) = ⋅m ⋅T (56) 100% − (43.8% + 8.3%) (64.4% + 4.1%) f (T Davg ) = ⋅ m ⋅T MBavg B 100% − (64.4% + 4.1%)

Of course the f (T Davg ) according to the above definitions is only rough estimation. The predicted values using the Equation (50) in combination with estimates from Equation (56) are presented in relation to the real measured values of Carphone QCIF in Figure 76 and Figure 77. The maximal error of overestimation and underestimation was equal respectively to 15% and 8.6%, but the average absolute error was equal to 3.93%. Moreover, the error counted in average was positive, which means over allocation of resources in most cases.

Measured and Predicted Time

12

10

8

6

4

Processing TimeProcessing [ms] Measured 2 Predic ted

0 IBPBPBPBPBPBPBPBP BPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBP BPBPBPBPBPBPBPB PBPBP BPBPBPBPBPBPBPB

Figure 76. Measured and predicted values for MD-XVID encoding of Carphone QCIF.

Finally, the total predicted time for the converter execution may be calculated as:

nI nP nB Texec = ∑TFexec,i + ∑TFexec, j + ∑TFexec,k (57) i =0 j =0 k =0

209 Chapter 3 – Design VII. Real-Time Processing Model

where nI, nP and nB are defied by Equation (47).

The total predicted time for the MD-XVID encoding of the Carphone QCIF calculated according to Equations (50), (56) and (57) was equal to 873ms and the measured to 836ms, thus the over allocation was equal to 4.43% for the given example.

Prediction Error

3,0 15,00%

2,5 Difference (l.s.) Error (r.s.) 2,0 10,00%

1,5

1,0 5,00%

0,5 Time [ms] Time 0,0 0,00% I BPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPBPB

-0,5

-1,0 -5,00%

-1,5

-2,0 -10,00%

Figure 77. Prediction error of MB-based estimation function in comparison to measured values.

VII.3.3.3 MV-based prediction

As it can be seen in Figure 72, the time required for each MB-type varies from MB to MB. These variations measured by standard deviation are smallest for the I-MBs (4.7) and almost twice as big for P-MBs (9.1) and B-MBs (7.8) for the obtained results. The main difference in coding algorithm between intra- and inter-coded macro blocks is the prediction part, which employs nine types of prediction vectors causing different execution paths. This leads to the assumption that the different types of prediction in case of predicted macro blocks influence the final execution time for each measured MB. Thus the third method based on the motion vector types has been investigated, which has lead to the function-block decomposition [Meyerhöfer,

210 Chapter 3 – Design VII. Real-Time Processing Model

2007] for the MD-based encoding, such that the execution code has been measured pro MV type separately.

Before going into detailed measurements, the static MD related to motion vectors has to be explained. The distribution of the motion vector types within the video sequence is depicted in Figure 78 and the absolute values of MV sums per frame are shown in Figure 79. The detailed explanation of graphs’ meaning and other examples of MV-based MD are given in section XIX.3 MV-based static MD of Appendix F.

Figure 78. Distribution of MV-types per frame in the Carphone QCIF sequence.

MVs per Frame MVs per frame

120 200

180 100 160 no_mv no_mv mv9 140 mv9 80 mv8 mv8 mv7 120 mv7 mv6 mv6 60 100 mv5 mv5 mv4 80 mv4 mv3 mv3 40 mv2 60 mv2 mv1 mv1 40 20 20

0 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 a) b)

Figure 79. Sum of motion vectors per frame and MV type in the static MD for Carphone QCIF sequence: a) with no B-frames and b) with B-frames.

The time consumed for the encoding measured per functional-block specific to each MV-type is depicted in Figure 80. It is clearly visible, that the encoding time per MV-type is proportional to the number of the MVs of the given type in the frame.

The most noticeable is the mv1 execution—black in Figure 80—, which can be mapped almost one-to-one with the absolute values included in the static MD—violet in Figure 79 a). The other behavior visible at once is caused by mv9 (dark blue in both figures). Such results in both

211 Chapter 3 – Design VII. Real-Time Processing Model

remarkable cases prove the linear dependency between the coding time and the number of MVs in respect to the considered MV-type. Moreover, the distribution graph (Figure 78) helps to find out very fast, which MV-types influence the frame encoding time i.e. for given example the mv1 is the darkest almost for the whole sequence beside the middle – then mv9 is the darkest. The other MV namely mv2 and mv4 are also key-impacts in the encoding of Carphone QCIF (96) example.

mv1 mv2 mv3 mv4 mv5 mv6 mv7 mv8 mv9

4

3

3

2

Time [ms] Time 2

1

1

0 1 3 5 7 9 11131517192123252729313335373941434547495153555759616365676971737577798183858789919395 Frame

Figure 80. MD-XVID encoding time of MV-type-specific functional blocks per frame for Carphone QCIF (96).

MVavg The encoding time has been measured per MV-type and the average Ti is depicted in Figure 81 for each MV-type. This average MV-related time is used as a denominator for prediction calculations:

MVavg Davg TFexec = v ⋅T + f (T ) (58) where f (T Davg ) is the one from as defined for Equation (50), T MVavg is the time vector including the converter-specific average execution time measured respectively for different MV-

212 Chapter 3 – Design VII. Real-Time Processing Model

types during the learning phase (Figure 81), which includes all operations related to MB’s using given MV-type (e.g. zeroing MB matrixes, error prediction and interpolation, transform coding, quantization, and entropy coding of MBs incl. quantized coefficients or quantized error values with MVs). It is defined as a vector having nine average encoding time values referring to the given MV-types:

MVavg MVavg MVavg MVavg MVavg T = [T1 ,T2 ,T3 ,...,T9 ] (59)

The v is a sum vector keeping the amount of MVs of the given type for the specific j-th frame of the i-th video sequence:

v = [v1,v2 ,v3 ,...,v9 ] v = MVsSum (VectorID) i moi, j (60) 1≤ i ≤ 9

where VectorID is a given MV-type, vi is the sum of the MVs of the given MV-type and:

MVsSum (VectorID) = {mv mv ∈ MV ∧TYPE(mv ) = VectorID ∧1≤ i ≤ V} (61) moi, j i i i

where V is the total number of MVs in the j-th frame, mvi is the current motion vector belonging to the set of all motion vectors MV of the j-th frame, and function TYPE(mvi) returns the type of mvi.

mv9

mv8

mv7

mv6

mv5

mv4

mv3

mv2

mv1

0 5 10 15 20 25 30 35 40 45 50

mv1 mv2 mv3 mv4 mv5 mv6 mv7 mv8 mv9

Avg. Time [μs] 50 47 39 46 37 46 42 42 31

Figure 81. Average encoding time measured per MB using the given MV-type.

213 Chapter 3 – Design VII. Real-Time Processing Model

The total time for the video sequence is calculated the same way as in Equation (57).

The measured encoding time and predicted time according to Equation (58) and (57) are presented in Figure 82. The predicted total time is denoted by TotalPredicted, and it is a sum of TstartupPredicted and TcleanupPredicted both reflecting f (T Davg ) , and TMoCompPredicted including sum of multiplications of amount of MVs by average time calculated per MV-type. Analogically, the real measured total time is denoted by Total and respective components are TimeStartup&1stMB, AllMBsBut1st and TimeCleanUp.

TotalPredicted TstartupPredicted TMoCompPredicted TcleanupPredicted Total TimeStartup&1stMB AllMBsBut1st TimeCleanUp 10

9

8

7

6

5

Time [ms] Time 4

3

2

1

0 1 5 9 131721252933374145495357616569737781858993

Figure 82. MV-based predicted and measured encoding time for Carphone QCIF (no B- frames).

The error of the MV-based prediction is shown in Figure 83 as difference in absolute values and percentage to the measured time.

The total predicted time was equal to 845ms and the measured 836ms (as in MB-based case), thus the difference was equal to 9ms which resulted in over estimation of 1.04%. As so, the MV-based prediction achieved better results than MB-based prediction.

214 Chapter 3 – Design VII. Real-Time Processing Model

Prediction Error

3,0 15,00%

2,5 Difference (l.s.) Error (r.s.) 2,0 10,00%

1,5

1,0 5,00%

0,5 Time [ms] 0,0 0,00% 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596

-0,5

-1,0 -5,00%

-1,5

-2,0 -10,00%

Figure 83. Prediction error of MB-based estimation function in comparison to measured values.

VII.3.3.4 The compiler-related time correction

Finally, the calculated time should be corrected by the compiler factor. This factor may derive from the measurement conducted in the previous section (VII.3.1.2). So, the compiler-related time correction is defined as:

Texec _ corr = υ ⋅Texec (62) where υ is a factor representing the relation of execution times in the different runtime environments: of the learning phase and of the working mode i.e. it’s equal to 1 if the same compiler was used in both phases and is calculated as ratio of execution times (or normalized times with normalization factor) for different compilers. For example, let’s assume the values presented in section VII.3.1.2 i.e. the execution time of the converter compiled with gcc 2.95.4 is equal to 1.6204 (normalized time to 106.78% and normalization factor to 1.5174) and respectively with gcc 4.0.2 is equal to 1.5678 (or 103.32% and the same normalization factor) for the same set of media data; if the code was compiled with gcc 2.95.4 for the learning phase and with gcc 4.0.2 for the working mode, then the value of υ used for the time prediction in the working mode is equal to the ratio of 1.5678/1.6204 (or 103.32%/106.78%). Moreover, if the normalization factor for gcc 4.0.2 is different e.g. 1.5489 then the normalized time for gcc 4.0.2 is equal to 101.22% and the ratio is calculated as (101.22%·1.5489)/(106.78%·1.5174). Please

215 Chapter 3 – Design VII. Real-Time Processing Model

note, that the normalization factor does not have to be constant, however then two values instead of one have to be stored, so it is advised to normalize the execution time by constant normalization factor. On the other hand, using the measured execution time directly for calculating υ delivers the same results and does not require storing of the normalization factor, but at the same it hides the straightforward information about which compilers are faster and which are slower.

VII.3.3.5 Conclusions to precise time prediction

It can easily be noticed that none of the given solutions can predict exact execution time. Each method delivers estimates burdened with some error. In case of the frame-based prediction the error is the biggest, smaller is in MB-base prediction and the smallest in MV-based prediction. Moreover, the influence of the compiler’s optimization cannot be neglected, until the same compiler is used for producing executables. Finally, the investigation of the time prediction has lead us to the conclusion that the errors in prediction cannot be really avoided, and the HRTA converter model is the only that exploits the predicted time (using any method) on costs of some drop in the quality of the output data. Obviously, the yielded quality drop is bigger if the less precise method for prediction was used.

VII.3.4. Mapping of MD-LLV1 Decoder to HRTA Converter Model

To stick to the hard real-time adaptive converter model and provide a guarantee of the minimal quality and delivering all the frames, the decoding algorithm had to be split in such a way that the complete base layer without any enhancements is decoded at first (before the given deadline) and next the enhancement layers up to the highest one (lossless) are decoded. However, the optimization problem of multiple executions of inverse quantization (IQ), inverse lifting scheme i.e. inverse binDCT (IbinDCT) and correction of pixel values in the output video stream for each layer has occurred here. So, to avoid the loss of processing time in case of more enhancement layers, the decoding was finally optimized and split into three parts of CHRTA as follows:

• CM – decoding of the complete frame in BL – de-quantized and transformed copy of the frame in BL goes to frame store (FS – buffer), and quantized coefficients are used in further EL computations

216 Chapter 3 – Design VII. Real-Time Processing Model

• CO – decoding of the ELs – the decoded bit planes representing differences between coefficient values on different layers are computed by formula (12) (on page 103)

• CD – cleaning up and delivery – includes the final execution of IQ, IbinDCT and pixel correction for the last enhancement layer, and as such utilizing all readily processed MBs from optional part, and provides the frame to the consumer.

The time required for the base timeslice (guaranteed BL decoding) is calculated as follows:

t = t ⋅ m base _ ts avg _ base / MB (63)

where m denotes number of MBs in one frame and tavg_base/MB is the averaged consumed time for one MB of BL in the given resolution (regardless the MB-type).

The time required for the cleanup timeslice (guaranteed frame delivery) is calculated as follows:

t = t ⋅ m + t cleanup _ ts max_ cleanup / MB max_ enhance / MB (64)

where tmax_cleanup/MB is the maximum of the consumed time for the cleanup step for one MB and tmax_enhance/MB being the maximum time for the enhancement step for one MB in one of ELs (to care for the last processed MB in the optional part) in respect to given resolution.

The time required for the enhancement timeslice (complete execution not guaranteed, but behaves like imprecise computations) is calculated according to:

t = T − (t + t ) enhance _ ts base _ ts cleanup _ ts (65)

where the T denotes the length of period (analogical to T in JCPSt).

Finally, the decoder must check if it can guarantee the minimal QoS i.e. if the LQA for all the frames is delivered:

t + t ≥ t ⋅ m base _ ts enhance _ ts max_ base / MB (66)

The check is relatively simple, namely the criterion is whether the maximum decoding time per macro block tmax_base/MB multiplied by the number of MBs in one frame m will fit into the sum of

217 Chapter 3 – Design VII. Real-Time Processing Model

the first two timeslices. Only if this check is valid, the resource allocation providing QoS will work and the LQA guarantees may be given. Otherwise, the allocation will be refused.

The input values for the formulas are obtained by measurements of the set of videos classified along the given resolutions:

• QCIF – container, mobile, mother_and_daughter • CIF – container, mobile, mother_and_daughter • PAL – mobcal, parkrun, shields

It has been decided so, since the maximum real values can only be measured by execution of the real data using compiled decoder. The average values of time per MB have been calculated per frame for all frames in the different sequences but in the same resolution, which means that they are burdened with the error at least as high as the frame-based prediction. On the other hand, the average values per MB could be calculated according to the one of the proposed methods mentioned earlier for the whole sequence and then averaged per MB in the given frame type–which could yield more exact average times for the processed sequence, however has not been investigated.

VII.3.5. Mapping of MD-XVID Encoder to HRTA Converter Model

Analogically to the MD-LLV1 HTRA decoder, the MD-XVID encoder has been mapped to the HRTA converter model, but only one time prediction method has been used in the later part.

VII.3.5.1 Simplification in time prediction

All the prediction methods discussed in the Precise time prediction section are not able to predict exact execution time and each of them estimates the time with some bigger or smaller error. Because the differences between MB-based and MV-based prediction are relatively small (compare Figure 77 and Figure 83), the simpler method i.e. MB-based prediction has been chosen for calculation the timeslices of the HRTA converter model, and what’s more, it has been simplified.

218 Chapter 3 – Design VII. Real-Time Processing Model

Additionally, it has been decided to allow only for the constant timeslices driven by output frame frequency having strictly periodic execution. The constancy of timeslices is derived directly from the average time of execution, namely it is based on the maximum average frame- time for default operations and on the average MB-specific time. The maximum average frame- time is chosen out of the three frame-type-dependent default-operations average time (as given in Equation (53)):

Davg Davg Davg Davg TMAX = max(TI , max(TP ,TB )) (67) where max(x,y) is defined as [Trybulec and Byliński, 1989]:

⎧x, if x ≥ y ⎫ 1 max(x, y) = ⎨ ⎬ = ⋅ ()x − y + x + y (68) ⎩y, otherwise⎭ 2 and the average MB-specific time is the mean value:

T MBavg + T MBavg + T MBavg T MBavg = I P B (69) AVG 3

VII.3.5.2 Division of encoding time according to HRTA

Having the above simplification defined, the hard real-time adaptive converter model of MD- XVID encoder with the minimal quality guarantees including processing of all the frames could be defined. The encoding algorithm had to be split in such a way that the default operations are completely treated as mandatory part and MB-specific encoding is partly treated as mandatory and partly as optional part.

The time required for the base timeslice according to CHRTA is calculated as follows:

Davg a t = T ⋅ (70) base _ ts MAX (a + b) where a and b are defined according to Equation (55).

Analogically, the cleanup timeslice is calculated as follows:

219 Chapter 3 – Design VII. Real-Time Processing Model

b t = T Davg ⋅ cleanup _ ts MAX (a + b) (71)

The time required for the enhancement timeslice, in which the complete execution of all MBs is not guaranteed, is calculated according to:

t = T MBavg ⋅m enhance _ ts AVG (72)

where m is the number of MBs to be coded.

Both Equations (70) and (71) work with the assumption of the worst case condition independently of the processed frame type. So there may be introduces some optimization allowing “moving” the MB-specific processing to unused time in the base time slice. Of course, the worst-case assumption should stay untouched in the clean-up step.

As so the additional relaxation condition is proposed:

⎢tbase _ ts − (TFexec ⋅a)⎥ TFexec ⋅a < tbase _ ts ⇒ mbase = ⎢ MBavg ⎥ (73) ⎣ TAVG ⎦ and

MBavg tenhance _ ts = TAVG ⋅(m − mbase ) (74)

Figure 84 demonstrates the mapping of the MD-XVID to HRTA converter model including the idea of relaxation according to Equations (73) and (74).

220 Chapter 3 – Design VII. Real-Time Processing Model

Reservation according to CHRTA clean startup MB-processing up IDLE

TSO I-Frame m IDLE o

IDLE c TSO or P-Frame m IDLE IDLE o

c

IDLE B-Frame m

o

c t

n

o

i

t

i

d

o

C

period period period n

o

i n-1 n n+1 t

a

x

a

l

e

R

mmoved IDLE

IDLE I-Frame m o

c mmoved IDLE

IDLE P-Frame m o

c

IDLE B-Frame m o

c t

Figure 84. Mapping of MD-XVID to HRTA converter model.

221 Chapter 3 – Design VII. Real-Time Processing Model

222 Chapter 4 – Implementation VIII. Core of the RETAVIC Architecture

Chapter 4 – Implementation

As soon as we started programming, we found to our surprise that it wasn’t as easy to get programs right as we had thought. Debugging had to be discovered. I can remember that exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs. Maurice Wilkes (1979, Reminiscing about his early days on EDSCA in 1940s)

VIII. CORE OF THE RETAVIC ARCHITECTURE

The RETAVIC project was divided into parts i.e. sub-projects. Each part was meant to be covered by one (or more) student work(s); however not all sub-projects have been conducted due to the time factor or missing human resources. Finally, the focus was to cover most important and critical parts of the RETAVIC project proving the idea of controllable meta-data- based real-time conversion for the most complex type of media (i.e. for video type) being the base assumption for the format independence of multimedia data in MMDBMS.

VIII.1. Implemented Best-effort Prototypes In the first phase, there has been the video transcoding chain implemented in the best-effort system, and to be more precise in two platforms of best-effort OSes, namely in Windows XP and in Linux. The implementation covered:

223 Chapter 4 – Implementation VIII. Core of the RETAVIC Architecture

• p-domain XVID – extension of XVID code to support p-domain-based bit rate control algorithm; • LLV1 codec –including encoding and decoding parts– implementation of the temporal and quantization layering schemes together with the binDCT algorithm and adaptation of Huffman tables; • MD analyzer – produces the static and continuous MD for the given video sequence; • MD-LLV1 codec – extension of LLV1 to support additional meta-data allowing skipping frames in the enhancement layers in the coded domain (i.e. bit stream does not have to be decoded); here both –producing and consuming– parts are implemented; • MD-XVID codec – extension of XVID to support additional meta-data (e.g. direct MVs reuse); it also includes some enhancements in the quality such as: 1) rotating diamond algorithm based on diamond search algorithm [Tourapis, 2002] with only two iterations for checking if full-pel MD-based MVs are suitable for the converted content, 2) predictor recheck, which allows for checking MD-based MV against the zero vector and against the median vector of three MD-based vectors (of the MBs in the neighborhood – left, top, top-right), and 3) subpel refinement, where the full-pel MC using MD-based MVs is calculated also for half-pel and q-pel).

Having the codec implemented, the functional and quality evaluations have been conducted, and moreover, some best-effort efficiency benchmarks have been accomplished. All these benchmarks and evaluations have already been reflected by charts and graphs in the previous part of this thesis i.e. in the Video Processing Model (V) and Real-Time Processing Model (VII) sections.

VIII.2. Implemented Real-time Prototypes The second phase of implementation covered at first the adaptation of the source code of the best-effort prototypes to support the DROPS system and its specific base functions and procedures as non-real-time threads, and secondly to implement the real-time threading futures allowing for algorithm division as designed previously. The implementation is based only on two prototypes of the previously mentioned best-effort implementations i.e. on MD-LLV1 and MD-XVID codecs, and it covers:

224 Chapter 4 – Implementation VIII. Core of the RETAVIC Architecture

• DROPS-porting of MD-LLV1 – adaptation of the MD-LLV1 codec to support the DROPS-specific environment; there is not distinction made between the MD-LLV1 implemented in DROPS or in Windows/Linux within this work, since only the OS- specific hacking activities have been conducted and neither algorithmic nor functional changes have been made; moreover, the implementation in DROPS behaves analogically to the best-effort system implementation since no real-time support has been included and even the source code is included in the same CVS tree as the best-effort MD-LLV1; • DROPS-porting of MD-XVID – is exactly analogical to MD-LLV1; the only difference is that it is based on MD-XVID codec; • RT-MD-LLV1 decoder – the implementation based on DROPS-porting of MD-LLV1 has covered the real-time issues i.e. the division of the algorithm on the mandatory and optional parts (which have been explained in Design of Real-Time Converters section), implementation of the preempter thread, special real-time logging through network, etc.; it is described in details in the following chapters; • RT-MD-XVID encoder – the implementation covers analogical aspects to RT-MD- LLV1 but is based on the DROPS-porting of MD-XVID (see also Design of Real-Time Converters section); it is also detailed in the subsequent part;

The real-time implementations allowed evaluating quantitatively the processing time under the real-time constraints and provided the means of assessing the QoS control of the processing steps during the real-time execution.

225 Chapter 4 – Implementation IX. Real-time Processing in DROPS

IX. REAL-TIME PROCESSING IN DROPS

IX.1. Issues of Source Code Porting to DROPS As already mentioned, the MD-LLV1 and MD-XVID codecs had to be ported to the DROPS environment. The porting steps, which are defined below, have been conducted for both implemented converters and as so they may be generalized as a guideline of source code porting to DROPS for any type of converter implemented in the best-effort system such as Linux or Windows.

The source code porting to DROPS algorithm consists of the following steps:

1) Adaptation (if exists) or development (if is not available) of the time measurement facility

2) Adaptation to logging environment for delivering standard and error messages obeying the real-time constraints

3) Adaptation to L4 environment

The first step covered the time measurement facility, which may be based on functions returning the system clock or directly on the tick counter of the processor. The DROPS provides a set of time-related functions giving the values in nanoseconds such as l4_tsc_to_ns or l4_tsc_to_s_and_ns; however as it has been tested by measurements, they demonstrate some inaccuracy in the conversion from the number of ticks (expressed by l4_time_cpu_t being a 64-bit integer CPU-internal time stamp counter) into nanoseconds on the trade-off of a higher performance. While in the non-real-time applications such inaccuracy is acceptable, in the continuous-media conversion using hard real-time adaptive model, where very small time-point distances (e.g. between begin and end of the processing of one MB) are measured, it is intolerable. Therefore more accurate implementation using also the processor’s tick counter (by exploiting DROPS functions like l4_calibrate_tsc, l4_get_hz, l4_rdtsc) but being based on the

226 Chapter 4 – Implementation IX. Real-time Processing in DROPS

floating-point calculations instead of integer-based i.e. the CPU-specific tsc_to_ns has been employed. This has lead to the more precise calculations.

Secondly, the exhaustive logging due to analysis purposes is required especially for investigating the behavior by execution of the real-time benchmarks. The DROPS provides mechanisms for logging to the screen (LogServer) or to the network (LogNetServer). The first option is not really useful for the further analysis due to the limited screen size, and only the second one is applicable here. However, the LogNetServer is based on OSKit framework [Ford et al., 1997] and can be compiled only with the gcc 2.95.x, which has been proved as the least effective in producing efficient binaries (Figure 65 on p. 196), but still this is not a problem. The real drawback is that the logging using the LogNetServer will influence the system under measurements due to the task switching effect (Figure 66 on p.198) caused by the LOG command executed by log DROPS’s server being different than the converter DROPS’s server97, and as so the delivered measures will be distorted. In addition, the LogNetServer due to its reliance on synchronous IPC over TCP/IP has unpredictable behavior, and thus not conformable to real-time application model. In results, the logging has to be avoided during real-time thread execution. On the other hand, it cannot be simply dropped, so the solution of using log buffers in memory to save logging messages generated during real-time phase and flushing them to the network by the LogNetServer during non-real-time phase is proposed. Such solution allows avoiding task switching problems and unpredictable synchronous communication delivering intact real-time execution of the conversion process.

Finally, the adaptation to L4-specific environment being used in DROPS has to be conducted. The DROPS does not conform to the Portable Operating System Interface for (POSIX) being a fundamental and standard API for Linux/Unix based systems. Moreover, the DROPS is still under development and kernel changes may occur, so it is important to recognized, which version of DROPS kernel is actually used and then do respective adaptations for the chosen system configuration: L4Env_base for the upgraded version 1.2 (called l4v2) or L4Env_Freebsd for the previous version 1.0 (called l4v0) [WWW_DROPS, 2006]. The L4Env_base differs from L4Env_Freebsd such that the second one is based on the OSKit [Ford et al., 1997] while the first

97 The DROPS’s servers are referred here as applications in the user space and outside the microkernel according to the OS ontology.

227 Chapter 4 – Implementation IX. Real-time Processing in DROPS

one has complete implementation of the fundamental system libraries (e.g. libc, log, names, basic_io, basic_mmap, syslog, time, etc.) on its own [Löser and Aigner, 2007]98.

In results, the functions and procedures specific to the POSIX often simply works but sometimes they may require adaptation. For example, the assembler-based command SIGILL compatible with POSIX used by MD-LLV1 and MD-XVID determines the SIMD processor extensions (namely MMX and SSE) but it is not supported on DROPS with L4Env_base mode. Thus, the adaptation of both converters has been done by simply assuming that the used hardware supports these extensions and no additional checks are conducted99, which eliminated the use of the problematic command.

Another problem with DROPS is that it still does not support the I/O functionality in respect to the real-time read from and write to the disk, and this is undoubtful limitation for the multimedia server. The supported simple file operations are not sufficient due missing real-time abilities. So, another practical solution has been employed for delivering bit streams with video data, static and continuous meta-data as input i.e. the particular data have been linked after compilation as binaries into the executable, loaded to the memory, which is real-time capable, during the DROPS booting process, and accessed through the defined general input interface allowing for integration of different inputs by calling input-specific read function. Obviously, such technique is unacceptable for real-world applications, but allows for conducting the proof- of-concepts of the assumed format independence through real-time transcoding. Another possibility would be using the low-latency real-time Ethernet-based data transmission [Löser and Härtig, 2004], but here only the specific subset of hardware allowing traffic shaping in the closed system has been used100, so it needs to be found out if this technique may be applied in

98 There are also other modes available like Tiny, Sigma0, L4Env, L4Linux, etc. The detailed specification of all available configurations together with detailed include paths and libraries can be found in [Löser and Aigner, 2007]. 99 Such assumption may be called a hack; however it reflects the reality, because nowadays almost all processors include the MMX- and SSE-based SIMD extensions in their ISA. Still, there has been the option for turning this SIMD support off by setting compilation flags XVID_CPU_MMX and/or XVID_CPU_SSE to zero to prohibit the usage of the given ISA subset. 100 The evaluation has been conducted with only three switches (two fast-ethernet and one gigabit) and two Ethernet cards (one fast-ethernet - Intel EEPro/100 and one gigabit - 3Com 3C985B-SX). The support for other hardware is not stated. On the other hand, the evaluation included the only real-time transmission as well as the shared network between real-time (DROPS) and best-effort (Linux) transmissions and proved the ability of guaranteeing sub-millisecond delays for network utilization of 93% for fast and 49% for gigabit Ethernet.

228 Chapter 4 – Implementation IX. Real-time Processing in DROPS

the wide area of general purpose systems such as multimedia server using common off-the-shelf hardware. Moreover, it would require development of the converters being able to read from and to write to the real-time capable network (e.g. DROPS-compliant RTSP/RTP sender or receiver).

IX.2. Process Flow in the Real-Time Converter Before going into the details of converter’s processing function, the process flow has to be defined. The DSI has already been mentioned as the streaming interface between the converters. This is one of the possible options for I/O operations required in the process flow of the real-time converter. Other possibility covers memory-mapped binaries (as mentioned in previous paragraph) for input and internal buffer (which is a real-time capable memory allocated by the converter). The DROPS simple file system and real-time capable Ethernet are jet another I/O option. The memory-mapped binaries and the internal output buffer have been selected for evaluation purposes due to their simplicity in the implementation. Moreover, some problems have appeared with other options: DROPS Simple FS did not support guaranteed real-time I/O operations, and RT-Ethernet required the specific hardware support (or would require OS- related activities in the driver development).

The process flow of the real-time converter is depicted in Figure 85. All mentioned I/O options (in gray) and the selected ones (in black) have been marked. The abstraction of the I/O can be delivered by the general input/output interface or the CSI proposed by [Schmidt et al., 2003]. However, the CSI has been left out due to its support only for version 1.0 of the DROPS (dependent on OSKit) – the development has been abandoned due to the closing of the memo.REAL project. So, if the abstraction had to be supported, the only option was to write the general I/O interface, such that no unnecessary copy operations appear e.g. by using pointer handing over and shared memory. So, the general input/output interfaces are nothing else but wrappers delivering information about the pointer to the given shared memory, which is delivered by the previous/subsequent element. This is a bit similar to the CSI described in section VII.2.6 (p. 190).

229 Chapter 4 – Implementation IX. Real-time Processing in DROPS

Figure 85. Process flow in the real-time converter.

The real-time converter uses the general input interface consisting of three functions:

int initialize_input (int type); int provide_input (int type, unsigned char ** address_p, unsigned int pos_n, unsigned int size_n); int p_read_input (int type, unsigned char ** address_p, unsigned int size_n); where the type defines the input type (being the most left element of the Figure 85) and can be provided by the control application to the converter after building the functionally-correct chain. Obviously, the given type of input has to provide all the required types of data i.e. media data, static MD, and continuous MD used by the real-time converter (otherwise the input should not be selected for the functionally-correct chain). Then the data in the input buffer is used by calling the provide_input checking if the requested size of data (size_n) at the given position (pos_n) can be provided by the input, and p_read_input based of the size of the quant (size_n) sets the address_p (being a pointer) to the correct position of the next quant in the memory (and it does not read any data!). The general input interface then forwards the calls to the input-specific implementation based on the type of the input i.e. for memory-mapped binaries the provide_input calls the provide_input_binary function respectively. The nice thing about that is that the real-time converter does not have to know how the input-specific function is implemented but only must know the type which should be called to get the data, thus the flexibility in delivering different transcoding chains by the control application is preserved.

230 Chapter 4 – Implementation IX. Real-time Processing in DROPS

Obviously, the input-specific functions should provide all three fundamental types of I/O functions such as open, read, and lseek in POSIX. It is done in the memory-mapped binaries by: initialize_input_binary (being equivalent to open), provide_input_binary (checks and reads at given position like read), set_current_position and get_current_position (counterparts of lseek).

The general output interface is implemented in analogy to the general input interface i.e. there are functions like initialize_output, provide_output and p_write_output defined. One remark is that, the output internal buffer allocates the memory itself based on the constant size provided by a system value during the start-up phase of the transcoding chain; however, it should be provided by the control application as variable parameter after all. Moreover, the output type should also fulfill the requirement of accepting the different output data produced by the real-time converter (again the rule of functionally-correct chain has to be applied) and should be given to the converter by the control application for calling the output functions properly.

IX.3. RT-MD-LLV1 Decoder The first step was porting the decoder to the real-time environment by adapting all the standard file access and time measurement functions to those present in DROPS as stated in the previous sections. Next the algorithmic changes in processing function had to be introduced. The timing functions and periods had to be defined in order to obtain a constant framerate requested by the user in such a way that only one frame is provided within one period. Here, a mechanism of stopping the process for one frame even if not every macro block (MB) was completely decoded had to be introduced. This has been provided by additional meta-data (MD) describing bit stream structure of each enhancement layer (discussed in section V.5.1 MD-based Decoding), because finding a next frame without Huffman decoding of remaining MBs of the current frame is not possible. So, the MD allowed for skip operation and going to next frame on the binary level i.e. operation on encoded bit stream.

IX.3.1. Setting-up Real-Time Mode

The time assigned to each timeslice is calculated by Equations (63), (64), (65) and (66) given in section VII.3.4 (p. 217) by means of initial measurements with the real-time decoder—example initial values embedded in the source code as constants are listed in Appendix G in function load_allocation_params(), but they should be included in the machine-dependent and resolution-

231 Chapter 4 – Implementation IX. Real-time Processing in DROPS

related static meta-data. These measurements provide average decoding time per macro block as well as maximum decoding time per macro block on a given platform. This allows us avoiding complex analysis of the specific architecture—being possible due to use of the HRTA converter model—and delivers huge simplifications in the allocation algorithm. Contrary, the predicted time could be calculated according to formulas given in the Precise time prediction section. The code responsible for setting up timeslices and real-time mode is given in Figure 86.

set up RT mode:

createPreempterThread(preemptPrio);

//set up RT periodic thread registerRTMainThread(preemptThread, periodLength)

// set up timeslices addTimeslice(baseLength, basePrio); addTimeslice(enhanceLength, enhancePrio); addTimeslice(cleanupLength, cleanupPrio);

//switch to RT mode startRT(timeOffset); while(running){ do_RT_periodic_loop(); }

Figure 86. Setting up real-time mode (based on [Mielimonka, 2006; Wittmann, 2005]).

IX.3.2. Preempter Definition

The decoder’s adaptive ability on the MB level (mentioned in the sub-section VII.3.4 of the Design of Real-Time Converters section) requires handling of IPCs referring to time from the DROPS kernel. The timeslice overrun IPC is only relevant for the enhancement timeslice of the main thread. In case of an enhancement timeslice overrun the rest of the enhancement layer processing has to be stopped and skipped. For the mandatory and cleanup timeslices timeslice overruns do not affect the processing i.e. for the base quality processing the enhancement timeslice can be additionally used, and for the cleaning up in the delivery timeslice a timeslice overrun should never happen – otherwise it is the result of erroneous measurements, as the maximum time (worst-case) should be allocated for it. The deadline miss IPC (DLM) is absolutely unintended for a hard real-time adaptive system, but nevertheless is optionally handled by skipping the frame whenever processing of the current frame is not finished. The system tries to limit the damage by this skipping operation, but a proper processing with the

232 Chapter 4 – Implementation IX. Real-time Processing in DROPS

guaranteed quality can not be assured anymore. It must be clear, that DLM might occur only by system instability assuming the correct allocation for the delivery timeslice (which is worst-case- based being compliant with HRTA converter model). Finally, the preemtper thread has been defined and is given by pseudo code in Figure 87 (full listing is given in Appendix G).

preempter:

while(running){ receive_preemption_ipc(msg) switch(msg.type){ case TIMESLICE_OVERRUN: if(msg.ts_id == ENHANCE_TS) { abort_enhance(); } break; case DEADLINE_MISS: if(frame_processing_finished) abort_waiting_for_next_period(); else { skip_frame(); raise_delivery_error(); } break; } }

Figure 87. Decoder’s preempter thread accompanying the processing main thread (based on [Wittmann, 2005]).

IX.3.3. MB-based Adaptive Processing

The use of abort_enhance() function in the preempter enforces an implementation of the checking function in the enhancement timeslice, in order to recognize when the timeslice overrun for the timeslice occurred and to correctly stop the processing. Therefore, the decoder checks the timeslice overrun (TSO) semaphore before decoding of next MB in the enhancement layer. If the TSO occurred, then all the remaining MBs for this enhancement layer as well as all MBs for higher layers are skipped i.e. the semaphore blocks it and the MB loop is quitted. The prototypical code used for decoding frame in the given enhancement layer (EL) is listed in Figure 88.

Regardless of the number of the already processed MBs in the enhancement timeslice, the current results are processed and arranged by the delivery step. The mandatory and delivery timeslices are required, and accordingly have to be executed completely in a normal processing,

233 Chapter 4 – Implementation IX. Real-time Processing in DROPS

but as mentioned already the deadline miss is handled additionally to cope with erroneous processing. The part responsible for base layer decoding has no time-related functions i.e. neither timeslice overrun nor deadline miss are checked –the analogical description was given for the preempter pseudo-code listed in Figure 87.

decode_frame_enhance(EL_bitstream, EL_no):

for (x=0; x

Figure 88. Timeslice overrun handling during the processing of enhancement layer (based on [Wittmann, 2005]).

Finally, the delivery part delivers: 1) the results from the base layer if no MB from the enhancement processing has been produced or 2) the output of the dequantization and the inverse transform of the MBs prepared by the enhancement timeslice. If at least one enhancement layer has been processed completely the dequantization and inverse transform is executed for all (i.e. mb_width· mb_height) MBs in the frame but only once.

IX.3.4. Decoder’s Real-Time Loop

The real-time periodic loop demonstrating all the parts of the HRTA converter model is given by the pseudo code listed in Figure 89. Here it is clearly visible that at first the decoding of base layer takes place. When it is finished the context is switched to the next reserved timeslices. Here it does not matter if the mandatory TSO of the base layer was missed or not, because here the miss of the enhanced TSO would be critical. But the enhanced TSO is anyway not missed in context of base layer processing since the minimal time assumption according to the worst-case have been made—for details see condition (66) on page 217. Another interesting event occurs during the enhancement timeslice namely the setting position to the end of the frame of each decoded enhanced layer. It occurs always: 1) if the decoding of the given EL was finished the function sets the pointer to the same position and 2) if the decoding was not finished the

234 Chapter 4 – Implementation IX. Real-time Processing in DROPS

pointer is moved to the end of frame based on delivered continuous meta-data allowing jumping over the skipped MBs in the coded domain of the video stream. The context to the next timeslice is switched just after the enhanced TSO i.e. maximum after the time required for processing of exactly one MB, and the delivery step is executed and processing context is switched to the non-real-time (i.e. idle) part. The delivery step finishes before or exactly in the period deadline (otherwise, there is erroneous situation). If it finishes before the deadline, then the converter waits in the idle mode till begin of next period.

do_RT_periodic_loop():

// BASE_TS decode_base(BL_Bitstream); next_reservation(ts1);

// ENHANCE_TS for (EL=1; EL <=desiredLayers; EL++){ if(!TSO_Enhance) { decode_frame_enhance(EL_Bitstream[EL], EL); } setPositionToEndOfFrame(EL_Bitstraem[EL]); } next_reservation(ts2);

// CLEANUP_TS if(desiredLayer>0){ decoder_dequant_idc_enhance(); decoder_put_enhance(outputBuffer); } else { copy_reference_frame(outputBuffer); }

next_reservation(ts3);

// NON_RT_TS – do nothing until the deadline if(!deadline_miss){ // i.e. normal case wait_for_next_period() }

Figure 89. Real-time periodic loop in the RT-MD-LLV1 decoder.

IX.4. RT-MD-XVID Encoder Analogically to RT-MD-LLV1, the first step in real-time implementation of MD-XVID was adaptation of all standard procedures and functions for file access and time measurements to the DROPS-specific environment. Also the timing functions and periods had to be defined analogically to RT-MD-LLV1. The same MB-based stopping mechanism of the process for

235 Chapter 4 – Implementation IX. Real-time Processing in DROPS

each type of frame had to be introduced. Of course, this mechanism has been possible due to provided continuous meta-data (see section V.5.2 MD-based Encoding) allowing definition of compressed emergency output in two ways: 1) by skipping the frame using special encoded symbol at the very beginning of each frame, or 2) by avoiding the frame skip through exploiting the processed MBs and zeroing only these which have not been processed yet. Additionally, the reuse of the continuous MD and substituting the not yet coded MBs by refining the first three and zeroing the rest of coefficients has been implemented, but could be applied directly only if no resolution change of the frame is applied within the transcoding process101.

IX.4.1. Setting-up Real-Time Mode

The time assigned to each timeslice according to HRTA converter model (see VII.3.5 Mapping of MD-XVID Encoder to HRTA Converter Model) has been measured analogically to the decoder; however, this time the values have not been hard-coded within the encoder source code, but provided as outside parameters through command line arguments to the encoder process. Such solution allowed to separate time prediction mechanism from the real-time encoder. The possible parameters are listed in Table 8. The code responsible for setting up timeslices and real- time mode is exactly the same as for the RT-MD-LLV1 (given in Figure 86 on p. 232) such that the variables are set to the values taken from arguments.

Argument Description -period_length Length of period used by real-time main thread; given in [ms] => periodLength -mandatory_length Length of mandatory timeslice, delivering LQA under worst case assumption; given in [ms] => baseLength -optional_length Length of optional timeslice for improving the video quality; given in [ms] => enhanceLength -cleanup_length Length of delivery timeslice for exploiting the processed data in the mandatory and optional; it should be specified according to worst-case assumption; in case of specifying zero, the frame skip will be applied always; given in [ms] => cleanupLength

Table 8. Command line arguments for setting up timing parameters of the real-time thread (based on [Mielimonka, 2006])

101 Some kind of indirect application of the first three coefficients is possible in case of resolution change. Then the MD-based coefficient should be rescaled analogically in the frequency domain, or transformed to the pixel domain and rescaled. However, there exists an undoubtful overhead of additional processing, which in case of timeslice overrun may not be possible.

236 Chapter 4 – Implementation IX. Real-time Processing in DROPS

IX.4.2. Preempter Definition

Analogically to RT-MD-LLV1, the encoder’s adaptive ability on the MB level also requires handling of DROPS-kernel’s IPCs informing about the time progress. The timeslice overrun IPC is only relevant for the optional timeslice of the main thread. For the mandatory timeslice TSO affects the processing such that the thread state is changed from mandatory to optional and continues processing in the enhancement mode. Then, in case of the optional TSO the rest of the enhancement processing has to be stopped and skipped. For the cleanup timeslice TSO should not happen. If it happens, it means wrong allocation due to erroneous parameters and is handled by skipping the frame whenever processing of the current frame was not yet finished. Obviously, the worst-case condition should be used for time allocation of the delivery step. The deadline miss IPC (DLM) is absolutely unintended for a hard real-time adaptive encoder and raises delivery error of the encoder but only if the processing of the current frame was not finished before. Contrary to the decoder, the encoder is stopped immediately in case of DLM. The pseudo code of the encoder’s preempter thread is listed in Figure 87 (full listing is given in Appendix G).

preempter:

while(running){ receive_preemption_ipc(msg) switch(msg.type){ case TIMESLICE_OVERRUN: if(msg.ts_id == BASE_TS) { next_timeslice(); } elseif (msg.ts_id == ENHANCE_TS){ next_timeslice(); } elseif (msg.ts_id == CLEANUP_TS){ if(!frame_processing_finished) skip_frame(); next_timeslice(); } break; case DEADLINE_MISS: if(!frame_processing_finished) { raise_delivery_error(); stop_encoder_immediatly(); }; break; } }

Figure 90. Encoder’s preempter thread accompanying the processing main thread.

237 Chapter 4 – Implementation IX. Real-time Processing in DROPS

IX.4.3. MB-based Adaptive Processing

It can be noticed easily, that there is a difference between the preempter in decoder and encoder. The decoder’s preempter changes only the semaphore signaling that the timeslice overrun occurred, and the context of the current timeslices is not changed i.e. the priority of the execution thread is not changed by the preempter, but first after finishing the processed MB. Contrary, the encoder’s preempter switches the context immediately, because there is no clear separation between the base and enhancement layers like in the decoder i.e. there are MBs encoded in the mandatory part in the same loop as in the enhancement part (for details see VII.3.5 Mapping of MD-XVID Encoder to HRTA Converter Model). The difference is that not all MBs assigned to the optional part may be processed, because when switching to the clean-up step, the MB loop is left in order to allow finishing the frame processing. Again the deadline miss is handled within this loop, but it should never occur in the correct processing – only in erroneous allocation (e.g. when period deadline occurs before end of cleanup timeslice) or system instability it may be expected. Then however, some more drastic action is taken than in case of real-time LLV1 decoder, namely the encoder is stopped at once by returning error signal XVID_STOP_IT_NOW, such that the real-time processing is interrupted and no further data is delivered.

encode_frame(frameType):

for (i = 0; i < mb_width*mb_height; i++) {

// do calculations for given MB encode_MB(MB_Type);

// REALTIME control within the MB loop – inactive in non-RT mode if (realtime_switch == REALTIME) { if ((MANDATORY) OR (OPTIONAL)) { continue; } if (CLEANUP) { // leave the MB loop and clean up break; } if (DEADLINE) { // only in erroneous allocation e.g. DEADLINE before CLEANUP // leave the MB loop & stop immediately return XVID_STOP_IT_NOW; } } }

Figure 91. Controlling the MB-loop in real-time mode during the processing of enhancement layer.

238 Chapter 4 – Implementation IX. Real-time Processing in DROPS

IX.4.4. Encoder’s Real-Time Loop

Since the proposed construction of preempter (Figure 90) and embedded real-time support within the code of the MB-loop (Figure 91), there is no extra facility for controlling the real- time loop analogical to the RT-MD-LLV1 decoder given in IX.3.4. The functionality responsible for switching of the current real-time allocation context is included in the preempter thread i.e. it calls the next_timeslice() function which in context (i.e. for the subsequent timeslice) executes the DROPS-specific next_reservation() function. The final stage after the clean-up step calling the next context of the non-real-time timeslice, in which idle processing by wait_for_next_period() occurs, is executed in all cases but not when the XVID_STOP_IT_NOW signal was generated (in other words not if the deadline miss occurred).

239 Chapter 4 – Implementation IX. Real-time Processing in DROPS

240 Chapter 5 – Evaluation and Application X. Experimental Measurements

Chapter 5 – Evaluation and Application

Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. Aaron Levenstein (Why People Work)

X. EXPERIMENTAL MEASUREMENTS

Sun, Chen and Chiang have proposed in their book [Sun et al., 2005] conducting the evaluation process in two phases in respect to the transcoding. One is called static tests while the other dynamic tests. The static tests evaluate only the coding efficiency of the transcoding algorithm, which is done obviously in respect to data quality (QoD). The graphs like PSNR vs. bit rate [kbps] or vs. bits per pixel [bpp] without bit-rate control or PSNR vs. frame number within the video sequence where the bit-rate control algorithm or the defined quantization parameter is used (as a constant factor) – e.g. by using p-domain bit rate control algorithm with two network profiles defined like varying in 2 sec periods VBR and constant CBR. The dynamic tests are used for evaluating the real-time behavior of the transcoding process. However, there is only proposal of using simulation environment in [Sun et al., 2005], namely the test based on the MPEG-21 P.12 test bed [MPEG-21 Part XII, 2004] is proposed to be used to simulate real time behavior of the transcoder. It follows the rules of workload estimation process, which includes the four stages:

1) developing the workload model generating the controlled media streams (e.g. various bitrates, resolutions, framerates) using different statistical distributions;

241 Chapter 5 – Evaluation and Application X. Experimental Measurements

2) developing estimation of resource consumption;

3) developing the application for generating the workload according to specified model and allowing load scalability (e.g. number of media streams);

4) measure and monitor the workload and resource consumption.

However, the model only simulates the real time behavior, but will never reflect precisely the real environment covering the set of real media data. Thus it was decided at the very beginning of the RETAVIC project not to use any modeling or simulating environment, but to prepare prototypes running on real computing system under the selected RTOS with few selected media sequences recognized by the community of researchers focused on the field of audio-video processing.

X.1. The Evaluation Process In analogy to the two mentioned phases, the evaluations have been conducted in both directions. The static tests have been already included in the respective sections, for example in Evaluation of the Video Processing Model through Best-Effort Prototypes (section V.6). The dynamic tests covering execution time measurements have been further conducted under best-effort as well as real-time OSes. Those run under best-effort systems, for example depicting behavior of converters (e.g. in section V.1 Analysis of the Video Codec Representatives), have been used for the imprecise time measurements due to the raised risk of external disruption caused by thread’s preemption and the potency of the precision errors due to existing timing functions.

Contrary, the benchmarks executed under RTOS have delivered the precise execution time measurement, which are especially important in real-time systems. They have been used for quantitative evaluation in two ways under DROPS: without scheduling (non-real-time mode) and with scheduling (real-time mode). The dynamic test under RTOS with non-RT mode have been exploited once already in the previous Design of Real-Time Converters section (VII.3) for evaluating the accuracy of prediction algorithms. The other dynamic benchmarks under RTOS with RT as well as non-RT mode are discussed in the following subsections. They evaluate quantitatively the real-time behavior of the converters respectively including scheduling

242 Chapter 5 – Evaluation and Application X. Experimental Measurements

according to the HRTA converter model (with TSO / DLM monitoring) or including the execution time in the non-RT mode for comparisons with the best-effort implementations.

Since the RETAVIC architecture is a complex system requiring a big team of developers to get it implemented completely, only few parts have been evaluated. Of course, these evaluations derive directly from the implementation i.e. they are conducted only for the implemented prototypes explained in the previous chapter. Each step in the evolution of the implemented prototypes has to be tested, if it fulfills the demanded quantitative requirements, which can only be checked by real measurements. In addition to that, the real-time converter itself needs to make measurements for analyzing the performance of the underlying system. The following sections discuss measurements as base for time prediction, calibration and admission from the run-up to the encoding process. The time trace can be used to recognize overrun situations and to adjust the processing in the future and it can be delivered by time recording during the real- time operation.

X.2. Measurement Accuracy – Low-level Test Bed Assumptions The temporal progress of the computer program is expected to be directly related to the binary code and the processed data i.e. the execution should go through the same sequence of processor instructions for a given combination of input data (functional correctness) and thus the time spent on given set of subsequent steps shall be the same. However, it has been proven that there exist many factors that can influence the execution time [Bryant and O'Hallaron, 2003]. These factors have to be eliminated as much as possible in every precise evaluation process, but beforehand they have to be identified and classified according to possible impacts.

X.2.1. Impact Factors

There are two levels of impact factors according to the time scale of duration of computer system’s events [Bryant and O'Hallaron, 2003] and they are depicted in Figure 92. The microscopic granularity covers such events as processor instructions (e.g. addition of integers, floating-point multiplication or floating-point division) and is measured in nanoseconds on the

243 Chapter 5 – Evaluation and Application X. Experimental Measurements

Gigahertz machine102. The deviations existing here derive from the impacts as the branch prediction logic and the processor caches [Bryant and O'Hallaron, 2003]. Completely other type of impacts can be classified on the macroscopic level and cover external events for example disc access operations, refresh of the screen, keystroke or other devices (network card, USB controller, etc.). The duration of macroscopic events is measured in milliseconds, which means that they last about one million times longer than microscopic events. The external event generates an interrupt (IRQ) to activate the scheduler [Bryant and O'Hallaron, 2003] and if the IRQ-handler has higher priority than the current thread according to the scheduling policy the preemption of the tasks occurs (for more details see section VII.3.1.3 Thread models – priorities, multitasking and caching) and unquestionable generates errors in the results of the evaluation process.

Figure 92. Logarithmic time scale of computer events [Bryant and O'Hallaron, 2003].

As mentioned, the impact factors being sources of irregularities have to be minimized to get highly accurate measurements. The platform-specific factors discussed in section VII.3.1 are considered already in the design phase, and especially the third subsection considers the underlying OS’s factors. Contrary to the best-effort operating systems some of the impacts can be eliminated in DROPS. According to the closed run-time system as described in the XVIII.3 section of Appendix E, the macroscopic events such as device, keystroke or even disc access

102 Obviously, they may be measured respectively in one-tenth or even one-hundredth of nanoseconds for faster processors, and in hundreds of nanoseconds or in microseconds on megahertz machines.

244 Chapter 5 – Evaluation and Application X. Experimental Measurements

interrupts has been eliminated, thus achieving the level acceptable for the transcoding processes evaluation.

X.2.2. Measuring Disruptions Caused by Impact Factors

On the other hand, not all impacts can be eliminated, especially those caused by the microscopic events. In that case, they should be measured and then considered in the final results. A simple method applicable here is to measure many times the same procedure and consider the warm- up, measure and cool-down phases [Fortier and Michel, 2002].

X.2.2.1 Deviations in CPU Cycles Frequency

An example of measuring the CPU frequency103, which may influence the time measurement of the transcoding process, is depicted in Figure 93. Here the warm-up as well as cool-down phase covers 10 executions, and the 90 measured values in-between are selected as results for evaluation. There have been two processors evaluated: a) AMD Athlon 1800+ and b) Intel Pentium Mobile 2Ghz.

1533150 2100000

1533140 2000000

1533130 1900000 BENCHMARK 1533120 SELECTED 1800000 1533110

1533100 1700000

1533090 1600000

1533080 1500000 1533070 BENCHMARK 1400000 SELECTED 1533060

1533050 1300000 1 5 9 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 101 105 109 a) b)

Figure 93. CPU frequency measurements in kHz for: a) AMD Athlon 1800+ and b) Intel Pentium Mobile 2Ghz.

The maximum absolute deviation and the standard deviation was equal respectively to 1.078 kHz and to 0.255 kHz for the first processor (Figure 93 a) – please note that the scale range on

103 The frequency instability is well-known impact factor and it can be recognized by the CPU performance evaluation by simple looping of assembler-based CPU-counter-read during one second period derived from timing functions. The responsible prototypical source code is implemented within the MD-LLV1 codec in the utils subtree in the timer.c module (functions: init_timer() and get_freq()).

245 Chapter 5 – Evaluation and Application X. Experimental Measurements

the graph is on the level of 100kHz (being 1/10000 of GHz) to show aberrations. Thus the absolute error being on the level of 1/100,000th of the percent (i.e. 1·10-7 of the measured value) is negligible in respect to the one second period. On the other hand, the maximum absolute deviation can be expressed in the absolute values in nanoseconds. The measurement period of one second is divided on 109 nanoseconds and during this period there are averagely 1,533,086,203 clock ticks with plus-minus 1,078 clock ticks. That makes the error expressed in nanosecond equal to 703 ns, and this value is considered later on.

Another situation can be observed for the second processor. Here the noticeable short drop of frequency to about 1.33GHz is caused by the internal processor controlling facility (such as energy saving and overheating protection). Even if this drop is omitted i.e. only 70 values instead of 90 are considered, both the maximum absolute deviation and the standard deviation are few hundreds times higher than for the previous processor and are equal respectively to 777.591 kHz and to 203.497 kHz. The absolute error on the level of 1/100 of the percent (i.e. 1·10-5) is bigger than previous one but still could be ignored; however, the unpredictable controlling facility generates impacts of big frequency change (of about 33%) that are too complex for consideration in the measurements and are not in the scope of this work. Thus the precise performance evaluations have been executed on the processor working with exactly one frequency (e.g. AMD Athlon), where no frequency switching is possible as in Intel Pentium Mobile processors104.

Finally, the error caused by the deviations in CPU cycles frequency definitely influences the time estimation of the microscopic events but can be neglected for the macroscopic events. Further, depending on the investigated event type (micro vs. macro), this error should or should not be considered.

X.2.2.2 Deviations in the Transcoding Time

An example of executing the coding of the same data repeatedly is used for recognizing the measurement errors of the transcoding. Here the impact factors have been analyzed in context

104 The frequency switching is probably the reason of problems during the attempt of defining the machine index described previously in the VII.3.1.1 Hardware architecture influence section (on p. 193).

246 Chapter 5 – Evaluation and Application X. Experimental Measurements

of the application by determining variations of the MD-XVID encoder under DROPS for the Coastguard CIF and Foreman QCIF sequences and for 100 executions for each sequence (with warm-up and cool-down phases each having 10 executions). The interruption of external macroscopic event has been eliminated by running only the critical parts of the DROPS (detailed in the XVIII.3 section of Appendix E). Next, three independent frames (i.e. 100th, 200th, and 300th) out of all frames in the sequence have been selected for the comparison of the execution time and recognition of the level of the deviations. The results are listed in Table 9.

Average Maximum Maximum Standard Standard Frame Execution Absolute Absolute Sequence Deviation Error Number Time Deviation Error [ns] [%] [ns] [ns] [%] 100 8 669 832 158 198 40 972 1.82% 0.47% Foreman 200 8 501 218 148 223 39 212 1.74% 0.46% QCIF 300 8 549 199 162 409 43 531 1.90% 0.51% 100 31 398 821 251 021 66 213 0.80% 0.21% Coastguard 200 30 814 549 260 892 70 911 0.85% 0.23% CIF 300 30 231 784 269 940 64 432 0.89% 0.21%

Table 9. Deviations in the frame encoding time of the MD-XVID in DROPS caused by microscopic factors (based on [Mielimonka, 2006]).

It can be seen that the maximum absolute error is now on the level of one percent (i.e. 1·10-2), while the standard error (counted from the standard deviation) is on the level of few-tenths of the percent. Obviously, these errors derive only from the microscopic impact factors, since the maximum absolute deviation is 1000 times smaller than the duration of the macroscopic event, and if the macroscopic event appeared during the execution of the performance test, the absolute deviation would be on the level of milliseconds.

Contrary to the error of CPU frequency expressed in thousand of nanoseconds, the maximum absolute deviation for the frame encoding time is expressed in hundred-thousands of nanoseconds. As so the participation of CPU frequency fluctuation error in the frame encoding error is counted in few-tenths of the percent i.e. 703ns vs. 158,198 ns gives about 0.44% of the encoding error. In other words, the standard error measured here is few hundreds times bigger than the one deriving from the measurements based on the clock cycle counter and CPU frequency.

247 Chapter 5 – Evaluation and Application X. Experimental Measurements

Additionally, the standard error calculated here is on the same level as the average of the standard error obtained in the previous multiple-execution measurements of the machine index being equal to or smaller than 0.7% (see section VII.3.1.1 Hardware architecture influence), which may be a proof of correctness of the measurement.

Due to the few facts given above, the CPU frequency deviations being an undoubtful burden to the microscopic events are neglected in the frame-based transcoding time evaluations. Secondly, the time values measured per frame can be represented by the numbers having only first four most-meaningful digits.

X.2.3. Accuracy and Errors – Summary

Finally, the accuracy of the measurements in the context of the frame-based transcoding is on the level of few-tenths of the percent. The impact factors of the macroscopic events have been eliminated, which was proved by errors being thousand times smaller than the duration of the macroscopic event. Moreover, the encoding task per frame in QCIF takes roughly the same time as a macroscopic event.

The measurement inaccuracy is mainly caused by microscopic events being influenced by the branch prediction logic and the processor caches. The fluctuations of the CPU frequency are unimportant for the real-time frame-based transcoding evaluations due to the minor influence on the final error of the transcoding time counted per frame and they are treated as spin-offs or side-effects.

However, if the level of measurements goes beneath the per-frame resolution (e.g. time measured per MB), the CPU frequency fluctuations may gain on importance and thus they may influence the results. In such case, the frequency errors should not be treaded as spin-offs but considered as meaningful for the results. These facts have to be kept in mind during the evaluation process.

248 Chapter 5 – Evaluation and Application X. Experimental Measurements

X.3. Evaluation of RT-MD-LLV1 The experimental system on which the measurements have been conducted is the PC_RT described in Appendix E and the DROPS configuration is given in section XVIII.3 of this appendix.

X.3.1. Checking Functional Consistency with MD-LLV1

To investigate the RT-MD-LLV1 converter behavior in respect to the quality of requested data, the four tests have been executed for Container CIF, where the time per frame spent on each timeslice type according to HRTA converter model has been measured. The results are depicted collectively in one graph (Figure 94).

18

16

14 mandatory_QBL mandatory_QELx 12 optional_QBL optional_QEL1 10 optional_QEL2 e [ms] optional_QEL3 8 Tim delivery_QBL delivery_QEL1 6 delivery_QEL2 delivery_QEL3 4

2

0 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 102 108 114 Frame Number

Figure 94. Frame processing time per timeslice type depending on the quality level for Container CIF (based on [Wittmann, 2005]).

Each quality level is responding to the number of processed QELs i.e. none, one, two or all three. The time spent in base (“mandatory_”), enhancement (“optional_”) and cleanup (“delivery_”) timeslices are depicted. The lowest quality level is referred by “_QBL” extension in the graph. The mandatory_QBL curve (dick and black) is on the level of 5.4ms per frame (it’s covered by other curve i.e. by delivery_QEL1), and respectively the optional_QBL on 0.03ms and the delivery_QBL on 0.8ms, which obviously is correct since in the lowest quality no calculations are done in the optional part. The higher quality (“_QELx”) required more time than the lowest quality. Taking more time by higher-quality processing was expected for the enhancement and cleanup timeslices, but not for the base timeslice since it processes exactly the same amount of

249 Chapter 5 – Evaluation and Application X. Experimental Measurements

base layer data—on the other hand, this difference between mandatory_QBL and mandatory_QELx may stem from the optimizations of the RT-MD-LLV1, where no frame is prepared for further enhancing if only base quality is requested. It is also noticeable, that the processing of the base timeslice took the same time for all three higher quality levels (mandatory_QELx).

Summarizing, it’s clearly visible that the RT-MD-LLV1 decoder behaves as assumed similarly to the best-effort implementation of LLV1 decoder (see Figure 24 on p.123) i.e. the higher quality is requested the more time necessary for the processing of enhancement layers should be allocated by the optional_QELn timeslice (see respectively optional curves of QEL1, QEL2, and QEL3). What’s more, the curves are more constant vertical lines, thus the processing is more stable and better predictable.

X.3.2. Learning Phase for RT Mode

As the allocation is based on average and maximum execution times on macro block level (see VII.3.4 Mapping of MD-LLV1 Decoder to HRTA on p.216), the time consumed for the different timeslices was measured. This was done by setting the framerate down to a value where all MBs in the highest quality level (up to QEL3) could be easily processed such that the reserved processing time was even ten times bigger than average case e.g. for CIF and QCIF videos this was 10fps (i.e. period equal to 100ms) and for PAL video 4fps (i.e. period equal to 250ms). Of course, a waste of resources being idle took place in such configuration. The period had the following timeslices (as defined in VII.3.2 Timeslices in HRTA) assigned: 30% of the period for the base timeslice, next 30% for the enhancement timeslice, 20% for the cleanup timeslice and remaining 20% has been used for idle non-RT part (i.e. waiting for the next period).

The time really consumed by each timeslice has been measured per frame and normalized by MB according to number of MBs being specific to each resolution (e.g. QCIF => 99 MBs) to allow comparison of different resolutions. This average time per MB for each frame is shown in Figure 95 for the base timeslice, in Figure 96 for the enhancement timeslice, and in Figure 97 for the cleanup timeslice. Please notice that for the enhancement timeslice (Figure 96) the time is measured for one MB but through all quality enhancement layers (up to QEL3) i.e. the given

250 Chapter 5 – Evaluation and Application X. Experimental Measurements

time is a sum of time in QEL1, QEL2 and QEL3 per one MB and leads to the lossless video data.

35

30

25

20 e [µs]

Tim 15

10

container_cif container_qcif 5 mobile_qcif mother_and_daughter_qcif shields_itu601 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 Frame Number

Figure 95. Normalized average time per MB for each frame consumed in the base timeslice (based on [Wittmann, 2005]).

60

55

50

45

40

35

e [µs] 30

Tim 25

20

15

10 container_cif container_qcif mobile_qcif mother_and_daughter_qcif 5 shields_itu601 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 Frame Number

Figure 96. Normalized average time per MB for each frame consumed in the enhancement timeslice (based on [Wittmann, 2005]).

As it can be seen in all three figures above, the average execution time per MB for the same video does not vary much for various frames. There are some minor deviations in the curves, but in general the curves are almost constant for each video. Still there is a noticeable difference between videos, but it can not be deduced that the difference is directly related to the resolution, which one could expect.

251 Chapter 5 – Evaluation and Application X. Experimental Measurements

16 15 14 13 12 11 10 9

e [µs] 8

Tim 7 6 5 4 3 container_cif container_qcif 2 mobile_qcif mother_and_daughter_qcif shields_itu601 1 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 Frame Number

Figure 97. Normalized average time per MB for each frame consumed in the cleanup timeslice (based on [Wittmann, 2005]).

To investigate existing differences in more details, the average and maximum time has been calculated and the difference between them in relation to average expressed on a percentage basis. The detailed results of execution time of all frames in given video are listed in Table 10. The smallest difference between average and maximum time (Δ%) can be noticed for cleanup timeslice. For the enhancement timeslice the differences between average case and worst case is bigger, and for the base timeslices differences are the biggest achieving up to 20% of the average time. On the other hand, the worst-case time being only 20% bigger than the average case is already very good achievement considering video decoding and its complexity.

Time per MB [µs] Video Sequence Base TS Enhancement TS Cleanup TS Name avg max Δ% avg max Δ% avg Max Δ% container_cif 17.39 18.23 4.83% 39.15 40.71 3.98% 14.51 14.80 2.00% container_qcif 18.97 20.69 9.07% 45.97 49.95 8.66% 12.50 12.74 1.92% mobile_qcif 29.35 30.35 3.41% 54.10 55.31 2.24% 12.67 12.99 2.53% mother_and_ 19.47 23.24 19.36% 45.80 48.65 6.22% 13.10 13.17 0.53% daughter_qcif shields_itu601 20.33 23.92 17.66% 39.20 45.26 15.46% 14.25 14.88 4.42%

Table 10. Time per MB for each sequence: average for all frames, maximum for all frames, and the difference (max-avg) in relation to the average (based on [Wittmann, 2005]).

252 Chapter 5 – Evaluation and Application X. Experimental Measurements

There are however the differences between different videos e.g. the average cases for the Container differ much from the Mobile (both in QCIF) i.e. 17.39 vs. 29.35 µs. This could be explained by the different number of the coded blocks in the videos namely in Mobile there have been roughly 550 blocks coded (out of 594, which is 6 blocks·99 MBs) per frame for the base layer, and in Container only about 350 coded blocks, whereas the applied normalization considered the resolution (i.e. the constant amount of MBs) and not really coded blocks.

X.3.3. Real-time Working Mode

Now, with the measured average and maximum times in the learning phase an allocation is possible for each video. The framerate can be set to an appropriate level in accordance to the user request and the capabilities and characteristics of the decoding algorithm. When setting the framerate gradually higher, the quality naturally drops, because the timeslice for the enhancement layer processing becomes smaller i.e. the optional part consumes less and less time. There is however limit where the framerate cannot be raised anymore and the allocation is refused in order to guarantee the base layer quality being the lowest acceptable quality (for details see the check condition given by Equation (66) on p.217). The measurement of the percentage of processed MBs for the different enhancement layers are depicted in Figure 98 for Mobile CIF, in Figure 99 for Container QCIF and in Figure 100 for Parkrun ITU601 (PAL).

120% RT-LLV1 decoding of Mobile CIF EL1 EL2

100% EL3 R E F U 80% S E D

A 60% L L O C 40% A T I O N 20% Percentage of decoded MBs-QELs in optional timeslice optional in MBs-QELs decoded of Percentage

0% 25 28 29 30 35 36 37 38 40 45 46 47 48 49 50 55 60 61

Framerate [fps]

Figure 98. Percentage of decoded MBs for enhancement layers for Mobile CIF with increasing framerate (based on [Wittmann, 2005]).

253 Chapter 5 – Evaluation and Application X. Experimental Measurements

120,00% RT-LLV1 decoding of Container QCIF EL1 EL2

100,00% EL3 R E F U 80,00% S E D

A 60,00% L L O C 40,00% A T I O N 20,00% Percentage of decoded MBs-QELs in optional timeslice optional in MBs-QELs of decoded Percentage

0,00% 110 120 122 123 124 125 130 140 160 180 200 220 221 222 223 224 225 226 227 228 230 240 260 270 280 285 288 289 290 291

Framerate [fps]

Figure 99. Percentage of decoded MBs for enhancement layers for Container QCIF with increasing framerate (based on [Wittmann, 2005]).

120,00% RT-LLV1 decoding of Parkrun ITU601 EL1 EL2

100,00% EL3 R E F U 80,00% S E D

A 60,00% L L O C 40,00% A T I O N 20,00% Percentage of decoded MBs-QELs in optional timeslice optional in MBs-QELs decoded of Percentage

0,00% 6 8 9 101112131415

Framerate [fps]

Figure 100. Percentage of decoded MBs for enhancement layers for Parkrun ITU601 with increasing framerate (based on [Wittmann, 2005]).

For framerates small enough (e.g. 29fps or less for Mobile CIF, 122fps for Container QCIF and 6fps for Parkrun ITU601), the complete video can be decoded with all enhancement layers, achieving the lossless reconstruction of the video. For higher framerates the quality has to be

254 Chapter 5 – Evaluation and Application X. Experimental Measurements

adapted by means of leaving out the remaining unprocessed MBs of the enhancement layers. Thus the quality may be changed not only according to the levels on certain layers—such as the PSNR values for each enhancement layer presented already in the V.6.2 section, where each level achieves respectively about 32dB, 38dB, 44dB and ∞dB and the difference is roughly equal to 6dB—, but the fine-grain scalability is also achievable. Since the LLV1 enhancement algorithm is based on bit planes, the amount of processed bits from the bit plane (deriving from the number of processed MBs) is linearly directly proportional to the gained quality expressed in PSNR values assuming that there is equal distribution of error bit values in the bit plane. For example, the framerate of 36fps for Mobile CIF allows achieving more than 44dB, because the complete QEL2 and 5% of QEL3 are decoded, and if the framerate targets 40fps then roughly 41dBs are obtained (QEL1 completely and about 50% of QEL2).

After setting the framerate too high, the feasibility check according to Equation (66) will fail, because no base layer may be processed completely. Such situation will occur in case of setting the framerate to 61fps for the Mobile CIF, and respectively 291fps for Container QCIF and 15fps for Parkrun ITU601 for the given test bed. As so the allocation is refused for such framerates since the LQA cannot be guaranteed. Of course the user is to be informed about that fact deriving from the lack of resources.

X.4. Evaluation of RT-MD-XVID X.4.1. Learning Phase for RT-Mode

The learning phase has been implicitly discussed in the Precise time prediction section i.e. the measurements have been conducted to find out the calculation factors for each of the mentioned prediction methods. The graphs depicting those measurements are given in: Figure 68, Figure 69, Figure 72, Figure 75, Figure 76, Figure 80, Figure 81 and Figure 82. However, there have not been any quality aspects investigated neither in respect to the duration of mandatory timeslice nor in respect to the requested lowest quality acceptable (LQA). Thus some more benchmarks have been conducted.

At first the complete coding (without loss of any MB) of the videos has been conducted. The implemented RT-MD-XVID has been used for this purpose; however no time restrictions have been specified i.e. the best-effort execution mode has been used such that neither timeslice

255 Chapter 5 – Evaluation and Application X. Experimental Measurements

overruns nor deadline misses nor encoding interruptions have occurred. The goal was to measure the average encoding time per frame with minimum and maximum values, and to find out relevant deviations. The minimum value represents the fastest encoding and reflects the I- frame encoding time, while the maximum value is the slowest encoding and depending on use of B-frames indicates the P- or B-frames. The results are depicted in Figure 101.

Encoding Time per Frame [ms] - Average (Min/Max) and Deviation Deviation [%]

40 Mobile 35 CIFN (IP)

30 Mobile QCIF (IP) 25 Foreman QCIF (IP) 20 Football 15 CIFN (IP)

10 Coastguard CIF (IP) 5 Carphone QCIF (IPB) 0 Carphone QCIF Carphone QCIF Coastguard CIF Football CIFN Foreman QCIF Mobile QCIF (IP) Mobile CIFN (IP) (IP) (IPB) (IP) (IP) (IP) Carphone QCIF (IP) Deviation 0,44 0,47 0,72 0,80 0,36 0,26 1,18 Average 8,60 8,91 31,12 27,14 8,86 9,17 28,46 0% 1% 2% 3% 4% 5% 6% a) b)

Figure 101. Encoding time per frame of various videos for RT-MD-XVID: a) average and b) deviation.

The deviation is on the level of 2.3% to 5.2%. Interesting fact is that for higher resolutions the peaks occur (see the maximum values) but the total execution is more stable (with smaller percentage deviation). Anyway, the real-time-implemented MD-based encoding in comparison to best-effort standard encoding (see Figure 9) is much more stable and predictable since the deviation is on lower level (up to 5.2% versus 6.6% to 24.4%) and the max/min ratio is noticeably lower (1.2 to 1.33 vs. 1.93 to 4.28 times more requires processing of the slowest frame in contrast to the fastest frame processing).

Although, the above results prove the positive influence of the MD on processing stability thus allow for using the average coding time in prediction, the most interesting is to see if the time constraints can be used during the real-time processing. If assumed, that a one CPU system is given (e.g. the one used for above tests – see section XVIII.3 in Appendix E), such that the encoding can use only half of the CPU’s processing time (the other half is meant for decoding and other operations), and that frame rate of test videos is equal to 25fps, there is only 20ms per

256 Chapter 5 – Evaluation and Application X. Experimental Measurements

frame available (out of 40ms period). Following, only the QCIF encoding can be executed within such time without any loss of MBs (complete encoding). In other cases (CIF), the computer is not powerful enough to encode videos in real-time completely. Analogical situation is for sequence with higher frame rate (CIFN) i.e. the available 50% of CPU time is mapped to 16.67ms out of 33.33ms period (due to 30 fps). Then, the test bed machine is also not capable of encoding the complete frame, so the quality losses are indispensable.

Thus, the next investigation covers analysis of time required for specified quality expressed by the given amount of MBs processed. As it was already mentioned in Mapping of MD-XVID Encoder to HRTA Converter Model (section VII.3.5), the mandatory and clean up timeslices are calculated according to the worst-case processing time selected as maximum of the frame-type- dependent default operations as defined in Equation (70) and (71). Since the time required for default operations is the smallest for I-frames and the biggest for B-frames, there is always some idle time in the base timeslice for processing at least some of MBs (see relaxation condition (73)).

P-only Quality (l.s.) P&B Quality (l.s.) P-only Time (r.s.) P&B Time (r.s.) 100% 15

90% 14

80% 13

70% 12

60% 11

50% 10

40% 9

30% 8 Time [ms] per frame

20% 7

10% 6 Quality (as percentage of coded MBs) coded of percentage Quality (as 0% 5 10% 20% 30% 40% 50% 60% 70% 80% 90% LQA (as percentage of coded MBs)

Figure 102. Worst-case encoding time per frame and achieved average quality vs. requested Lowest Quality Acceptable (LQA) for Carphone QCIF.

257 Chapter 5 – Evaluation and Application X. Experimental Measurements

On the other hand, the definition allows delivering all frames, but it may happen that the B- frames have no MBs coded at all, P-frames have only few MBs coded, and I-frames a bit more than P-frames. Since the MB-coding is generally treated as optional, there is no possibility to define the minimum quality per frame that should be delivered. Thus the worst case execution time has been measured per LQA defined on different levels i.e. the different amounts of MBs to be always coded was requested and the worst encoding time per frame was measured. This worst-case time is depicted in Figure 102 (right scale). The quality steps are equal to every 10% of all MBs (as given in X-axis). Having such LQA-specific worst-case times, the guaranties of delivering requested LQA can be given. Please note, that the values for video with P-frames only are lower than for the P&B-frames.

Now, if the measured worst-case time of given LQA will be used as basis for the base timeslice calculation, then definitely the achieved quality will be higher (of course assuming the relaxation condition which allows for movement of some of the MB-coding part into the base timeslice). To prove the really obtained quality with guaranteed LQA, the encoding was executed once again not with the base timeslice as calculated according to (70) but with the timeslice set to the measured LQA-specific worst-case execution time. The really achieved quality is depicted in Figure 102 (left scale). There is clearly visible, that the obtained quality is higher than the LQA. Moreover, the bigger requested LQA is the smaller differences between achieved quality and LQA are e.g. for LQA=90% achieved quality is equal to 95% for P-only and P&B encoded video, and for LQA=10% the P-only video achieves 28% and P&B encoding achieves 40%.

X.4.2. Real-time Working Mode

Finally, the time constrained execution fully exploiting the characteristics of the HRTA converter model has to be conducted. Here the encoding is driven by the available processor time and by the requested quality. Of course, the quality is to be mapped on the predicted time according to learning phase, and it must be checked if the execution is feasible at all. For the following tests, the feasibility check was assumed to be always positive, since the goal was to analyze the quality drop in respect to time constrains (analogically to the previously tested RT- MD-LLV1). Moreover, not all but only the most interesting configurations of timeslice division are depicted, namely, for the investigated resolutions respective mandatory and optional timeslices have been chosen such that the ranges depict the quality drop phase for the first fifty

258 Chapter 5 – Evaluation and Application X. Experimental Measurements

frames of each video sequence. In other words, the timeslices too big where no quality drop occurs or timeslices too small where none of MBs is processed are omitted. Additionally, the clean up time is not included in the graphs, since its worst case assumption allowed finishing all the frames correctly. The results are depicted in Figure 103, Figure 104, and Figure 105.

Figure 103 shows two sequences with QCIF resolution (i.e. 99 MBs) for two configurations of mandatory and optional timeslices:

1) tbase_ts = 6.6ms and tenhance_ts = 3ms

2) tbase_ts = 5ms and tenhance_ts = 3ms

The first configuration was prepared with the assumption of running three encoders in parallel within the limited period size of 20ms, and the second one assumes parallel execution of four encoders. As it can be seen, the RT-MD-XVID is able to encode on average 57.2 MBs for Mobile QCIF and 65.4 MBs for Carphone QCIF in the first configuration with the standard deviation being equal respectively to 2.3 and 5.0MBs. This results in processing of about 58% and 66% of all MBs, and according to [Militzer, 2004] this should yield the PSNR-quality of about 80% to 85% in comparison to completely coded frame. If the defined enhancement timeslice is used, the encoder will process all the MBs for both video sequences.

For the second configuration, the RT-MD-XVID encodes on average 21.4 MBs of Mobile QCIF and 24.6 MBs of Carphone QCIF. The standard deviations amount 1.2 MBs for both videos, which is also visible by the flattened curves. So, processing of respectively only 21% and 25% of all MBs results in very low but still acceptable quality [Militzer, 2004]105. And even if the enhancement part is used, the encoded frame will sometimes (Carphone QCIF) or always (Mobile QCIF) have some of MBs still not coded. Based on the above results, the acceptable quality range of encoding time having minimum 5ms and maximum 9.6ms can be derived.

105 The minimal threshold of 20% of all MBs shall be considered for frame encoding.

259 Chapter 5 – Evaluation and Application X. Experimental Measurements

Figure 103. Time-constrained RT-MD-XVID encoding for Mobile QCIF and Carphone QCIF.

Analogically to QCIF resolution, the evaluation of RT-MD-XVID encoding has been conducted for the CIFN (i.e. 330 MBs) and CIF (i.e. 396 MBs) resolutions, but here it was assumed that only one encoder works in a time. So, the test timeslices are defined as follows:

1) for CIFN tbase_ts = 16.6ms and tenhance_ts = 5ms

2) for CIF tbase_ts = 20ms and tenhance_ts = 10ms

The results are depicted in Figure 104. The achieved quality was better for the CIF than for the CIFN sequence i.e. the amount of coded MBs achieved 149.8 MBs for Coastguard CIF vs. 86.8 MBs for Mobile CIFN (and respectively 37.8% and 26.3%) and the standard deviation was on the level of 9.9 and 4.8 MBs. The disadvantage of CIFN derives directly from the smaller timeslice, which is caused by additional costs of having 5 frames more per second. On the other hand, the content may also influence the results. Anyway, the encoder was not able to encode all MBs for both videos even when the defined enhancement timeslice was used. In the case of Coastguard CIF, 12% to 21% of all MBs per frame were missing only for the first few frames.

260 Chapter 5 – Evaluation and Application X. Experimental Measurements

Contrary, the encoding of Mobile CIFN produced only 38% to 53% of all MBs per frame along the whole sequence.

Figure 104. Time-constrained RT-MD-XVID encoding for Mobile CIFN and Coastgueard CIF.

The last experiment has targeted the B-frames processing. Therefore the RT-MD-XVID has been executed with additional option -use_bframes. The assumptions analogical to the QCIF without B-frames have been made i.e.:

1) tbase_ts = 6.6ms and tenhance_ts = 3ms

2) tbase_ts = 5ms and tenhance_ts = 3ms

The results of the RT-MD-XVID encoding for two QCIF sequences are depicted in Figure 105. The minimal threshold of processed MBs of Carphone QCIF in the mandatory part has been reached only for the I- and P-frames and only for the bigger timeslice (first configuration), but additional enhancement allowed for finishing all the MBs for all the frames. There have been 45.1 MBs processed on average within the base timeslice, however there is a big difference

261 Chapter 5 – Evaluation and Application X. Experimental Measurements

between frame-related processing times expressed through the standard deviation on the high level of 19.8 MBs. Such behavior may yield noticeable quality changes in the output.

Figure 105. Time-constrained RT-MD-XVID encoding for Carphone QCIF with B-frames.

The execution of the RT-MD-XVID with the second configuration ended up with even worse results. Here the mandatory timeslice was not able to deliver any MBs for B-frames. If the processing of other frame-types was treated as separate subset, then it could be comparable to previous results (Figure 103). In the reality however, the average is calculated among all the frames and thus the processing of mandatory part is unsatisfactory (i.e. avg. of 12.2 processed MBs leads to as low quality as 12.4%). Using the enhancement layer allows processing of 84 MBs (84.4%) on average but with still high standard deviation of 15.3 MBs (15.4%).

In result, the use of B-frames is not advised since the jitter in the number of processed MBs results in the quality deviations, which is a contradiction of the assumption of delivering as constant quality level as possible since the humans classify the fluctuations between good and low as lower quality even if the average is higher [Militzer et al., 2003].

262 Chapter 5 – Evaluation and Application X. Experimental Measurements

Considering all investigated resolutions and frame rates, the processing complexity of the CIF and CIFN sequences circumscribes the limits of real-time MD-based video encoding at least for the used test bed. The rejection of B-frame processing makes the processing simpler, more predictable and more efficient; nevertheless it is still possible to use B-frame processing in some specific applications requiring higher compression on costs of higher quality oscillations or on costs of additional processing power.

263 Chapter 5 – Evaluation and Application XI. Corollaries and Consequences

XI. COROLLARIES AND CONSEQUENCES

XI.1. Objective Selection of Application Approach based on Transcoding Costs Before going into details about specific field of application, some general remarks on the format independence approach are given. As it was stated in Related Work, there are three possible solutions to provide at least some kind of multi-format or multi-quality media delivery. It is an important issue to recognize which method is really required for the considered application. There has been already some research done in this direction. In general, there are two aspects always considered: dynamic and static transformations. The dynamic transformation refers to any type of processing (e.g. transcoding, adaptation) done on the fly i.e. during the real-time transmission on demand. The static approach considers off-line preparation of the multimedia data (regardless if it’s multi-copy or scalable solution).

Based on those two views, the trade-off between the storage vs. processing is investigated. In [Lum and Lau, 2002], a method is proposed, which finds an optimum considering the CPU processing (i.e. dynamic) cost called also the transcoding overhead and the I/O storage (i.e. static) cost of pre-adopted content. The hybrid model prepares selectively a sub-set of quality variants of data and leaves the remaining qualities for the dynamic algorithm using this prepared sub-set as a base for calculations. The authors proposed the content adaptation framework allowing for content negotiations and delivery realization—the second one is responsible for adopting the content by employing the transcoding relation graph (TRG) having transcoding costs on edges and data versions on nodes, the modified greedy-selection algorithm supporting time and space, and the simple linear cost model [Lum and Lau, 2002], which is defined as:

t j = m × vi + c (75)

where tj ist he processing time of version j of data transformed from version i (vi), c is fixed overhead of synthesizing any content (independently from content size), m is transcoding time per unit of the source content, and || is size operator. The author claims that the algorithm can be applied to any type of data; however in case of audio and video the cost defined as in Equation

264 Chapter 5 – Evaluation and Application XI. Corollaries and Consequences

(75) will not work, because the transcoding costs are influenced not only by the amount of data but also by the content. Moreover, neither the frequency of data use (popularity) nor the storage bandwidth for accessing multimedia data is considered.

Another solution is proposed in [Shin and Koh, 2004], where the authors consider the skewed access patterns to multimedia objects (i.e. frequency of use or popularity) and the storage costs (bandwidth vs. space). The solution is proved by simulating a QoS adaptive VOD streaming service with the Poisson-distributed requests and the popularity represented by Zipf distribution using various parameters (from 0.1 to 0.4). However, it did not consider the transcoding costs at all.

Based on the two mentioned solutions, the best-suited cost algorithm could be defined in analogy to the first proposal [Lum and Lau, 2002] and applied to the audio and video data but at first both storage size and bandwidth shall be considered as in [Shin and Koh, 2004], the transcoding cost has to be refined, and the popularity of the data should be respected. Then, the applicability of the real-time transcoding-based approach could be evaluated based not only on subjective requirements but also on objective measures. Since it was not the scope of the RETAVIC project, this idea is left for the further research.

XI.2. Application Fields The RETAVIC architecture, if fully implemented, could find application in many fields. Only some of these fields are discussed within this section and the focus is made on the three most important aspects.

The undoubtful and most important application area is the archiving of multimedia data with purpose of long-time storage, in which the group of potential users is represented by TV producers, national and private broadcasting companies including TV and Radio. Private and national museums and libraries belong also this group as interested in storing high quality multimedia data where the separation of internal format from the presentation format could allow for very long time storage. The main interest here is to provide the available collections and assets in a digital form through Internet to usual end-users with average-to-high quality

265 Chapter 5 – Evaluation and Application XI. Corollaries and Consequences

expectation, or through the intranets or other distribution media to the researchers in the archeology, fine art and other fields.

Second possible application targets scientific databases and simulations. Here the generated high-quality video precisely demonstrating processor-demanding simulations can be stored without loss of information. Such videos can be created by various users from the chemical and physical or bio-engineering laboratories, gene research, or other modeling and development centers. Of course, the costs of conducting the calculations and simulations should be much higher than the costs of recording and storage. The time required for calculating the final results should also be considered since the stored multimedia data can be accessed relatively fast.

Very similar application as above can be regarded as multimedia databases for industrial research, manufacturing and applied sciences. The difference to the previous fields is such that the multimedia data come not from the artificial simulations but from the real recordings, which are usually referred to as natural audio-video. The first example, where the various data qualities may be use in the medical domain, in which the advanced, untypical and complex surgery is recorded without loss of information and then distributed among the physicians, medicine students and professors. Another example is recording of the scientific experiments, where the execution costs are very high, thus the lossless audio-video recording is critical for further analysis. Here the microscope and telescope observation where three high-resolution cameras delivering RGB signals separately are considered as significant. Yet another example is application for the industrial inspection, namely there may be a critical system requiring short but with very high quality periodic observations. Finally, the industrial research testing new prototypes may require very high quality in very short periods of time by employing high-speed high-resolution cameras. Of course, the data generated by such cameras should be kept in a lossless state in order to combine it with the other sensor data.

Some other general application fields without specific examples can also be found. Among them there are few interesting worth of listing:

• Analysis of fast-moving or explosive scenes

ƒ Analysis of fast-moving machine elements

266 Chapter 5 – Evaluation and Application XI. Corollaries and Consequences

ƒ Optimization of manufacturing machines ƒ Tests of explosive materials ƒ Crash test experiments ƒ Airbag developments

• Shock and vibration analysis • Material and quality control

ƒ Material forming analysis ƒ Elastic deformations

• High-definition promotional and slow-motion recordings for movies and television

Finally, a partial application in the video-on-demand systems can be imagined as well. Here however, not the distribution chain including network issues like caching or media proxies is meant but rather the high-end multimedia data centers, where the mudlitmedia data is prepared according to quality requirements of a class of end-user systems (which are represented as one user exposing specific requirements). Of course, there may be also few classes of devices, which are interested in the same content during multicasting or broadcasting. However, if a VoD system was based on unicast communication, the RETAVIC architecture is not adviced at all.

XI.3. Variations of the RETAVIC Architecture The RETAVIC architecture beside direct application in the mentioned field could be used as source for other variants. Three possible ideas can be proposed here. The first one is to use the RETAVIC architecture not as the extension for the MMDBMS but just for the media servers allowing them for additional functionality like audio-video transcoding. Such conversion option could be a separate extra module. It could work with standard format of the media server (usually one lossy format), which can be treated as internal storage format in RETAVIC. Then only the format-specific decoding should be implemented as described in Evaluation of storage format independence (p. 86).

267 Chapter 5 – Evaluation and Application XI. Corollaries and Consequences

Next, the non-real-time implementation of the RETAVIC architecture could be possible. Here however, the OS-specific scheduling functionality being in its behaviour analogical to QAS [Hamann et al., 2001a] has to be guaranteed. Moreover, the precise timing functionality allowing for controlling the HRTA-compliant converters should be provided. For examples, [Nilsson, 2004] discusses possibility of using fine-grain timing under MS Windows NT family106 in which the timing resolutions goes beneath the standard timer with intervals of 10ms i.e. at first at the level of 1ms, and then 100ns units. Besides timing problems also some additional issues like tic frequency, synchronization, protection against system time changes, interrupts (IRQs), thread preemptiveness and avoidance of preemption through thread priority (which is limited to few possible classes) are discussed [Nilsson, 2004]. Even though, the discussed OS could provide acceptable time resolution for controlling HRTA-converters, still the problem with scheduling requires additional extensions.

Finally, the mixture of the mentioned variations is also possible, thus the direct application on the media-server specific OS would be possible. Then the need of using RTOS could be eliminated. Obviously, the real-time aspects should be supported by additional extensions in the best-effort OS analogically to the discussion of the previous variation.

106 In the article it covers: MS Windows NT 4.0, MS Windows 2000 and MS Windows XP.

268 Chapter 6 – Summary XII. Conclusions

Chapter 6 – Summary

If we knew what it was we were doing, it would not be called research, would it? Albert Einstein

XII. CONCLUSIONS

The research on format-independence provision for multimedia database systems conducted during the course of this work has brought many answers but even more questions. The main contribution of this dissertation is the RETAVIC architecture exploiting the meta-data-based real-time transcoding and the lossless scalable formats providing quality-dependent processing. RETAVIC is discussed in many aspects, where design is the most important part. However, the other aspects such as implementation, evaluation and applications are not neglected.

The system design included the requirements, conceptual model and its evaluation. The video and audio processing models covered analysis of codec representative, statement of assumptions, specification of static and continuous media-type-related meta-data, presentation of peculiar media formats, and evaluation of the models. The need for lossless, scalable binary format caused a Layered Lossless Video format (LLV1) [Militzer et al., 2005] being designed and implemented within this project. The attached evaluation proved that the proposed internal formats for the lossless media data storage are scalable in data quality and processing and that the meta-data assisted transcoding is acceptable solution with lower-complexity and still acceptable quality. The real-time processing discussed aspects connected with multimedia processing in respect to its time-dependence and with the design of real-time media converters

269 Chapter 6 – Summary XII. Conclusions

including the hard-real time adaptive converter model, evaluation of three proposed prediction methods and the mapping of the best-effort meta-data-based converters to HRTA-compliant converters.

The implementation depicted the key elements of the programming phase covering the pseudo code for the important algorithms and the most critical source code examples. Additionally, the systematic course of source code porting from best-effort to RTOS-specific environment, which has been referred to as a guideline of source code porting to DROPS for any type of converter implemented in the best-effort system such as Linux or Windows.

The evaluation has proved that the time-constrained media transcoding executed under DROPS real-time operating system is possible. The prototypical real-time implementation of the critical parts of the transcoding chain for video—real-time video coders—have been evaluated in respect to functional, quantitative and qualitative properties. The results have shown the possibility of controlling the decoding and encoding processes according to the quality specification and workload limitation of the available resources. The workload borders of the test bed machine have been achieved already by processing the sequences in the CIF resolution.

Finally, this work delivered the analysis of requirements for media internal storage format for audio and video, the review of format independence support in current multimedia management systems (incl. MMDBMS and media servers) and the discussion of various transcoding architectures being potentially applicable with purpose of format independent delivery.

270 Chapter 6 – Summary XIII. Further Work

Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world. Albert Einstein (1929, Interview by George S. Viereck in Philadelphia Saturday Evening Post)

XIII. FURTHER WORK

There are few directions in which the further work can be conducted. One is to improve audio and video formats themselves e.g. compression efficiency or processing optimization. Another can be refinement of the internal storage within the RETAVIC architecture – here proposal of new formats could be expected. Yet another could be improvement of the REATVIC architecture.

An extension of the RETAVIC architecture is a different aspect than the variant of the architecture. The extension is meant here as an improvement, enhancement, upgrade, or refinement. As so there could be proposed one enhancement in the real-time delivery phase in the direct processing channel, namely a module for bypassing static and continuous MD. Currently, the RETAVIC architecture sends only multimedia data in requested format but it may assist in providing other formats by intermediate caching proxies. Such extension could be referred to as Network-Assisted RETAVIC and it would allow to push the MD-based conversion to the borders of network (e.g. to media gateways or proxies). For example the MD could be transmitted together with audio and video streams in order to allow building cheaper- in-processing converting proxies, which do not have to use worst-case scheduling for transcoding anymore. Proxies should be build in analogical way to the one proposed by real- time transcoding of the real-time delivery phase (with all the issues discussed in related sections). Of course, there is obvious limitation of such solution – it is not applicable to live transmission (which may not be the case in current proxies with worst-cast assumption), until there is a method for predictable real-time MD creation. Moreover, there are two evident disadvantages:

271 Chapter 6 – Summary XIII. Further Work

1) the load in the network segment between the central server and proxies will be higher and 2) more sophisticated and distributed management of the proxies is required (e.g. changing internal format, extending the MD set, adding or changing supported encoders). On the other hand, the transcoding may be applied closer to the client and may reduce the load of the central server. This is especially important if there is a class of clients in the same network segment having the same format requirements.

Another possible extension is the refinement of the internal storage format by exchanging the old and not efficient anymore format by a newer and better one as described in Evaluation of storage format independence. For example, there could be MPEG-4 SVC [MPEG-4 Part X, 2007] applied as the internal storage format for video. A hybrid of WavPack/Vorbis investigated in [Penzkofer, 2006] could be applied as the audio internal format in systems where lower level of audio scalability is expected (just two layers).

Last but not least aspect mentioned in this section, which could be investigated in the future, is an extension of the LLV1 format. Here two functional changes and one processing optimizations can be planned: a) new temporal layering, b) new quantization layering and c) optimization in decoding of lossless stream.

Figure 106. Newly proposed temporal layering in the LLV1 format.

The first functional change covers a new proposal of temporal layering, which is depicted in Figure 106. The idea behind it is using the P-frames instead of the B-frames (due to instability in the processing of B-frames) in the temporal enhancement layer in order to gain the smother decoding process thus better prediction of the real-time processing. This however may introduce some losses in the compression efficiency of the generated bitstream. Thus the trade-

272 Chapter 6 – Summary XIII. Further Work

off between the coding efficiency and the gain in processing stability should be investigated in details.

Next change in the LLV1 algorithm refers to the different division on quantization enhancement layers. The cross-layer switching of the coefficients between the enhancement bitplanes allows for reconstruction the most important DC/AC coefficients at first. Since the coefficients produced by the binDCT are ordered in zig-zag scan according to their importance, it could be possible to realize analogical 3-D zig-zac scanning across the enhancement bitplanes also according to the coefficient importance, namely it can work as follows:

• In current LLV1 bitplane the value stores difference to next quantization layer for each coefficient

QEL1 • Assume the first three values from each enhancement layer are taken, namely: c1 , QEL1 QEL1 QEL2 QEL2 QEL2 QEL3 QEL3 QEL3 c2 , c3 , c1 , c2 , c3 , c1 , c2 , c3 • Then:

QEL1 QEL2 QEL3 ƒ the first three values of each layer: c1 , c1 , c1 are ordered as first three QEL1 QEL1 QEL1 values of the new first enhancement layer: c1 , c2 , c3 , QEL1 QEL2 QEL3 ƒ the next three values on second position of each layer: c2 , c2 , c2 are QEL1 ordered as second-next three values of the new first enhancement layer: c4 , QEL1 QEL1 c5 , c6 , QEL1 QEL2 QEL3 ƒ the next three values on third position of each layer: c3 , c3 , c3 are QEL1 ordered as third-next three values of the new first enhancement layer: c7 , QEL1 QEL1 c8 , c9

QEL1 QEL2 QEL3 • Next always groups of three values of each layer are taken (ci , ci , ci )and assigned respectively to next elements of the current QEL • If the current QEL is complete, the values are assigned to the next QEL

Such reorganization would raise definitely the quality in respect to the amount of coded bits (considering the assumption of coefficient importance being real), i.e. it would raise coding efficiency if certain levels of quality are considered. In other words, the data quality is distributed not linear (as now) but with important values cumulated at the beginning of the

273 Chapter 6 – Summary XIII. Further Work

whole enhancement part. This method causes higher complexity in the coding algorithm, since there is no clear separation on one-layer-specific processing. The algorithm shall be further investigated to prove the data quality and processing changes.

Finally, the simple processing optimizations of the LLV1 decoding can be evaluated. Theoretically, if the lossless stream is requested and the QEL3 is encoded (decoded), the quantization (inverse quantization) step may be omitted completely. This is caused by the format assumption i.e. the last enhancement layer produces the coefficient which are quantized with the quantization step equal to 1, which means that the quantized value is equal to the unquantized value. So, the (de-)quantization step is not required anymore. This introduces yet another case in processing, and was not checked during the development of LLV1.

274 Appendix A XIV. Glossary of Definitions

Appendix A

XIV. GLOSSARY OF DEFINITIONS

The list of the definitions and terms is divided in three groups: data-related, processing-related and quality-related terms. Each group is ordered in the logical (not alphabetical) sequence i.e. at first come the most fundamental terms being followed by more complex definitions, such that a subsequent definition may refer to the previous term, but the previous terms are not based on the following ones.

XIV.1. Data-related Terms • media data – text, image (natural pictures, 2D and 3D graphics, 3D pictures), audio (natural sound incl. human voice, synthetic), and visual (natural video, 2D and 3D animation, 3D video); • media object – MO – special digital representation of media data; it has type, content and format (with structure and coding scheme); • multimedia data – collection of the media data, where more than one type of media data is involved (e.g movie, musical video-clip); • multimedia object – MMO – collection of MOs; it represents multimedia data in digital form; • meta data – MD – description of an MO or an MMO;

275 Appendix A XIV. Glossary of Definitions

• quant107 (based on [Gemmel et al., 1995]) – a portion of digital data that is treated as one logical unit occurring at a given time (e.g. sample, frame, window of samples e.g. 1024, or group of pictures – GOP, but also combination of samples and frames); • timed data – type of data, in which the data depends on time108 i.e. the data (e.g. a quant) is usable then and only then when they occur in a given point of time; in other words, too early or too late occurrence makes the data invalid or unusable; other terms equivalent to timed within this work are following: time-constrained, time-dependent; • continuous data – time-constrained data ordered in such a way that continuity of parts of data, which are related to each other, is present; they may be periodic or sporadic (irregular); • data stream (or shortly stream) – is a digital representation of the continuous data; most similar definition is from [ANS, 2001] such as “a sequence of digitally encoded signals used to represent information in transmission”; • audio stream – a sequence of audio samples is a sequence of numerical values representing the magnitude of the audio signal at identical intervals. The direct equivalent is the pulse-code modulation (PCM) of the audio. There are also extensions of PCM such as Differential (or Delta) PCM (DPCM) or Adaptive DPCM (ADPCM), which represents not the value of the magnitude, but the differences between these values. In DPCM, it is simply the difference between the current and the previous value, and in ADPCM it is almost the same, but the size of the quantization step varies additionally (thus allows more accurate digital representation of small values in comparison to high values of the analog signal); • video stream – it may be a sequence of half-frames if the interlaced mode is required. As default, the full frame (one picture) mode is assumed; • continuous MO – MO that has analogical properties to continuous data; in other words it’s a data stream, where the continuous data is exactly one type of media data;

107 Quanta is plural of quant. 108 There is also other research area of database systems, which discusses timed data [Schlesinger, 2004]. However, there different perspective on “timed” issue is presented and completely different aspects are discussed (global-views in Grid computing and their problems on data coming from snapshots in different point of time).

276 Appendix A XIV. Glossary of Definitions

• audio stream, video stream – it’s used interchangeable for continuous MO of type audio or of type video; • multimedia stream – a data stream including quanta of more than one MO type; an audio-video (AV) stream is the most common case; it’s also called continuous MMO; • container format – is a (multi)media file format for storing media streams; it may be used for one or many types of media (depending on the format container specification); it may be designed for storage e.g. RIFF AVI [Microsoft Corp., 2002c] and MPEG-2 Program Stream (PS) [MPEG-2 Part I, 2000], or optimized for transmission e.g. (ASF) [Microsoft Corp., 2007b] or MPEG-2 Transport Stream (TS) with packetized elementary streams (PES) [MPEG-2 Part I, 2000]; • compression/coding scheme – a binary compressed/encoded representation of the media stream for exactly one specific media type; it is usually described by four characters code (FOURCC) being a registered (or well-recognized) abbreviation of the name of the compression/coding algorithm; • Lossless Layered Video One (LLV1) – a scalable video format having four quantization layers and two temporal layers allowing storing YUV 4:2:0 video source without any loss of information. The upper layer extends the lower layer by storing additional information i.e. the upper relies on data from the layer below.

XIV.2. Processing-related Terms • transformation – a process of moving data from one state to another (transforming) – it’s most general term; the terms conversion/converting are equivalents within this work; the transformation may be lossy, where the loss of information is possible, or lossless, where no loss of information occurs; • multimedia conversion – transformation, which refers to many types of media data; there are three categories of conversion: media-type, format and content changers; • coding – the altering of the characteristics of a signal to make the signal more suitable for an intended application (…) [ANS, 2001]; decoding is an inverse process to coding; coding and decoding are conversions;

277 Appendix A XIV. Glossary of Definitions

• converter – a processing element (e.g. computer program) that applies conversion to the processed data; • coder/decoder – a converter used for coding/decoding; also encoder refers to coder • codec – acronym for coder-decoder [ANS, 2001] i.e. an assembly consisting of an encoder and a decoder in one piece of equipment or a piece of software capable of encoding to a coding scheme and decoding from this scheme; • (data) compression – a special case (or a subset) of coding used for 1) increasing the amount of data that can be stored in a given domain, such as space, time, or frequency, or contained in a given message length or 2) reducing the amount of storage space required to store a given amount of data, or reducing the length of message required to transfer a given amount of information [ANS, 2001]; decompression is an inverse process to compression, but not necessary mathematically inverse; • compression efficiency – general non-quantitative term reflecting the efficiency of the compression algorithm such that the more compressed output with the smaller size is understood as a better algorithm effectiveness; it is also often referred to as coding efficiency; • compression ratio109 – the uncompressed (origin) to compressed (processed) size; the bigger the value is the better; compression ration usually is bigger than 1, however, it may occur that the value is lower than 1 - in that case the compression algorithm is not able to anymore (e.g. if the already compressed data is compressed again) and should not be applied; • compression size109 [in %] – the compressed (processed) to uncompressed (origin) size of the data multiplied by 100%; the smaller the value is the better; compression size usually ranges between more than 0% and 100%; the value higher than 100% responds to the compression ratio lower than 1; • transcoding – according to [ANS, 2001] it’s a direct digital-to-digital conversion from one encoding scheme to a different encoding scheme without returning the signals to analog form; however within this work it’s defined in a more general way as a

109 The definition „compression rate” is not used here due to its unclearness i.e. in many papers it is referred once as compression ratio and otherwise as compression size. Moreover, the compression ratio and size are obvious properties of data, but they derive directly from the processing (), and as so they are classified as processing-related terms.

278 Appendix A XIV. Glossary of Definitions

conversion from one encoding scheme to a different one, where normally at least two different codecs had to be involved; it is also referred as heterogeneous transcoding; there are also other special cases of transcoding distinguished in the later part of this work; • transcoding efficiency – is analogical to the coding efficiency defined before, but in respect to the transcoding; • transcoder – a device or system that converts one bit stream into another bit stream that possesses a more desirable set of parameters [Sun et al., 2005]; • cascade transcoder – a transcoder that fully decodes and the fully encodes the data stream; in other words, it is decoder-encoder combination; • adaptation – a subset of transcoding, where only one encoding scheme is involved and no coding scheme or bit-stream syntax is changed e.g. there is MPEG-2 to MPEG-2 conversion used for lowering the quality (i.e. bit rate reduction, spatial or temporal resolution decrease); it is also known as homogeneous or unary-format transcoding; • chain of converters – a directed, one-path, acyclic graph consisting of few converters; • graph of converters – a directed, acyclic graph consisting of few interconnected chains of converters; • conversion model of (continuous) MMO – a model provided for conversion independent of hardware, implementation and environment; JCPS (event and data) [Hamann et al., 2001b] or hard real-time adaptive (described later) models are suggested to be most suitable here; continuous is often omitted due to the assumption that audio and video are usually involved in context of this work; • (error) drift – erroneous effect in successively predicted frames caused by loss of data, regardless if intentional or unintentional, causing mismatch between reference quant used for prediction of next quant and origin quant used before; it is defined in [Vetro, 2001] for video transcoding as “blurring or smoothing of successively predicted frames caused by the loss of high frequency data, which creates a mismatch between the actual reference frame used for prediction in the encoder and the degraded reference frame used for prediction in the transcoder or decoder”;

279 Appendix A XIV. Glossary of Definitions

XIV.3. Quality-related Terms • quality – a specific characteristic of an object, which allows to compare (objectively or subjectively) two objects and say, which one has higher level of excellence; usually it refers to an essence of an object; however, in computer science it may refer also to a set of characteristic; other common definition is “degree to which a set of inherent characteristic fulfils requirements” [ISO 9000, 2005]; • objective quality – the quality that is measured by the facts using quantitative methods where the metric110 has an uncertainty according to metrology theory; the idea behind the objective measures is to emulate subjective quality assessment results by the metrics and quantitative methods e.g. for the psycho-acoustic listening test [Rohdenburg et al., 2005]; • subjective quality – the quality that is measured by the end user and heavily depends on his experience and perception capabilities; an example of standardized methodology for subjective quality evaluation used in speech processing can be found in [ITU-T Rec. P.835, 2003]; • Quality-of-Service – QoS – a set of qualities related to the collective behavior of one or more objects [ITU-T Rec. X.641, 1997] i.e. as an assessment of a given service based on characteristics; it is assumed within this work, that it is objectively measured; • Quality-of-Data – QoD – the objectively-measured quality of the stored MO or MMO; it is assumed to be constant in respect to the given (M)MO111; • transformed QoD – T(QoD) – the objectively-measured quality requested by the user; it may be equal to QoD or worse (but not better) e.g. lower resolution requested; • Quality-of-Experience – QoE – subjectively-assessed quality perceived with some level of experience by the end-user (also called subjective QoS), which depends on QoD, T(QoD), QoS and human factors; QoE is a well-defined term reflecting the subjective quality given above;

110 A metric is a scale of measurement defined in terms of a standard (i.e. well-defined unit). 111 The QoD may change only when (M)MO has scalable properties i.e. QoD will scale according to the amount of accessed data (which is enforced by the given coding scheme).

280 Appendix B XV. Detailed Algorithm for LLV1 Format

Appendix B

XV. DETAILED ALGORITHM FOR LLV1 FORMAT

XV.1. The LLV1 decoding algorithm To understand how the LLV1 bitstream is processed and how the reconstruction of the video from all the layers is performed, the detailed decoding algorithm is presented in Figure 107. The input for the decoding process is defined by the user i.e. he specifies how many layers the decoder should decode. As so, the decoder accepts base layer binary stream (required), and up to three optional QELs. Since the QELs depends on BL, the video properties as well as other structure data are encoded only within the BL bitstream in order to avoid redundancy.

Three loops can be distinguished in the core of the algorithm: frame loop, macro block loop and block-based processing. The first one is the most outer loop and is responsible for processing all the frames in the encoded binary stream. For each frame, the frame type is extracted from the BL. There are four types possible: intra-coded, forward-predicted, bi- directionally predicted and skipped frames. Depending on the frame type further actions are performed. The next inner distinguishable part is called macro block loop. The MB type and the coded block patterns (CBPs) of macro blocks for all (requested by the user) layers are extracted. Based on that, just some or all blocks are processed within the most inner block loop. In case of inter MB (forward or bi-directionally predicted), before getting into block loop, the motion vectors are decoded additionally and the motion compensated frame is created by calculating the reference sample interpolation, which uses an input reference frame from the frame buffer of the BL.

281 Appendix B XV. Detailed Algorithm for LLV1 Format

Figure 107. LLV1 decoding algorithm.

282 Appendix B XV. Detailed Algorithm for LLV1 Format

The block loops are executed for the base layer at first and then for the enhancement layers. In contrast to the base layer, however, not all steps are executed for all the enhancement layers – only the enhancement layer executed as the last one includes all the steps. The quantization plane (q-plane) reconstruction is the step required to calculate coefficient values by applying Equation (12) (on p. 103) and using data from the bit plane of the QEL. Dequantization and inverse binDCT are executed once, if only BL was requested, or maximum twice, if any other QEL was requested. The reconstruction of the base layer is required anyway in both cases, because the frames from the BL are used for the reference sample interpolation mentioned earlier. In case of intra blocks additional step is applied, namely motion error compensation i.e. correction of pixel values of interpolated frames by the motion error extracted from the respectively BL or QEL.

283 Appendix B XV. Detailed Algorithm for LLV1 Format

284 Appendix C XVI. Comparison of MPEG-4 and H.263 Standards

Appendix C

XVI. COMPARISON OF MPEG-4 AND H.263 STANDARDS

XVI.1. Algorithmic differences and similarities This section describes in short the most important differences between the H.263 and the MPEG-4 standards for natural video coding. The differences are organized by features of the standards, according to the part of the encoding process where they are used: Motion Estimation and Compensation, Quantization, Coefficient Re-scanning and Variable Length Coding. At the end, the features are discussed that provide enhanced functionality that is not specifically related to the previous categories.

Motion Estimation and Compensation: The most interesting tools in this section are without doubt Quarter Pixel Motion Compensation (Qpel), Global Motion Compensation (GMC), Unrestricted Motion Vectors (UMV) and Overlapped Block Motion Compensation. Quarter Pixel Motion Compensation is a feature unique to MPEG-4, allowing the motion compensation process to search for a matching block using ¼ pixel accuracy and thus enhancing the compression efficiency. Global Motion Compensation defines a global transformation (warping) of the reference picture used as a base for motion compensation. This feature is implemented in both standards with some minor differences, and it is especially useful when coding global motion on a scene, such as zooming in/out. Unrestricted Motion Vectors allow the Motion Compensation process to search for a match for a block in the reference picture using larger search ranges, and it is implemented in both standards now. Overlapped Block Motion Compensation has been introduced in H.263 (Annex F) [ITU-T Rec. H.263+, 1998] as a feature to provide better concealment when errors occur in the reference frame, and to enhance the perceptual visual quality of the video.

285 Appendix C XVI. Comparison of MPEG-4 and H.263 Standards

DCT: The Discrete Cosine Transformation algorithm used by any video coding standard is defined statistically to comply with the IEEE standard 1180-1990. Both standards are the same in this respect.

Quantization: DCT Coefficient Prediction and MPEG-4 Quantization are the most important features in this category. DCT Coefficient Prediction allows DCT coefficients in a block to be spatially predicted from a neighboring block, to reduce the number of bits needed to represent them, and to enhance the compression efficiency. Both standards specify DCT coefficient prediction now. MPEG-4 Quantization is unique to the MPEG-4 standard, and unlike the basic quantization method which uses a fixed step quantizer for every DCT coefficient in a block, MPEG-4 uses a weighted quantization table method, in which each DCT coefficient in a block is quantized differently according to a weight table. This table can be customized to achieve better compression depending on the characteristics of the video being coded. H.263 adds one further operation after quantization of the DCT coefficients, which is particularly efficient at improving the visual quality of video coded at low bit rates by removing blocking effects, the Deblocking Filter mode.

Coefficient re-scanning: The use of alternate scan modes (vertical and horizontal), besides the common zig-zag DCT coefficient reordering scheme, is a feature now available to both standards. These scan modes are used in conjunction with DCT coefficient prediction to achieve better compression efficiency.

Variable-length coding: Unlike earlier standards, which used only a single VLC table for coding run-length coded (quantized) DCT coefficients in both intra- and inter-frames, MPEG-4 and H.263 specify the use of a different Huffman VLC table for coding intra-pictures, enhancing the compression efficiency of the standards. H.263 goes a little bit further by allowing some inter-frames to be coded using the Intra VLC table (H.263 Annex S) [ITU-T Rec. H.263+, 1998].

Next, additional and new features are discussed, which do not fall in the basic encoding categories, but define new capabilities of the standards.

286 Appendix C XVI. Comparison of MPEG-4 and H.263 Standards

Arbitrary-Shaped-Object coding (ASO): Defines algorithms to enable the coding of a scene as a collection of several objects. These objects can then be coded separately allowing a coding with higher quality for more important objects of the scene and higher compression for unimportant details. ASO coding is unique to MPEG-4. H.263 does not offer any comparable capability.

Scalable Coding: Scalable coding allows encoding of a scene using several layers. One base layer contains a low quality / low resolution version of the scene, while the enhancement layers code the residual error and consecutively refine the quality and resolution of the image. MPEG- 4 and H.263 introduce three types of scalability. Temporal Scalability layers enhance the temporal resolution of a coded scene (e.g. from 15 fps to 25 fps). Spatial scalability layers enhance the resolution of a coded scene (e.g. from QCIF to CIF). SNR Scalability (also known as FGS, Fine Granularity Scalability) layers enhance the Signal-to-Noise Ratio of a coded scene. Although both standards support scalable coding, the standards differ in the approach used to support this capability. In contrast to H.263, MPEG-4 implements SNR scalability by using only one enhancement layer to code the reconstruction error from the base layer. This enhancement layer can be used to refine the scene progressively by truncating the layer according to the capabilities / restrictions of the decoding client to achieve good quality under QoS restrictions.

Error-resilient coding: Error resilience coding features have been introduced in MPEG-4 and H.263 to be able to effectively detect, reduce and conceal errors in the video stream, caused by transmission over error prone communication channels. These features are to be especially used with low bit-rate video, but are not restricted to that case. Features such a Reversible Variable Length Codes (unique to MPEG-4), Data partitioning and slices (video packets) fall into this category to enable better error detection, recovery and concealment.

Real-time coding: There are tools that enable a better control for the encoding application to adapt to changing QoS restrictions and bandwidth conditions. These tools use a backward channel from the decoder to the encoder, so that the last can change encoding settings to better control the quality of the video. Reduced Resolution Coding (MPEG-4, Reduced Resolution Update in H.263) is a feature used by both standards. It enables the encoder to code a downsampled image to reduce to a given bit rate without causing the comparably bigger loss of visual quality that occurs when dropping frames in the encoding process. MPEG-4 uses

287 Appendix C XVI. Comparison of MPEG-4 and H.263 Standards

NEWPRED which enables the encoder to select a different reference picture for the motion compensation process, if the current one leads to errors in decoding. H.263 defines a better version of this feature, Enhanced Reference Picture Resampling (H.263++, Annex U) [ITU-T Rec. H.263++, 2000], which offers the same capability as NEWPRED, but adds the possibility of using multiple reference frames in the motion compensation of P and B pictures.

XVI.2. Application-oriented comparison In a case-study based on the proposed algorithm, many existing solutions as well as possible applications in near future have been analyzed. This has resulted in 4 general and most common comparison scenarios. These are: Baseline, Compression efficiency, Realtime and Scalable coding. Together with discussion about the standards some examples of suitable applications for each comparison scenario (thus making it reasonable to the reader) are given.

Baseline: here the basic encoding tools proposed by each standard are compare i.e. MPEG-4 Simple versus H.263 Baseline. It is only a theoretical comparison, since H.263 baseline, being an earlier standard and starting point for MPEG-4 too, lacks many of the tools already used in MPEG-4 Simple profile. This scenario is suitable for applications which do not require high quality video or high compression efficiency and use relatively error-free communication channels, with the advantage of a widespread compatibility, and a cheap and low complexity of implementation. A typical application for this would be capturing video for low-level or mid- range digital cameras, home grabbing, popular cheap hardware (simple TV cards). Because of the limitations of H.263 Baseline, an MPEG-4 Simple Profile compliant coding solution would be better for this case. However, H.263 Baseline combined with Advanced Intra Coding (H.263 Annex I) [ITU-T Rec. H.263+, 1998] offers almost the same capabilities with similar complexity. So the choice between any of these solutions is a matter of taste, although maybe a series of in-depth tests and benchmarks of available implementations could shed a better light as to which standard performs better in this case (some of the results are available in [Topivata et al., 2001; Vatolin et al., 2005; WWW_Doom9, 2003]).

Compression efficiency: This is one of the main comparison scenarios. Here the comparison of H.263 and MPEG-4 regarding the tools they offer to achieve high compression efficiency is proposed, that is, the tools that help to encode a scene with a given quality with the least

288 Appendix C XVI. Comparison of MPEG-4 and H.263 Standards

possible amount of bits. For this scenario MPEG-4 Advanced Simple Profile against H.263’s High Latency Profile is compared. In this application scenario the focus is on achieving high compression and good visual quality. Typical representative applications for this scenario include the home user digital video at VCD and DVD qualities, the digital video streaming and downloading (Video On Demand) industry, as well as the High Definition TV (Digital TV) broadcasting industry, surgery in hospitals, digital libraries and museums, multimedia encyclopedias, Video sequences in computer and console games, etc. The standard of choice for this type of applications is MPEG-4, as it offers better compression tools, such as MPEG-4 Quantization, Quarter Pixel Motion Compensation and B frames. (H.263 can support B frames, but only in scalability mode).

Realtime: This is another interesting scenario for comparison where the tools that each standard offers for dealing with Real Time Encoding and Decoding of a video stream. For this scenario, the MPEG-4’s Advanced Real-Time Streaming profile with H.263’s Conversational Internet Profile is compared. The focus in this scenario relies not on compression or high resolution, but on manageable complexity for real time encoding and error detection, correction and concealment features typical for a real time communication scenario, where transmission errors are more probable. Applications in this scenario make use of video with low to medium resolution and usually low constant bitrates (CBR) to facilitate the live transmission of it. Video conferencing is a good representative for this application scenario. Video is coded in real time, and there is a continuous communication CBR channel between encoder and decoder, so that information is exchanged for controlling and monitoring purposes. Other applications which need ‘live’ encoding of video material such as video telephony, process monitoring applications, surveillance applications, network video recording (NVR), Web cams and mobile video applications (GPRS phones, military systems), live TV transmissions, etc. can make use of video encoding solutions for the realtime scenario. Both H.263 and MPEG-4 have put efforts in developing features for this type of applications. However, H.263 is still the standard of choice here, since it offers tools specially designed to offer better video at low resolutions and low bitrates. Features such as Deblocking Filter and Enhanced Reference Picture Selection make H.263 a better choice than MPEG-4.

289 Appendix C XVI. Comparison of MPEG-4 and H.263 Standards

Scalable coding: Last but not least, comparison of both standards according to the tools they provide for scalable coding is proposed. Scalable coding is an attractive alternative to Real-Time encoding for satisfying Quality of Service restrictions. In this scenario the MPEG-4’s Simple Scalable and FGS scalable profiles to H.263 Baseline + Advance Intra Coding (Annex I) + Scalability (Annex O) [ITU-T Rec. H.263+, 1998] are compared. The goal is to compare the ability of both standards to provide good quality and flexible adaptation to a particular QoS level by using enhancement layers. The general idea of scalable coding is to encode the video just once, but serve video at several quality / resolution levels. The desired QoS shall be achieved by sending more or less enhancement layers according to the networks bandwidth conditions. Scalable Coding is designed to suit a large variety of applications, due to its ability to encode and send video at various bit rates. Video on Demand applications can make use of the features offered by this scenario and offer low (e.g. Modem) / medium (e.g. ISDN) / high (e.g. ADSL) bitrate versions of music videos or movie trailers, without having to keep 3 different versions of the video stream, one for each bit rate. Other type of applications that benefit from this scenario are those where the communication channel used to transmit video does not offer a constant bandwidth, so that the video bit rate has to adapt to the changing conditions of the network. Streaming applications over mobile channels come to mind. Although H.263 offers features to support scalable coding, these features are not as powerful as those offered by MPEG-4. Of special interest here is the new SNR scalability approach of MPEG-4, which is much more flexible than former scalability solutions. One of the potential problems of scalable coding, however, is the small availability of open and commercial encoders and decoders supporting it at present time, since most MPEG-4 compliant products only comply with the Simple or Advanced Simple Profile (compression efficiency), and most H.263 products target only the mobile / real time low bit rate market.

After defining a comparison scenario, the addressed area of the problem can be presented by a graph of ranges (or graph of coverage). Such graphs can show the dependencies between the application requirements and the area covered by the scenario. As an example (Figure 108) a graph quality range vs. bandwidth requirements (with roughly estimated H.263 and MPEG-4 functions of behavior) is depicted.

290 Appendix C XVI. Comparison of MPEG-4 and H.263 Standards

e Real- tim

Figure 108. Graph of ranges – quality vs. bandwidth requirements

XVI.3. Implementation analysis On this part we will have a look at current implementations of the MPEG-4 and the H.263 standard (we have chosen one representative for each of them). However, we do not dig into because of many ad-hoc comparisons that are publicly available, for example the [WWW_Doom9, 2003] compared seven different implementations in 2003 and [Vatolin et al., 2005] compared different set of seven codecs in 2005. Besides, we do not want to provide yet another benchmark and performance evaluation description.

One of the disadvantages of this type of comparison is that the current implementations for each standard target different application markets. MPEG-4 compliant applications are mostly compliant with the Simple Profile (SP) or the Advanced Simple Profile (ASP). Some of these applications are open sourced, but most are commercial pro-ducts. Other MPEG-4 profiles do not offer such a variety of current solutions in the market, and for many of them it is even very hard to find more than one company offering solutions for that specific profile. H.263 is exclusively a low bit rate encoder. There are few non-commercial products based on the standard, and even the reference implementation, now maintained by the University of British Columbia (UBC), has become a commercial product. Even for research purposes obtaining the source of the encoder is subject to a payment. H.263 and MPEG-4 use both many algorithms whose patents and rights are held by commercial companies, and as such, one must be very careful not to break copyright agreements.

291 Appendix C XVI. Comparison of MPEG-4 and H.263 Standards

XVI.3.1. MPEG-4

This is a list of products based on the MPEG-4 standard (as owners declare): • 3viX: SP, ASP • On2 VP6: SP, ASP • Ogg (VP3-based): SP • DivX 5.x: SP, ASP • XVID 0.9: SP, ASP, • Dicas mpegable: SP, ASP, Streaming technology • QuickTime MPEG-4: SP, Streaming technology • Sorenson MPEG-4 Pro: SP, ASP • UB Video: SP, ASP

XVID [WWW_XVID, 2003] is open source. As such it was easier to analyze this product and test its compliance with MPEG-4. The result of the analysis of the source code (version 0.9, stable) show that XVID is at the moment only a SP compliant encoder. However, the development version of the codec aims for ASP compliance. ARTS profile should be included in the later versions of XVID.

One of the missing parts is the ability to generate MPEG-4 system stream but only the MPEG- 4 Video stream. On the other hand, the video stream may be encapsulated in the AVI container. Moreover, there are tools available to extract the MPEG-4 compliant stream and encapsulate it in MPEG-4 System stream.

XVI.3.2. H.263

This is only a short list of products based on the H.263 standard (as owners declare):

• Telenor TMN 3.0 • Scalar VTC: H.263+ • UBC H.263 Library 0.3

As representative for the H.263 standard we chose the Telenor TMN 3.0 encoder, which was the reference implementation for the standard. This version is, however, somehow obsolete in

292 Appendix C XVI. Comparison of MPEG-4 and H.263 Standards

comparison to the new features introduced by H.263+++ [ITU-T Rec. H.263+++, 2005]. The software itself only supports the annexes proposed by the H.263 standard document (Version 1 in 1995) and the following annexes from the H.263+ standard document (Version 2): K, L, P, Q, R [ITU-T Rec. H.263+, 1998].

293 Appendix C XVI. Comparison of MPEG-4 and H.263 Standards

294 Appendix D XVII. Loading Continuous Metadata into Encoder

Appendix D

XVII. LOADING CONTINUOUS METADATA INTO ENCODER

The pseudo code showing how to load the continuous MD into the encoder for each frame is presented by Listing 6. The continuous MD are stored in this case as binary compressed stream, so at first decompression using simple Huffman decoding should be applied (not depicted on the listing). Then the resulting stream is nothing else that the sequence of bits, where the given position(s) is(are) mapped to a certain value of the continuous MD element.

LoadMetaData R bipred (1 bit) {0,1} R frame_type (2 bits) {I_VOP, P_VOP ,B_VOP} if !I_VOP oR fcode (FCODEBITS) => length, height if bipred oR bcode (BCODEBITS) => b_length, b_height endif endif R mb_width (MBWIDTHBITS) R mb_height (MBHEIGHTBITS) // do for all macro blocks for 0..mb_height for 0..mb_width // def. MACROBLOC pMB R pMB->mode (MODEBITS) R pMB->priority (PRIORITYBITS) // do for all blocks for i=0..5 if I_VOP oR pMB->DC_coeff[i] (12) oR pMB->AC_coeff[i] (12) oR pMB->AC_coeff[i+6] (12) elseif (B_VOP && MODE_FORWARD) || (P_VOP && (MODE_INTER || MODE_INTRA)) oR MVECTOR(x,y) (length,height) if !B_VOP && MODE_INTRA oR pMB->DC_coeff[i] (12) oR pMB->AC_coeff[i] (12) oR pMB->AC_coeff[i+6] (12) elseif B_VOP && MODE_BACKWARD oR MVECTOR(x,y) (b_length,b_height) else

295 Appendix D XVII. Loading Continuous Metadata into Encoder

// for direct mode: B_VOP && MODE_DIRECT // and other (unsupported yet) modes //do nothing with MD bitstream endif if bipred && (B_VOP || (P_VOP && !MODE_INTRA)) oR MVECTOR(x,y) (b_length,b_height) endif endfor endfor endfor endLoadMetaData

Listing 6. Pseudo code for loading the continuous MD.

The rows marked with R means always read the data (yellow highlighting) and respectively with oR - optional read (not always included in the stream – depends on the previously read values). The size of read data in bits is defined in normal brackets just after the given MD property (in bold). The size may be represented by constant in capitals, which are related respectively: FCODEBITS to the maximum size of the forward MV, BCODEBITS to the maximum size of the backward MV, MBWIDTHBITS and MBHEIGTHBITS to maximum number of MBs in width and in height, MODEBITS to the number of supported MB types, PRIORITYBITS to the total number of MBs in the frame. The values are given in the curly brackets if the domain is strictly defined. Sign “=>” means that the read MD attribute allows for calculating other elements used later on.

296 Appendix E XVIII. Test Bed

Appendix E

XVIII. TEST BED

There have been used many resources for conducting the RETAVIC project. They have been allocated depending on the tasks defined in the section X.1 (The Evaluation Process). Due to the type of measurements there are three general groups distinguished: high-efficiency processing, precise measurements in non-real-time, and precise measurements in real-time. These three groups have been using different equipment and the details are given for each group separately.

XVIII.1. Non-real-time processing of high load The goal of the test bed for non-real-time processing of high load is to measure functional aspects of the examined algorithms i.e. the measurement of compression efficiency, the evaluation of the quality, and the dependencies between achieved bitrates and quality, and the scalability of the codecs.

The specially designed server MultiMonster was serving this purpose. The server was a powerful cluster build of one “queen-bee” and eight “bees”. The detailed cluster specification is given in Table 11, and respectively the hardware details about queen bee in Table 12 and bees in Table 13.

297 Appendix E XVIII. Test Bed

MULTIMONSTER CLUSTER CONTROL Administrative Management Console 19’’ LCD + keyboard + mouse Switch 16x BEES 8x MM Processing Server 2x Intel Pentium 4 2.66GHz Only 1 processor installed NETWORK Switch 1Gbps 24 ports STORAGE EasyRAID System 3.2TB 16x WD 200GB 7.2kRMP 2MB cache, 8.9ms Effective storage: RAID Level 5 => 1.3TB RAID Level 3 => 1.3TB QUEEN BEE 1x MM System Server 2x Intel Xeon 2.8GHz RAID Storage Attached OS Management and Configuration Cluster tools: ClusterNFS & OpenMOSIX POWER UPS 3kVA (just for QUEEN BEE and STORAGE) Total available processors: real 10 / virtually seen 12 (Xeon HyperThreading)

Table 11. Configuration of the MultiMonster cluster.

QUEEN BEE CPU model name 2x Intel(R) Xeon(TM) CPU 2.80GHz CPU clock (MHz) 2785 Cache size 512 KB Memory (MB) 2560 CPU Flags fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ss ht tm CPU Speed (BogoMips) 5570.56 Network Card Intel Corp. 82543GC Gigabit Ethernet Controller (Fiber) (rev 02) 2x Broadcom Corporation NetXtreme BCM5703 Gigabit Ethernet (rev 02) RAID Bus Controller Compaq Computer Corporation Smart Array 5i/532 (rev 01)

Table 12. The hardware configuration for the queen-bee server.

298 Appendix E XVIII. Test Bed

BEE CPU model name Intel(R) Pentium(R) 4 CPU 2.66GHz CPU clock (MHz) 2658 Cache size 512 KB Memory (MB) 512 CPU Flags fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm CPU Speed 5308.41 (BogoMips) Network Card 2x Broadcom Corporation NetXtreme BCM5702 Gigabit Ethernet (rew 02)

Table 13. The hardware configuration for the bee-machines.

The operating system is an adopted version of Linux SuSE 8.2. The special OpenMOSIX kernel in version 2.4.22 was patched to support Broadcom gigabit network cards (bcm5700 v.6.2.17-1). Moreover, it was extended by special ClusterNFS v.3.0 functionality. Special kernel configuration options have been applied, among others: support for 64GB RAM and symmetric multiprocessing (SMP). The software used within the testing environment covers: OpenMOSIX tools (such as view, migmon, mps, etc.), self-written scripts for cluster management (cexce, xcexec, cping, creboot, ckillall, etc.), transcode (0.6.12) and audio-video processing software (codecs, libraries, etc.).

There are also other tools available which have been written by students to support the RETAVIC project, and to name just few: audio-video conversion benchmark - AVCOB - (based on transcode 0.6.12) by Shu Liu [Liu, 2003], file analysis extensions for MPEG-4 (based on tcprobe 0.6.12) by Xinghua Liang, LLV1 codec with analyzer and transcoder (based on XviD) by Michael Militzer, MultiMonster Multimedia Server by Holger Velke, Jörg Meier and Marc Iseler (based on JBoss AS and JavaServlet Technology), or automation scripts for benchmarking audio codecs by Florian Penzkoffer.

Finally, there are also few applications by the author such as: PSNR measurement tool, YUV2AVI converter, YUV presenter, and web application presenting some of the results (written mainly in PHP).

299 Appendix E XVIII. Test Bed

XVIII.2. Imprecise measurements in non-real-time This part was used for the first proof of concepts with respect to the expected behavior of the audio-video processing algorithms. It was applied in the best effort system (Linux or Windows), so there has been some error allowed. The goal was not to obtain exact measurements but rather to just show the differences between standard and developed algorithms and justify the relevance of the proposed ideas.

To provide imprecise measurements of analyzed audio and video processing algorithms but still burdened with the relatively small error, the isolation from the network to avoid unpredictable outside influence has to be applied. Moreover, to prove behavior of the processing on the diverse processor architectures, the different configurations of the computers has to be used in some cases. Thus few other configurations have been employed as listed below in Table 14 and Table 15.

PC_RT CPU model name AMD Athlon(tm) XP 1800+ CPU clock (MHz) 1533 Cache size 256 KB Memory (MB) 512 CPU Flags fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscal mp mmxext 3dnowext 3dnow CPU Speed (BogoMips) 3022.84 Network Card 3com 3c905 100BaseTX

Table 14. The configuration of PC_RT.

PC CPU model name Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz CPU clock (MHz) 1596 Cache size 512 KB Memory (MB) 512 CPU Flags fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm CPU Speed (BogoMips) 2396.42 Network Card not used

Table 15. The configuration of PC.

300 Appendix E XVIII. Test Bed

XVIII.3. Precise measurements in DROPS The precise measurements in real-time system are done under the closed system. To provide comparability of the measurements, exactly one computer has been used. The detailed configuration is listed in the previous section as PC_RT in Table 14.

Due to the DROPS requirements, there had to be used just specific type of the network card. The used network card was based on the 3Com 3c905 chip. There have been three exactly same machines configured for the development under DROPS where such network cards have been installed, however the real-time measurements have been conducted always on the same machine (faui6p7). This allowed minimizing the error even more during the measurement process.

If not explicitly stated in the in-place description, a general ordered set of modules from DROPS have been used. The set included following modules in sequence:

• rmgr – the resource manager with sigma0 option (reference to root pager) for handling physical memory, interrupts and tasks, and loading the kernel • modaddr 0x0b000000 – allowing for allocating higher addresses of memory • fiasco_apic – the microkernel with the APIC one-shot mode configured and scheduling • sigma0 – the root pager, which possesses all the available memory at the beginning and makes it available to other processes • log_net_elinkvortex – the network logging server using the supported 3Com network card driver. In very rare cases, the standard log was used instead of the network log_net_elingvortex. • names – the server for registering names of running modules. It was also used to control boot sequence of the modules (because the grub does not provide such functionality). • dm_phys – the simple dynamic memory (data space) manager providing memory parts for demanding tasks • simple_ts – the simple, generic task server required for additional address space creation during runtime (L4 tasks) • real_time_application – the real-time audio-video decoding or encoding application. It runs as a demanding L4 task (thus needs simple_ts and dm_phys)

301 Appendix E XVIII. Test Bed

Such defined configuration of the constant set of OS modules allowed achieving very stable and predictable benchmarking environment, where no outside processes could influence the real- time measurements. The only possible remaining source of interrupts could be the network log server used in DROPS for grabbing the measurement values. However it has not sent the output at once, but only after the real-time execution was finished. As so, the possible interrupts generated by the network card on a given IRQ has been eliminated from the measured values.

302 Appendix F XIX. Static Meta-Data for Few Video Sequences

Appendix F

XIX. STATIC META-DATA FOR FEW VIDEO SEQUENCES

The attributes’ values of the entities in the initial static MD set have been calculated for few video sequences. Instead of representing these values in tables, which would lead to enormous occupation of space, they are demonstrated in the graphical form. There are three levels of values depicted in analogy to the natural hierarchy of the initial static MD set proposed in the thesis (Figure 12 on p. 98)112. The frame-based static MD represent the StaticMD_Video subset, the MB-based static MD refer to the StaticMD_Frame subset, and the MV-based (or block- based) static MD are connected with the StaticMD_MotionVector (or StaticMD_Layer) subset. Each of these levels is depicted in the individual section.

XIX.1. Frame-based static MD The sequences under investigation have usually been prepared in two versions (Figure 109). The normal where only I- and P- frames have been defined, and with the _temp extension where additionally B-frames have been included. The distribution of P- and B-frames within the video sequences was enforced by the LLV1 temporal scalability layer i.e. the equal number of P- and B- frames appeared. The sums of each frame type (IFramesSum, PFramesSum, and BFramesSum) are showed as the distribution in the video sequence.

112 The division on three levels has been used in the precise time prediction during the design of the real-time processing model.

303 Appendix F XIX. Static Meta-Data for Few Video Sequences

Distribution of I-, P- & B-frames

parkrun_itu601 shields_itu601 mobcal_itu602_temp mobcal_itu601 mobile_cif mobile_cif_temp container_cif I-f r ames container_cif_temp P-frames mother_and_daugher_cif_temp mother_and_daugher_cif B-frames mobile_qcif_temp mobile_qcif container_qcif container_qcif_temp mother_and_daugher_qcif mother_and_daugher_qcif_temp

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Figure 109. Distribution of frame types within the used set of video sequences.

XIX.2. MB-based static MD The MD-based static MD have been prepared for few video sequences in analogy to the frame- based calculations. There are three types of macro blocks distinguished and the respective sums (IMBsSum, PMBsSum, and BMBsSum) included in the StaticMD_Frame subset of the initial static MB set are calculated for each frame in the video sequence and depicted on the charts below.

Coded MBs L0 Coded MBs L0

120 120

100 100

80 80

B-MBs B-MBs 60 P-MBs 60 P-MBs I-MBs I-MBs

40 40

20 20

0 0 1 5 9 131721252933374145495357616569737781858993 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 carphone_qcif_96 carphone_qcif_96_temp

Coded MBs L0 Coded MBs L0

450 450

400 400

350 350

300 300

250 B-MBs 250 B-MBs P-MBs P-MBs 200 I-MBs 200 I-MBs

150 150

100 100

50 50

0 0 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 coastguard_cif coastguard_cif_temp

304 Appendix F XIX. Static Meta-Data for Few Video Sequences

Coded MBs L0 Coded MBs L0

120 120

100 100

80 80

B-MBs B-MBs 60 P-MBs 60 P-MBs I-MBs I-MBs

40 40

20 20

0 0 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 coastguard_qcif coastguard_qcif_temp

Coded MBs L0 Coded MBs L0

120 450

400 100 350

80 300

B-MBs 250 B-MBs 60 P-MBs P-MBs I-MBs 200 I-MBs

40 150

100 20 50

0 0 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 container_cif mobile_cif

Coded MBs L0 Coded MBs L0

350 350

300 300

250 250

200 B-MBs 200 B-MBs P-MBs P-MBs 150 I-MBs 150 I-MBs

100 100

50 50

0 0 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 mobile_cifn_140 mobile_cifn_140_temp

Coded MBs L0 Coded MBs L0

120 120

100 100

80 80

B-MBs B-MBs 60 P-MBs 60 P-MBs I-MBs I-MBs

40 40

20 20

0 0 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 mobile_qcif mobile_qcif_temp

305 Appendix F XIX. Static Meta-Data for Few Video Sequences

XIX.3. MV-based static MD The MV-based static MD have been prepared for few videos again in analogy to the previous sections. There are nine types of MVs distinguished as described in the Video-Related Static MD section (V.3). The respective frame-specific sum (MVsSum) is kept in relation to the MV type in the StaticMD_MotionVector subset of the initial static MB set. Beside the nine types, there is one more value called no_mv. This value refers to the macro blocks in which no motion vector is stored, and the same the MB is intra-coded. Please note, that no_mv is different from the zero- MV (i.e. x=0 and y=0), because in case of no_mv neither the backward-predicted nor bi- directionally-predicted interpolation occurs, while in the other case one of these is applied. XIX.3.1. Graphs with absolute values

The charts below depict the absolute number of the MVs depending on the type per frame. The sum of all ten cases (nine MVs + no_mv) is constant for the sequences having only I- or P-MBs (or frames), because either one MV or no_mv is assigned per MB. Contrary, two MVs are assigned to the bi-directionally predicted MBs, so the total number may vary between the number of MBs (no B-MBs) and the value twice as big as the number of MBs (only B-MBs).

MVs per Frame MVs per frame

120 200

180 100 160 no_mv no_mv mv9 140 mv9 80 mv8 mv8 mv7 120 mv7 mv6 mv6 60 100 mv5 mv5 mv4 80 mv4 mv3 mv3 40 mv2 60 mv2 mv1 mv1 40 20 20

0 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 carphone_qcif_96 carphone_qcif_96_temp

MVs per frame MVs per frame

450 900

400 800

350 700 no_mv no_mv mv9 mv9 300 600 mv8 mv8 mv7 mv7 250 500 mv6 mv6 mv5 mv5 200 400 mv4 mv4 mv3 mv3 150 300 mv2 mv2 mv1 100 200 mv1

50 100

0 0 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 coastguard_cif coastguard_cif_temp

306 Appendix F XIX. Static Meta-Data for Few Video Sequences

MVs per frame MVs per frame

120 250

100 200 no_mv no_mv mv9 mv9 80 mv8 mv8 mv7 150 mv7 mv6 mv6 60 mv5 mv5 mv4 100 mv4 mv3 mv3 40 mv2 mv2 mv1 mv1 50 20

0 0 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 coastguard_qcif coastguard_qcif_temp

MVs per frame MVs per frame

120 450

400 100 350 no_mv no_mv mv9 mv9 80 300 mv8 mv8 mv7 mv7 250 mv6 mv6 60 mv5 mv5 200 mv4 mv4 mv3 mv3 40 150 mv2 mv2 mv1 100 mv1 20 50

0 0 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 container_cif mobile_cif

MVs per frame MVs per frame

350 700

300 600

no_mv no_mv 250 mv9 500 mv9 mv8 mv8 200 mv7 400 mv7 mv6 mv6 mv5 mv5 150 mv4 300 mv4 mv3 mv3 100 mv2 200 mv2 mv1 mv1

50 100

0 0 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 mobile_cifn_140 mobile_cifn_140_temp

MVs per frame MVs per frame

120 250

100 200 no_mv no_mv mv9 mv9 80 mv8 mv8 150 mv7 mv7 mv6 mv6 60 mv5 mv5 mv4 100 mv4 mv3 40 mv3 mv2 mv2 mv1 mv1 50 20

0 0 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 253 267 281 295 mobile_qcif mobile_qcif_temp

XIX.3.2. Distribution graphs

307 Appendix F XIX. Static Meta-Data for Few Video Sequences

The distribution graphs for the same sequences are depicted below. The small rectangles (bars) depicts the sum of the given type of vector within the frame, such that the white color is equal to zero and the black one is equal to all MVs in the frame. Of course, the darker the color is, the more MVs of a given type exist in the frame. There are frame numbers along the X-axis starting with first frame from the left side and ending with the last frame on the right. The bar width depends on the number of frames presented in the histogram. The MV-types are assigned along the Y-axis starting with mv1 from the top, going step-by-step down to mv9, and having no_mv at the very bottom. As so, it is easy noticeable, that the first frame of each video sequence has the most left and bottom rectangle dark and all nine rectangles above it in the same column white— it is due to the use of only close GOPs in each sequence and thus always I-frame at the beginning having no MVs at all (because only I-MBs are included).

carphone_qcif_96

carphone_qcif_96_temp

coastguard_cif

308 Appendix F XIX. Static Meta-Data for Few Video Sequences

coastguard_cif_temp

coastguard_qcif

coastguard_qcif_temp

container_qcif

container_qcif_temp

mobile_cifn_140

309 Appendix F XIX. Static Meta-Data for Few Video Sequences

mobile_cifn_140_temp

mobile_qcif

mobile_qcif_temp

mobile_cif

310 Appendix G XX. Full Listing of Important Real-Time Functions in RT-MD-LLV1

Appendix G

This appendix includes full listings of real-time functions for the meta-data-based Drops- implemented converters i.e. RT-MD-LLV1 decoder and RT-MD-XVID encoder are included.

XX. FULL LISTING OF IMPORTANT REAL-TIME FUNCTIONS IN RT-MD-LLV1

XX.1. Function preempter_thread() #if REALTIME static void preempter_thread (void){ l4_rt_preemption_t _dw; l4_msgdope_t _result; l4_threadid_t id1, id2; extern l4_threadid_t main_thread_id; extern volatile char timeslice_overrun_optional; extern volatile char timeslice_overrun_mandatory; extern volatile char deadline_miss; extern int no_base_tso; extern int no_enhance_tso; extern int no_clean_tso; extern int no_deadline_misses; id1 = L4_INVALID_ID; id2 = L4_INVALID_ID; while (1) { // wait for preemption IPC if (l4_ipc_receive(l4_preemption_id(main_thread_id), L4_IPC_SHORT_MSG, &_dw.lh.low, &_dw.lh.high, L4_IPC_NEVER, &_result) == 0){ if (_dw.p.type == L4_RT_PREEMPT_TIMESLICE) { /* this is timeslice 1 ==> mandatory */ if (_dw.p.id == 1){ /* mark this TSO */ timeslice_overrun_mandatory = 1; /* count tso */ no_base_tso++; } /* this is timeslice 2 ==> optional */ else if (_dw.p.id == 2){ /* mark this TSO for main thread */ timeslice_overrun_optional = 1; /* count tso */ no_enhance_tso++; } /* this is timeslice 3 ==> mandatory */ else if (_dw.p.id == 3){ /* count tso */ no_clean_tso++; } } /* this is a deadline miss ! * => we're really in trouble! */

311 Appendix G XX. Full Listing of Important Real-Time Functions in RT-MD-LLV1

else if (_dw.p.type == L4_RT_PREEMPT_DEADLINE){ /* mark deadline miss */ deadline_miss=1; /* count tso */ no_deadline_misses++; } } else LOG("Preempt-receive returned %x", L4_IPC_ERROR(_result)); } } #endif /*REALTIME*/

XX.2. Function load_allocation_params()

#if REALTIME /* load parameters for allocation */ void load_allocation_params(void){ /* list with allocations (must be defined as -D with Makefile) */ #ifdef _qcif file = "_qcif"; max_base_per_MB = 0.0; avg_base_per_MB_base = 27.15; max_base_per_MB_base = 28.17; avg_base_per_MB_enhance = 29.35; max_base_per_MB_enhance = 30.35; max_enhance_per_MB = 18.44; avg_cleanup_per_MB_base = 2.22; max_cleanup_per_MB_base = 3.05; avg_cleanup_per_MB_enhance = 13.10; max_cleanup_per_MB_enhance = 13.17; #elif defined _cif file = "_cif"; max_base_per_MB = 0.0; avg_base_per_MB_base = 21.36; max_base_per_MB_base = 22.14; avg_base_per_MB_enhance = 25.12; max_base_per_MB_enhance = 25.97; max_enhance_per_MB = 15.17; avg_cleanup_per_MB_base = 2.02; max_cleanup_per_MB_base = 2.07; avg_cleanup_per_MB_enhance = 14.66; max_cleanup_per_MB_enhance = 15.30; #elif defined _itu601 file = "_itu601"; max_base_per_MB = 0.0; avg_base_per_MB_base = 18.59; max_base_per_MB_base = 23.54; avg_base_per_MB_enhance = 22.71; max_base_per_MB_enhance = 27.69; max_enhance_per_MB = 15.09; avg_cleanup_per_MB_base = 1.60; max_cleanup_per_MB_base = 1.72; avg_cleanup_per_MB_enhance = 14.37; max_cleanup_per_MB_enhance = 14.96; #else file = "unknown_video"; max_base_per_MB = 0.0; avg_base_per_MB_base = 27.15; max_base_per_MB_base = 28.17; avg_base_per_MB_enhance = 29.35; max_base_per_MB_enhance = 30.35; max_enhance_per_MB = 18.44; avg_cleanup_per_MB_base = 2.22; max_cleanup_per_MB_base = 3.05; avg_cleanup_per_MB_enhance = 14.96; max_cleanup_per_MB_enhance = 15.30; #endif } #endif /*REALTIME*/

312 Appendix G XXI. Full Listing of Important Real-Time Functions in RT-MD-XVID

XXI. FULL LISTING OF IMPORTANT REAL-TIME FUNCTIONS IN RT-MD-XVID

XXI.1. Function preempter_thread()

#if REALTIME static void preempter_thread (void){ l4_rt_preemption_t _dw; l4_msgdope_t _result; l4_threadid_t id1, id2; extern l4_threadid_t main_thread_id; extern volatile char deadline_miss; extern int no_deadline_misses;

id1 = L4_INVALID_ID; id2 = L4_INVALID_ID;

while (1) { // wait for preemption IPC if (l4_ipc_receive(l4_preemption_id(main_thread_id), L4_IPC_SHORT_MSG, &_dw.lh.low, &_dw.lh.high, L4_IPC_NEVER, &_result) == 0){ if (_dw.p.type == L4_RT_PREEMPT_TIMESLICE) { // this is timeslice 1 ==> mandatory if (_dw.p.id == 1){ realtime_mode = OPTIONAL; l4_rt_next_reservation(1, &left); } // this is timeslice 2 ==> optional else if (_dw.p.id == 2){ realtime_mode = MANDATORY_CLEANUP; l4_rt_next_reservation(2, &left); } // this is timeslice 3 ==> mandatory else if (_dw.p.id == 3){ realtime_mode = DEADLINE; l4_rt_next_reservation(3, &left); } } // this is a deadline miss ! // => we're really in trouble! else if (_dw.p.type == L4_RT_PREEMPT_DEADLINE){ // mark deadline miss deadline_miss=1; // count tso no_deadline_misses++; } } else LOG("Preempt-receive returned %x", L4_IPC_ERROR(_result)); } } #endif /*REALTIME*/

313 Appendix G XXI. Full Listing of Important Real-Time Functions in RT-MD-XVID

314 Appendix H XXII. MPEG-4 Audio Tools and Profiles

Appendix H

This appendix covers audio-specific aspects such as MPEG-4 tools and profiles and MPEG-4 SLS enhancements.

XXII. MPEG-4 AUDIO TOOLS AND PROFILES

Coding Func- tionality (Tools / Modules)

it/s VR

Audio Object Type gain control control gain switching block standard - shapes window LD – AAC shapes window filterbank - standard filterbank - SSR TNS LTP intensity coupling prediction domain frequency PNS MS SIAQ FSS upsampling filter tool AAC - coding & qantization – TwinVQ coding & quantization BSAC - coding & quantization ERAAC Tools syntax payload ER ToolEP CELP compression silence HVXC 4kbHVXC SA tools SASBF MIDI HILN TTSI SBR Layer-1 Layer-2 Layer-3 AAC main x x x x x x x x x x AAC LC x x x x x x x x x AAC SSR x x x x x x x x x x AAC LTP x x x x x x x x x x SBR x AAC Scalable x x x x x x x x x x x x TwinVQ x x x x x x x CELP x HVXC x TTSI x Main synthetic x x x Wavetable synthesis x x General MIDI x Algorithmic Synthesis x and Audio FX ER AAC LC x x x x x x x x x x x ER AAC LTP x x x x x x x x x x x x ER AAC scalable x x x x x x x x x x x x x x ER TwinVQ x x x x x x x x ER BSAC x x x x x x x x x x ER AAC LD x x x x x x x x x x x ER CELP x x x x ER HVXC x x x x ER HILN x x x ER Parametric x x x x x SSC Layer-1 x Layer-2 x Layer-3 x

Table 16. MPEG Audio Object Type Definition based on Tools/Modules [MPEG-4 Part III, 2005].

315 Appendix H XXII. MPEG-4 Audio Tools and Profiles

Explanation of abbreviations used in the Table 16 (for detailed description of Tools/Modules readers are referred to [MPEG-2 Part VII, 2006; MPEG-4 Part III, 2005] and respective MPEG-4 standard amendments):

• LC – Low Complexity • ER – Error Robust • SSR – Scalable Sample Rate • BSAC – Bit Sliced Arithmetic • LTP – Long Term Predictor Coding • SBR – Spectral Band Replication • HILN – Harmonic and Individual • TwinVQ – Transform-domain Lines plus Noise Weighted Interleaved Vector • SSC – Sinusoidal Coding Quantization • LD – Low Delay • CELP – Code Excited Linear • SSR – Scalable Sample Rate Prediction • TNS – Temporal Noise Shaping • HVXC – Harmonic Vector • SA – Structured Audio Excitation Coding • SASBF – Structured Audio Sample • TTSI – Text-to-Speech Interface Bank Format • MIDI – Musical Instrument Digital • HE – High Efficiency Interface • PS – Parametric Stereo

Audio Profile Mobile High Low High Scalable Main Natural Audio Audio Quality Delay AAC Efficiency Audio Audio Audio Internet- Object Type Audio Audio Profile AAC Profile Profile Profile working Object Profile Profile Profile Audio Object Profile Type Type ID 1 AAC main x x 2 AAC LC x x x x x x 3 AAC SSR x x 4 AAC LTP x x x x 23 ER AAC LD x x x

Table 17. Use of few selected Audio Object Types in MPEG Audio Profiles [MPEG-4 Part III, 2005].

316 Appendix H XXIII. MPEG-4 SLS Enhancements

XXIII. MPEG-4 SLS ENHANCEMENTS

This section is based on the work conducted in frames of the joint master thesis project [Wendelska, 2007] in cooperation with Dipl.-Math. Ralf Geiger from Fraunhofer IIS.

XXIII.1. Investigated versions - origin and enhancements Origin v0 – origin version (not used due to printf overhead) v01 – origin version (printf commented out for measurements) New interpolations v1 – new interpolateValue1to7 (but incomplete measurements – only 2 sequences checked) v2 – new interpolateFromCompactTable (1st method) Vectorizing Headroom v3 – new vector msbHeadroomINT32 (with old interpolateFromCompactTable) Vectorizing 2-level loop of srfft_fixpt v4 – new vector 1st loop srfft_fixpt (with old msbHeadroomINT32) v5 – old 1st loop srfft_fixpt, new vector 2nd loop srfft_fixpt v6 – new vectors: 1st and 2nd loop srfft_fixpt New interpolation and vectorizing Headroom v7 – new interpolateFromCompactTable (1st method) and new vector msbHeadroomINT32 [incl. v2 & v3]

XXIII.2. Measurements

Average Time Execution times of different versions for 17 five compared sequences. Only times of v2, 16 15 14 v3 and v7 are smaller than the origin time of 13 12 unoptimized code for all sequences. The 11 bach44m 10 barcelona32s 9 minimum and maximum time measured is chopin44s 8 7 jazz48m depicted as deviations from the average of 6 jazz48s 5 4 twelve-time execution. Out of all 420 3 2 measurements, there are only 3 cases having 1 0 v01v2v3v4v5v6v7 difference between MAX and AVG over 5.8% (namely 15.5%, 10.7%, 7.5%), and only 2 cases of MIN to AVG difference being over 5.0% (namely 6.1% and 5.1%). These five measurements are influenced by outside factors and thus are treated as irrelevant.

317 Appendix H XXIII. MPEG-4 SLS Enhancements

Average Time Cumulated Average execution time was cumulated for 60 all sequences considering respective

50 versions. It shows clearly that smaller time is

40 achieved only by v2, v3 and v7. jazz48s jazz48m 30 chopin44s barcelona32s bach44m 20

10

0 v01v2v3v4v5v6v7

Execution Time (vN% ) vs. Origin Time (v01% ) The execution time of each version in

105% comparison to the origin time (v01)

100% demonstrates the gain for each sequence and

95% v01% for all on average. The v2 version needs only v2% 90% v3% 96.6%, v3 only 79.8% and v7 only 76.4% of v4% the origin time of unoptimized version. 85% v5% v6% 80% v7%

75%

70% bach44m barcelona32s chopin44s jazz48m jazz48s average

Speed-up Ratio (Origin vs. Current) Thus the v7 version delivers finally the best

1,40 speed up ranging from 1.26 to 1.35 for 1,35 different sequences and being equal to 1.31 1,30 on average. 1,25 bach44m 1,20 barcelona32s chopin44s 1,15 jazz48m 1,10 jazz48s average 1,05

1,00 v01v2v3v4v5v6v7 0,95

0,90

XXIII.3. Overall Final Improvement The final benchmark has been conducted in comparison the origin. Figure 110 presents the overall speedup ratios for both the encoder and the decoder i.e. the gained percentage of the processing time of the final code version and of the original one. The total execution time of the encoder was decreased by 21%-36%, depending on the input file, while the decoder’s total time only by 15%-25%. All the successfully vectorized functions and operations together obtained about 18% speedup of the total execution time and about 28% of the IntMDCT time comparing to the original code version. The enhancement in the accumulated execution time of the IntMDCT calculations being the focus of the project was improved to noticeably larger than

318 Appendix H XXIII. MPEG-4 SLS Enhancements

the overall results i.e. IntMDCT-encoding speedup achieved 42%-48% and InvIntMDCT required 45%-50% less time respectively. So, in results the decreased by the factor of 2 of the main optimization target has been achieved.

Figure 110. Percentage of the total gained time between the original code version and the final version [Wendelska, 2007].

319 Appendix H XXIII. MPEG-4 SLS Enhancements

320 Bibliography

Bibliography

[Ahmed et al., 1974] Ahmed, N., Natarajan, T., Rao, K. R.: Discrete Cosine Transform. IEEE Trans. Computers Vol. xx, pp.90-93, 1974. [ANS, 2001] ANS: American National Standard T1.523-2001: Telecom Glossary 2000. Alliance for Telecommunications Industry Solutions (ATIS) Committee T1A1, National Telecommunications and Information Administration's Institute for Telecommunication Sciences (NTIA/ITS) - Approved by ANSI, Feb. 28th, 2001. [Assunncao and Ghanbari, 1998] Assunncao, P. A. A., Ghanbari, M.: A Frequency-Domain Video Transcoder for Dynamic Bit-Rate Reduction of MPEG-2 Bit Streams. IEEE Trans. Circuits and Systems for Video Technology Vol. 8(8), pp.953-967, 1998. [Astrahan et al., 1976] Astrahan, M. M., Blasgen, M. W., Chamberlin, D. D., Eswaran, K. P., Gray, J. N., Griffiths, P. P., King, W. F., Lorie, R. A., McJones, P. R., Mehl, J. W., Putzolu, G. R., Traiger, I. L., Wade, B., V., W.: System R: A Relational Approach to Database Management. ACM Transactions on Database Systems Vol. 1(2), pp.97-137, 1976. [Auspex, 2000] Auspex: A Storage Architecture Guide. White Paper, Santa Clara (CA), USA, Auspex Systems, Inc., 2000. [Barabanov and Yodaiken, 1996] Barabanov, Yodaiken: Real-Time Linux. Linux Journal Vol., 1996. [Bente, 2004] Bente, N.: Comparison of Multimedia Servers Available on Nowadays Market— Hardware and Software. Study Project, Database Systems Chair. FAU Erlangen-Nuremberg, Erlangen, Germany [Berthold and Meyer-Wegener, 2001] Berthold, H., Meyer-Wegener, K.: Schema Design and Query Processing in a Federated Multimedia Database System. 6th International Conference on Cooperative Information Systems (CoopIS'01), in Lecture Notes in Computer Science Vol.2172, Trento, Italy, Springer Verlag, Sep. 2001. [Bovik, 2005] Bovik, A. C.: Handbook of Image and Video Processing (2nd Ed.), Academic Press, ISBN 0-12-119792-1, 2005. [Bryant and O'Hallaron, 2003] Bryant, R. E., O'Hallaron, D. R.: Computer Systems – A Programmer's Perspective. Chapter IX. Measuring Program Execution Time, Prentice Hall, ISBN 0- 13-034074-X, 2003. [Campbell and Chung, 1996] Campbell, S., Chung, S.: Database Approach for the Management of Multimedia Information. Multimedia Database Systems. Ed.: K. Nwosu, Kluwer Academic Publishers, ISBN 0-7923-9712-6, 1996. [Candan et al., 1996] Candan, K. S., Subrahmanian, V. S., Rangan, P. V.: Towards a Theory of Collaborative Multimedia IEEE International Conference on Multimedia Computing and Systems (ICMCS'96), Hiroshima, Japan, Jun. 1996.

321 Bibliography

[Carns et al., 2000] Carns, P. H., Ligon III, W. B., Ross, R. B., Thakur, R.: PVFS: A Parallel File System For Linux Clusters. 4th Annual Linux Showcase and Conference, Atlanta (GA), USA, Oct. 2000. [Cashin, 2005] Cashin, E.: Kernel Korner - ATA Over Ethernet: Putting Hard Drives on the LAN. Linux Journal Vol. 134, 2005. [Chamberlin et al., 1981] Chamberlin, D. D., Astrahan, M. M., Blasgen, M. W., Gray, J. N., King, W. F., Lindsay, B. G., Lorie, R., Mehl, J. W., Price, T. G., Putzolu, F., Selinger, P. G., Schkolnick, M., Slutz, D. R., Traiger, I. L., Wade, B. W., Yost, R. A.: A History and Evaluation of System R. Communications of the ACM Vol. 24(10), pp.632-646, 1981. [Ciliendo, 2006] Ciliendo, E.: Linux-Tuning: Performance-Tuning für Linux-Server. iX Vol. 01/06, pp.130-132, 2006. [CODASYL Systmes Committee, 1969] CODASYL Systmes Committee: A Survey of Generalized Data Base Management Systems. Technical Report (PB 203142), May 1969. [Codd, 1970] Codd, E. F.: A Relational Model of Data for Large Shared Data Banks. Communications of the ACM Vol. 13(6), pp.377-387, 1970. [Codd, 1995] Codd, E. F.: "Is Your DBMS Really Relational?" and "Does Your DBMS Run By the Rules?" ComputerWorld, (Part 1: October 14, 1985, Part 2: October 21, 1985). Vol. xx, 1995. [Connolly and Begg, 2005] Connolly, T. M., Begg, C. E.: Database Systems: A Practical Approach to Design, Implementation, and Management (4th Ed.). Essex, England, Pearson Education Ltd., ISBN 0-321-21025-5, 2005. [Curran and Annesley, 2005] Curran, K., Annesley, S.: Transcoding Media for Bandwidth Constrained Mobile Devices. International Journal of Network Management Vol. 15, pp.75-88, 2005. [Cutmore, 1998] Cutmore, N. A. F.: Dynamic Range Control in a Multichannel Environment. Journal of the Audio Engineering Society Vol. 46(4), pp.341-347, 1998. [Dashti et al., 2003] Dashti, A., Kim, S. H., Shahabi, C., Zimmermann, R.: Server Design, Prentice Hall Ptr, ISBN 0-13-067038-3, 2003. [Davies, 1984] Davies, B.: Integral Transforms and Their Applications (Applied Mathematical Sciences), Springer, ISBN 0-387-96080-5, 1984. [Dennis and Van Horn, 1966] Dennis, J. B., Van Horn, E. C.: Programming semantics for multiprogrammed computations. Communications of the ACM Vol. 9(3), pp.143-155, 1966. [Devos et al., 2003] Devos, H., Eeckhaut, H., Christiaens, M., Verdicchio, F., Stroobandt, D., Schelkens, P.: Performance requirements for reconfigurable hardware for a scalable wavelet video decoder. CD-ROM Proceedings of the ProRISC / IEEE Benelux Workshop on Circuits, Systems and Signal Processing, STW, Utrecht, Nov. 2003. [Ding and Guo, 2003] Ding, G.-g., Guo, B.-l.: Improvement to Progressive Fine Granularity Scalable Video Coding 5th International Conference on Computational Intelligence and Multimedia Applications (ICCIMA'03) Xi'an, China, Sep. 2003. [Dingeldein, 1995] Dingeldein, D.: Multimedia interactions and how they can be realized. SPIE Photonics West Symposium, Multimedia Computing and Networking, San José (CA), USA, SPIE Vol. 2417, pp.46-53, Mar. 1995. [Dogan, 2001] Dogan, S.: Video Transcoding for Multimedia Communication Networks. PhD Thesis. University of Surrey, Guildford, United Kingdom. Oct. 2001.

322 Bibliography

[Effelsberg and Steinmetz, 1998] Effelsberg, W., Steinmetz, R.: Video Compression Techniques. Heidelberg, Germany, dpunkt Verlag, 1998. [Eisenberg and Melton, 2001] Eisenberg, A., Melton, J.: SQL Multimedia and Application Packages (SQL/MM). SIGMOD Record Vol. 30(4), 2001. [El-Rewini et al., 1994] El-Rewini, H., Lewis, T. G., Ali, H. H.: Task Scheduling in Parallel and Distributed Systems. New Jersey, USA, PTR Prentice Hall, ISBN 0-13-099235-6, 1994. [Eleftheriadis and Anastassiou, 1995] Eleftheriadis, A., Anastassiou, D.: Constrained and General Dynamic Rate Shaping of Compressed Digital Video. 2nd IEEE International Conference on Image Processing (ICIP'95), Arlington (VA), USA, IEEE, Oct. 1995. [Elmasri and Navathe, 2000] Elmasri, R., Navathe, S. B.: Fundamentals of Database Systems. Reading (MA), USA, Addison Wesley Longman Inc., ISBN 0-8053-1755-4, 2000. [Fasheh, 2006] Fasheh, M.: OCFS2: The Oracle Clustered File System, Version 2, retrieved on 21.07.2006, 2006, from http://oss.oracle.com/projects/ocfs2/dist/documentation/fasheh.pdf, 2006. [Feig and Winograd, 1992] Feig, E., Winograd, S.: Fast Algorithms for the Discrete Cosine Transform. IEEE Trans. Signal Processing Vol. 40(9), pp.2174-2193, 1992. [Ford et al., 1997] Ford, B., van Maren, K., Lepreau, J., Clawson, S., Robinson, B., Turner, J.: The FLUX OS Toolkit: Reusable Components for OS Implementation. 6th IEEE Workshop on Hot Topics in Operating Systems, Cape Cod (MA), USA, May 1997. [Fortier and Michel, 2002] Fortier, P. J., Michel, H. E.: Computer Systems Performance Evaluation and Prediction. Burlington (MA), USA, Digital Press, ISBN 1-55558-260-9, 2002. [Fry and Sibley, 1976] Fry, J. P., Sibley, E. H.: Evolution of Data-Base Management Systems. ACM Computing Surveys (CSUR) Vol. 8(1), pp.7-42, 1976. [Geiger et al., 2001] Geiger, R., Sporer, T., Koller, J., Brandenburg, K.: Audio Coding Based On Integer Transforms. 111th Convention AES Convention, New York (NY), USA, AES, Sep. 2001. [Geiger et al., 2004] Geiger, R., Yokotani, Y., Schuller, G., Herre, J.: Improved Integer Transforms using Multi-Dimensional Lifting. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'04), Montreal (Quebec), Canada, IEEE, May, 11-17th, 2004. [Geiger et al., 2006] Geiger, R., Yu, R., Herre, J., Rahardja, S. S., Kim, S.-W., Lin, X., Schmidt, M.: ISO / IEC MPEG-4 High-Definition Scalable Advanced Audio Coding. 120th Convention of Audio Engineering Society (AES), Paris, France, AES No. 6791, May 2006. [Gemmel et al., 1995] Gemmel, D. J., Vin, H. M., Kandlur, D. D., Rangan, P. V., Rowe, L. A.: Multimedia Storage Servers: A Tutorial. IEEE Computer Vol. 28(5), pp.40-49, 1995. [Gibson et al., 1998] Gibson, J. D., Berger, T., Lookabaugh, T., Lindbergh, D., Baker, R. L.: Digital Compression for Multimedia: Principles and Standards. London, UK, Academic Press, 1998. [Hamann, 1997] Hamann, C.-J.: On the Quantitative Specification of Jitter Constrained Periodic Streams. 5th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems MASCOTS’97, Haifa, Israel, Jan. 1997. [Hamann et al., 2001a] Hamann, C.-J., Löser, J., Reuther, L., Schönberg, S., Wolter, J., Härtig, H.: Quality-Assuring Scheduling – Using Stochastic Behavior to Improve Resource Utilization. 22nd IEEE Real-Time Systems Symposium (RTSS 2001), London, UK, Dec. 2001.

323 Bibliography

[Hamann et al., 2001b] Hamann, C.-J., Märcz, A., Meyer-Wegener, K.: Buffer Optimization in Realtime Media Streams using Jitter-Constrained Periodic Streams. SFB 358 - G3 - 01/2001 Technical Report. TU Dresden, Dresden, Germany. Jan. 2001. [Härtig et al., 1997] Härtig, H., Hohmuth, M., Liedtke, J., Schönberg, S.: The performance of μ- kernel-based systems. 16th ACM Symposium on Operating Systems Principles, Saint Malo, France, 1997. [Härtig et al., 1998] Härtig, H., Baumgartl, R., Borriss, M., Hamann, C.-J., Hohmuth, M., Mehnert, F., Reuther, L., Schönberg, S., Wolter, J.: DROPS — OS Support for Distributed Multimedia Applications. 8th ACM SIGOPS European Workshop (SIGOPS EW'98), Sintra, Portugal, Sep. 1998. [Henning, 2001] Henning, P. A.: Taschenbuch Multimedia (2nd Ed.). München, Germany, Carl Hanser Verlag, ISBN 3-446-21751-7, 2001. [Hohmuth and Härtig, 2001] Hohmuth, M., Härtig, H.: Pragmatic Nonblocking Synchronization for Real-time Systems. USENIX Annual Technical Conference, Boston (MA), USA, Jun. 2001. [IBM Corp., 1968] IBM Corp.: Information Management Systems/360 (IMS/360) - Application Description Manual, New York (NY), USA, IBM Corp. Form No. H20-0524-1 White Plains, 1968. [IBM Corp., 2003] IBM Corp.: DB2 Universal Database: Image, Audio, and Video Extenders - Administration and Programming, Version 8. 1st Ed., Jun. 2003. [Ihde et al., 2000] Ihde, S. C., Maglio, P. P., Meyer, J., Barrett, R.: Intermediary-based Transcoding Framework. Poster Proc. of 9th Intl. World Wide Web Conference (WWW9), 2000. [Imaizumi et al., 2002] Imaizumi, S., Takagi, A., Kiya, H.: Lossless Inter-frame Video Coding using Extended JPEG2000. International Technical Conference on Circuits Systems, Computers and Communications (ITC CSCC '02), Phuket, Thailand Jul. 2002. [ISO 9000, 2005] ISO 9000: Standard 9000:2005 – Quality Management Systems – Fundamentals and Vocabulary, ISO Technical Committee 176 / SC1, Sep. 2005. [ITU-T Rec. H.262, 2000] ITU-T Rec. H.262: Information Technology – Generic Coding of Moving Pictures and Associated Audio Information: Video. Recommendation H.262, ITU-T, Feb. 2000. [ITU-T Rec. H.263+, 1998] ITU-T Rec. H.263+: Video coding for low bit rate communication (called H.263+). Recommendation H.263, ITU-T, Feb. 1998. [ITU-T Rec. H.263++, 2000] ITU-T Rec. H.263++: Video coding for low bit rate communication - Annex U,V,W (called H.263++). Recommendation H.263, ITU-T, Nov. 2000. [ITU-T Rec. H.263+++, 2005] ITU-T Rec. H.263+++: Video coding for low bit rate communication - Annex X and unified specification document (called H.263+++). Recommendation H.263, ITU-T, Jan. 2005. [ITU-T Rec. H.264, 2005] ITU-T Rec. H.264: for Generic Audiovisual Services. Recommendation H.264 & ISO/IES 14496-10 AVC, ITU-T & ISO/IES, Mar. 2005. [ITU-T Rec. P.835, 2003] ITU-T Rec. P.835: Subjective Test Methodology for Evaluating Speech Communication Systems that include Noise Suppression Algorithm. Recommendation P.835, ITU-T, Nov. 2003.

324 Bibliography

[ITU-T Rec. T.81, 1992] ITU-T Rec. T.81: Information Technology – Digital Compression and Coding of Continuous-Tone Still Images – Requirements and Guidelines. ITU-T Recommendation T.81 and ISO/IEC International Standard 10918-1, JPEG (ITU-T CCITT SG-7 and ISO/IEC JTC-1/SC-29/WG-10), Sep. 1992. [ITU-T Rec. X.641, 1997] ITU-T Rec. X.641: Information technology – Quality of Service: Framework. Recommendation X.641, ITU-T, Dec. 1997. [ITU-T Rec. X.642, 1998] ITU-T Rec. X.642: Information technology – Quality of Service: Guide to Methods and Mechanisms. Recommendation X.642, ITU-T, Sep. 1998. [Jaeger et al., 1999] Jaeger, T., Elphinstone, K., Liedke, J., Panteleenko, V., Park, Y.: Flexible Access Control Using IPC Redirection. 7th Workshop on Hot Topics in Operating Systems (HOTOS), Rio Rico (AZ), USA, IEEE Computer Society, Mar. 1999. [Jankiewicz and Wojciechowski, 2004] Jankiewicz, K., Wojciechowski, M.: Standard SQL/MM: SQL Multimedia and Application Packages. IX Seminarium PLUG "Przetwarzanie zaawansowanych struktur danych: Oracle interMedia, Spatial, Text i XML DB", Warsaw, Poland, Stowarzyszenie Polskiej Grupy Użytkowników systemu Oracle, Mar. 2004. [JTC1/SC32, 2007] JTC1/SC32: ISO/IEC 13249: 2002 Information technology -- Database languages -- SQL multimedia and application packages. ISO/IEC 13249 3rd Ed., ISO/IEC, 2007. [Käckenhoff et al., 1994] Käckenhoff, R., Merten, D., Meyer-Wegener, K.: "MOSS as Multimedia Object Server - Extended Summary". Multimedia: Advanced Teleservices and High Speed Communication Architectures, Proc. 2nd Int. Workshop - IWACA '94 (Heidelberg, Sept. 26-28, 1994), Ed. R. Steinmetz, Lecture Notes in Computer Science Vol.868, Heidelberg, Germany, Springer-Verlag [Kahrs and Brandenburg, 1998] Kahrs, M., Brandenburg, K.: Applications of Digital Signal Processing to Audio and Acoustics, Kluwer Academic Publishers, ISBN 0-7923-8130-0, 1998. [Kan and Fan, 1998] Kan, K.-S., Fan, K.-C.: Video Transcoding Architecture with Minimum Buffer Requirement for Compressed MPEG-2 Bitstream. Signal Processing Vol. 67(2), pp.223- 235, 1998. [Keesman et al., 1996] Keesman, G., Hellinghuizen, R., Hoeksema, F., Heideman, G.: Transcoding of MPEG Bitstream. Signal Processing - Image Communication Vol. 8(6), pp.481-500, 1996. [Khoshafian and Baker, 1996] Khoshafian, S., Baker, A.: MultiMedia and Imaging Databases, Morgan Kaufmann, ISBN 1-55860-312-3, 1996. [King et al., 2004] King, R., Popitsch, N., Westermann, U.: METIS: a Flexible Database Foundation for Unified Media Management. ACM Multimedia 2004 (ACMMM'04), New York (NY), USA, Oct. 2004. [Knutsson et al., 2003] Knutsson, B., Lu, H., Mogul, J., Hopkins, B.: Architecture and Performance of Server-Directed Transcoding. ACM Transactions on Internet Technology (TOIT) Vol. 3(4), pp.392-424, 2003. [Kuhn and Suzuki, 2001] Kuhn, P., Suzuki, T.: MPEG-7 Metadata for Video Transcoding: Motion and Difficulty Hint. SPIE Conference on Storage and Retrieval for Multimedia Databases, San Jose (CA), USA, SPIE Vol. 4315, 2001.

325 Bibliography

[LeBlanc and Markatos, 1992] LeBlanc, T. J., Markatos, E. P.: Shared Memory vs. Message Passing in Shared-Memory Multiprocessors. 4th IEEE Symposium on Parallel and Distributed Processing, Arlington (TX), USA Dec. 1992. [Lee et al., 2005] Lee, C.-J., Lee, K.-S., Park, Y.-C., Youn, D.-H.: Adaptive FFT Window Switching for Psychoacoustic Model in MPEG-4 AAC, Seoul, Korea, Yonsei University Digital Signal Processing Lab, pp.553, Jul. 2005. [LeGall, 1991] LeGall, D.: MPEG: A Video Compression Standard for Multimedia Applications. Communications of the ACM Vol. 34(4), pp.46-58, 1991. [Li and Shen, 2005] Li, K., Shen, H.: Coordinated Enroute Multimedia Object Caching in Transcoding Proxies for Tree Networks. ACM Transactions on Multimedia Computing, Communications and Applications Vol. 1(3), pp.289-314, 2005. [Li, 2001] Li, W.: Overview of Fine Granularity Scalability in MPEG-4 Video Standard. IEEE Trans. Circuits and Systems for Video Technology Vol. 11(3), pp.301-317, 2001. [Liang and Tran, 2001] Liang, J., Tran, T. D.: Fast Multiplierless Approximation of the DCT with the Lifting Scheme. IEEE Trans. Signal Processing Vol. 49(12), pp.3032-3044, 2001. [Liebchen et al., 2005] Liebchen, T., Moriya, T., Harada, N., Kamamoto, Y., Reznik, Y.: The MPEG-4 Audio Lossless Coding (ALS) Standard - Technology and Applications. 119th AES Convention, New York (NY), USA, Oct. 2005. [Liedtke, 1996] Liedtke, J.: L4 Reference Manual 486 Pentium Pentium Pro Version 2.0. Research Report RC 20549, Yorktown Heights (NY), USA, IBM T. J. Watson Research Center, Sep. 1996. [Lin et al., 1987] Lin, K. J., Natarajan, S., Liu, J. W. S.: Imprecise Results: Utilizing Partial Computations in Real-Time Systems. 8th IEEE Real-Time Systems Symposium (RTSS '87), San Jose (CA), USA, Dec. 1987. [Lindner et al., 2000] Lindner, W., Berthold, H., Binkowski, F., Heuer, A., Meyer-Wegener, K.: Enabling Hypermedia Videos in Multimedia Database Systems Coupled with Realtime Media Servers. International Symposium on Database Engineering & Applications (IDEAS), Yokohama, Japan, Sap. 2000. [Liu, 2003] Liu, S.: Audio-Video Conversion Benchmark “AVCOB” – Analysis, Design and Implementation. Master Thesis, Database Systems Chair. FAU Erlangen-Nuremberg, Erlangen, Germany [Löser et al., 2001a] Löser, J., Härtig, H., Reuther, L.: A Streaming Interface for Real-Time Interprocess Communication. Technical Report TUD-FI01-09, Operating Systems Group. TU Dresden, Dresden, Germany. Aug. 2001. [Löser et al., 2001b] Löser, J., Härtig, H., Reuther, L.: Position Summary: A Streaming Interface for Real-Time Interprocess Communication. 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), Schloss Elmau in Bavaria, Germany, May 2001. [Löser and Härtig, 2004] Löser, J., Härtig, H.: Low-latency Hard Real-Time Communication over Switched Ethernet. 16th Euromicro Conference on Real-Time Systems (ECRTS'04), Catania (Sicily), Italy, Jun.-Jul. 2004. [Löser and Aigner, 2007] Löser, J., Aigner, R.: Building Infrastructure for DROPS (BID) Specification. Publicly-Available Specification, Operating Systems Group. TU Dresden, Dresden, Germany. Apr. 25th, 2007.

326 Bibliography

[Lum and Lau, 2002] Lum, W. Y., Lau, F. C. M.: On Balancing between Transcoding Overhead and Spatial Consumption in Content Adaptation. 8th Intl. Conf. on Mobile Computing and Networking, Atlangta (GA), USA, ACM, Sep. 2002. [Luo, 1997] Luo, Y.: Shared Memory vs. Message Passing: the COMOPS Benchmark Experiment. Internal Report, Los Alamos (NM), USA, Los Alamos National Laboratory (Scientific Computing Group CIC-19), Apr. 1997. [Märcz and Meyer-Wegener, 2002] Märcz, A., Meyer-Wegener, K.: Bandwidth-based Converter Description for Realtime Scheduling at Application Level in Media Servers. SDA Workshop 2002, Dresden, Germany, pp.10, Mar. 2002. [Märcz et al., 2003] Märcz, A., Schmidt, S., Suchomski, M.: Scheduling Data Streams in memo.REAL. Internal Communication, TU Dresden / FAU Erlangen, pp.8, Jan. 2003. [Marder and Robbert, 1997] Marder, U., Robbert, G.: The KANGAROO Project. Proc. 3rd Int. Workshop on Multimedia Information Systems, Como, Italy, Sep. 1997. [Marder, 2000] Marder, U.: VirtualMedia: Making Multimedia Database Systems Fit for World- wide Access. 7th Conference on Extending Database Technology (EDBT'02) - PhD Workshop, Konstanz, Germany, Mar. 2000. [Marder, 2001] Marder, U.: On Realizing Transformation Independence in Open, Distributed Multimedia Information Systems. Datenbanksysteme in Büro, Technik und Wissenschaft (BTW) Vol., pp.424-433, 2001. [Marder, 2002] Marder, U.: Multimedia Metacomputing in Webbasierten multimedialen Informationssytemen. PhD Thesis. Univeristy of Kaiserslautern, Kaiserslautern, Germany. 2002. [Margaritidis and Polyzos, 2000] Margaritidis, M., Polyzos, G.: On the Application of Continuous Media Filters over Wireless Networks. IEEE Int. Conf. on Multimedia and Expo (ICME'00), New York (NY), USA, IEEE Computer Society, Aug. 2000. [Marovac, 1983] Marovac, N.: On Interprocess Interaction in Distributed Architectures. ACM SIGARCH News Vol. 11(4), pp.17-22, 1983. [Maya et al., 2003] Maya, Anu, Asmita, Snehal, Krushna (MAASK): MigShm - Shared Memory over openMosix. Project Report on MigShm. From http://mcaserta.com/maask/Migshm_Report.pdf, Apr. 2003. [McQuillan and Walden, 1975] McQuillan, J. M., Walden, D. C.: Some Considerations for a High Performance Message-based Interprocess Communication System. ACM SIGCOMM/SIGOPS Workshop on Interprocess Communications - Applications, Technologies, Architectures, and Protocols for Computer Communication, 1975. [Mehnert et al., 2003] Mehnert, F., Hohmuth, M., Härtig, H.: Cost and Benefit of Separate Address Spaces in Real-Time Operating Systems. 23rd IEEE Real-Time Systems Symposium (RTSS'03), Austin, Texas, USA, Dec. 2003. [Mehrseresht and Taubman, 2005] Mehrseresht, N., Taubman, D.: An efficient content-adaptive motion compensated 3D-DWT with enhanced spatial and temporal scalability. Preprint submitted to IEEE Transactions on Image Processing, May 2005. [Meyer-Wegener, 2003] Meyer-Wegener, K.: Multimediale Datenbanken - Einsatz von Datenbanktechnik in Multimedia-Systemen (2. Auflage). Wiesbaden, Germany, B. G. Teubner Verlag / GWV Fachverlag GmbH, ISBN 3-519-12419-X, 2003.

327 Bibliography

[Meyerhöfer, 2007] Meyerhöfer, M. B.: Messung und Verwaltung von Softwarekomponenten für die Performancevorhersage. PhD Thesis, Database Systems Chair. FAU Erlangen-Nuremberg, Erlangen. Jul. 2004. [Microsoft Corp., 2002a] Microsoft Corp.: Introducing DirectShow for Automotive. MSDN Library - Mobil and Embedded Development Documentation, retrieved on Feb. 10th, 2002a. [Microsoft Corp., 2002b] Microsoft Corp.: The Filter Graph and Its Components. MSDN Library - DirectX 8.1 C++ Documentation, retrieved on Feb. 10th, 2002b. [Microsoft Corp., 2002c] Microsoft Corp.: AVI RIFF File Reference. MSDN Library - DirectX 9.0 DirectShow Appendix, retrieved on Nov. 22nd, from http://msdn.microsoft.com/archive/en-us/directx9_c/directx/htm/avirifffilereference.asp, 2002c. [Microsoft Corp., 2007a] Microsoft Corp.: [MS-MMSP]: Microsoft Media Server (MMS) Protocol Specification. MSDN Library, retrieved on Mar. 10th, from http://msdn2.microsoft.com/en- us/library/cc234711.aspx, 2007a. [Microsoft Corp., 2007b] Microsoft Corp.: Overview of the ASF Format. MSDN Library - Windows Media Format 11 SDK, retrieved on Jan. 21st, from http://msdn2.microsoft.com/en-us/library/aa390652.aspx, 2007b. [Mielimonka, 2006] Mielimonka, A.: The Real-Time Implementation of XVID Encoder in DROPS Supporting QoS for Video Streams. Study Project, Database Systems Chair. FAU Erlangen-Nuremberg, Erlangen, Germany. Sep. 2006. [Militzer et al., 2003] Militzer, M., Suchomski, M., Meyer-Wegener, K.: Improved p-Domain Rate Control and Perceived Quality Optimizations for MPEG-4 Real-time Video Applications. 11th ACM International Conference of Multimedia (ACM MM'03), Berkeley (CA), USA, Nov. 2003. [Militzer, 2004] Militzer, M.: Real-Time MPEG-4 Video Conversion and Quality Optimizations for Multimedia Database Servers. Diploma Thesis, Database Systems Chair. FAU Erlangen- Nuremberg, Erlangen, Germany. Jul. 2004. [Militzer et al., 2005] Militzer, M., Suchomski, M., Meyer-Wegener, K.: LLV1 – Layered Lossless Video Format Supporting Multimedia Servers During Realtime Delivery. Multimedia Systems and Applications VIII in conjuction to OpticsEast, Boston (MA), USA, SPIE Vol. 6015, pp.436-445, Oct. 2005. [Miller et al., 1998] Miller, F. W., Keleher, P., Tripathi, S. K.: General Data Streaming. 19th IEEE Real-Time Systems Sysmposium (RTSS), Madrid, Spain, Dec. 1998. [Minoli and Keinath, 1993] Minoli, D., Keinath, R.: Distributed Multimedia Through Broadband Communication. Norwood, UK, Artech House, ISBN 0-89006-689-2, 1993. [Mohan et al., 1999] Mohan, R., Smith, J. R., Li, C.-S.: Adapting Multimedia Internet Content for Universal Access. IEEE Trans. Multimedia Vol. 1(1), pp.104-114, 1999. [Morrison, 1997] Morrison, G.: Video Transcoders with Low Delay. IEICE Transactions on Communications Vol. E80-B(6), pp.963-969, 1997. [Mostefaoui et al., 2002] Mostefaoui, A., Favory, L., Brunie, L.: SIRSALE: a Large Scale Video Indexing and Content-Based Retrieving System. ACM Multimedia 2002 (ACMMM'02), Juan- les-Pins, France, Dec. 2002. [MPEG-1 Part III, 1993] MPEG-1 Part III: ISO/IEC 11172-3:1993 Information technology – Generic coding of moving pictures and associated audio for digital storage media at up to

328 Bibliography

about 1,5 Mbit/s – Part 3: Audio. ISO/IEC 11172-3 Audio, MPEG (ISO/IEC JTC-1/SC- 29/WG-11), 1993. [MPEG-2 Part I, 2000] MPEG-2 Part I: ISO/IEC 13818-1:2000 Information technology – Generic coding of moving pictures and associated audio information – Part 1: Systems. ISO/IEC 13818-1 Systems, MPEG (ISO/IEC JTC-1/SC-29/WG-11), Dec. 2000. [MPEG-2 Part II, 2001] MPEG-2 Part II: ISO/IEC 13818-2:2000 Information technology – Generic coding of moving pictures and associated audio information – Part 2: Video. ISO/IEC 13818-2 Video, MPEG (ISO/IEC JTC-1/SC-29/WG-11), Dec. 2000. [MPEG-2 Part VII, 2006] MPEG-2 Part VII: ISO/IEC 13818-2:2000 Information technology – Generic coding of moving pictures and associated audio information – Part 7: Advanced Audio Coding (AAC). ISO/IEC 13818-7 AAC Ed. 4, MPEG (ISO/IEC JTC-1/SC-29/WG- 11), Jan. 2006. [MPEG-4 Part I, 2004] MPEG-4 Part I: ISO/IEC 14496-1:2004 Information technology – Coding of audio-visual objects – Part 1: Systems (3rd Ed.). ISO/IEC 14496-1 3rd Ed., MPEG (ISO/IEC JTC-1/SC-29/WG-11), Nov. 2004. [MPEG-4 Part II, 2004] MPEG-4 Part II: ISO/IEC 14496-2:2004 Information technology – Coding of audio-visual objects – Part 2: Visual (3rd Ed.). ISO/IEC 14496-2 3rd Ed., MPEG (ISO/IEC JTC-1/SC-29/WG-11), Jun. 2004. [MPEG-4 Part III, 2005] MPEG-4 Part III: ISO/IEC 14496-3:2005 Information technology – Coding of audio-visual objects – Part 3: Audio (3rd Ed.). ISO/IEC 14496-3: , MPEG Audio Subgroup (ISO/IEC JTC-1/SC-29/WG-11), Dec. 2005. [MPEG-4 Part III FDAM5, 2006] MPEG-4 Part III FDAM5: ISO/IEC 14496- 3:2005/Amd.3:2006 Scalable Lossless Coding (SLS). ISO/IEC 14496-3 Amendment 3, MPEG Audio Subgroup (ISO/IEC JTC-1/SC-29/WG-11), Jun. 2006. [MPEG-4 Part IV Amd 8, 2005] MPEG-4 Part IV Amd 8: ISO/IEC 14496-4:2004/Amd.8:2005 High Efficiency Advanced Audio Coding, audio BIFS, and Structured Audio Conformance. ISO/IEC 14496-4 Amendment 8, MPEG Audio Subgroup (ISO/IEC JTC-1/SC-29/WG-11), May 2005. [MPEG-4 Part V, 2001] MPEG-4 Part V: ISO/IEC 14496-5:2001 Information technology – Coding of audio-visual objects – Part 5: Reference Software (2nd Ed.). ISO/IEC 14496-5 Software for Visual Part, MPEG (ISO/IEC JTC-1/SC-29/WG-11), 2001. [MPEG-4 Part X, 2007] MPEG-4 Part X: ISO/IEC 14496-10:2005/FPDAM 3 Information technology – Coding of audio-visual objects – Part 10: Advanced Video Coding – Amendment 3: Scalable Video Coding. ISO/IEC 14496-10 Final Proposal Draft, MPEG (ISO/IEC JTC- 1/SC-29/WG-11), Jan. 2007. [MPEG-7 Part III, 2002] MPEG-7 Part III: ISO/IEC 15938-3 Information Technology – Multimedia Content Description Interface – Part 3: Visual. ISO/IEC 15938-3, MPEG (ISO/IEC JTC-1/SC-29/WG-11), Apr. 2002. [MPEG-7 Part V, 2003] MPEG-7 Part V: ISO/IEC 15938-5 Information Technology – Multimedia Content Description Interface – Part 5: Multimedia Description Schemes. ISO/IEC 15938-5 Chapter 8 Media Description Tools, MPEG (ISO/IEC JTC-1/SC-29/WG-11), 2003.

329 Bibliography

[MPEG-21 Part I, 2004] MPEG-21 Part I: ISO/IEC 21000-1 Information Technology – Multimedia Framework (MPEG-21) – Part 1: Vision, Technologies and Strategy. ISO/IEC 21000-1 2nd Ed., MPEG (ISO/IEC JTC-1/SC-29/WG-11), Nov. 2004. [MPEG-21 Part II, 2005] MPEG-21 Part II: ISO/IEC 21000-2 Information Technology – Multimedia Framework (MPEG-21) – Part 2: Digital Item Declaration. ISO/IEC 21000-2 2nd Ed., MPEG (ISO/IEC JTC-1/SC-29/WG-11), Oct. 2005. [MPEG-21 Part VII, 2004] MPEG-21 Part VII: ISO/IEC 21000-7 Information Technology – Multimedia Framework (MPEG-21) – Part 7: Digital Item Adaptation. ISO/IEC 21000-7, MPEG (ISO/IEC JTC-1/SC-29/WG-11), Oct. 2004. [MPEG-21 Part XII, 2004] MPEG-21 Part XII: MPEG N5640 - ISO/IEC 21000-12 Information Technology – Multimedia Framework (MPEG-21) – Part 12: Multimedia Test Bed for Resource Delivery. ISO/IEC 21000-12 Working Draft 2.0, MPEG (ISO/IEC JTC-1/SC- 29/WG-11), Oct. 2004. [MPEG Audio Subgroup, 2005] MPEG Audio Subgroup: Verification Report on MPEG-4 SLS (MPEG2005/N7687). MPEG Meeting "Nice'05", Nice, France, Oct. 2005. [Nilsson, 2004] Nilsson, J.: Timers: Implement a Continuously Updating, High-Resolution Time Provider for Windows. MSDN Magazine. Vol. 3, 2004. [Oracle Corp., 2003] Oracle Corp.: Oracle interMedia User's Guide. Ver.10g Release 1 (10.1) -- Chapter 7, Section 7.4 Supporting Media Data Processing, 2003. [Östreich, 2003] Östreich, T.: Transcode – Linux Video Stream Processing Tool, retrieved on July 10th, from http://www.theorie.physik.uni-goettingen.de/~ostreich/transcode/ (http://www.transcoding.org/cgi-bin/transcode), 2003. [Pai et al., 1997] Pai, V., Druschel, P., Zwaenopoel, W.: IO-Lite: A Unified I/O Buffering and Caching System. Technical Report TR97-294, Computer Science. Rice University, Houston (TX), USA. 1997. [Pasquale et al., 1993] Pasquale, J., Polyzos, G., Anderson, E., Kompella, V.: Filter Propagation in Dissemination Trees: Trading Off Bandwidth and Processing in Continuous Media Networks. 4th Intl. Workshop ACM Network and Operating System Support for Digital Audio and Video (NOSSDAV'93), pp.259-268, Nov. 1993. [PEAQ, 2006] PEAQ: ITU-R B5.1387-1 – Implementation PQ-Eval-Audio, Part of "AFsp Library and Tools V8 R1" (ftp://ftp.tsp.ece.mc-gill.ca/TSP/AFsp/AFsp-v8r1..gz), Jan. 2006. [Penzkofer, 2006] Penzkofer, F.: Real-Time Audio Conversion and Format Independence for Multimedia Database Servers. Study Project, Database Systems Chair. FAU Erlangen-Nuremberg, Erlangen, Germany. Jul. 2006. [Posnak et al., 1996] Posnak, E. J., Vin, H. M., Lavender, R. G.: Presentation Processing Support for Adaptive Multimedia Applications. SPIE Multimedia Computing and Networking, San Jose (CA), USA, SPIE Vol. 2667, pp.234-245, Jan. 1996. [Posnak et al., 1997] Posnak, E. J., Lavender, R. G., Vin, H. M.: An Adaptive Framework for Developing Multimedia Software Components. Communications of the ACM Vol. 40(10), pp.43- 47, 1997. [QNX, 2001] QNX: QNX Neutrino RTOS (Ver.6.1). QNX Software Systems Ltd., 2001.

330 Bibliography

[Rakow et al., 1995] Rakow, T., Neuhold, E., Löhr, M.: Multimedia Database Systems – The Notions and the Issues. GI-Fachtagung, Dresden, Germany, Datenbanksysteme in Büro, Technik und Wissenschaft (BTW) [Rangarajan and Iftode, 2000] Rangarajan, M., Iftode, L.: Software Distributed Shared Memory over Virtual Interface Architecture - Implementation and Performance. 4th Annual Linux Conference, Atlanta (GA), USA, Oct. 2000. [Rao and Yip, 1990] Rao, K. R., Yip, P.: Discrete Cosine Transform: Algorithms, Advantages, Applications. San Diego (CA), USA, Academic Press, Inc., ISBN 0-12-580203-X, 1990. [Reuther and Pohlack, 2003] Reuther, L., Pohlack, M.: Rotational-Position-Aware Real-Time Disk Scheduling Using a Dynamic Active Subset (DAS). 24th IEEE International Real-Time Systems Symposium, Cancun, Mexico, Dec. 2003. [Reuther et al., 2006] Reuther, L., Aigner, R., Wolter, J.: Building Microkernel-Based Operating Systems: DROPS - The Dresden Real-Time Operating System (Lecture Notes Summer Term 2006), retrieved on Nov. 25th, 2006, from http://os.inf.tu-dresden.de/Studium/KMB/ (http://os.inf.tu-dresden.de/Studium/KMB/Folien/09-DROPS/09-DROPS.pdf), 2006. [Rietzschel, 2002] Rietzschel, C.: Portierung eines Video-Codecs auf DROPS. Study Project, Operating Systems Group - Institute for System Architecture. TU Dresden, Dresden, Germany. Dec. 2002. [Rietzschel, 2003] Rietzschel, C.: VERNER ein Video EnkodeR uNd -playER für DROPS. Master Thesis, Operating Systems Group - Institute for System Architecture. TU Dresden, Dresden, Germany. Sep. 2003. [Rohdenburg et al., 2005] Rohdenburg, T., Hohmann, V., Kollmeier, B.: Objective Perceptual Quality Measures for the Evaluation of Noise Reduction Schemes. International Workshop on Acoustic Echo and Noise Control '05, High Tech Campus, Eindhoven, The Netherlands, Sep. 2005. [Roudiak-Gould, 2006] Roudiak-Gould, B.: HuffYUV v2.1.1 Manual. Description and Source Code, retrieved on Jul. 15th, 2006, from http://neuron2.net/www.math.berkeley.edu/benrg/huffyuv.html, 2006. [Sayood, 2006] Sayood, K.: Introduction to Data Compression (3rd Ed.). San Francisco (CA), USA, Morgan Kaufman, 2006. [Schaar and Radha, 2000] Schaar, M., Radha, H.: MPEG M6475: Motion-Compensation Based Fine-Granular Scalability (MC-FGS) MPEG Meeting, MPEG (ISO/IEC JTC-1/SC-29/WG- 11), Oct. 2000. [Schäfer et al., 2003] Schäfer, R., Wiegand, T., Schwarz, H.: The emerging H.264/AVC Standard. EBU Technical Review, Berlin, Germany, Heinrich Hertzt Institute, Jan. 2003. [Schelkens et al., 2003] Schelkens, P., Andreopoulos, Y., Barbarien, J., Clerckx, T., Verdicchio, F., Munteanu, A., van der Schaar, M.: A comparative study of scalable video coding schemes utilizing wavelet technology. SPIE Photonics East - Wavelet Applications in Industrial Processing, SPIE Vol. 5266, Providence (RI), USA,. [Schlesinger, 2004] Schlesinger, L.: Qualitätsgetriebene Konstruktion globaler Sichten in Grid- organisierten Datenbanksystemen. PhD Thesis, Database Systems Chair. FAU Erlangen- Nuremberg, Erlangen. Jul. 2004.

331 Bibliography

[Schmidt et al., 2003] Schmidt, S., Märcz, A., Lehner, W., Suchomski, M., Meyer-Wegener, K.: Quality-of-Service Based Delivery of Multimedia Database Objects without Compromising Format Independence. 9th International Conference on Distributed Multimedia Systems (DMS'03), Miami (FL), USA Sep. 2003. [Schönberg, 2003] Schönberg, S.: Impact of PCI-Bus Load on Applications in a PC Architecture. 24th IEEE International Real-Time Systems Symposium, Cancun, Mexico, Dec. 2003. [Schulzrinne et al., 1996] Schulzrinne, H., Casner, S., Frederick, R., Jacobson, V.: RTP: A transport protocol for Real-Time Applications. RFC 1889, Jan. 1996. [Schulzrinne et al., 1998] Schulzrinne, H., Rao, A., Lanphier, R.: Real Time Streaming Protocol (RTSP). RFC 2326, Apr. 1998. [Shin and Koh, 2004] Shin, I., Koh, K.: Cost Effective Transcoding for QoS Adaptive Multimedia Streaming. Symposium on Applied Computing (SAC'04), Nicosia, Cyprus, ACM, Mar. 2004. [Singhal and Shivaratri, 1994] Singhal, M., Shivaratri, N.: Advanced Concepts in Operating Systems, McGraw-Hill, ISBN 978-0070575721, 1994. [Sitaram and Dan, 2000] Sitaram, D., Dan, A.: Multimedia Servers: Applications, Environments and Design, Morgan Kaufmann, ISBN 1-55860-430-8, 2000. [Skarbek, 1998] Skarbek, W.: Multimedia. Algorytmy i standardy kompresji. Warszawa, PL, Akademicka Oficyna Wydawnicza PLJ, 1998. [Sorial et al., 1999] Sorial, H., Lynch, W. E., Vincent, A.: Joint Transcoding of Multiple MPEG Video Bitstreams. IEEE International Symposium on Circuits and Systems (ISCAS'99), Orlando (FL), USA, May 1999. [Spier and Organick, 1969] Spier, M. J., Organick, E. I.: The multics interprocess communication facility. 2nd ACM Symposium on Operating Systems Principles, Princeton (NJ), USA, 1969. [Steinberg, 2004] Steinberg, U.: Quality-Assuring Scheduling in the Fiasco Microkernel. Master Thesis, Operating Systems Group. TU Dresden, Dresden, Germany. Mar. 2004. [Stewart, 2005] Stewart, J.: An Investigation of SIMD Instruction Sets. Study Project, School of Information Technology and Mathematical Sciences. University of Ballarat, Ballarat (Victoria), Australia. Nov. 2005. [Suchomski, 2001] Suchomski, M.: The Application of Specialist Program Suites in Network Servers Efficiency Research. Master Thesis. New University of Lisbon and Wroclaw University of Technology, Monte de Caparica - Lisbon, Portugal and Wroclaw, Poland. Jul. 2001. [Suchomski et al., 2004] Suchomski, M., Märcz, A., Meyer-Wegener, K.: Multimedia Conversion with the Focus on Continuous Media. Transformation of Knowledge, Information and Data. Ed.: P. van Bommel. London, UK, Information Science Publishing. Chapter XI, ISBN 159140528-9, 2004. [Suchomski et al., 2005] Suchomski, M., Militzer, M., Meyer-Wegener, K.: RETAVIC: Using Meta-Data for Real-Time Video Encoding in Multimedia Servers. ACM NOSSDAV '05, Skamania (WA), USA, Jun. 2005. [Suchomski and Meyer-Wegener, 2006] Suchomski, M., Meyer-Wegener, K.: Format Independence of Audio and Video in Multimedia Database Systems. 5th Internationa Conference on Multimedia and Network Information Systems 2006 (MiSSI '06), Wroclaw, Poland, Oficyna Wydawnicza Politechniki Wroclawskiej, pp.201-212, Sep. 2006.

332 Bibliography

[Suchomski et al., 2006] Suchomski, M., Meyer-Wegener, K., Penzkofer, F.: Application of MPEG-4 SLS in MMDBMSs – Requirements for and Evaluation of the Format. Audio Engineering Society (AES) 120th Convention, Paris, France, AES Preprint No. 6729, May 2006. [Sun et al., 1996] Sun, H., Kwok, W., Zdepski, J.: Architectures for MPEG Compressed Bitstream Scaling. IEEE Trans. Circuits and Systems for Video Technology Vol. 6(2), 1996. [Sun et al., 2005] Sun, H., Chen, X., Chiang, T.: Digital Video Transcoding for Transmission and Storage. Boca Raton (FL), CRC Press. Chapter 11: 391-413, 2005. [Sun Microsystems Inc., 1999] Sun Microsystems Inc.: Java Media Framework API Guide (Nov. 19th, 1999), retrieved on Jan. 10th, 2003, from http://java.sun.com/products/java- media/jmf/2.1.1/guide/, 1999. [Suzuki and Kuhn, 2000] Suzuki, T., Kuhn, P.: A proposal for segment-based transcoding hints. ISO/IEC M5847, Noordwijkerhout, Netherlands, Mar. 2000. [Symes, 2001] Symes, P.: Video Compression Demistyfied, McGraw-Hill, ISBN 0-07-136324-6, 2001. [Tannenbaum, 1995] Tannenbaum, A. S.: Moderne Betriebssysteme - 2nd Ed., Prentice Hall International: 78-88, ISBN 3-446-18402-3, 1995. [Topivata et al., 2001] Topivata, P., Sullivan, G., Joch, A., Kossentini, F.: Performance evaluation of H.26L, TML 8 vs. H.263++ and MPEG-4. Technical Report N18, ITU-T Q6/SG16 (VCEG), Sep. 2001. [Tourapis, 2002] Tourapis, A. M.: Enhanced predictive zonal search for single and multiple frame motion estimation. SPIE Visual Communications and Image Processing, SPIE Vol. 4671, Jan. 2002. [Tran, 2000] Tran, T. D.: The BinDCT – Fast Multiplierless Approximation of the DCT. IEEE Signal Processing Letters Vol. 7(6), 2000. [Trybulec and Byliński, 1989] Trybulec, A., Byliński, C.: Some Properties of Real Numbers - Operations: min, max, square, and square root. Mizar Mathematical Library (MML) - Journal of Formalized Mathematics Vol. 1, 1989. [van Doorn and de Vries, 2000] van Doorn, M. G. L. M., de Vries, A. P.: The Psychology of Multimedia Databases. 5th ACM Conference on Digital Libraries, San Antonio (TX), USA, 2000. [Vatolin et al., 2005] Vatolin, D., Kulikov, D., Parshin, A., Kalinkina, D., Soldatov, S.: MPEG-4 Video Codecs Comparison, retrieved on Mar., 2005, from http://www.compression.ru/video/codec_comparison/mpeg-4_en.html, 2005. [Vetro et al., 2000] Vetro, A., Sun, H., Divakaran, A.: Adaptive Object-Based Transcoding using Shape and Motion-Based Hints. ISO/IEC M6088, Geneva, Switzerland, May 2000. [Vetro, 2001] Vetro, A.: Object-Based Encoding and Transcoding. PhD Thesis, Electrical Engineering. Polytechnic University, Brooklyn (NY), USA. Jun. 2001. [Vetro et al., 2001] Vetro, A., Sun, H., Wang, Y.: Object-Based Transcoding for Adaptable Video Content Delivery. IEEE Transactions on Circuits and Systems for Video Technology Vol. 11(3), pp.387-401, 2001. [Vetro, 2003] Vetro, A.: Transcoding Scalable Coding & Standardized Metadata. International Workshop Very Low Bitrate Video (VLBV) Vol. 2849, Urbana (IL), USA, Sep. 2003. [Vetro et al., 2003] Vetro, A., Christopoulos, C., Sun, H.: Video Transcoding Architectures and Techniques: An Overview. IEEE Signal Processing Magazine Vol. 20(2), pp.18-29, 2003.

333 Bibliography

[Vetro, 2004] Vetro, A.: MPEG-21 Digital Item Adaptation: Enabling Universal Media Access. IEEE Multimedia Vol. 11, pp.84-87, 2004. [VQEG (ITU), 2005] VQEG (ITU): Tutorial - Objective Perceptual Assessmnet of Video Quality: Full Reference Television, ITU Video Quality Expert Group, Mar. 2004. [Wallace, 1991] Wallace, G. K.: The JPEG Still Picture Compression Standard. Communications of the ACM Vol. 34, pp.30-34, 1991. [Wang et al., 2004] Wang, Y., Huang, W., Korhonen, J.: A Framework for Robust and Scalable Audio Streaming. ACM Multimedia '04 (ACMMM'04), New York (NY), USA, ACM, Oct. 2004. [Warnes, 2000] Warnes, G. R.: A Recipe for a diskless MOSIX cluster using Cluster-NFS, retrieved on May, 10th, 2000, from http://clusternfs.sourceforge.net/Recipe.pdf, 2000. [Weinberger et al., 2000] Weinberger, M. J., Seroussi, G., Sapiro, G.: The LOCO-I Lossless Image Compression Algorithm, Principles and Standardization into JPEG-LS. IEEE Trans. on Image Processing Vol. 9(8), pp.1309-1324, 2000. [Wen et al., 2003] Wen, J.-R., Li, Q., Ma, W.-Y., Zhang, H.-J.: A Multi-paradigm Querying Approach for a Generic Multimedia Database Management System. ACM SIGMOD Record Vol. 32(1), pp.26-34, 2003. [Wendelska, 2007] Wendelska, J. A.: Optimization of the MPEG-4 SLS Implementation for Scalable Lossless Audio Coding. Diploma Thesis, Database Systems Chair. FAU Erlangen- Nuremberg, Erlangen, Germany. Aug. 2007. [Westerink et al., 1999] Westerink, P. H., Rajagopalan, R., Gonzales, C. A.: Two-pass MPEG-2 variable-bit-rate encoding. IBM Journal of Research and Development - Digital Multimedia Technology Vol. 43(4), pp.471, 1999. [Wittmann and Zitterbart, 1997] Wittmann, R., Zitterbart, M.: Towards Support for Heterogeneous Multimedia Communications. 6th IEEE Workshop on Future Trends of Distributed Computing Systems, Bologna, Italy, Nov. 2000. [Wittmann, 2005] Wittmann, R.: A Real-Time Implementation of a QoS-aware Decoder for the LLV1 Format. Study Project, Database Systems Chair. FAU Erlangen-Nuremberg, Erlangen, Germany. Nov. 2005. [Wu et al., 2001] Wu, F., Li, S., Zhang, Y.-Q.: A Framework for Efficient Progressive Fine Granularity Scalable Video Coding. IEEE Trans. Circuits and Systems for Video Technology Vol. 11(3), pp.332-344, 2001. [WWW_AlparySoft, 2004] WWW_AlparySoft: Lossless Video Codec - Ver. 2.0 Alpha, retrieved on Dec. 17th, 2004, from http://www.alparysoft.com/products.php?cid=8, 2004. [WWW_Doom9, 2003] WWW_Doom9: Codec shoot-out 2003 – 1st Installment, retrieved on Apr. 10th, 2003, from http://www.doom9.org/codecs-103-1.htm, 2003. [WWW_DROPS, 2006] WWW_DROPS: The Dresden Real-Time Operating System Project, retrieved on Oct. 23rd, 2006, from http://os.inf.tu-dresden.de/drops/, 2006. [WWW_FAAC, 2006] WWW_FAAC: FAAC - Advanced Audio Coder (Ver. 1.24), retrieved on Dec. 10th, 2006, from http://sourceforge.net/projects/faac/, 2006. [WWW_FAAD, 2006] WWW_FAAD: FAAD - Freeware Advanced Audio Coder (Ver. 2.00), retrieved on Nov. 10th, 2006, from http://www.audiocoding.com, 2006.

334 Bibliography

[WWW_FFMPEG, 2003] WWW_FFMPEG: FFmpeg Documentation, retrieved on Nov. 23rd, 2006, from http://ffmpeg.mplayerhq.hu/ffmpeg-doc.html, 2003. [WWW_FLAC, 2006] WWW_FLAC: Free Lossless Audio Codec (FLAC), retrieved on Feb. 28th, 2006, from http://flac.sourceforge.net/, 2006. [WWW_LAME, 2006] WWW_LAME: Lame Version 3.96.1, “Lame Ain’t an MP3 Encoder”, http://lame.sourceforge.net, retrieved on Dec. 10th, 2006, 2006. [WWW_MA, 2006] WWW_MA: Monkey’s Audio - A Fast and Powerful Lossless Audio Compressor, retrieved on Sep. 23rd, 2006, from http://www.monkeysaudio.com/, 2006. [WWW_MPEG SQAM, 2006] WWW_MPEG SQAM: MPEG Sound Quality Assessment Material -- Subset of Ebu SQAM, retrieved on Nov. 15th, from http://www.tnt.uni- hannover.de/project/mpeg/audio/sqam/, 2006. [WWW_OGG, 2006] WWW_OGG: Ogg (libogg), Vorbis (libvorbis) and OggEnc (vorbis-tools) Version 1.1, retrieved on Dec. 10th, 2006, from http://www.xiph.org/vorbis/, 2006. [WWW_Retavic - Audio Set, 2006] WWW_Retavic - Audio Set: Evaluation Music Set, retrieved on Jan. 10th, from http://www6.informatik.uni-erlangen.de/research/projects/retavic/audio/, 2006. [WWW_VQEG, 2007] WWW_VQEG: Offical Website of Video Quality Expert Group - Test Video Sequences, retrieved on Feb, 14th, 2007, from http://www.its.bldrdoc.gov/vqeg/ (ftp://vqeg.its.bldrdoc.gov/, mirror with thumbnails: http://media.xiph.org/vqeg/TestSeqences/), 2007. [WWW_WP, 2006] WWW_WP: WavPack - Hybrid Lossless Audio Compression, retrieved on Feb. 26th, 2006, from http://www.wavpack.com/, 2006. [WWW_XIPH, 2007] WWW_XIPH: Xiph.org Test Media - Derf's Collection of Test Video Clips, retrieved on Feb. 14th, from http://media.xiph.org/video/derf/, 2007. [WWW_XVID, 2003] WWW_XVID: XVID MPEG-4 Video Codec v.1.0, retrieved on Apr. 3rd, 2003, from http://www.xvid.org, 2003. [Wylie, 1994] Wylie, F.: Tandem Coding of Digital Audio Data Compression Algorithms. 96th Convention of Audio Engineering Society (AES), Belfast, N. Ireland, AES No. 3784, Feb. 1994. [Yeadon, 1996] Yeadon, N. J.: Quality of Service Filtering for Multimedia Communications. PhD Thesis. Lancaster University, Lancaster, UK. Yeadon96. [Youn, 2008] Youn, J.: Method of Making a Window Type Decision Based on MDCT Data in Audio Encoding. U. S. P. a. T. O. (PTO). USA, Sony Corporation (Tokyo, JP). Vol., 2008.

335