electronics

Article Two-Dimensional Audio Compression Method Using Coding Schemes

Seonjae Kim 1 , Dongsan Jun 1,*, Byung-Gyu Kim 2,*, Seungkwon Beack 3, Misuk Lee 3 and Taejin Lee 3

1 Department of Convergence IT Engineering, Kyungnam University, Changwon 51767, Korea; [email protected] 2 Department of IT Engineering, Sookmyung Women’s University, Seoul 04310, Korea 3 Electronics and Telecommunications Research Institute (ETRI), Daejeon 34129, Korea; [email protected] (S.B.); [email protected] (M.L.); [email protected] (T.L.) * Correspondence: [email protected] (D.J.); [email protected] (B.-G.K.)

Abstract: As video compression is one of the core technologies that enables seamless media streaming within the available network bandwidth, it is crucial to employ media to support powerful coding performance and higher visual quality. Versatile Video Coding (VVC) is the latest video coding standard developed by the Joint Video Experts Team (JVET) that can compress original data hundreds of times in the image or video; the latest audio coding standard, Unified Speech and Audio Coding (USAC), achieves a compression rate of about 20 times for audio or speech data. In this paper, we propose a pre-processing method to generate a two-dimensional (2D) audio signal as an input of a VVC encoder, and investigate the applicability to 2D audio compression using the video coding scheme. To evaluate the coding performance, we measure both signal-to-noise ratio (SNR) and bits per sample (bps). The experimental result shows the possibility of researching 2D audio encoding  using video coding schemes. 

Citation: Kim, S.; Jun, D.; Kim, B.-G.; Keywords: audio compression; video compression; next generation audio coding; versatile video Beack, S.; Lee, M.; Lee, T. coding (VVC); mu-law encoding; linear mapping Two-Dimensional Audio Compression Method Using Video Coding Schemes. Electronics 2021, 10, 1094. https:// doi.org/10.3390/electronics10091094 1. Introduction As consumer demands for realistic and rich media services on low-end devices are Academic Editor: Juan M. Corchado rapidly increasing in the field of multimedia delivery and storage applications, the need for audio or video codecs with powerful coding performance is emphasized, which can achieve Received: 23 March 2021 minimum bitrates and maintain higher perceptual quality compared with the original Accepted: 3 May 2021 data. As the state-of-the-art video coding standard, Versatile Video Coding (VVC) [1] was Published: 6 May 2021 developed by the Joint Video Experts Team (JVET) of the ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG). VVC can achieve a Publisher’s Note: MDPI stays neutral bitrate reduction of 50% with similar visual quality compared with the previous method, with regard to jurisdictional claims in published maps and institutional affil- named High Efficiency Video Coding (HEVC) [2]. In the field of audio coding technology, iations. Unified Speech and Audio Coding (USAC) [3] was developed by the MPEG audio group as the latest audio coding standard, which integrates speech and audio coding schemes. Whereas VVC can accomplish significant coding performance in the original image or video data, USAC can provide about 20 times compression performance of the original audio or speech data. Copyright: © 2021 by the authors. In this study, we conducted various experiments to verify the possibility of com- Licensee MDPI, Basel, Switzerland. pressing two-dimensional (2D) audio signals using video coding schemes. In detail, we This article is an open access article distributed under the terms and converted 1D audio signal into 2D audio signal with the proposed 2D audio conversion conditions of the Creative Commons process as an input of a VVC encoder. We used both signal-to-noise ratio (SNR) and bits Attribution (CC BY) license (https:// per sample (bps) for performance evaluations. creativecommons.org/licenses/by/ The remainder of this paper is organized as follows: In Section2, we overview the 4.0/). video coding tools, which are newly adopted in VVC. In Section3, we describe the proposed

Electronics 2021, 10, 1094. https://doi.org/10.3390/electronics10091094 https://www.mdpi.com/journal/electronics Electronics 2021, 10, x FOR PEER REVIEW 2 of 12

Electronics 2021, 10, 1094 The remainder of this paper is organized as follows: In Section 2, we overview2 the of 12 video coding tools, which are newly adopted in VVC. In Section 3, we describe the proposed methods for converting a 1D audio signal into a 2D audio signal. Finally, the experimentalmethods for convertingresults and aconclusions 1D audio signal are pr intoovided a 2D in audio Sections signal. 4 and Finally, 5, respectively. the experimental results and conclusions are provided in Sections4 and5, respectively. 2. Overview of VVC 2. OverviewVVC can ofprovide VVC powerful coding performance compared with HEVC. One of the main differencesVVC can provide between powerful HEVC and coding VVC performance is the block compared structure. withBoth HEVC.HEVC Oneand ofVVC the commonlymain differences specify between coding HEVCtree unit and (CTU) VVC isas the the block largest structure. coding Both unit, HEVC which and has VVC a changeablecommonly size specify depending coding tree on unitthe encoder (CTU) as configuration. the largest coding In addition, unit, which a CTU has acan changeable be split intosize four depending coding units on the (CUs) encoder by a configuration. quad tree (QT) In structure addition, to a adapt CTU canto a bevariety split of into block four properties.coding units In (CUs)HEVC, by a aCU quad can tree be (QT)further structure partitioned to adapt into to one, a variety two, or of blockfour prediction properties. unitsIn HEVC, (PUs) aaccording CU can beto the further PU splitting partitioned type. into After one, obtaining two, or fourthe residual prediction block units derived (PUs) fromaccording the PU-level to the PU intra- splitting or inter-prediction, type. After obtaining a CU thecan residual be partitioned block derived into multiple from the transformPU-level intra-units (TUs) or inter-prediction, according to a a residual CU can quad-tree be partitioned (RQT) into structure multiple similar transform to that units of CU(TUs) split. according to a residual quad-tree (RQT) structure similar to that of CU split. VVCVVC substitutes substitutes the the concepts concepts of of multiple multiple partition partition unit unit types types (CU, (CU, PU, PU, and and TU) TU) with with aa QT-based QT-based multi-type multi-type tree tree (QTMTT) (QTMTT) block block structure, structure, where where the the MTT MTT is is classified classified into into binarybinary tree tree (BT) (BT) split split and and ternary ternary tree tree (TT) (TT) split split to to support support more more flexibility flexibility for for CU CU partition partition shapes.shapes. This This means means that that a aQT QT can can be be further further split split by by the the MTT MTT structure structure after after a aCTU CTU is is first first partitionedpartitioned by by a aQT QT structure. structure. As As depicted depicted in in Figure Figure 1,1 ,VVC VVC specifies specifies four four MTT MTT split split types: types: verticalvertical binary binary split split (SPLIT_BT_VER), (SPLIT_BT_VER), horizontal horizontal binary binary split split (SPLIT_BT_HOR), (SPLIT_BT_HOR), vertical vertical ternaryternary split split (SPLIT_TT_VER), (SPLIT_TT_VER), and and horizontal horizontal ternary ternary split split (SPLIT_TT_HOR), (SPLIT_TT_HOR), as as well well as as QTQT split. split. In In VVC, VVC, a aQT QT or or MTT MTT node node is is considered considered a aCU CU for for prediction prediction and and transform transform processesprocesses without without any any further further partitioning partitioning sc schemes.hemes. Note Note that that CU, CU, PU, PU, and and TU TU have have the the samesame block block size size in in the the VVC VVC block block structure. structure. In In other other words, a CU inin VVCVVC cancan havehave eithereither a asquare square oror rectangularrectangular shape,shape, whereaswhereas a a CU CU in in HEVC HEVC always always has has a a square square shape. shape.

FigureFigure 1. 1. VVCVVC block block structure structure (QTMTT). (QTMTT).

TableTable 1 shows newly adopted coding tool toolss between between HEVC HEVC and and VVC. VVC. In In general, general, intra-predictionintra-prediction generates generates a predicted a predicted block block from from the the reconstructed reconstructed neighboring neighboring pixels pixels of theof thecurrent current block. block. As Asshown shown in inFigure Figure 2, 2VVC, VVC can can provide provide 67 67 intra-prediction intra-prediction modes, modes, wherewhere modes modes 0 0and and 1 1are are planar planar and and DC DC mode, mode, respectively, respectively, and and the the others others are are in in angular angular predictionprediction mode mode to to represent represent edge direction.direction. AccordingAccording to to [ 4[4],], the the VVC VVC test test model model (VTM) (VTM) [5 ] achieves 25% higher compression performance than the HEVC test model (HM) [6] under [5] achieves 25% higher compression performance than the HEVC test model (HM) [6] the all-intra (AI) configuration recommended by JVET Common Test Conditions (CTC) [7]. under the all-intra (AI) configuration recommended by JVET Common Test Conditions This improvement was mainly realized by the newly adopted coding tools, such as position (CTC) [7]. This improvement was mainly realized by the newly adopted coding tools, such dependent intra-prediction combination (PDPC) [8], cross component linear model intra- as position dependent intra-prediction combination (PDPC) [8], cross component linear prediction (CCLM) [9], wide angle intra-prediction (WAIP) [10], and matrix-based intra- model intra-prediction (CCLM) [9], wide angle intra-prediction (WAIP) [10], and matrix- prediction (MIP) [11]; the computational complexity of encoder substantially increased approximately 26 times [4].

Electronics 2021, 10, x FOR PEER REVIEW 3 of 12

based intra-prediction (MIP) [11]; the computational complexity of encoder substantially increased approximately 26 times [4].

Table 1. The list of video coding tools. Comparison between HEVC and VVC.

Category HEVC VVC 67 intra-prediction modes Wide angle intra-prediction (WAIP) Position-dependent intra-prediction combination Electronics 2021, 10, 1094 Intra 35 Intra-prediction (PDPC) 3 of 12 prediction modes Matrix-based intra-prediction (MIP) Multi reference line prediction (MRL) Table 1. The list of video coding tools. Comparison betweenIntra sub-block HEVC and VVC.partitioning (ISP) Category HEVCCross-component VVC linear model (CCLM) 67 intra-predictionRegular modes skip/merge Wide angle intra-prediction (WAIP) Position-dependentSub-block-based intra-prediction temporal combination motion (PDPC) vector Intra prediction 35 Intra-prediction modes Matrix-basedpredictors intra-prediction (SbTMVP) (MIP) Multi reference line prediction (MRL) Intra sub-blockGeometric partitioning partitioning (ISP) mode (GPM) Skip/Merge Regular skip/merge Cross-componentMerge linear with model motion (CCLM) vector difference (MMVD) Regular skip/merge Sub-block-basedDecoder-side temporal motion motion vector vector predictors refinement (SbTMVP) (DMVR) GeometricBi-directional partitioning mode (GPM)optical flow (BDOF) Skip/Merge Regular skip/merge Merge with motion vector difference (MMVD) Decoder-sideCombined motion vector inter- refinement and (DMVR)intra-prediction (CIIP) Bi-directional optical flow (BDOF) CombinedAdvanced inter- and intra-prediction motion vector (CIIP) prediction (AMVP) Advanced motion Symmetric motion vector difference (SMVD) Inter Advanced motion vector prediction (AMVP) vector predictionSymmetric Affine motion vectorinter-prediction difference (SMVD) Inter prediction Advancedprediction motion vector prediction (AMVP) Affine inter-prediction (AMVP) Adaptive motionAdaptive vector motion resolution vector (AMVR) resolution (AMVR) Bi-prediction with CU-level weight (BCW) Bi-prediction with CU-level weight (BCW)

Figure 2.2.Intra-prediction Intra-prediction modes modes in VVC. in VVC. Inter-prediction fetches a predicted block from previously decoded reference frames usingInter-prediction fetches (ME) anda predicted motion compensationblock from previously (MC) processes. decoded The reference motion frames parameterusing motion can be estimation transmitted (ME) to the and decoder motion in either compensation an explicit or (MC) implicit processes. manner. When The motion aparameter CU is coded can as be SKIP transmitted or MERGE to the mode, decoder the encoder in either does an not explicit transmit or implicit any coding manner. pa- When rameters,a CU is suchcoded as significantas SKIP or residual MERGE coefficients, mode, motionthe encoder vector differences,does not transmit or a reference any coding pictureparameters, index. Insuch cases as of significant SKIP and MERGE residual modes, coefficients, the motion motion information vector for thedifferences, current or a CUreference is derived picture from index. the neighboring In cases CUsof SKIP including and MERGE spatial and modes, temporal the candidates.motion information VVC for adoptedthe current several CU new is derived coding tools from for the SKIP/MERGE neighboring and CUs inter-prediction, including spatial such as and affine temporal intercandidates. prediction VVC [12 adopted], geometric several partitioning new coding mode tools (GPM) for [13 SKIP/MERGE], merge with motionand inter-prediction, vector difference (MMVD) [14], decoder-side motion vector refinement (DMVR) [15], combined inter- and intra-prediction (CIIP) [16], and bi-prediction with CU-level weight (BCW) [17]. VVC also adopted adaptive motion vector resolution (AMVR) [18], symmetric motion vector difference (SMVD) [19], and advanced motion vector prediction (AMVP) to save the bitrate of the motion parameters. Although these tools can significantly improve the coding performance, the computational complexity of a VVC encoder is up to nine times higher under the random-access (RA) configuration compared with that of HEVC [4]. Electronics 2021, 10, 1094 4 of 12

Per the JVET CTC [7], VVC supports three encoding configurations [20]: all-intra (AI), low-delay (LD), and random-access (RA). In the AI configuration, each picture is encoded as an intra picture and does not use temporal reference pictures. Conversely, in case of the LD configuration, only the first frame in a video sequence is encoded as an intra frame; subsequent frames are encoded using the inter-prediction tools to exploit temporal reference pictures as well as intra-prediction tools. In the RA coding mode, an intra frame is encoded in an approximately one second interval in accordance with the intra period, and other frames are encoded as hierarchical B frames along the group of pictures (GOP) size. This means that a delay may occur in RA configuration because the display order is different from the encoding order. Although the aforementioned VVC coding tools provide significant compression performance compared with HEVC, their computational complexity [21] of the encoder is high. Therefore, many researchers have studied various fast encoding algorithms for block structure, intra-prediction, and inter-prediction of VVC while maintaining the visual quality. In terms of block structure, Fan et al. proposed a fast QTMT partition algorithm based on variance and a Sobel operator to choose only one partition from five QTMT partitions in the process of intra-prediction [22]. Jin et al. proposed a fast QTBT partition method through a convolutional neural network (CNN) [23]. To reduce the computational complexity of intra-prediction, Yang et al. proposed a fast intra-mode decision method with gradient descent search [24] and Dong et al. proposed an adaptive mode pruning method by removing the non-promising intra-modes [25]. In terms of inter-prediction, Park and Kang proposed a fast affine motion estimation method using statistical characteristics of parent and current block modes [26].

3. VVC-Based 2D Audio Compression Method 3.1. Overall Frameworks of 2D Audio Coding In general, the original 1D audio signal is generated from pulse code modulation (PCM), which involves sampling, quantization, and binarization processes. In this study, we used a 48 kHz sampled audio signal, and each sample was already mapped to signed 16 bits binary number by PCM coding. As the inputs of VVC specify a video sequence, which is an unsigned 2D signal with a max 12 bits integer number [7], we converted the original 1D audio signal into an unsigned 2D audio signal as the pre-processing before VVC encoding. As shown in Figure3, the overall architectures for 2D audio compression can be divided into the encoder and decoder side, where the encoder side consists of the 2D audio conversion process and the VVC encoder. In the 2D audio conversion process, signed 1D audio signals with a bit-depth of 16 bits are mapped to unsigned 1D audio signals with the bit-depth of 8 or 10 bits and then unsigned 2D audio signals are generated by 1D-to-2D packing as an input of the VVC encoder. After receiving a compressed bitstream, the decoder side conducts VVC decoding and inverse 2D audio conversion process to reconstruct the 1D audio signal with a bit-depth of 16 bits. Note that it may strongly influence the coding performance of the VVC depending on how the 1D signal is converted to a 2D signal.

3.2. Proposed 2D Audio Conversion Process To generate unsigned 1D audio signals, we exploited three approaches as follows: 1. Non-linear mapping (NLM) to generate unsigned 8 bits per sample; 2. Linear mapping (LM) to generate unsigned 10 bits per sample; 3. Adaptive linear mapping (ALM) to generate unsigned 10 bits per sample. Electronics 2021, 10, 1094 5 of 12 Electronics 2021, 10, x FOR PEER REVIEW 5 of 12

Figure 3. OverallOverall architectures architectures of VVC-based 2D audio coding.

3.2. ProposedNLM by 2D Mu-law Audio Conversion [27] is primarily Process used in 8 bits PCM digital telecommunication systemsto generate as one ofunsigned two versions 1D audio of ITU-Tsignals, G.711. we exploited This reduces three approaches the dynamic as range follows: of an 1.original Non-linear signed 1Dmapping audio signal,(NLM) asto expressedgenerate unsigned in Equation 8 bits (1): per sample; 2. Linear mapping (LM) to generate unsigned 10 bits per sample; ln1 + 2N − 1·|I| 3. Adaptive linear mappingNLM ((ALM)I) = sgn to( geneI)· rate unsigned 10 bits, per sample. (1) ln(1 + (2N − 1)) NLM by Mu-law [27] is primarily used in 8 bits PCM digital telecommunication systemswhere sgn as( Ione) and ofN twodenote versions the sign offunction ITU-T G. and711. the This bit-depth reduces of output the dynamic samples, range respectively. of an originalSimilarly, signed linear1D audio mapping signal, (LM)as expressed maps the in dynamicEquation range(1): of an original signed 1D

audio signal into a specified sample range depending on the|| maximum value (Max) of 𝑁𝐿𝑀𝐼 𝑠𝑔𝑛𝐼 , samples. As shown in Figure4, the sample’s dynamic range of LM is from -Max to Max.(1) This range can be divided into uniform intervals and each interval is represented with the wheretype of 𝑠𝑔𝑛 unsigned𝐼 and N 𝑁 bits denote by Equation the sign (2). function and the bit-depth of output samples, respectively.  I + Max(|I|)   Similarly, linear mappingLM( I(LM)) = maps the dynamic· 2N − range1 , of an original signed 1D (2) audio signal into a specified sample rang2·Maxe depending(|I|) on the maximum value (Max) of samples. As shown in Figure 4, the sample’s dynamic range of LM is from -Max to Max. b c (| |) Thiswhere rangex canand beMax dividedI indicate into uniform the rounding intervals operationand each interval and the is maximum represented value with of the all samples, respectively. Note that the encoder side transmits the value of Max(|I|) with type of unsigned N bits by Equation (2). the bit-depth of 16 bits for signed 1D conversion. Since this method has to search for the || maximum value in the overall𝐿𝑀 audio𝐼 ⌊ samples, this 2 process1⌋, is slow compared with the || (2) length of the audio signal. In addition, an unused sample range, called the dead zone, wheremay exist ⌊𝑥⌋ and in the 𝑀𝑎𝑥 range|𝐼| ofindicate the unsigned the rounding 1D signal, operatio as shownn and inthe Figure maximum4. To resolvevalue of the all samples,aforementioned respectively. problems, Note wethat considered the encoder adaptive side transmits linear mappingthe value (ALM).of 𝑀𝑎𝑥| As𝐼| shown with the in bit-depthFigure5, ALM of 16 firstly bits searchesfor signed for 1D both conversion the Min(.I )Sinceand Maxthis (methodI) values has among to search the signed for the 1D maximumaudio samples value and in thenthe overall performs audio the LMsamples, conversion this process to represent is slow the compared unsigned Nwith bits the 1D lengthaudio signal,of the audio as expressed signal. In in addition, Equation an (3). un Becauseused sample a video range, encoder called generally the dead compresses zone, may existinput in data the in range the unit of ofthe a 2Dunsigned frame, ALM1D sign canal, reduce as shown the encoding in Figure delay 4. To in theresolve unit ofthe a aforementionedframe and avoid problems, the dead zonewe considered problem compared adaptive withlinear LM. mapping Therefore, (ALM). the As encoder shown side in Figure 5, ALM firstly searches for both the 𝑀𝑖𝑛𝐼 and 𝑀𝑎𝑥𝐼 values among the signed 1D audio samples and then performs the LM conversion to represent the unsigned N bits 1D audio signal, as expressed in Equation (3). Because a video encoder generally

Electronics 2021, 10, x FOR PEER REVIEW 6 of 12 Electronics 2021, 10, x FOR PEER REVIEW 6 of 12

Electronics 2021, 10, 1094 6 of 12 compresses input data in the unit of a 2D frame, ALM can reduce the encoding delay in thecompresses unit of a inputframe data and inavoid the unitthe deadof a 2Dzone frame, problem ALM compared can reduce with the LM. encoding Therefore, delay the in the unit of a frame and avoid the dead zone problem compared with LM. Therefore, the encoderhas to transmit side has bothto transmitMin(I )bothand 𝑀𝑖𝑛Max𝐼(I )andvalues 𝑀𝑎𝑥 for𝐼 each values frame, for each where frame,k is the where index 𝑘 is of k 𝑀𝑖𝑛𝐼 k 𝑀𝑎𝑥𝐼 𝑘 theencoderthe index frames. side of the has frames. to transmit both and values for each frame, where is the index of the frames.  I − Min(I )   = k k · N − ALM(Ik) 𝐼 𝑀𝑖𝑛 𝐼 2 1 (3) 𝐴𝐿𝑀𝐼 ⌊Max𝐼(I𝑀𝑖𝑛k) − Min𝐼 (Ik) 2 1⌋ (3) 𝑀𝑎𝑥𝐼 𝑀𝑖𝑛 𝐼 𝐴𝐿𝑀𝐼 ⌊ 2 1⌋ (3) 𝑀𝑎𝑥𝐼 𝑀𝑖𝑛𝐼

FigureFigure 4. 4. GraphicalGraphical representation representation of of linear linear mapping mapping where where the the or orangeange area area indicates indicates the the unused unused sample sample range range called called the the deadFigure zone. 4. Graphical representation of linear mapping where the orange area indicates the unused sample range called the dead zone. dead zone.

Figure 5. Graphical representation of adaptive linear mapping. FigureFigure 5. 5. GraphicalGraphical representation representation of of adaptive adaptive linear linear mapping. mapping. After performing unsigned 1D conversion, the 2D audio signal is generated from the After performing unsigned 1D conversion, the 2D audio signal is generated from the unsignedAfter 1Dperforming audio signals unsigned used 1D as conversion,the inputs ofth ethe 2D VVC audio encoder. signal is Forgenerated example, from if the unsigned 1D audio signals used as the inputs of the VVC encoder. For example, if the spatialunsigned resolution 1D audio (width signals height) used as of the a frame inputs is 256of the 256, VVC the encoder. unsigned For 1D example, audio signals, if the spatial resolution (width × height) of a frame is 256 × 256, the unsigned 1D audio signals, whosespatial totalresolution number (width is 256 height) 256, are of arranged a frame is a 256 frame 256, according the unsigned to the raster-scan1D audio signals, order, whose total number is 256 × 256, are arranged a frame according to the raster-scan order, as aswhose shown total innumber Figure is 2566. Figure 256, are7 showsarranged the a frame2D packing according results to the accordingraster-scan to order, the shown in Figure6. Figure7 shows the 2D packing results according to the aforementioned aforementionedas shown in Figure 2D conversion 6. Figure approaches. 7 shows Althethough 2D packing 2D audio results signals according showed different to the 2D conversion approaches. Although 2D audio signals showed different and complex andaforementioned complex texture 2D conversion patterns betweenapproaches. consec Althoughutive frames, 2D audio we signals confirmed showed that different the 2D texture patterns between consecutive frames, we confirmed that the 2D signals had a signalsand complex had a horizontallytexture patterns strong between directionality consec withinutive frames, a frame. we confirmed that the 2D signalshorizontally had a stronghorizontally directionality strong directionality within a frame. within a frame.

ElectronicsElectronics 2021 2021, ,10 10, ,x x Electronics FORFOR PEERPEER2021 REVIEWREVIEW, 10, 1094 7 of7 of 12 12 7 of 12

Figure 6. TheFigure 2D packing 6. The ofof 2D 1D1D packing audioaudio ofsignalssignals 1D audio (256 (256 signals 256 256 spatial (256spatial× resolution). 256resolution). spatial resolution).

Figure 7. Visualization of consecutive 2D audio signals in the unit of frames (256 256 spatial Figure 7. Visualization of consecutive 2D audio signals in the unit of frames (256 × 256 spatial resolution). Figureresolution). 7. Visualization of consecutive 2D audio signals in the unit of frames (256 256 spatial resolution). 4. Experimental Results 4. Experimental Results 4. Experimental ResultsIn this study, we used an original audio signal (48 kHz, 16 bits) whose length was In this study,approximately we used an 195original s and audio consisted signal of (48 five kHz, music, 16 fivebits)speech, whose andlength five was mixed items, as In this study, we used an original audio signal (48 kHz, 16 bits) whose length was approximately 195described s and inconsisted Table2. Toof compressfive music, a 2D five audio speech, signal and using five the mixed VVC encoder,items, as we first set the approximately 195 s and consisted of five music, five speech, and five mixed items, as described in Tablespatial 2. To resolution compress to a 2562D ×audi256.o signal Because using the originalthe VVC audio encoder, signal we had first 9,371,648 set samples, describedthe spatial in resolution Tablewe 2. generated To to compress 256 143 256. frames a 2D Because audi witho asignalthe 256 original× using256 spatial theaudio VVC resolution.signal encoder, had As we9,371,648 demonstrated first set in the thesamples, spatial we resolution generatedspectrograms to143 256 frames in Figure 256. with Because8 ,a ALM 256 is the256 superior originalspatial other resolution. audio conversion signal As demonstrated methodshad 9,371,648 in terms of SNR. samples,in the spectrograms we generated in Figure143 frames 8, ALM with is asu 256perior 256 other spatial conversion resolution. methods As demonstrated in terms of in the spectrograms in Figure 8, ALM is superior other conversion methods in terms of

Electronics 2021, 10, x FOR PEER REVIEW 8 of 12

Electronics 2021, 10, 1094 8 of 12

SNR. After generating 2D audio sequences with 143 frames, they were compressed by the VVC encoder under the experimental environments presented in Table 3. After generating 2D audio sequences with 143 frames, they were compressed by the VVC encoderTable 2. underDescription the experimental of audio signal environments items. presented in Table3.

TableCategory 2. Description of audioItem signal Name items. Description salvation Classical chorus music Category Item Name Description te15 Classical music Music Music_1 salvationRock music Classical chorus music te15 Classical music MusicMusic_3 Music_1Pop music Rock music phi7 Music_3Classical music Pop music es01 phi7English speech Classical music louis_raquin_15es01 French speech English speech Speech Wedding_speechlouis_raquin_15 Korean speech French speech Speech Wedding_speech Korean speech te1_mg54_speech German speech te1_mg54_speech German speech Arirang_speechArirang_speech Korean speech Korean speech twinkle_ff51twinkle_ff51 Speech with Speech pop withmusic pop music SpeechOverMusic_1SpeechOverMusic_1 Speech with Speech chorus with music chorus music MixedMixed SpeechOverMusic_4SpeechOverMusic_4 Speech with Speech pop withmusic pop music HarryPotter Speech with background music HarryPotterlion Speech with Speech background with background music music lion Speech with background music

FigureFigure 8. 8.Spectrograms Spectrograms ofof thethe music_1music_1 audioaudio signal.signal. To evaluate the coding performance, we measured the SNR between the original and To evaluate the coding performance, we measured the SNR between the original and reconstructed 1D audio signals after decoding the compressed bitstream. Table4 shows reconstructed 1D audio signals after decoding the compressed bitstream. Table 4 shows that NLM produced the worst coding performance due to the incorrectness of the signed that NLM produced the worst coding performance due to the incorrectness of the signed 1D conversion. In addition, 10 bits LM had a higher SNR than ALM regardless of the 1D conversion. In addition, 10 bits LM had a higher SNR than ALM regardless of the encoding mode, as shown in Table4 and Figure9. encoding mode, as shown in Table 4 and Figure 9.

Electronics 2021, 10,Electronics 1094 2021, 10, x FOR PEER REVIEW 9 of 12 9 of 12

Table 3. ExperimentalTable environments3. Experimental for environments 2D audio encoding for 2D inaudio the VVC. encoding in the VVC.

8 bits NLM 8 bits 10 bitsNLM LM 10 bits 10bits LM ALM 10 bits ALM Input bit-depthInput bit-depth 8 bits 8 bits 10 bits 10 bits 10 bits 10 bits VTM versionVTM version VTM10.0 VTM10.0 Chroma formatChroma format YUV4:0:0 YUV4:0:0 Encoding modeEncoding mode All-intra, Low-delay-B, All-intra, Random-access Low-delay-B, Random-access Target bps 0.5 and 1.0 Target bps 0.5 and 1.0

Table 4. ExperimentalTable results 4. Experimental according to the results different according conversion to the modes.different conversion modes. 8 bits NLM 10 bits LM 10 bits ALM Encoding Mode Target bps Encoding Target 8 bits NLM 10 bits LM 10 bits ALM SNR (dB)Mode bps bps SNR (dB)SNR (dB) bps bps SNR SNR (dB) bps bps SNR (dB) bps 1.0 14.09 1.011.0 30.55 14.09 1.03 1.01 30.55 29.30 1.03 1.02 29.30 1.02 All-intra All-intra 0.5 6.78 0.490.5 23.21 6.78 0.50 0.49 23.21 22.18 0.50 0.52 22.18 0.52 1.0 13.78 1.011.0 30.17 13.78 0.97 1.01 30.17 29.66 0.97 1.03 29.66 1.03 Low-delay-B Low-delay-B 0.5 7.15 0.490.5 23.55 7.15 0.49 0.49 23.55 22.64 0.49 0.52 22.64 0.52 1.0 13.63 0.99 30.28 0.99 29.21 1.03 Random-access Random 1.0 13.63 0.99 30.28 0.99 29.21 1.03 0.5 7.55-access 0.52 0.5 23.62 7.55 0.51 0.52 23.62 21.82 0.51 0.50 21.82 0.50

FigureFigure 9. Rate–distortion 9. Rate–distortion (RD) curves (RD) curves according according to the different to the differe conversionnt conversion methods. methods. We further investigated if the objective performance was well-reflected in the subjec- We further investigated if the objective performance was well-reflected in the tive quality. In terms of subjective assessment, we conducted a MUSHRA test [28] which is subjective quality. In terms of subjective assessment, we conducted a MUSHRA test [28] commonly used to evaluate subjective quality in the MPEG audio group. In this experiment, which is commonly used to evaluate subjective quality in the MPEG audio group. In this we used 10 audio items with a length of 10 s, and seven listeners participated. To enable experiment, we used 10 audio items with a length of 10 s, and seven listeners participated. comparison with the state-of-the-art audio coding method, the same audio signal was To enable comparison with the state-of-the-art audio coding method, the same audio coded by USAC [29]. Because the encoded frames have different quantization parameters signal was coded by USAC [29]. Because the encoded frames have different quantization (QPs) according to the temporal layer of LD and RA configuration, SNR per frame can parameters (QPs) according to the temporal layer of LD and RA configuration, SNR per fluctuate between consecutive frames. Therefore, we performed MUSHRA tests under the frame can fluctuate between consecutive frames. Therefore, we performed MUSHRA tests AI configuration without QP variations. In the AI configuration, we set the target to 0.5 bps under the AI configuration without QP variations. In the AI configuration, we set the and conducted MUSHRA listening tests for the original, USAC, ALM, LM, and NLM. As target to 0.5 bps and conducted MUSHRA listening tests for the original, USAC, ALM, shown in Figure 10, the subjective quality of 10 bits ALM was higher than that of 10 bits LM, which opposesLM, and the resultsNLM. As in Tableshown4. in Figure 10, the subjective quality of 10 bits ALM was higher than that of 10 bits LM, which opposes the results in Table 4.

ElectronicsElectronics 20212021, 10, 10, x, 1094FOR PEER REVIEW 1010 of of12 12

FigureFigure 10. 10. ScoresScores of of the the MUSHRA MUSHRA listening listening test test with with a a 95% 95% confidence confidence interval for differentdifferent methods. methods. Although the average score of ALM or LM was lower than that of USAC, it was necessaryAlthough to furtherthe average investigate score of the ALM possibility or LM ofwas researching lower than 2D that audio of USAC, encoding it was with necessaryrespect to to 2D further packing investigate and video the possibility optimization. of researching Firstly, the2D audio coding encoding performance with is respectstrongly to 2D dependent packing onand how video the codec 1D signal optimi is convertedzation. Firstly, to a 2Dthe signal.coding Becauseperformance we used is stronglysimple 2Ddependent packing on in thishow study, the 1D it wassignal necessary is converted to consider to a 2D a moresignal. elaborate Because 2D we packing used simplemethod 2D suitablepacking forin this video study, coding it was tools. necessary For example, to consider if we a more generate elaborate the 2D2D signalspacking to efficiently perform the block based inter prediction, the coding performance of LD or method suitable for video coding tools. For example, if we generate the 2D signals to RA configurations can be improved. Secondly, the current VVC has to transmit many efficiently perform the block based inter prediction, the coding performance of LD or RA unnecessary coding parameters due to the adoption of new tools, even though these tools configurations can be improved. Secondly, the current VVC has to transmit many are not used in the 2D audio encoding process. If a video encoder is optimized with core unnecessary coding parameters due to the adoption of new tools, even though these tools coding tools and light syntax structure, the coding performance will be improved. are not used in the 2D audio encoding process. If a video encoder is optimized with core coding5. Conclusions tools and light syntax structure, the coding performance will be improved. In this paper, we proposed a pre-processing method to generate a 2D audio signal as 5. Conclusions an input of a VVC encoder, and investigated its applicability to 2D audio compression using theIn video this codingpaper, scheme.we proposed To evaluate a pre-processing the coding method performance, to generate we measured a 2D audio both signal signal-to- as annoise input ratio of a (SNR) VVC andencoder, bits per and sample investigated (bps). Additionally, its applicability we conducted to 2D audio a MUSHRA compression test to usingevaluate the video subjective coding quality. scheme. The To experimental evaluate the results coding showed performance, the possibility we measured of researching both signal-to-noise2D audio encoding ratio using(SNR) video and codingbits per schemes. sample (bps). Additionally, we conducted a MUSHRA test to evaluate subjective quality. The experimental results showed the possibilityAuthor Contributions: of researchingConceptualization, 2D audio encoding S.K. and using D.J.; video methodology, coding schemes. S.K. and D.J.; software, S.K.; validation, D.J., B.-G.K., S.B., M.L. and T.L.; formal analysis, S.K. and D.J.; investigation, S.K. and AuthorD.J.; resources, Contributions: D.J.; dataConceptualization, curation, S.B., S.K. M.L. and and D.J.; T.L.; methodology, writing—original S.K. and draft D.J.; preparation, software, S.K.; S.K.; validation,writing—review D.J., B.-G.K., and editing, S.B., M.L. D.J. and and T.L.; B.-G.K.; formal visualization, analysis, S.K. S.K.; and supervision,D.J.; investigation, D.J. and S.K. B.-G.K.; and D.J.;project resources, administration, D.J.; data D.J.curation, and B.-G.K.; S.B., M.L. funding and T.L.; acquisition, writing—original S.B., M.L. anddraft T.L. preparation, All authors S.K.; have writing—reviewread and agreed and to the editing, published D.J. versionand B.-G.K.; of the visu manuscript.alization, S.K.; supervision, D.J. and B.-G.K.; project administration, D.J. and B.-G.K.; funding acquisition, S.B., M.L. and T.L. All authors have Funding: This research was funded by the research of basic media content technologies, grant read and agreed to the published version of the manuscript. number 21ZH1200. Funding: This research was funded by the research of basic media content technologies, grant Acknowledgments: number 21ZH1200. This work was supported by an Electronics and Telecommunications Research Institute (ETRI) grant funded by the Korean government. (21ZH1200, the research of basic media Acknowledgments:content technologies). This work was supported by an Electronics and Telecommunications Research Institute (ETRI) grant funded by the Korean government. (21ZH1200, the research of basic media contentConflicts technologies). of Interest: The authors declare no conflict of interest. Conflicts of Interest: The authors declare no conflict of interest.

Electronics 2021, 10, 1094 11 of 12

References 1. Bross, B.; Chen, J.; Liu, S.; Wang, Y. Versatile Video Coding Editorial Refinements on Draft 10. Available online: https: //jvet.hhi.fraunhofer.de/ (accessed on 10 October 2020). 2. Sullivan, G.J.; Ohm, J.R.; Han, W.J.; Wiegand, T. Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [CrossRef] 3. ISO/IEC. Information Technology—MPEG Audio Technologies—Part 3: Unified Speech and Audio Coding; International Standard 23003-3; ISO/IEC: Geneva, Switzerland, 2012. 4. Bossen, F.; Li, X.; Suehring, K. AHG Report: Test Model Software Development (AHG3). In Proceedings of the 20th Meeting Joint Video Experts Team (JVET), Document JVET-T0003, Teleconference (Online), Ljubljana, Slovenia, 10–18 July 2018. 5. VVC Reference Software (VTM). Available online: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM (accessed on 21 March 2021). 6. HEVC Reference Software (HM). Available online: https://vcgit.hhi.fraunhofer.de/jvet/HM (accessed on 21 March 2021). 7. Bossen, F.; Boyce, J.; Suehring, K.; Li, X.; Seregin, V. VTM Common Test Conditions and Software Reference Configurations for SDR Video. In Proceedings of the 20th Meeting Joint Video Experts Team (JVET), Document JVET-T2021, Teleconference (Online), 6–15 January 2021. 8. Auwera, G.; Seregin, V.; Said, A.; Ramasubramonian, A.; Karczewicz, M. CE3: Simplified PDPC (Test 2.4.1). In Proceedings of the 11th Meeting Joint Video Experts Team (JVET), Document JVET-K0063, Ljubljana, Slovenia, 10–18 July 2018. 9. Ma, X.; Yang, H.; Chen, J. CE3: Tests of Cross-Component Linear Model in BMS. In Proceedings of the 11th Meeting Joint Video Experts Team (JVET), Document JVET-K0190, Ljubljana, Slovenia, 10–18 July 2018. 10. Racapé, F.; Rath, G.; Urban, F.; Zhao, L.; Liu, S.; Zhao, X.; Li, X.; Filippov, A.; Rufitskiy, V.; Chen, J. CE3-Related: Wide-Angle Intra Prediction for Non-Square Blocks. In Proceedings of the 11th Meeting Joint Video Experts Team (JVET), Document JVET-K0500, Ljubljana, Slovenia, 10–18 July 2018. 11. Pfaff, J.; Stallenberger, B.; Schäfer, M.; Merkle, P.; Helle, P.; Hinz, T.; Schwarz, H.; Marpe, D.; Wiegand, T. CE3: Affine Linear Weighted Intra Prediction. In Proceedings of the 14th Meeting Joint Video Experts Team (JVET), Document JVET-N0217, Geneva, Switzerland, 19–27 March 2019. 12. Zhang, H.; Chen, H.; Ma, X.; Yang, H. Performance Analysis of Affine Inter Prediction in JEM 1.0. In Proceedings of the 2nd Meeting Joint Video Experts Team (JVET), Document JVET-B0037, San Diego, CA, USA, 20–26 February 2016. 13. Gao, H.; Esenlik, S.; Alshina, E.; Kotra, A.; Wang, B.; Liao, R.; Chen, J.; Ye, Y.; Luo, J.; Reuzé, K.; et al. Integrated Text for GEO. In Proceedings of the 17th Meeting Joint Video Experts Team (JVET), Document JVET-Q0806, Brussels, Belgium, 20–26 February 2016. 14. Jeong, S.; Park, M.; Piao, Y.; Park, M.; Choi, K. CE4 Ultimate Motion Vector Expression. In Proceedings of the 12th Meeting Joint Video Experts Team (JVET), Document JVET-L0054, Macao, China, 3–12 October 2018. 15. Sethuraman, S. CE9: Results of DMVR Related Tests CE9.2.1 and CE9.2.2. In Proceedings of the 13th Meeting Joint Video Experts Team (JVET), Document JVET-M0147, Marrakech, Morocco, 9–18 January 2019. 16. Chiang, M.; Hsu, C.; Huang, Y.; Lei, S. CE10.1.1: Multi-Hypothesis Prediction for Improving AMVP Mode, Skip or Merge Mode, and Intra Mode. In Proceedings of the 12th Meeting Joint Video Experts Team (JVET), Document JVET-L0100, Macao, China, 3–12 October 2018. 17. Ye, Y.; Chen, J.; Yang, M.; Bordes, P.; François, E.; Chujoh, T.; Ikai, T. AHG13: On Bi-Prediction with Weighted Averaging and Weighted Prediction. In Proceedings of the 13th Meeting Joint Video Experts Team (JVET), Document JVET-M0111, Marrakech, Morocco, 9–18 January 2019. 18. Zhang, Y.; Chen, C.; Huang, H.; Han, Y.; Chien, W.; Karczewicz, M. Adaptive Motion Vector Resolution Rounding Align. In Proceedings of the 12th Meeting Joint Video Experts Team (JVET), Document JVET-L0377, Macao, China, 3–12 October 2018. 19. Luo, J.; He, Y. CE4-Related: Simplified Symmetric MVD Based on CE4.4.3. In Proceedings of the 13th Meeting Joint Video Experts Team (JVET), Document JVET-M0444, Marrakech, Morocco, 9–18 January 2019. 20. Chen, J.; Ye, Y.; Kim, S. Algorithm description for Versatile Video Coding and Test Model 11 (VTM 11). In Proceedings of the 20th Meeting Joint Video Experts Team (JVET), Document JVET-T2002, Teleconference (Online), 10–18 July 2018. 21. Chien, W.; Boyce, J.; Chen, W.; Chen, Y.; Chernyak, R.; Choi, K.; Hashimoto, R.; Huang, Y.; Jang, H.; Liao, R.; et al. JVET AHG Report: Tool Reporting Procedure (AHG13). In Proceedings of the 20th Meeting Joint Video Experts Team (JVET), Document JVET-T0013, Teleconference (Online), 7–16 October 2020. 22. Fan, Y.; Sun, H.; Katto, J.; Ming’E, J. A fast QTMT partition decision strategy for VVC intra prediction. IEEE Access 2020, 8, 107900–107911. [CrossRef] 23. Jin, Z.; An, P.; Yang, C.; Shen, L. Fast QTBT partition algorithm for intra frame coding through convolutional neural network. IEEE Access 2018, 6, 54660–54673. [CrossRef] 24. Yang, H.; Shen, L.; Dong, X.; Ding, Q.; An, P.; Jiang, G. Low-complexity CTU partition structure decision and fast intra mode decision for versatile video coding. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1668–1682. [CrossRef] 25. Fast Intra Mode Decision Algorithm for Versatile Video Coding. Available online: https://doi.org/10.1109/TMM.2021.3052348 (accessed on 5 May 2021). 26. Park, S.; Kang, J. Fast affine motion estimation for versatile video coding (VVC) encoding. IEEE Access 2019, 7, 158075–158084. [CrossRef] Electronics 2021, 10, 1094 12 of 12

27. ITU-T G.711 Pulse Code Modulation (PCM) of Voice Frequencies. Available online: http://handle.itu.int/11.1002/1000/911 (accessed on 21 March 2021). 28. International Telecommunication Union. Method for the Subjective Assessment of Intermediate (MUSH-RA). ITU-T, Recommendation BS 1543-1. Available online: https://www.itu.int/pub/R-REC/en (accessed on 1 January 2016). 29. Beack, S.; Seong, J.; Lee, M.; Lee, T. Single-Mode-Based Unified Speech and Audio Coding by Extending the Linear Prediction Domain Coding Mode. ETRI J. 2017, 39, 310–318. [CrossRef]