Real Time Video Neural Style Transfer on Mobile Devices

Kunster - AR Art Video Maker - Real time video neural style transfer on mobile devices Wojciech Dudzik, Damian Kosowski Netguru S.A., ul. Małe Garbary 9 61-756 Poznan,´ Poland Email: [email protected], [email protected] Abstract—Neural style transfer is a well-known branch of deep style transfer (NST) methods for videos on mobile phones, learning research, with many interesting works and two major especially the iPhone. In order to do this, we investigate drawbacks. Most of the works in the field are hard to use by non- the problem of temporal coherence among existing methods expert users and substantial hardware resources are required. In this work, we present a solution to both of these problems. and propose another approach as we found problems with We have applied neural style transfer to real-time video (over applying them to mobile devices. We also refined the neural 25 frames per second), which is capable of running on mobile network architecture with regard to its size; therefore, our main devices. We also investigate the works on achieving temporal contributions are as follow: coherence and present the idea of fine-tuning, already trained models, to achieve stable video. What is more, we also analyze • Real-time application of neural style transfer to videos the impact of the common deep neural network architecture on on mobile devices (iOS). the performance of mobile devices with regard to number of • Investigation into achieving temporal coherence with ex- layers and filters present. In the experiment section we present isting methods. the results of our work with respect to the iOS devices and • Analyzing the size of the models present in literature, and discuss the problems present in current Android devices as well as future possibilities. At the end we present the qualitative results of proposing a new smaller architecture for the task. stylization and quantitative results of performance tested on the First we shall review the current status of the NST field iPhone 11 Pro and iPhone 6s. The presented work is incorporated related to image and video style transfer (see II). Further in in Kunster - AR Art Video Maker application available in the (section III) we describe the training regime and proposed Apple’s App Store. neural network architecture. And finally, achieving temporal I. INTRODUCTION coherence will be presented in Section III-B. In Section IV, we will discuss the results obtained during our experiment and Painting as a form of art has accompanied us through the show performance on mobile devices. history, presenting all sorts of things, from the mighty kings portraits through, historic battles to ordinary daily activities. II. RELATED WORK It all changed with the invention of photography, and later, In this section, we will briefly review the selected methods digital photography. Nowadays, most of us carry a smart phone for NST related to our work, while for a more compre- equipped with an HD camera. In the past, re-drawing an image hensive review, we recommend [3]. The first method for in a particular style required a well-trained artist and lots of NST was proposed by Gatys et al. [1]. He demonstrated time (and money). exciting results that caught eyes in both academia and industry. This problem has been studied by both artists and computer That method opened many new possibilities and attracted science researchers for over two decades within the field of the attention of other researchers, i.e.: [4], [5], [6], [7], [8], non-photorealistic rendering (NPR). However, most of these whose work is based on Gatys original idea. One of the NPR stylization algorithms were designed for particular artis- best successory projects was proposed by Johnson et al. [4] tic styles and could not be easily extended to other styles. A with the feed-forward perceptual losses model. In his work, common limitation of these methods is that they only use low- Johnson used a pre-trained VGG [9] to compute the content arXiv:2005.03415v1 [cs.CV] 7 May 2020 level image features and often fail to capture image structures and style loss. This allowed real-time inference speed while effectively. The first to use convolution neural networks (CNN) maintaining good style quality. A natural way to extend this for that task was Gatys et al. [1], [2]. They proposed a neural image processing technique to videos is to perform a certain algorithm for automatic image style transfer, which refines image transformation frame by frame. However, this scheme a random noise to a stylized image iteratively constrained inevitably brings temporal inconsistencies and thus causes by a content loss and a style loss. This approach resulted severe flicker artifacts for the methods that consider single in multiple works that attempted to improve the original and image transformation. One of the methods that solved this addressed its major drawbacks, such as: the long time needed issue was proposed by Ruder et al. [10], which was specifically to obtain stylization or applying this method to videos. In our designed for video. Despite its usability for video, it requires a work, we studied the possibility of delivering these neural time-consuming computations (dense optical flow calculation), and may take several minutes to process a single frame. Due quantization options, and hardware usage. to this fact, it makes it not applicable for real-time usage. This motivated us to pursue possibilities for work involved To obtain a consistent and fast video style transfer method, in both areas of interest: NST and mobile applications of some real-time or near real-time models have recently been DNN. As a result, we propose improvements by combining developed. the methods from those fields and propose a real-time video Using a feed-forward network design, Huang et al. [11] pro- neural style transfer on mobile devices. posed a model similar to the Johnson’s [4] with an additional III. PROPOSED METHOD temporal loss. This model provides faster inference times since it neither estimates optical flows nor uses information We propose a reliable method for achieving real-time neural about the previous frame at the inference stage. Another, more style transfer on mobile devices. In our approach, we are recent, development published by Gao et al. [12] describes a primarily focused on iPhones. There are still some differences model that does not estimate optical flows but involves ground- between them and Android-based phones, which we will truth optical flows only in loss calculation in the training stage. address in Section IV-A. In this section, we will present our The use of the ground-truth optical flow allows us to obtain network architecture and training procedure. an occlusion mask. Masks are further used to mark pixels that A. Network architecture are untraceable between frames and should not be included in temporal loss calculations. Additionally, temporal loss is In Fig. 2, we present the architecture of our network, which considered not only on the output image but also at the feature is the architecture present in [12]. In this paper, the architecture level of DNN. Gao’s lightweight and feed-forward network is is composed of three main components: 1) the encoder, 2) the considered to be one of the fastest approach for video NST. decoder, 3) the VGG-16 network. The encoder part is respon- Still, applying methods mentioned above may be trouble- sible for obtaining feature maps while the decoder generates some due to limited capabilities of mobile devices. Even stylized images from these feature maps. The last part of the though modern smartphones are able to run many machine VGG network is used for perceptual loss calculation. During learning models, achieving real-time performance introduces the inference, only the encoder and decoder parts are used. A more strict requirements about the model design. There was detailed description of each layer of the network was presented several noticeable reports dealing with this issue, i.e. [13], in Tab. I, with a focus on the number of filters. [14], [15]. In these papers, authors are focusing on running We provide modifications to the original architecture, in- current methods such as Johnson [4], on mobile platforms. In cluding changes in the number of filters at each layer (showed order to meet the desired performance authors are encompass- in last column of Tab. I), and removing the TanH opertation ing the use of specialized ARM instructions - NEON or GPU at the last layer. Moreover, all of the kernel sizes were equal computation while all of them perform only image to image to 3 × 3 as opposed to [12] where first and last layer have style transfer. Other implementations include [16] and very kernel size of 9 × 9. We used reflection padding for each popular Prisma application. Both of them relay on processing of the convolutions. For the upsample layer, we used the images on server side (although Prisma added later option for nearest neighbors method. A visualization of the residual layer on device processing). As a consequence, both of them heavily architecture is presented in Fig. 1. The β is introduced in order depend on the internet connection and enable processing of to investigate the influence of the number of residual layers single images at the same time. on the final result of stylization. Since the pace of progression in the hardware capabilities TABLE I: Detailed layer by layer architecture of network used (CPU, GPU) of mobile devices is very fast, the computation power grows each year. This trend was shown clearly in Layer Type Filters in [12] Our filters 1 Input Ignatov et al[17], where the authors present a comprehensive 2 Conv + instnorm + Relu 48 α · 32 review of smartphones performance against popular machine 3 Conv + instnorm + Relu (stride 2) 96 α · 48 4 Conv + instnorm + Relu (stride 2) 192 α · 64 learning and deep learning technologies used nowadays.

Load more