Caffe: Convolutional Architecture for Fast Feature Embedding∗

Yangqing Jia∗, Evan Shelhamer∗, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, Trevor Darrell UC Berkeley EECS, Berkeley, CA 94702 {jiayq,shelhamer,jdonahue,sergeyk,jonlong,rbg,sguada,trevor}@eecs.berkeley.edu

ABSTRACT 1. INTRODUCTION Caffe provides multimedia scientists and practitioners with A key problem in multimedia data analysis is discovery of a clean and modifiable framework for state-of-the-art deep effective representations for sensory inputs—images, sound- learning algorithms and a collection of reference models. waves, haptics, etc. While performance of conventional, The framework is a BSD-licensed C++ library with Python handcrafted features has plateaued in recent years, new de- and MATLAB bindings for training and deploying general- velopments in deep compositional architectures have kept purpose convolutional neural networks and other deep mod- performance levels rising [8]. Deep models have outper- els efficiently on commodity architectures. Caffe fits indus- formed hand-engineered feature representations in many do- try and internet-scale media needs by CUDA GPU computa- mains, and made learning possible in domains where engi- tion, processing over 40 million images a day on a single K40 neered features were lacking entirely. or Titan GPU (≈ 2.5 ms per image). By separating model We are particularly motivated by large-scale visual recog- representation from actual implementation, Caffe allows ex- nition, where a specific type of deep architecture has achieved perimentation and seamless switching among platforms for a commanding lead on the state-of-the-art. These Con- ease of development and deployment from prototyping ma- volutional Neural Networks, or CNNs, are discriminatively chines to cloud environments. trained via back-propagation through layers of convolutional Caffe is maintained and developed by the Berkeley Vi- filters and other operations such as rectification and pooling. sion and Learning Center (BVLC) with the help of an ac- Following the early success of digit classification in the 90’s, tive community of contributors on GitHub. It powers on- these models have recently surpassed all known methods for going research projects, large-scale industrial applications, large-scale visual recognition, and have been adopted by in- and startup prototypes in vision, speech, and multimedia. dustry heavyweights such as Google, Facebook, and Baidu for image understanding and search. Categories and Subject Descriptors While deep neural networks have attracted enthusiastic interest within and beyond, replication of I.5.1 [Pattern Recognition]: [Applications–Computer vi- published results can involve months of work by a researcher sion]; D.2.2 [Software Engineering]: [Design Tools and or engineer. Sometimes researchers deem it worthwhile to Techniques–Software libraries]; I.5.1 [Pattern Recognition]: release trained models along with the paper advertising their [Models–Neural Nets] performance. But trained models alone are not sufficient for rapid research progress and emerging commercial applica- General Terms tions, and few toolboxes offer truly off-the-shelf deployment Algorithms, Design, Experimentation of state-of-the-art models—and those that do are often not computationally efficient and thus unsuitable for commercial Keywords deployment.

arXiv:1408.5093v1 [cs.CV] 20 Jun 2014 To address such problems, we present Caffe, a fully open- Open Source, Computer Vision, Neural Networks, Parallel source framework that affords clear access to deep architec- Computation, tures. The code is written in clean, efficient C++, with CUDA used for GPU computation, and nearly complete, ∗Corresponding Authors. The work was done while Yangqing Jia was a graduate student at Berkeley. He is well-supported bindings to Python/Numpy and MATLAB. currently a research scientist at Google, 1600 Amphitheater Caffe adheres to software engineering best practices, pro- Pkwy, Mountain View, CA 94043. viding unit tests for correctness and experimental rigor and speed for deployment. It is also well-suited for research use, due to the careful modularity of the code, and the clean sep- aration of network definition (usually the novel part of research) from actual implementation. In Caffe, multimedia scientists and practitioners have an orderly and extensible toolkit for state-of-the-art deep learn- ing algorithms, with reference models provided out of the box. Fast CUDA code and GPU computation fit industry needs by achieving processing speeds of more than 40 mil- . Core Open Pretrained Framework License language Binding(s) CPU GPU source Training models Development Python, Caffe BSD C++ distributed MATLAB cuda-convnet [7] unspecified C++ Python discontinued Decaf [2] BSD Python discontinued OverFeat [9] unspecified Lua C++,Python centralized /Pylearn2 [4] BSD Python distributed Torch7 [1] BSD Lua distributed

Table 1: Comparison of popular deep learning frameworks. Core language is the main library language, while bindings have an officially supported library interface for feature extraction, training, etc. CPU indicates availability of host-only computation, no GPU usage (e.g., for cluster deployment); GPU indicates the GPU computation capability essential for training modern CNNs. lion images per day on a single K40 or Titan GPU. The may be used to construct networks and classify inputs. The same models can be run in CPU or GPU mode on a vari- Python bindings also expose the solver module for easy pro- ety of hardware: Caffe separates the representation from the totyping of new training procedures. actual implementation, and seamless switching between het- Pre-trained reference models. Caffe provides (for aca- erogeneous platforms furthers development and deployment— demic and non-commercial use—not BSD license) reference Caffe can even be run in the cloud. models for visual tasks, including the landmark “AlexNet” While Caffe was first designed for vision, it has been adopted ImageNet model [8] with variations and the R-CNN detec- and improved by users in speech recognition, robotics, neu- tion model [3]. More are scheduled for release. We are roscience, and astronomy. We hope to see this trend con- strong proponents of reproducible research: we hope that tinue so that further sciences and industries can take advan- a common software substrate will foster quick progress in tage of deep learning. the search over network architectures and applications. Caffe is maintained and developed by the BVLC with the active efforts of several graduate students, and welcomes 2.1 Comparison to related software open-source contributions at http://github.com/BVLC/caffe. We summarize the landscape of convolutional neural net- We thank all of our contributors for their work! work software used in recent publications in Table 1. While our list is incomplete, we have included the toolkits that are 2. HIGHLIGHTS OF CAFFE most notable to the best of our knowledge. Caffe differs from Caffe provides a complete toolkit for training, testing, other contemporary CNN frameworks in two major ways: finetuning, and deploying models, with well-documented ex- (1) The implementation is completely C++ based, which amples for all of these tasks. As such, it’s an ideal starting eases integration into existing C++ systems and interfaces point for researchers and other developers looking to jump common in industry. The CPU mode removes the barrier of into state-of-the-art machine learning. At the same time, specialized hardware for deployment and experiments once it’s likely the fastest available implementation of these algo- a model is trained. rithms, making it immediately useful for industrial deploy- (2) Reference models are provided off-the-shelf for quick ment. experimentation with state-of-the-art results, without the Modularity. The software is designed from the begin- need for costly re-learning. By finetuning for related tasks, ning to be as modular as possible, allowing easy extension to such as those explored by [2], these models provide a warm- new data formats, network layers, and loss functions. Lots start to new research and applications. Crucially, we publish of layers and loss functions are already implemented, and not only the trained models but also the recipes and code plentiful examples show how these are composed into train- to reproduce them. able recognition systems for various tasks. Separation of representation and implementation. Caffe model definitions are written as config files using the 3. ARCHITECTURE Protocol Buffer language. Caffe supports network archi- tectures in the form of arbitrary directed acyclic graphs. 3.1 Data Storage Upon instantiation, Caffe reserves exactly as much memory Caffe stores and communicates data in 4-dimensional ar- as needed for the network, and abstracts from its underly- rays called blobs. ing location in host or GPU. Switching between a CPU and Blobs provide a unified memory interface, holding batches GPU implementation is exactly one function call. of images (or other data), parameters, or parameter updates. Test coverage. Every single module in Caffe has a test, Blobs conceal the computational and mental overhead of and no new code is accepted into the project without corre- mixed CPU/GPU operation by synchronizing from the CPU sponding tests. This allows rapid improvements and refac- host to the GPU device as needed. In practice, one loads toring of the codebase, and imparts a welcome feeling of data from the disk to a blob in CPU code, calls a CUDA peacefulness to the researchers using the code. kernel to do GPU computation, and ferries the blob off to Python and MATLAB bindings. For rapid proto- the next layer, ignoring low-level details while maintaining typing and interfacing with existing research code, Caffe a high level of performance. Memory on the host and device provides Python and MATLAB bindings. Both languages is allocated on demand (lazily) for efficient memory usage. uigtann:adt ae ece h mgsadlabels and images classification) the fetches digit ex- layer MNIST data typical (for a a network training: shows Caffe during 1 a Figure of ample algorithm. descent gradient Network A Training 3.4 of independent and seamless prove definition. is GPU to model switch the tests and CPU/GPU (with The CPU results corresponding identical it). produce with that come routines Layers reconstruction. or switch. classification the as computes such that task layer loss a a for that with layer objective ends data and and a disk with forward from begins network loads typical the A learn- of machine systems. end-to-end ing are correctness models Caffe ensuring passes. backward layers, of graph Mode Run and Networks composi- 3.3 the to networks. due of effort construction minimal tional Coding requires tasks. visual layers state-of-the-art custom are for These needed element- types hinge. the and normalization, all softmax like response losses and local operations, wise logistic, and rectified like linear nonlinearities products, inner pooling, volution, earlier to back-propagated turn in and are parameters layers. which the inputs, to the respect to with gradients a the and computes outputs, a the whole: produces more pass and a or inputs as one the the network yields takes for the responsibilities and key of two input, have operation Layers as output. blobs as more blobs or one takes Layers 3.2 source open other the for by community. contributed support some added including and sources, recently data below) have (discussed we modularity, design layer-wise code CPU to minimal Thanks with machines impact. commodity on through- 150MB/s a of provide put Buffers Protocol and LevelDB program, test effi- and Python. and most version, languages, C++ binary multiple notably in the implementations interface with cient compatible format human-readable text a binary serialization, efficient minimal-size serialized, when features: strings important and several layers have represent which layers. boxes the blue into where fed network, or Caffe by a produced of blobs example data classification represent digit octagons MNIST yellow An 1: Figure 2 1 https://code.google.com/p/leveldb/ https://code.google.com/p/protobuf/ aetan oesb h atadsadr stochastic standard and fast the by models trains Caffe single a setting by GPU or CPU on run is network The acyclic directed any for bookkeeping the all does Caffe con- including: types layer of set complete a provides Caffe it layer: network neural a of essence the is layer Caffe A LevelDB in stored is data Large-scale Buffers Protocol Google as disk to saved are Models httkstegain ihrsett h upt and output, the to respect with gradient the takes that

2 owr pass forward aaae.I our In databases.

backward that 1 ,

h mg n re octgrz tit n fte100Im- 1,000 the of takes one demo into The it categories categorize phone. to ageNet tries mobile and via image including the provided users, images the on by classification object state-of-the-art ing results. direct state-of-the-art its obtain or Caffe to using (Decaf) [6], Adobe precursor partners and [11] industry Facebook several Berkeley as with of such Members collaborated at also tasks. have projects of EECS number state-of-the- research a achieving of on universities, performance number art other large and a Berkeley in UC used been ready EXAMPLES AND APPLICATIONS 4. [5]. retrieval transfer This object knowledge and as needed. [3], such detection as tasks weights object for [2], new for essential initializes weights is and model capability old task the for new definition finetunes the model Caffe From a network, and Caffe. new network in existing the method an standard of snapshot a a is data, or are architectures which of all resuming, documented. and and stopping implemented for momentum, schedules, snapshots decay rate and learning at are training Vi- code to sequentially. tal network source the through Caffe pass that the mini-batches in network. found whole ples/lenet/lenet_train.prototxt pro- is the train that example which layer gradients This loss and con- classification loss feeds as a the and duces into such transforms, prediction layers linear final multiple rectified the through and pooling it volution, passes disk, from online! classifica- yourself object out Caffe it the Try of demo. example tion An 2: Figure ooe oauayojc erea [5]. fine- retrieval by object applied dataset vocabulary been has open ImageNet network to resulting full The the network. this of tuning categories 10,000 all 2. Figure in 4 3 http://www.image-net.org/challenges/LSVRC/2013/ http://demo.caffe.berkeleyvision.org/ betClassification Object al- has Caffe release, public since months six first its In new to model existing an of adaptation the Finetuning, utemr,w aescesul rie oe with model a trained successfully have we Furthermore,

4 yia lsicto euti shown is result classification typical A .

aehsa niedemo online an has Caffe

aaaepoesdin processed are Data .

3 exam- show- R-CNN: Regions with CNN features warped region aeroplane? no. . . person? yes. . CNN . tvmonitor? no. 1. Input 2. Extract region 3. Compute 4. Classify image proposals (~2k) CNN features regions

Figure 5: The R-CNN pipeline that uses Caffe for object detection.

Figure 3: Features extracted from a deep network, This documentation will be exposed in a standalone web visualized in a 2-dimensional space. Note the clear interface in the near future. separation between categories, indicative of a suc- cessful embedding. 5. AVAILABILITY Source code is published BSD-licensed on GitHub.5 Project details, step-wise tutorials, and pre-trained models are on Learning Semantic Features In addition to end-to-end the homepage.6 Development is done in and OS X, training, Caffe can also be used to extract semantic features and users have reported Windows builds. A public Caffe from images using a pre-trained network. These features Amazon EC2 instance is coming soon. can be used “downstream” in other vision tasks with great success [2]. Figure 3 shows a two-dimensional embedding of all the ImageNet validation images, colored by a coarse 6. ACKNOWLEDGEMENTS category that they come from. The nice separation testifies We would like to thank NVIDIA for GPU donation, the to a successful semantic embedding. BVLC sponsors (http://bvlc.eecs.berkeley.edu/), and Intriguingly, this learned feature is useful for a lot more our open source community. than object categories. For example, Karayev et al. have shown promising results finding images of different styles 7. REFERENCES such as “Vintage” and “Romantic” using Caffe features (Fig- ure 4) [6]. [1] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A MATLAB-like environment for machine learning. In BigLearn, NIPS Workshop, 2011. Ethereal HDR Melancholy Minimal [2] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014. [3] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. [4] I. Goodfellow, D. Warde-Farley, P. Lamblin, V. Dumoulin, M. Mirza, R. Pascanu, J. Bergstra, F. Bastien, and Y. Bengio. Pylearn2: a machine learning research library. arXiv preprint 1308.4214, 2013. Figure 4: Top three most-confident positive pre- [5] S. Guadarrama, E. Rodner, K. Saenko, N. Zhang, dictions on the Flickr Style dataset, using a Caffe- R. Farrell, J. Donahue, and T. Darrell. Open-vocabulary trained classifier. object retrieval. In RSS, 2014. [6] S. Karayev, M. Trentacoste, H. Han, A. Agarwala, Object Detection Most notably, Caffe has enabled us T. Darrell, A. Hertzmann, and H. Winnemoeller. to obtain by far the best performance on object detection, Recognizing image style. arXiv preprint 1311.3715, 2013. [7] A. Krizhevsky. cuda-convnet. evaluated on the hardest academic datasets: the PASCAL https://code.google.com/p/cuda-convnet/, 2012. VOC 2007-2012 and the ImageNet 2013 Detection challenge [8] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet [3]. classification with deep convolutional neural networks. In Girshick et al. [3] have combined Caffe together with tech- NIPS, 2012. niques such as Selective Search [10] to effectively perform [9] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, simultaneous localization and recognition in natural images. and Y. LeCun. Overfeat: Integrated recognition, Figure 5 shows a sketch of their approach. localization and detection using convolutional networks. In Beginners’ Guides To help users get started with in- ICLR, 2014. [10] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. stalling, using, and modifying Caffe, we have provided in- Selective search for object recognition. IJCV, 2013. structions and tutorials on the Caffe webpage. The tuto- [11] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and rials range from small demos (MNIST digit recognition) to L. Bourdev. Panda: Pose aligned networks for deep serious deployments (end-to-end learning on ImageNet). attribute modeling. In CVPR, 2014. Although these tutorials serve as effective documentation of the functionality of Caffe, the Caffe source code addition- 5https://github.com/BVLC/caffe/ ally provides detailed inline documentation on all modules. 6http://caffe.berkeleyvision.org/