tools for the measurement of animal behavior in Mackenzie Weygandt Mathis1,* and Alexander Mathis1 1Harvard University, Cambridge, MA USA | *[email protected]

Recent advances in computer vision have made accurate, fast behavior over long time periods into meaningful met- and robust measurement of animal behavior a reality. In the rics? How does one use behavioral quantification to build a past years powerful tools specifically designed to aid the mea- better understanding of the brain and an animal’s umwelt (1)? surement of behavior have come to fruition. Here we discuss how capturing the postures of animals - pose estimation - has In this review we discuss the advances, and challenges, in an- been rapidly advancing with new deep learning methods. While imal pose estimation and its impact on neuroscience. Pose es- challenges still remain, we envision that the fast-paced develop- timation refers to methods for measuring posture, while pos- ment of new deep learning tools will rapidly change the land- scape of realizable real-world neuroscience. ture denotes to the geometrical configuration of body parts. While there are many ways to record behavior (7–9), videog- Highlights: raphy is a non-invasive way to observe the posture of ani- mals. Estimated poses across time can then, depending on 1. Deep neural networks are shattering performance the application, be transformed into kinematics, dynamics, benchmarks in computer vision for various tasks. and actions (3,5–7, 10). Due to the low-dimensional of posture, these applications are computationally tractable. 2. Using modern deep learning approaches (DNNs) in the lab is a fruitful approach for robust, fast, and efficient A very brief history of pose estimation. measurement of animal behavior. The postures and actions of animals have been documented 3. New DNN-based tools allow for customized tracking as far back as cave paintings, illustrating the human desire approaches, which opens new avenues for more flexi- to distill the essence of an animal for conveying information. ble and ethologically relevant real-world neuroscience. As soon as it was possible to store data on a computer, re- searchers have built systems for automated analysis. Over time, these systems reflected all flavors of artificial intelli- Introduction gence from rule-based via expert systems, to machine learn- Behavior is the most important output of the underlying ing (11, 12). Traditionally posture was measured by placing neural computations in the brain. Behavior is complex, often markers on the subject (3), or markerlessly by using body multi-faceted, and highly context dependent both in how models (i.e. cylinder-based models with edge features (13)). con-specifics or other observers understand it, as well as Other computer vision techniques, such as using texture or how it is emitted. The study of animal behavior - ethology - color to segment the person from the background to create has a rich history rooted in the understanding that behavior silhouettes (14, 15), or using so-called hand-crafted features gives an observer a unique look into an animal’s umwelt with decoders (11, 12, 16) were also popular before deep (1–3); what are the motivations, instincts, and needs of an learning flourished. animal? What survival value do they provide? In order to arXiv:1909.13868v2 [cs.CV] 18 Oct 2019 understand the brain, we need to measure behavior in all its The deep learning revolution for posture. beauty and depth, and distill it down into meaningful metrics. Pose estimation is a challenging problem, but it has been Observing and efficiently describing behavior is a core tenant tremendously advanced in the last five years due to advances of modern ethology, neuroscience, medicine, and technology. in deep learning. Deep neural networks (DNNs) are com- putational algorithms that consist of simple units, which are In 1973 Tinbergen, Lorenz, and von Frisch were the organized in layers and then serially stacked to form “deep first ethologists awarded the Nobel Prize in Physiology networks". The connections between the units are trained or Medicine for their pioneering work on the patterns on data and therefore learn to extract information from raw of individual and social group behavior (4). The award data in order to solve tasks. The current deep learning revo- heralded a coming-of-age for behavior, and how rigorously lution started with achieving human-level accuracy for object documenting behavior can influence how we study the recognition on the ImageNet challenge, a popular benchmark brain (4). Manual methods are powerful, but also highly with many categories and millions of images (16, 17). A labor intensive and subject to the limits of our senses. combination of large annotated data sets, sophisticated net- Matching (and extending) the capabilities of biologists with work architectures, and advances in hardware made this pos- technology is a highly non-trivial problem (5,6), yet harbors sible and quickly impacted many problems in computer vi- tremendous potential. How does one compress an animal’s sion (see reviews (12, 16, 18)).

Mathis & Mathis | arXiv.org | October 21, 2019 | 1–11 Figure. 1. 2D pose estimation, 3D pose estimation & dense representations of humans and other animals: a: Example 2D multi-human pose estimation from OpenPose (19). b: Example 3D human pose estimation from (20). c: Dense representations of humans with DensePose, adapted from Guler et al. (21). d: Animals have diverse bodies and experimenter’s are often interested in specific key points, making tailored network attractive solutions. DNNs open a realm of possibilities: from mice to cuttlefish. e: 3D pose estimation requires multiple cameras views, or 2D to 3D “lifting". f: The new SMALST model which fits full 3D models to images from Zuffi et al. (22) applied to zebras.

2D and 3D (human) pose estimation. mans (38, 39) and rodents (40, 41). Recently dense-pose rep- resentations, i.e. 3D point clouds or meshes (Figure1c), have In 2014 “DeepPose" was the first paper to apply deep become a popular and elegant way to capture the soft-tissue learning to human 2D pose estimation (23), and immediately and shape of bodies, which are highly important features new networks were proposed that improved accuracy by for person identification, fashion (i.e. clothing sales), and introducing a translation invariant model (24), and convolu- in medicine (21, 42, 43). However, state-of-the-art perfor- tional networks plus geometric constraints (25, 26). In the mance currently requires body-scanning of many subjects to few years since, numerous human pose estimation papers make body models. Typically, large datasets are collected to (approx. 4,000 on ), and new benchmarks enable the creation of robust algorithms for inference on di- with standardized datasets plus evaluation metrics appeared, verse humans (or for animals, scanning toy models has been which allow better comparisons of “state-of-the-art" perfor- fruitful (44)). Recently, outstanding improvements have been mance (27). This culture has driven rapid and remarkable made to capture shapes of animals from images (22, 45, 46). increases in performance: from 44% of body parts correctly However, there are no animal-specific toolboxes geared to- labeled to nearly 94% - with the top 15 networks being wards neuroscience applications, although we believe that within a few percentage points of each other (an example this will change in the near future, as for many applications top network is shown in Figure1a) (19, 28–32). The having the soft-tissue measured will be highly important, i.e. history and many advances in 2D human pose estimation are in obesity or pregnancy research, etc. comprehensively reviewed in (11, 33). Animal pose estimation 3D human pose estimation is a more challenging task and 3D labeled data is more difficult to acquire. There has been The remarkable performance when using deep learning for massive improvements in networks; see review (34). Yet cur- human 2D & 3D pose estimation plus dense-representations rently, the highest accuracy is achieved by using multiple 2D made this large body of work ripe for exploring its utility views to reconstruct a 3D estimate (Figure1b; (20, 35)), but in neuroscience (Figure1d-f). In the past two years, deep other ways of “lifting" 2D into 3D are being actively ex- learning tools for laboratory experiments have arrived plored (20, 36, 37). (Figure2a-d).

Dense-representations of bodies. Many of the properties of DNNs were extremely appealing: remarkable and robust performance, relatively fast inference Other video-based approaches for capturing the posture and due to GPU hardware, and efficient code due to modern soft tissue of humans (and other animals) also exist. Depth- packages like TensorFlow and PyTorch (reviewed in (47)). cameras such as the Microsoft Kinect have been used in hu- Furthermore, unlike for many previous algorithms, neither

2 | arXiv.org Mathis & Mathis | deep learning & animal behavior body models nor tedious manual tuning of parameters is The advantage is that these networks are pretrained on larger required. Given the algorithms, the crucial ingredient for hu- datasets (for different tasks where a lot of data is available man pose estimation success was large-scale well annotated like ImageNet), therefore they are effectively imbued with data sets of humans with the locations of the bodyparts. good image representations.

Here, we identify 5 key areas that were important for mak- This is indeed what “DeepLabCut," the first tool to leverage ing DNN-based pose estimation tools useful for neuroscience the advances in human pose estimation for application to laboratories, and review the progress in the last two years: animals did (48). DeepLabCut was built on a subset of “DeeperCut" (29), which was an attractive option due to 1. Can DNNs be harnessed with small training datasets? its use of ResNets, which are powerful for transfer learn- Due to the nature of “small-scale" laboratory experi- ing (49, 50). Moreover transfer learning reduces training ments, labeling > 20,000 or more frames is not a fea- times (49–51), and there is a significant gain over using sible approach (the typical human benchmark dataset randomly-initialized networks in performance, especially for sizes). smaller datasets (50). 2. The end-result must be as accurate as a human manually-applied labels (i.e. the gold standard), and The major result from DeepLabCut was benchmarking on computationally tractable (fast). smaller datasets and finding that only a few hundred anno- tated images are enough to achieve excellent results for di- 3. The resulting networks should be robust to changes in verse pose estimation tasks like locomotion, reaching and experimental setups, and for long-term storage and re- trail-tracking in mice, egg-laying in flies and hunting in chee- analysis of video data, to video compression. tahs, due to transfer learning (Figure2f,g,h) (50, 52, 53). “DeepBehavior," which utilized different DNN-packages for 4. Animals move in 3D, thus having efficient solutions various pose estimation problems, also illustrated the gain of for 3D pose estimation would be highly valuable, es- transfer learning (51). pecially in the context of studying motor learning and control. 2. Accuracy & speed. 5. Tracking multiple subjects and objects is important for To be useful, pose estimation tools need to be as good as many experiments studying social behaviors as well as human annotation of frames (or tracking markers, another for animal-object interactions. proxy for a human-applied label). DeepLabCut was shown to reach human labeling accuracy (48), and can replace 1. Small training sets for lab-sized experiments. physical markers (58). Moreover, they need to be efficient While the challenges discussed above for human pose (fast) for both offline analysis and online analysis. Speed is estimation also apply for other animals, one important often related to the depth of the network. Stacked-hourglass challenge for applying these methods to neuroscience was networks, which use iterative refinement (28, 59) and annotated data sets - could DNNs be harnessed for much fewer layers, are fast. Two toolboxes, “LEAP" (55) and smaller datasets, at sizes reasonable for typical labs? Thus, “DeepPoseKit" (60) adopted variants of stacked-hourglass while it was clear that given enough annotated frames the networks. LEAP allows the user to rapidly compute pos- same algorithms will be able to learn to track the body parts tures, and then perform unsupervised behavioral analysis of any animal, there were feasibility concerns. (Figure2d,e) (61). This is an attractive solution for real-time applications, but it is not quite as accurate. For various Human networks are typically trained on thousands of datasets, DeepPoseKit reports it is about three times as images, and nearly all the current state-of-the-art networks accurate as LEAP, yet similar to DeepLabCut (60). They provide tailored solutions that utilize the skeleton structure also report about twice faster video processing compared during inference (19, 30). Thus, applying these tools to new to DeepLabCut and LEAP for batch-processing (on small datasets was not immediately straight-forward, and to create frame sizes). animal-specific networks one would need to potentially curate large datasets of the animal(s) they wanted to track. Deeper networks are slower, but often have more general- Additionally, researchers would need tailored DNNs to track ization ability (49). DeepLabCut was designed for gener- their subjects (plus the ability to track unique objects, such alization and therefore utilized deeper networks (ResNets) as the corners of a box, or an implanted fiber). that are inherently slower than stacked-hourglass networks, yet DeepLabCut can match human accuracy in labeling (Fig- Thus, one of the most important challenges is creating ure2f) (48). The speed has been shown to be compatible with tailored DNNs that are robust and generalize well with little online-feedback applications (52, 62, 63). Other networks training data. One potential solution for making networks for recently added to DeepLabCut (with a MobileNetV2 back- animal pose estimation that could generalize well, even with bone) give slightly lower accuracy, with twice the speed (50). little data, was to use transfer learning - the ability to take a Overall, on GPU hardware all packages are fast, and reach network that has been trained on one task to perform another. speeds of several hundred frames per second in offline modes.

Mathis & Mathis | deep learning & animal behavior arXiv.org | 3 Figure. 2. DNNs applied to animal pose estimation. a: Knee tracking during cycling adopted from (54). b: 3D limb estimates from (52). c: A Lichen Katydid tracked with DeepLabCut, courtesy of the authors. d: Fly with LEAP annotated body parts. The circles indicate the fraction for which predicted positions of the particular body part are closer to the ground truth than the radii on test images (adopted from (55)). e: Density plot of t-SNE plus frequency-transformed freely moving fly body-part trajectories. Patches with higher probability indicate more common movements like different types of grooming behaviors (adopted from (55)); f. DeepLabCut requires little data to match human performance. Average error (RMSE) for several splits of training and test data vs. number of training images compared to RMSE of a human scorer. Each split is denoted by a cross, the average by a dot. For 80% of the data, the algorithm achieves human level accuracy on the test set. As expected, test RMSE increases for fewer training images. Around 100 diverse frames are enough to provide high tracking performance (<5-pixel accuracy - adopted from (48)). g: Networks that perform better on ImageNet perform better for predicting 22 body parts on horses on within-domain (similar data distribution as training set, red) and out-of-domain data (novel horses, black). The faint lines are individual splits. (adopted from (50)). h: Due to the convolutional network architecture, when trained on one mouse the network generalizes to detect body parts of three mice (adopted from (48)). i: 3D reaching kinematics of rat (adopted from (56)). j: 3D pose estimation on a cheetah for 2 example poses from 6 cameras as well es example 2D views (adopted from (53)). k: Pupil and pupil-size tracking (adopted from (57)).

3. Robustness. How can robustness be increased within the DNN? Both transfer learning (discussed above) and data augmentation Neuroscience experiments based on video recordings pro- strategies are popular and rapidly evolving approaches to duce large quantities of data and are collected over exten- increase robustness in DNNs (see review (65)). Moreover, sive periods of time. Thus, analysis pipelines should be ro- active learning approaches allow an experimenter to contin- bust to a myriad of perturbations: such as changes in setups uously build more robust and diverse datasets, for large scale (backgrounds, light sources, cameras, etc.), subject appear- projects by expanding the training set with images, where the ance (due to different animal strains), and compression algo- network fails (48, 53, 55). So far, the toolboxes have been rithms (which allow storage of perceptually good videos with tested on data from the same distribution (i.e. by splitting little memory demands (64)). frames from videos into test and training data), which is

4 | arXiv.org Mathis & Mathis | deep learning & animal behavior important for assessing the performance (48, 55, 60), but did part on each animal (Figure2h; (48, 72)) and link them using not directly tested out-of-domain robustness. part affinity fields (19), pairwise predictions (29), or geomet- rical constraints (72), plus combinatorics. Over the course of long-term experiments the background or even animal strain can change, which means having The impact on experimental neuroscience robust networks would be highly advantageous. We re- cently tested the generalization ability of DeepLabCut In the short time period these tools have become available with different network backbones for pose estimation. there has been a rather wide adoption by the neuroscience We find that pretraining on ImageNet strongly improves and ethology community. Beyond the original publications, out-of-domain performance, and that better ImageNet per- DeepLabCut has already been used for pose estimation and forming networks are more robust (Figure2g) (50). There behavioral analysis in many contexts. For instance, knee is still a gap to close in out-of-domain performance, however. movement quantification during cycling (Figure2a) (54), postural analysis during underwater running (73), social DeepLabCut is also robust to video compression, as compres- behavior in bats (74), for benchmarking thermal constraints sion by more than 1,000X only mildly affects accuracy (less with optogenetics (75), fly leg movement analysis (76, 77), than 1 pixel average error less) (52). The International Brain for 3D rat reaching (Figure2i) (56), hydra in a thermal Lab (IBL) independently and synergistically showed that for stimulation assay (78) and pupillometry (Figure2k) (57). tracking multiple body parts in a rodent decision making task, Also inanimate objects can be tracked, and it has indeed DeepLabCut is robust to video compression (66). Thus, in also been used to track metal beads when subjected to a practice users can substantially compress videos, while re- high voltage (79), and magic tricks (i.e. coins and the taining accurate posture information. magician) (80). LEAP (55) has been used to track ants (81) and mice (82). 4. 3D animal pose estimation. Currently, there are several animal pose estimation tool- Pose estimation is just the beginning; the next steps involve boxes that explicitly support 2D and 3D key-point detec- careful analysis of kinematics, building detailed, multi-scale tion (53, 67–69). DeepLabCut uses 2D pose estimation to ethograms of behaviors, new modeling techniques to un- train a single camera-invariant 2D network (or multiple 2D derstand large-scale brain activity and behaviors across a networks) that is then used to perform traditional triangu- multitude of timescales, and beyond. We envision three lation to extract 3D key points (Figure2i, j; (53, 56)). A branches where powerful feature tracking and extensions pipeline built on DeepLabCut called “Anipose" allows for will be useful: motor control studies (often involving 3D reconstruction from multiple cameras using a wider va- complex motor actions), naturalistic behaviors in the lab and riety of methods (67). “DeepFly3D" (68) uses the network in the wild, and better quantification of robust and seemingly architecture from Newell et al. (28) and then adds elegant simple “non-motor" tasks (Figure3). tools to compute an accurate 3D estimate of Drosophila melanogaster by using the fly itself vs. standard calibra- Many paradigms in neuroscience can be loosely arranged tion boards. Zhang et al. use epipolar geometry to train along three branches as natural (i.e. mouse parenting be- across views and thereby improve 3D pose estimation for havior), simple trail-based tasks (i.e. classical condition- mice, dogs, and monkeys (69). ing), and/or complex motor actions like skilled reaching (Fig- ure3). For example, you can have simple and natural tasks 5. Multi-animal & object tracking. such as licking for water, or complex and natural behaviors such as escaping from a looming stimulus that would rarely Many experiments in neuroscience require measuring inter- produce repeated trajectories. For simplicity, here we dis- actions of multiple animals or interactions with objects. Hav- cuss how pose estimation can potentially enhance these stud- ing the ability to both flexibly track user-defined objects or ies along those three branches, namely complex movements multiple animals therefore is highly desirable. There are (Motor control & kinematics), natural behaviors (Natural be- many pre-deep learning algorithms that allow tracking of ob- haviors & ethologically relevant features), and during simple jects (one such modern example called “Tracktor" also nicely motor-output tasks (Revisiting classic tasks). summarizes this body of work (70)). Recently researchers have also applied deep learning to this problem. For exam- Motor control & kinematics. ple, the impressive “idTracker:ai" (71) allows for users to track a hundred individual, unmarked animals. Arac et al. Often in neuroscience-minded motor control studies end- used YOLO, a popular and fast object localization network, effector proxies (such as manipulandums or joysticks) are for tracking two mice during a social behavior (51). These, used to measure the motor behavior of subjects or animals. and others, can then be combined with pose estimation pack- There are relatively few marker-tracking based movement ages for estimating the pose of multiple animals. Currently, neuroscience studies, in which many degrees of freedom two paths are possible: one is to apply pose estimation al- were measured alongside neural activity, with notable ex- gorithms after tracking individuals (for which any package ceptions like (83, 84). The ease with which kinematics of could be used); or, two, extract multiple detections for each limbs and digits can now be quantified (48, 51, 56, 77) should

Mathis & Mathis | deep learning & animal behavior arXiv.org | 5 Figure. 3. The Behavioral Space in Neuroscience: new applications for deep learning-assisted analysis. This diagram depicts how pose estimation with non- invasive videography can benefit behavioral paradigms that span from “trial-based behaviors" such as classical conditioning, to “complex motor behaviors/tasks" as in 3D reaching assays, to “naturalistic tasks," often without any trial structure, and that are more akin to real-world ’tasks’ that animals undertake. For example, a looming stimulus is ethologically relevant and complex, and pup-retrieval can be natural yet repeatable in a laboratory setting. With new tools that allow for fast and accurate analysis of movement, these types of experiments become more feasible (with much less human labor than previously required). greatly simplify such studies in the future. We expect many OpenPose (19). Consistent with studies performed in the more highly detailed kinematic studies will emerge that uti- laboratory, they found a tendency for pairs of people to either lize DNN-based analyses, especially for freely moving ani- walk in or exactly out of phase (86). mals, for small and aquatic animals that cannot be marked, and for motor control studies that can leverage large-scale How else can DNNs help? Specialized body parts often play recordings and behavioral monitoring. a key role in ethologically relevant behaviors. For instance, ants use their antenna to follow odor trails (87), while moles Natural behaviors & ethologically relevant features. use their snouts for sampling bilateral nasal cues to local- ize odorants (88). To accurately measure such behaviors, There is a trend in motor neuroscience towards natural highly accurate feature-detectors of often tiny, highly dex- behaviors; i.e. less constrained tasks, everyday-skills, and terous bodyparts are needed. This is a situation where deep even “in the wild" studies (85). For instance, we used learning algorithms can excel. Pose estimation algorithms DeepLabCut for 3D pose estimation in hunting cheetah’s can not only be used to detect the complete "pose", but due captured via multiple Go-Pro cameras (Figure2j; (53)). to their flexibility they are extremely useful to track etholog- Another “in the wild example" is given by a recent study by ically relevant body parts in challenging situations; inciden- Chambers et al. (86), who revisited the classic question of tally DeepLabCut was created, in part, to accurately track the how people synchronize their walking, but with a modern snouts of mice following odor trails that were printed onto twist by using videos from YouTube and analysis with a treadmill (48). There are of course many other special-

6 | arXiv.org Mathis & Mathis | deep learning & animal behavior ized body parts that are hard to track: like whiskers, bee- specialized networks could now be built that leverage large stingers, jellyfish tentacles, or octopus arms, and we believe datasets of specific animals. It is hard to envision a universal that studying these beautiful systems in more natural and “animal pose detector" network (for object recognition ethologically relevant environments has now gotten easier. this is possible) as animals have highly diverse body plans and experimentalists often have extremely different needs. Revisiting classic tasks. Currently many individual labs create their own specialized networks, but we plan to create shareable networks for Measuring behavior is already impacting “classic" decision- specific animals (much like the specific networks, i.e. making paradigms. For example, several groups could show hand network in OpenPose (96), or the human-network broad movement encoding across the brain during decision- in DeepLabCut (29, 48)). For example, many open field making tasks by carefully quantifying behavior (89, 90). experiments could benefit from robust and easy-to-use DNNs Moreover, large scale efforts to use these “simple" yet robust for video analysis across similar body points of the mouse. trial-based behaviors across labs and brain areas are lever- Indeed, efforts are underway to create networks where one aging deep learning, and comparing their utility compared can simply analyze their data without training, and we hope to classical behavior-monitoring approaches. For example, the community will join these efforts. Nothing improves the IBL has surmised that DeepLabCut could replace tradi- DNNs more than more training data. These efforts, together tional methods used for eye, paw and lick detection (66). We with making code open source, will contribute to the repro- believe that detailed behavioral analysis will impact many ducibility of science and make these tools broadly accessible. paradigms, which were historically not considered “motor" studies, as now it is much easier to measure movement. In summary, we aimed to review the progress in com- puter vision for human pose estimation, how it influenced Remaining challenges in pose estimation. animal pose estimation, and how neuroscience laboratories Advances in deep learning have changed how easily pos- can leverage these tools for better quantification of behavior. ture can be measured and has impacted many studies. How- Taken together, the tremendous advance of computer vision ever, pose estimation remains a hard computer vision prob- has provided tools that are practical for the use in the labo- lem and challenges remain (16, 27, 33, 91–93). In the con- ratory, and they will only get better. They can be as accurate text of multi-animal/human pose estimation, dealing with as human-labeling (or marker-based tracking), and are fast highly crowded scenes, in which different individuals cover enough for closed-loop experiments, which is key for under- each other, remains highly challenging (27, 33, 91). In gen- standing the link between neural systems and behavior. We eral, dealing with occlusions remains a challenge. In some expect that in-light of shared, easy-to-use tools and additional experiments occlusions are hard to avoid. Thus, networks deep learning advances, there will be thrilling and unforeseen that can constraint body part detection based on anatomi- advances in real-world neuroscience. cal relationships can be advantageous, but are computation- Acknowledgments: ally more complex and slower (29). As we highlighted in the robustness section, it is hard to train networks to We thank Eldar Insafutdinov, Alex Gomez-Marin, the M. generalize to out-of-domain scenarios (50, 92, 94). Even Mathis Lab, and the anonymous reviewers for comments on though very large data sets have been amassed to build robust the manuscript, and Julia Kuhl for illustrations. Funding DNNs (19, 30–32), they can still fail in sufficiently different was provided by the Rowland Institute at Harvard Univer- scenarios (50, 92, 93). Making robust networks will highly sity (M.W.M.) and NIH U19MH114823 (A.M). The authors useful for creating shareable behavior- or animal-specific net- declare no conflicts of interest. works that can generalize across laboratories. There will also be much work towards even faster, lighter models. References 1. J. Baron Uexküll. Streifzüge durch die Umwelten von Outlook & Conclusions Tieren und Menschen Ein Bilderbuch unsichtbarer Welten. Springer, Berlin, Heidelberg, 1956. The field of 2D, 3D, and dense pose estimation will continue 2. Nikolaas Tinbergen. On aims and methods of Ethology. to evolve. For example, with respect to handling occlusions Zeitschrift für Tierpsychologie, 20(4):410–433, 1963. and robustness to out-of-domain data. Perhaps larger and 3. Nikolai A. Bernstein. The co-ordination and regulation of more balanced datasets will be created to better span the movements, volume 1. Oxford, New York, Pergamon Press, behavioral space, more temporal information will be utilized 1967. 4. Nikolaas Tinbergen. Nobel lecture. the nobel prize in physi- when training networks or analyzing data, and new algorith- ology or medicine 1973, 1973. mic solutions will be found. 5. Andreas T Schaefer and Adam Claridge-Chang. The surveillance state of behavioral automation. In Current Will we always need training data? A hot topic in object Opinion in Neurobiology, 2012. recognition is training from very few examples (one-shot or 6. David J Anderson and Pietro Perona. Toward a science of computational ethology. Neuron, 84(1):18–31, 2014. zero-shot learning) (95). Can this be achieved in pose esti- 7. Anthony I Dell, John A Bender, Kristin Branson, Iain D mation? Perhaps as new architectures and training regimes Couzin, Gonzalo G de Polavieja, Lucas PJJ Noldus, Alfonso come to fruition this could be possible. Alternatively, Pérez-Escudero, Pietro Perona, Andrew D Straw, Martin

Mathis & Mathis | deep learning & animal behavior arXiv.org | 7 Wikelski, et al. Automated image-based tracking and its ap- 24. Arjun Jain, Jonathan Tompson, Yann LeCun, and Christoph plication in ecology. Trends in ecology & evolution, 29(7): Bregler. Modeep: A deep learning framework using motion 417–428, 2014. features for human pose estimation. CoRR, abs/1409.7963, 8. Roian Egnor and Kristin Branson. Computational analysis 2014. of behavior. Annual review of neuroscience, 39:217–236, 25. Jonathan J Tompson, Arjun Jain, Yann LeCun, and 2016. Christoph Bregler. Joint training of a convolutional net- 9. Valentina Camomilla, Elena Bergamini, Silvia Fantozzi, and work and a graphical model for human pose estimation. In Giuseppe Vannozzi. Trends supporting the in-field use of Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and wearable inertial sensors for sport performance evaluation: K. Q. Weinberger, editors, Advances in Neural Information A systematic review. Sensors, 18(3):873, 2018. Processing Systems 27, pages 1799–1807. Curran Asso- 10. Alex Gomez-Marin, Joseph J Paton, Adam R Kampff, Rui M ciates, Inc., 2014. Costa, and Zachary F Mainen. Big behavioral data: psy- 26. Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann Le- chology, ethology and the foundations of neuroscience. Na- Cun, and Christoph Bregler. Efficient object localization us- ture neuroscience, 17(11):1455, 2014. ing convolutional networks. In The IEEE Conference on 11. Ronald Poppe. Vision-based human motion analysis: An Computer Vision and Pattern Recognition (CVPR), June overview. Computer Vision and Image Understanding, 108 2015. (1):4 – 18, 2007. ISSN 1077-3142. doi: https://doi.org/10. 27. Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov, Leonid 1016/j.cviu.2006.10.016. Special Issue on Vision for Human- Pishchulin, Anton Milan, Juergen Gall, and Bernt Schiele. Computer Interaction. Posetrack: A benchmark for human pose estimation and 12. Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Ar- tracking. In Proceedings of the IEEE Conference on Com- naud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen puter Vision and Pattern Recognition, pages 5167–5176, Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Gin- 2018. neken, and Clara I Sánchez. A survey on deep learning in 28. Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- medical image analysis. Medical image analysis, 42:60–88, glass networks for human pose estimation. In European 2017. Conference on Computer Vision, pages 483–499. Springer, 13. David Hogg. Model-based vision: a program to see a walk- 2016. ing person. Image and Vision Computing, 1(1):5 – 20, 1983. 29. Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, ISSN 0262-8856. doi: https://doi.org/10.1016/0262-8856(83) Mykhaylo Andriluka, and Bernt Schiele. DeeperCut: A 90003-3. deeper, stronger, and faster multi-person pose estimation 14. Christopher Richard Wren, Ali Azarbayejani, Trevor Darrell, model. In European Conference on Computer Vision, pages and Alex Paul Pentland. Pfinder: Real-time tracking of the 34–50. Springer, 2016. human body. IEEE Transactions on pattern analysis and 30. Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, machine intelligence, 19(7):780–785, 1997. Siyu Tang, Evgeny Levinkov, Bjoern Andres, and Bernt 15. Daniel Cremers, Mikael Rousson, and Rachid Deriche. A Schiele. Arttrack: Articulated multi-person tracking in the review of statistical approaches to level set segmentation: wild. In Proceedings of the IEEE conference on computer integrating color, texture, motion and shape. International vision and pattern recognition, 2017. journal of computer vision, 72(2):195–215, 2007. 31. Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. Pifpaf: 16. Thomas Serre. Deep learning: the good, the bad, and the Composite fields for human pose estimation. In Proceed- ugly. Annual Review of Vision Science, 5:399–426, 2019. ings of the IEEE Conference on Computer Vision and Pat- doi: https://doi.org/10.1146/annurev-vision-091718-014951. tern Recognition, pages 11977–11986, 2019. 17. Md Zahangir Alom, Tarek M Taha, Christopher Yakopcic, 32. Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep Stefan Westberg, Paheding Sidike, Mst Shamima Nasrin, high-resolution representation learning for human pose es- Brian C Van Esesn, Abdul A S Awwal, and Vijayan K timation. arXiv preprint arXiv:1902.09212, 2019. Asari. The history began from alexnet: a comprehen- 33. Q. Dang, J. Yin, B. Wang, and W. Zheng. Deep learn- sive survey on deep learning approaches. arXiv preprint ing based 2d human pose estimation: A survey. Tsinghua arXiv:1803.01164, 2018. Science and Technology, 24(6):663–676, Dec 2019. ISSN 18. Jürgen Schmidhuber. Deep learning in neural networks: An 1007-0214. doi: 10.26599/TST.2018.9010100. overview. Neural Networks, 61:85 – 117, 2015. ISSN 0893- 34. Nikolaos Sarafianos, Bogdan Boteanu, Bogdan Ionescu, 6080. doi: https://doi.org/10.1016/j.neunet.2014.09.003. and Ioannis A Kakadiaris. 3d human pose estimation: A re- 19. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. view of the literature and analysis of covariates. Computer Realtime multi-person 2d pose estimation using part affinity Vision and Image Understanding, 152:1–20, 2016. fields. In Proceedings of the IEEE conference on computer 35. Julieta Martinez, Rayat Hossain, Javier Romero, and vision and pattern recognition, 2017. James J Little. A simple yet effective baseline for 3d human 20. Dushyant Mehta, Helge Rhodin, Dan Casas, Oleksandr Sot- pose estimation. In Proceedings of the IEEE International nychenko, Weipeng Xu, and Christian Theobalt. Monocular Conference on Computer Vision, pages 2640–2649, 2017. 3d human pose estimation using transfer learning and im- 36. Denis Tomè, Chris Russell, and Lourdes Agapito. Lifting proved CNN supervision. CoRR, abs/1611.09813, 2016. from the deep: Convolutional 3d pose estimation from a sin- 21. Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. gle image. CoRR, abs/1701.00295, 2017. Densepose: Dense human pose estimation in the wild. In 37. Ching-Hang Chen and Deva Ramanan. 3d human pose Proceedings of the IEEE Conference on Computer Vision estimation= 2d pose estimation+ matching. In Proceedings and Pattern Recognition, pages 7297–7306, 2018. of the IEEE Conference on Computer Vision and Pattern 22. Silvia Zuffi, Angjoo Kanazawa, Tanja Berger-Wolf, and Recognition, pages 7035–7043, 2017. Michael Black. Three-d safari: Learning to estimate zebra 38. Jamie Shotton, Ross B. Girshick, Andrew W. Fitzgibbon, pose, shape, and texture from images "in the wild". In ICCV. Toby Sharp, Mat Cook, Mark Finocchio, Richard Moore, IEEE Computer Society, 08 2019. Pushmeet Kohli, Antonio Criminisi, Alex Kipman, and An- 23. Alexander Toshev and Christian Szegedy. Deeppose: Hu- drew Blake. Efficient human pose estimation from single man pose estimation via deep neural networks. In Proceed- depth images. IEEE Transactions on Pattern Analysis and ings of the IEEE conference on computer vision and pattern Machine Intelligence, 35:2821–2840, 2012. recognition, pages 1653–1660, 2014. 39. Š. Obdržálek, G. Kurillo, F. Ofli, R. Bajcsy, E. Seto, H. Jimi-

8 | arXiv.org Mathis & Mathis | deep learning & animal behavior son, and M. Pavel. Accuracy and robustness of kinect pose 54. Oral Kaplan, Goshiro Yamamoto, Takafumi Taketomi, estimation in the context of coaching of elderly population. Alexander Plopski, and Hirokazu Kato. Video-based visu- In 2012 Annual International Conference of the IEEE Engi- alization of knee movement in cycling for quantitative and neering in Medicine and Biology Society, pages 1188–1193, qualitative monitoring. In 2019 12th Asia Pacific Workshop Aug 2012. doi: 10.1109/EMBC.2012.6346149. on Mixed and Augmented Reality (APMAR), pages 1–5. 40. Alexander B Wiltschko, Matthew J Johnson, Giuliano Iurilli, IEEE, 2019. Ralph E Peterson, Jesse M Katon, Stan L Pashkovski, Vic- 55. Talmo D Pereira, Diego E Aldarondo, Lindsay Willmore, toria E Abraira, Ryan P Adams, and Sandeep Robert Datta. Mikhail Kislin, Samuel S-H Wang, Mala Murthy, and Mapping sub-second structure in mouse behavior. Neuron, Joshua W Shaevitz. Fast animal pose estimation using 88(6):1121–1135, 2015. deep neural networks. Nature methods, 16(1):117, 2019. 41. Weizhe Hong, Ann Kennedy, Xavier P Burgos-Artizzu, 56. Alexandra Bova, Krista Kernodle, Kaitlyn Mulligan, and Moriel Zelikowsky, Santiago G Navonne, Pietro Perona, and Daniel Leventhal. Automated rat single-pellet reaching with David J Anderson. Automated measurement of mouse so- 3-dimensional reconstruction of paw and digit trajectories. cial behaviors using depth sensing, video tracking, and ma- Journal of visualized experiments: JoVE, 2019. chine learning. Proceedings of the National Academy of 57. Balaji Sriram, Lillian Li, Alberto Cruz-Martín, and Anirvan Sciences, 112(38):E5351–E5360, 2015. Ghosh. A sparse probabilistic code underlies the limits of 42. Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, behavioral discrimination. Cerebral Cortex, 2019. Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and 58. Dalton D. Moore, Jeff D. Walker, Jason N. MacLean, and Michael J. Black. Expressive body capture: 3d hands, face, Nicholas G. Hatsopoulos. Comparing pose estimation per- and body from a single image. In Proceedings IEEE Conf. formance using deeplabcut and marker-based tracking. So- on Computer Vision and Pattern Recognition (CVPR), June ciety for Neuroscience, 2019. 2019. 59. Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 43. Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Segnet: A deep convolutional encoder-decoder architecture Jitendra Malik. End-to-end recovery of human shape and for image segmentation. IEEE transactions on pattern anal- pose. In Computer Vision and Pattern Regognition (CVPR), ysis and machine intelligence, 39(12):2481–2495, 2017. 2018. 60. Jacob M Graving, Daniel Chae, Hemal Naik, Liang Li, Ben- 44. Silvia Zuffi, Angjoo Kanazawa, David Jacobs, and Michael jamin Koger, Blair R Costelloe, and Iain D Couzin. Deep- Black. 3d menagerie: Modeling the 3d shape and pose of posekit, a software toolkit for fast and robust animal pose animals. 2017 IEEE Conference on Computer Vision and estimation using deep learning. eLife, 8:e47994, oct 2019. Pattern Recognition (CVPR), 11 2016. ISSN 2050-084X. doi: 10.7554/eLife.47994. 45. Benjamin Biggs, Thomas Roddick, Andrew Fitzgibbon, and 61. Gordon J. Berman, Daniel M. Choi, William Bialek, and Roberto Cipolla. Creatures great and smal: Recovering the Joshua W. Shaevitz. Mapping the stereotyped behaviour shape and motion of animals from video. In Asian Confer- of freely moving fruit flies. Journal of The Royal Society In- ence on Computer Vision, pages 3–19. Springer, 2018. terface, 11(99), 2014. ISSN 1742-5689. doi: 10.1098/rsif. 46. Silvia Zuffi, Angjoo Kanazawa, and Michael J. Black. Lions 2014.0672. and tigers and bears: Capturing non-rigid, 3d, articulated 62. Brandon Forys, Dongsheng Xiao, Pankaj Gupta, Jamie D shape from images. In Proceedings of the IEEE conference Boyd, and Timothy H Murphy. Real-time markerless video on Computer Vision and Pattern Recognition), June 2018. tracking of body parts in mice using deep neural networks. 47. Giang Nguyen, Stefan Dlugolinsky, Martin Bobák, Viet Tran, bioRxiv, page 482349, 2018. Álvaro López García, Ignacio Heredia, Peter Malík, and 63. Vilim Štih, Luigi Petrucco, Andreas M Kist, and Ruben Por- Ladislav Hluchý. and deep learning tugues. Stytra: An open-source, integrated system for stim- frameworks and libraries for large-scale data mining: a sur- ulation, tracking and closed-loop behavioral experiments. vey. Artificial Intelligence Review, 52(1):77–124, Jun 2019. PLoS computational biology, 15(4):e1006699, 2019. ISSN 1573-7462. doi: 10.1007/s10462-018-09679-z. 64. Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and 48. Alexander Mathis, Pranav Mamidanna, Kevin M Cury, Taiga Ajay Luthra. Overview of the H. 264/AVC video coding stan- Abe, Venkatesh N Murthy, Mackenzie Weygandt Mathis, dard. IEEE Transactions on circuits and systems for video and Matthias Bethge. Deeplabcut: markerless pose estima- technology, 13(7):560–576, 2003. tion of user-defined body parts with deep learning. Nature 65. Connor Shorten and Taghi M. Khoshgoftaar. A survey on neuroscience, 21:1281–1289, 2018. image data augmentation for deep learning. Journal of Big 49. Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do bet- Data, 6(1):60, Jul 2019. ISSN 2196-1115. doi: 10.1186/ ter imagenet models transfer better? In Proceedings of the s40537-019-0197-0. IEEE Conference on Computer Vision and Pattern Recog- 66. Guido Meijer, Michael Schartner, Gaëlle Chapuis, and Karel nition, pages 2661–2671, 2019. Svoboda. Video hardware and software for the international 50. Alexander Mathis, Mert Yüksekgönül, Byron Rogers, brain laboratory. International Brain Laboratory, 2019. Matthias Bethge, and Mackenzie W Mathis. Pretraining 67. Pierre Karashchuk. lambdaloop/anipose: v0.5.0, August boosts out-of-domain robustness for pose estimation. arXiv 2019. preprint arXiv:1909.11229, 2019. 68. Semih Günel, Helge Rhodin, Daniel Morales, João H Cam- 51. Ahmet Arac, Pingping Zhao, Bruce H Dobkin, S Thomas pagnolo, Pavan Ramdya, and Pascal Fua. Deepfly3d, a Carmichael, and Peyman Golshani. Deepbehavior: A deep deep learning-based approach for 3d limb and appendage learning toolbox for automated analysis of animal and hu- tracking in tethered, adult Drosophila. eLife, 8:e48571, oct man behavior imaging data. Frontiers in systems neuro- 2019. ISSN 2050-084X. doi: 10.7554/eLife.48571. science, 13:20, 2019. 69. Yilun Zhang and Hyun Soo Park. Multiview supervision by 52. Alexander Mathis and Richard A. Warren. On the infer- registration. arXiv preprint arXiv:1811.11251v2, 2019. ence speed and video-compression robustness of deeplab- 70. Vivek Hari Sridhar, Dominique G. Roche, and Simon Gin- cut. bioRxiv, 2018. doi: 10.1101/457242. gins. Tracktor: Image-based automated tracking of animal 53. Tanmay Nath, Alexander Mathis, An Chi Chen, Amir Patel, movement and behaviour. Methods in Ecology and Evolu- Matthias Bethge, and Mackenzie W Mathis. Using deeplab- tion, 10(6):815–820, 2019. doi: 10.1111/2041-210X.13166. cut for 3d markerless pose estimation across species and 71. Francisco Romero-Ferrero, Mattia G Bergomi, Robert C behaviors. Nature protocols, 14:2152–2176, 2019. Hinz, Francisco JH Heras, and Gonzalo G de Polavieja.

Mathis & Mathis | deep learning & animal behavior arXiv.org | 9 idtracker. ai: tracking all individuals in small or large col- 88. Kenneth C Catania. Stereo and serial sniffing guide naviga- lectives of unmarked animals. Nature methods, 16(2):179, tion to an odour source in a mammal. Nature communica- 2019. tions, 4:1441, 2013. 72. Zheheng Jiang, Zhihua Liu, Long Chen, Lei Tong, Xian- 89. Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz, grong Zhang, Xiangyuan Lan, Danny Crookes, Ming-Hsuan Charu Bai Reddy, Matteo Carandini, and Kenneth D. Har- Yang, and Huiyu Zhou. Detection and tracking of mul- ris. Spontaneous behaviors drive multidimensional, brain- tiple mice using part proposal networks. arXiv preprint wide activity. Science, 364(6437), 2019. ISSN 0036-8075. arXiv:1906.02831, 2019. doi: 10.1126/science.aav7893. 73. Neil J Cronin, Timo Rantalainen, Juha P Ahtiainen, Esa 90. Simon Musall, Matthew T. Kaufman, Ashley L. Juavinett, Hynynen, and Ben Waller. Markerless 2d kinematic analysis Steven Gluf, and Anne K. Churchland. Single-trial neu- of underwater running: A deep learning approach. Journal ral dynamics are dominated by richly varied movements. of biomechanics, 87:75–82, 2019. Nat. Neurosci., 22, 2019. doi: https://doi.org/10.1038/ 74. Wujie Zhang and Michael M Yartsev. Correlated neural ac- s41593-019-0502-4. tivity across the brains of socially interacting bats. Cell, 178, 91. Thomas Golda, Tobias Kalb, Arne Schumann, and Jürgen 2019. Beyerer. Human pose estimation for real-world crowded 75. Scott F Owen, Max H Liu, and Anatol C Kreitzer. Thermal scenarios. arXiv preprint arXiv:1907.06922, 2019. constraints on in vivo optogenetic manipulations. Nature 92. Helge Rhodin, Jörg Spörri, Isinsu Katircioglu, Victor Con- neuroscience, 22, 2019. stantin, Frédéric Meyer, Erich Müller, Mathieu Salzmann, 76. Salil S Bidaye, Meghan Laturney, Amy K Chang, Yuejiang and Pascal Fua. Learning monocular 3d human pose esti- Liu, Till Bockemuhl, Ansgar Buschges, and Kristin Scott. mation from multi-view images. In Proceedings of the IEEE Two brain pathways initiate distinct forward walking pro- Conference on Computer Vision and Pattern Recognition, grams in drosophila. bioRxiv, page 798439, 2019. pages 8437–8446, 2018. 77. Anthony W Azevedo, Pralaksha Gurung, Lalanti Venkata- 93. Nidhi Seethapathi, Shaofei Wang, Rachit Saluja, Gun- subramanian, Richard Mann, and John C Tuthill. A size nar Blohm, and Konrad P Kording. Movement science principle for leg motor control in drosophila. bioRxiv, page needs different pose tracking algorithms. arXiv preprint 730218, 2019. arXiv:1907.10226, 2019. 78. Constantine Nicholas Tzouanas, Soonyoung Kim, Kr- 94. Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, ishna N Badhiwala, Benjamin W Avants, and Jacob Robin- Evgenia Rusak, Oliver Bringmann, Alexander S Ecker, son. Thermal stimulation temperature is encoded as a firing Matthias Bethge, and Wieland Brendel. Benchmarking ro- rate in a hydra nerve ring. bioRxiv, page 787648, 2019. bustness in object detection: Autonomous driving when win- 79. Benjamin De Bari, James A Dixon, Bruce A Kay, and Dilip ter is coming. arXiv preprint arXiv:1907.07484, 2019. Kondepudi. Oscillatory dynamics of an electrically driven 95. Yongqin Xian, Christoph H Lampert, Bernt Schiele, and dissipative structure. PloS one, 14(5):e0217305, 2019. Zeynep Akata. Zero-shot learning—a comprehensive eval- 80. Regina Zaghi-Lara, Miguel Ángel Gea, Jordi Camí, Luis M uation of the good, the bad and the ugly. IEEE Transactions Martínez, and Alex Gomez-Marin. Playing magic tricks to on Pattern Analysis and Machine Intelligence, 41(9):2251– deep neural networks untangles human deception. arXiv 2265, Sep. 2019. ISSN 0162-8828. doi: 10.1109/TPAMI. preprint arXiv:1908.07446, 2019. 2018.2857768. 81. Glenna Clifton, David Holway, and Nick Gravish. Rough 96. Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser substrates constrain walking speed in ants through modula- Sheikh. Hand keypoint detection in single images using tion of stride frequency and not stride length. bioRxiv, page multiview bootstrapping. In Proceedings of the IEEE con- 731380, 2019. ference on computer vision and pattern recognition, 2017. 82. Yuanlei Yue, Pan Xu, Zhichao Liu, Zekai Chen, Juntao Su, Ai Chen, Ryan T. Ash, Emre Barut, Rahul Simha, Stelios Smirnakis, Chen Zeng, and Hui Lu. Mecp2 deletion im- paired layer 2/3-dominant dynamic reorganization of corti- Highlighted References: cal circuit during motor skill learning. bioRxiv, 2019. doi: [1*] A survey on deep learning in medical image 10.1101/786822. analysis (12) Comprehensive review of all deep learning 83. Carlos E Vargas-Irwin, Gregory Shakhnarovich, Payman Yadollahpour, John MK Mislow, Michael J Black, and John P algorithms used in medical image analysis as of 2017, as Donoghue. Decoding complete reach and grasp actions well as a discussion of most successful approaches, together from local primary motor cortex populations. Journal of neu- with [2**] a fantastic introduction for newcomers to the field. roscience, 30(29):9659–9669, 2010. 84. Stefan Schaffelhofer, Andres Agudelo-Toro, and Hansjörg [2**] Deep learning: The good, the bad, and the ugly (16) Scherberger. Decoding a wide range of hand configurations from macaque motor, premotor, and parietal cortices. Jour- Excellent review of deep learning progress including a nal of Neuroscience, 35(3):1068–1081, 2015. detailed description of recent successes as well as limitations 85. Alexander Mathis, Andrea R Pack, Rodrigo S Maeda, and of computer vision algorithms. Samuel David McDougle. Highlights from the 29th annual meeting of the society for the neural control of movement. [3**] Realtime multi-person 2d pose estimation using Journal of Neurophysiology, 122:1777–1783, 2019. doi: 10. 1152/jn.00484.2019. part affinity fields (19) OpenPose was the first real-time 86. Claire Chambers, Gaiqing Kong, Kunlin Wei, and Konrad multi-person system to jointly detect human body parts by us- Kording. Pose estimates from online videos show that side- ing part affinity fields, a great way to link body part proposals by-side walkers synchronize movement under naturalistic across individuals. The toolbox is well maintained and now conditions. PloS one, 14(6):e0217861, 2019. boasts body, hand, facial, and foot keypoints (in total 135 87. Ryan W Draft, Matthew R McGill, Vikrant Kapoor, and Venkatesh N Murthy. Carpenter ants use diverse antennae keypoints) as well as 3D support. sampling strategies to track odor trails. Journal of Experi- mental Biology, 221(22):jeb185124, 2018.

10 | arXiv.org Mathis & Mathis | deep learning & animal behavior [4*] Dense-pose: Dense human pose estimation in the [8**] DeepLabCut: markerless pose estimation of wild.(21) Using a large dataset of humans (50K), they build user-defined body parts with deep learning (48) DeepLab- dense correspondences between RGB images and human Cut was the first deep learning toolbox for animal pose bodies. They apply this to human “in the wild," and build estimation. The key advance was to benchmark a subset tools for efficiently dealing with occlusions. It is highly of the feature detectors in DeeperCut (30). This paper accurate and runs up to 25 frames per second. showed nearly human-level performance with only 50-200 images. It benchmarked flies moving in a 3D chamber, [5**] Three-D Safari: Learning to Estimate Zebra Pose, hand articulations and open-field behavior in mice, and Shape, and Texture from Images “In the Wild" (22) Zuffi provided open-source tools for creating new datasets and et al. push dense pose estimations by using a new SMALST data loaders to train deep neural networks, and post-hoc model for capturing zebras pose, soft-tissue shape, and even analysis. Subsequent work has improved accuracy, speed, texture “in the wild." This is a difficult challenge as zebras and introduced more network variants into the Python are designed to blend into the background in the safari. package (50, 52, 53). This paper makes significant improvements on accuracy and realism, and builds on a line of elegant work from these [9*] DeepBehavior: A deep learning toolbox for auto- authors (44, 46) mated analysis of animal and human behavior imaging data (51) Arac et al. use three different DNN-packages [6**] DeeperCut: A deeper, stronger, and faster multi- (OpenPose (19), YOLO and Tensorbox) for analyzing 3D person pose estimation model (29) DeeperCut is a highly analysis of pellet reaching, three-chamber test, social behav- accurate algorithm for multi-human pose estimation due ior in mice, and 3D human kinematics analysis for clinical to improved deep learning based body part detectors, and motor function assessment with OpenPose (19). They also image-conditioned pairwise terms to predict the location of provide MATLAB scripts for additional analysis (after pose body parts based on the location of other body parts. These estimation). terms are then used to find accurate poses of individuals via graph cutting. In ArtTrack (30) the work was extended to [10*] Using DeepLabCut for 3D markerless pose estima- fast multi-human pose estimation in videos. tion across species and behaviors (53). A Nature Protocols user-guide to DeepLabCut2.0, with 3D pose estimation of [7*] Recovering the shape and motion of animals from hunting cheetahs and improved network performance. The video (45) The authors combine multiple methods in order toolbox is provided as a Python package with graphical to efficiently fit 3D shape to multiple quadrupeds from user interfaces for labeling, active-learning-based network camels to bears. They also provides a novel dataset of joint refinement, together with Jupyter Notebooks that can be run annotations and silhouette segmentation for eleven animal on cloud resources such as Google Colaboratory (for free). videos. [11*] Fast animal pose estimation using deep neural net- works (55). LEAP (LEAP estimates animal pose), a DNN method for predicting the positions of animal body parts. This framework consists of a graphical interface for label- ing of body parts and training the network. Training and inference times are fast due to the lightweight architecture. The authors also analyzed insect gait based on unsupervised behavioral methods (61), which they directly applied to the posture, rather than to compressed image features.

Mathis & Mathis | deep learning & animal behavior arXiv.org | 11