Social Relation Recognition from Videos via Multi-scale Spatial-Temporal Reasoning Xinchen Liu†, Wu Liu†, Meng Zhang†, Jingwen Chen§, Lianli Gao¶, Chenggang Yan§, and Tao Mei† †JD AI Research, Beijing, China §Hangzhou Dianzi University, Hangzhou, China ¶University of Electronic Science and Technology of China, Chengdu, China {liuxinchen1, zhangmeng1208}@jd.com,
[email protected] {jingwenchen, cgyan}@hdu.edu.cn,
[email protected],
[email protected] Abstract Discovering social relations, e.g., kinship, friendship, Colleague etc., from visual contents can make machines better inter- pret the behaviors and emotions of human beings. Existing Man A Man A Standing Man A Woman B Woman A Woman B Woman A studies mainly focus on recognizing social relations from Woman B Talking Chair Cup still images while neglecting another important media— Door Door Writing Desk Door Document Meeting video. On the one hand, the actions and storylines in videos Document Document Room provide more important cues for social relation recogni- tion. On the other hand, the key persons may appear at arbitrary spatial-temporal locations, even not in one same Couple image from beginning to the end. To overcome these chal- lenges, we propose a Multi-scale Spatial-Temporal Reason- Standing Man A Woman A Man A Woman A Woman A Man A ing (MSTR) framework to recognize social relations from Smiling videos. For the spatial representation, we not only adopt a Hat Hat Hugging Bag Tree Hat temporal segment network to learn global action and scene Grass Grass Mower Outdoors information, but also design a Triple Graphs model to cap- ture visual relations between persons and objects.