<<

.i -

Spatial News Exploring as a Format for Content Production, Organization, and Consumption

By

Hisham Bedri

M.S. Technology and Policy, MIT, 2015

Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences at the Massachusetts Institute of Technology.

September 2018

Massachusetts Institute of Technology 2018. All rights reserved.

Signature of Author: _Signature redacted Hisham Bedri Program in Media And Sciences

Certified by: Signature redacted Andrew Lipp~an V Senior Research Scientist and Associate Director of the MIT Media Lab Thesis Supervisor

Accepted by: Signature redacted Tod Machovef Academic Head, Program in Media Arts and Sciences

MASSACHUSS ISTTTE OF TECHNOLOGY

OCT 1 6 2018 LIBRARIES ARQHIVES Spatial News: Exploring Augmented Reality as a Format for Content Production, Organization, and Consumption

By Hisham Bedri

The following served as a reader for this thesis: A Thesis reader: Signature redacted V. Michael Bove Principal Research Scientist Object Based Media Group, MIT Media Lab

2 Spatial News: Exploring Augmented Reality as a Format for Content Production, Organization, and Consumption

By Hisham Bedri

The following served as a reader for this thesis:

Thesis reader: Signature redacted lyad Rahwan Associate Professor of MEdia Arts and Sciences AT&T Career Development Professor of Media Arts and Sciences

3 Abstract: News has been criticised for being fake and promoting echo-chambers. At the same time, spatial technologies have become more accessible, enabling affordable virtual reality (VR) and augmented reality (AR) systems. These systems enable a new channel for interfaces and content. Can these technologies establish a connection between space and news, resulting in a stronger connection between viewers and the news? We address these questions by building tools for news production and content consumption that use spatial technology. Through user-tests we show that spatial-organization of news information can result in greater news exposure. We evaluate spatial production tools by creating three live-broadcasts in VR and comparing them to broadcasts done by a production team. We also show that users have a bimodal response to 2.5D videos shown in AR. This thesis presents and evaluates a series of interactive spatial experiences to address the potential for spatial technologies for media-based journalism.

4 Acknowledgments

In the name of God, the most gracious, the most merciful

This thesis and its work would not have been possible without the persistent support, inspiration, and criticism from my advisor Andrew Lippman. One fine passover, Lippman gathered his graduate students for a Passover Seder. As is tradition, he hid a piece of Matzo in his house. As per tradition, the Seder would not end until the piece of Matzo was found and delivered to Lippman. It was I who found that piece of Matzo, and I who began negotiating with him about the terms of handing it over. I saw this is as a chance to graduate. I looked Lippman dead in the eye and demanded that he allow me to graduate. Lippman, with a mix of both understanding and resolve said: "I'll give you three olives." A few rounds of negotiation later, I found my way out of MIT. I won't reveal how many olives I got.

I want to thank the readers of my thesis, Mike Bove and lyad Rahwan, who offered expert guidance. Thanks for your patience with my incoherence. I would also like to thank the staff who held me up along the way, including Deborah Widener, Linda Peterson, Keira Horowitz, Amanda Stoll, Monica Orta, Amna Carreiro, Maggie Church and many others.

A gigantic thank you to my Viral family, in no particular order: Mike Jiang, Agnes Cameron, Kalli Retzepi, Britney Johnson, David Anderton, Nchinda Nchinda, Travis Rich, Jasmin Rubinovitz, Brian Tice, Leo Mebaza, and Tomer Weller. To my UROP Meital Hoffman, thank you and I'm sure your style of thinking and open-mindedness will take you far.

The ideas in this thesis were developed alongside the creativity and energy of Michael Draskovic, who first came up with the idea of "holographic news".

I am indebted to my intellectual muses who really pushed my thinking and outlook. A big thank you each: Gabe Fields, Erin Hong, and Jonathan Harvey Buschell.

I have the unique privilege of completing two masters here at MIT, and that could only have been possible thanks to the help and inspiration of those who put in effort to develop my potential. Thank you for taking a chance on me: Ayush Bhandari, Matt Hirsch, Nick Ashford, Frank Fields, Pattie Maes, Scott Greenwald, and Petros Boufounos.

Finally, I want to thank my family (Babiker Bedri, Dina Bedri, Salwa Elarabi, and Howeida Elarabi) for always being there for me. Thank you to my co-captain Thariq Shihipar, who has really seen me through every up and down and has provided a framework to analyze each. Thank you to all in my religious community, and finally, I am grateful to God for the opportunity and test which is academia.

5 Spatial News 1

Abstract: 4

Acknowledgments 5

Chapter 1: Introduction and Background: 8 Motivation: 8 Contribution: 8 Overview of the Thesis: 9 Background: 9 Chapter 2: Content Creation Using Spatial Tools: 13 Producing 2D content using spatial tools: Broadercasting 13 Broadercasting System: 18 Customization: 20 Video Inputs: 21 On-Screen Graphics: 21 Twitter: 22 Clips: 23 Broadercasting Live Tests: 24 Broadercasting Test 1: 24 Broadercasting Live Test 2: 26 Broadercasting Live Test 3: 28

Chapter 3: Spatial Organization of Content 31 GlobAR system overview: 32 System Design 32 Wall-based: 33 Mobile-AR: 35 Experimental Design 38 Evaluation: 38

Chapter 4: Placing 2.5D Video Content in a User's Space 46 Stereo-capture for 2.5D Volumetric Video: 47 2.5D Video Interview Experiment Design: 51 2.5D Video Interview Evaluation 53 Spatial content and interactivity: AR basketball 55 Creating 2.5D content from archival footage: Bringing back Malcolm X 59

Conclusion 61

Appendix: 63 6 Part A: GlobAR application 63 Part B: Volumetric Video User Study 69

7 Chapter 1: Introduction and Background:

Motivation: We live in the most connected time in history, yet it feels like we are more disconnected than ever. News is as old as the town-crier. Humans have always been sharing information about current events and rumors. News transmittance has evolved over the years with communication technologies, including runners, visible light communication (smoke signals, semaphores), print, telegraph, , TV, and eventually the internet. Along the way, there have been trends of concentration of news sources, culminating in a small number of channels during the golden age of . Recent years have shown a divergence in news channels (cable news), followed by the blog which enabled anyone to be a written-. As access to communication tools increases, one would expect a greater diversity of stories to be told. Instead, we've observed that most content interfaces are now feeds which only display a filtered set of sources. This combination of interface, algorithms, and advertisers has led to more news filters and social echo chambers.

The question I address in this thesis is whether technologies can enable a link between space and news. Can a spatial-interface reduce cognitive load, focus our attention, and still present us with a breadth of sources. Can we build interfaces that can pop a user's information bubble while still giving breadth and depth?

Additionally, can we utilize these spatial technologies as tools of production for both standard 2D media and new 3D content? By making this process more accessible, we can enable new stories to be introduced into the live-broadcast format.

AR and VR have a lot of hope and hype associated with them. , users, and technologists alike are excited about the prospect of a new format". There isn't much precedent, however, as the format is still novel. Shedding more light on media in ARNR can alleviate the issues of disillusionment when the hype-curve comes down.

Contribution: This thesis explores spatial technologies for the production and consumption of 2D and 3D media. The contributions of this thesis are:

" A proposal for using spatial tools to produce news, rendered into practice through a VR studio for live-production broadcasting, as well as an evaluation of its use " A spatial-news interface which leads to greater news exposure

1 NYTimes spotlight: Immersive AR/VR https://www.nytimes.com/spotlight/augmented-reality 2 Roberts, Graham. How We'll Bring The News Into Your Home. NYTimes website 8 " An algorithm for capture and processing of stereo video to generate 2.5D velimetfie video content " A mobile AR application to place 2.5D video in space and the design of a psychological study to evaluate AR videos and spatial memory

Overview of the Thesis: This thesis explores the connection between news and space in three parts: spatial-tools for production, spatial-organization of news information, and spatial consumption of 2.5D videos.

Chapter 2 of this thesis focuses on using spatial technology for production of media. This part of the thesis focuses on spatial interfaces for generating both 2D (linear video). The Broadercasting VR studio is introduced, discussed, and evaluated.

Chapter 3 of this thesis focuses on the effect of spatial technologies on interfaces for consumption of news/media content. A spatial news interface called Globar is introduced, discussed, and evaluated. This section focuses on the problem of spatial organization.

Chapter 4 of this thesis focuses on the issues of presence, interactivity, and spatial anchoring. The Popout project is introduced, evaluated, and discussed. This project focuses on the concept of presence and whether AR presentation of 2.5D video in a user's space has an effect on the user's perception of presence.

Background:

In the not too-distant past, globes were found in many homes. These objects occupied physical space and allowed people to directly access the concept of the geography of a planet. Similarly, there were many scientific instruments which tracked the weather, the stars, and the time/date. In addition, table-top models were used as early as Napoleon's time in planning wars and battles. These physical objects, however, have been rapidly replaced by 2D projections on our phones, tablets, and screens.

Humans have remarkable 3D capacities. Cognitive scientists have studied the spatial capacity of our brains, both for navigation as well as memory. The visuospatial sketchpad is part of our working memory, and is utilized by the central executive in retaining and utilizing information 3. Research has also shown that spatial memory is hierarchical, and that humans can identify landmarks within a layout, and orient themselves within that layout using landmarks '. These landmarks can also serve to store and utilize information, such as the method of loci (memory palace), wherein memorizers are able to increase their capacity by relating information to a physical space (real or imagined), and performing a walkthrough of that space5 . Furthermore, studies have shown that people are better at recognizing real-world 3D-objects over 2D photos

3 Ang, Su Yin, and Kerry Lee. "Central executive involvement in children's spatial memory." Memory 16.8 (2008): 918-933. 4 Chun, Marvin M., and Yuhong Jiang. "Contextual cueing: Implicit learning and memory of visual context guides spatial attention." Cognitive psychology 36.1 (1998): 28-71. * Gross, Richard. Psychology: The science of mind and behaviour 7th edition. Hodder Education, 2015. 9 of them, indicating that the spatial/3D/present nature of objects has a different effect on the viewer than our current 2D interfaces can provide.

Augmented Reality (AR) is a catch-all term which describes displays and interfaces which overlay digital information on the real-world. AR has taken root in heads-up displays as well as screens which display the real world that's behind them as captured by a camera. The first heads-up AR display was created by Ivan Sutherland in 1968'. One key to these technologies is the process of localizing the device spatially with 6 degrees of freedom (6DOF: forward/back, left/right, up/down, pitch, yaw, roll). The algorithms to do this using cameras and other sensors is called Simultaneous Localization and Mapping (SLAM)'. SLAM has been performed utilizing visual markers', and natural features in the world 0. Recently this has been commercialized in the Microsoft Hololens, and ARKIT and ARCORE for mobile devices. These devices can be used to create persistent 3D objects that a user will see in their space that they can interact with.

Throughout this thesis, I will refer to technology that allows 6DOF viewing and interaction as spatial technologies. A majority of this thesis talks about mobile AR when referring to AR.

Feature Google cardboard / Vive, (Full Hololens (AR headset) Mobile phone based Simple VR viewers VR viewers) AR

Roll/Pitch/Yaw

3D movement

Controller Roll/Pitch/Yaw

3D controller

Table 1: A taxonomy of spatial devices available to the consumer at the time of writing this thesis.

VR and AR research over the years has produced paradigms for interaction and locomotion in a 3D space. One paradigm that is particularly interesting is World in Miniature (WIM), which was described by Doug Bowman in 3D User Interfaces". Objects/spaces are shrunk down into dollhouse versions of themselves and presented to the user. The user can then make changes to the dollhouse version to affect change on the 1:1 scale version. This interaction paradigm

6 Snow, Jacqueline C., et al. "Real-world objects are more memorable than photographs of objects." Frontiers in human neuroscience 8 (2014): 837. 7 Sutherland, Ivan E. "The ultimate display." Multimedia: From Wagner to virtual reality (1965). 8 Thrun, Sebastian, and John J. Leonard. "Simultaneous localization and mapping." Springer handbook of robotics. Springer Berlin Heidelberg, 2008. 871-889. 9 Kato, Hirokazu, and Mark Billinghurst. "Marker tracking and hmd calibration for a video-based augmented reality conferencing system." Augmented Reality, 1999 (IWAR'99) Proceedings. 2nd IEEE and ACM International Workshop on. IEEE, 1999. 10 Engel, Jakob, Thomas Sch6ps, and Daniel Cremers. "LSD-SLAM: Large-scale direct monocular SLAM." European Conference on . Springer, Cham, 2014. 1 Bowman, Doug, et al. 3D User Interfaces: Theory and Practice, CourseSmart eTextbook. Addison-Wesley, 2004. 10 enables more intuitive locomotion, interaction, and a bird's-eye perspective of the 3D space. This is a paradigm I explored for spatial news.

World-in-miniature (WIM)

~iL

Figure 1: World in Miniature technique for locomotion and interaction in VR 8

Web news has evolved from 2D print news, and often does not utilize the spatial aspect or 3rd dimension. There have been news-interface experiments that utilize 3D to visualize aspects of news. MSNBC's NewsWare studio produced Spectra, which creates a visually stunning interface organized by topic. Spectra organizes topics using the 2D space and allows the user to explore in a delightful way. Jonathan Speiser created World Lens13, which is a 3D organization of the news shown on a large touchscreen display. The interactive experience had a topic view and a map view. In the map view, one could see where news originated from and what places were being covered.

Figure 2: WorldLens screenshots0 , news information is displayed geographically

Unfiltered News is a news aggregator that maps news stories to their country of origin. The news is displayed on a 2D map, with each country represented as a circular bubble. The interface supports multiple languages, so it's possible to view stories written from a local perspective. 1

12 Spectra news viewer, MSNBC 1 Speiser, Jonathan Eliezer. WorldLens: exploring world events through media. Diss. Massachusetts Institute of Technology, 2014. 14 Unfiltered.news from Jigsaw http://unfiltered.news 11 0..

*.

-U

Figure 3: Unfiltered News: an interface for viewing news spatially. '5

Both interfaces used 3D projections onto 2D space to create interactive experiences, however the experiences were trapped in 2D screens which users navigate by clicking or gaze. AR offers the possibility of embedding stories in a user's space, and forcing a user to see content they weren't looking for as they maneuver their phone across an augmented reality space.

15 Unfiltered.news from Jigsaw http://unfiltered.news 12

I Chapter 2: Content Creation Using Spatial Tools: When a major event occurs, one does not necessarily have the resources or time to bring in a production truck. During the demonstrations in Tahrir Square, protestors captured photos and videos on smartphones to document and share experiences instantly, creating pop-up amature broadcasts.

If one wanted to become their own CNN, including cut-aways, multi-camera production, and info-graphics, one would need a production truck and a highly experienced crew. Spatial technologies can help to reduce the cost-of-entry for content creators. In this chapter, I will discuss producing 2D live content without a production truck. This enables new kinds of content from creators we would not normally view. This can also facilitate new kinds of relationships between consumers, prosumers, and the big networks.

Producing 2D content using spatial tools: Broadercasting The group we belong to has a distorting effect on our perception of reality. This was the conclusion of a seminal case study into confirmation bias in 1954 called "They Saw a Game" 16. The case study documents the reaction of Princeton and Dartmouth students to a rather dirty football game where both quarterbacks were removed from the game due to heavy injuries. The study showed that bias towards who started the rough-play was colored by which school the study participants went to. This bias permeated into details such as the number of rule infractions committed by the teams.

So how can two people see the same game but take away completely different realities? If we could make two groups of people watch the same game and agree on shared realities, we can find a way to depolarize populations.

Fast forward to today, where sporting and political events are broadcast for two audiences, home and away (or conservative and liberal). There are usually at least two broadcast trucks present at live events, each with its own production staff, engineers, and distribution capacity. Both trucks get access to the same camera feeds, although each has a wildly different audience it's feeding to. Thus stories, shots, and commentary are colored by the interests of each audience. This channelization polarizes an audience into two groups, home and away. This traditional broadcasting architecture lends itself to picking a side. This style of broadcasting can be viewed below.

16 Hastorf, Albert H., and Hadley Cantril. "They saw a game; a case study." The Journal of Abnornal and Social Psychology49.1 (1954): 129. 13 Traditional Broadcasting

Home Audience Away Audience

Horne Away

06

Figure 4: Diagram describing the two broadcasts generated from a sports game for the home and away audiences.

However, audiences are not easily categorized into two categories. There are those that have a fantasy-game following, wherein they receive points based on the performance of individual players that are spread out throughout the league. There are those who are interested in the fashion and celebrity appearances of the games. There are even those who are not interested in sports at all, but follow the live event for the commercials. What if, from a single event, it were possible to produce content that satisfies the complex interests and features of audiences? What if there were a different production truck for each of these interests? This is shown in the diagram below.

14 Group-forming Broadcasting

Fantasy Football Audience

Fashion Audience HA Celebrity Appearance Audience

Home Audience Away Audience

HoMe Fashion Fantasy Football Celebrity WAY Umg 1-K

Figure 5: Diagram describing the a potential multitude of broadcasts generated from a sports game for a wide array of overlapping audiences.

By enabling more groups to form, we can take an audience that is normally polarized along one axis, and allow it to form more complex and cross-cutting groups. By showing an event in its full complexity, or at least giving the audience the option to see it as such, one can begin to see a shared reality and reduce the tribalism associated with traditional broadcasting.

In order to enable a multitude of broadcasts from the same live event, one traditionally requires many production trucks, each with its own staff. We observe, however, that much of the raw footage, camera-feeds, and clips can be shared by these multiple trucks. We aim to ease the friction of producing a broadcast by creating a production truck which is easy to set up, customize, and share-content with. We also observe that advances in VR enable us to make experiences that are larger than the screen of an app or a website. We propose and developed a VR broadcasting truck which enables fast and inexpensive pop-up broadcasts for events. By reducing the cost of producing a broadcast, we hope to enable group-forming broadcast networks.

Traditional broadcasting networks grow linearly with the number of participants tuning in. Phone networks grow by the square of the number of participants in the network. The number of groups, however, grows exponentially with the number of members in the network 7 . This is illustrated below.

17 Reed, David P. "The Law of the Pack" (Harvard Business Review, February 2001, pp 23-4 15

M Traditional Broadcasting: Phone network: Whatsapp Network: 0

Number of Broadcasts: Number of Connections Number of Groups n-i B = n C = n(n-1) G = I n1 2 kk! (n - k)!

Figure 6: Diagram showing the number of connections in a standard broadcast (left), phone network (middle), and group network (right). n is the number of users in the network. K is the size of a group (integer between 2 and n-1). On the right is an expression for the number of groups in a network, with each group corresponding to a combination of k users from the total group of n users.

Previous work in this space by Dan Sawada enabled users to produce their own news-cast. In this interface, users could produce their own daily-show program where they incorporated clips from the news-cycle. In a big way, youtube and meme-ification on whatsapp have demonstrated that this idea can work well in viralizing and redistributing content.

16 Figure 7: Example of Recast, an interface that allows you to combine news clips.18

We were also inspired by emerging accounts in social media such as House of Highlights on instagram. The House of Highlights is a popular instagram account which produces highlight clips of NBA games. The account doesn't focus on any individual team. This is a successful example of amateur broadcasting.

-11- Sign up Log in

houseofhiqhlghts 0

7.094 posts 7m follo*ers 419 following

House of Highaghts Everytling you need to see from the sports world! To submit i video send me a DV andfor use the hasteg ehouseofhighlights' Part of BR. Yes. m Omar

kl,

h5~

Figure 8: House of Highlights is an instagram account that compiles highlights from games. 19

18 Sawada, Dan. Recast 1 House of Highlights instagram 17 Broadercasting System: By making broadcasting easy and accessible, we can enable more broadcasts out of the same events, leading to more stories than the two polarizing ones we see. We developed a VR broadcasting system which puts the broadcast truck on your head and makes it easier for anyone to produce a broadcast. We provided tools by which a producer can draw upon images, graphics, data, and live cameras to create a video stream equivalent to a broadcast. This permits several people to cooperate and is designed to work for organized sports as well as a pop-up studio for breaking news or live events. We build the space in virtual reality and operate it through a VR/AR head-mounted display. Broadercasting V aleastug explores using mixed reality as a tool for live, collaborative content creation.

The main advantages of the Broadercasting studio over a broadcast truck are: * customizability - multiple screens that you can position anywhere * directly manipulate graphics on screen - * collaborative - clips can be shared * social media integration

The Broadercasting studio is implemented as an application on the Vive VR headset. This headset has two controllers and a headset that are tracked in 3D with sub-mm accuracy. The necessary equipment for the system are a VR headset and a PC. The system is compatible with saved videos, and will soon be extended to live-video stream inputs. This is demonstrated in this high level diagram.

18 Viralcasting Studio: Live

Tw A N 0 0@g LvI O %3" t.*oe now A 000" mweuf" ff"Co"

Live 0 1. .a&a ow / IVI 6

Live 0@ Graphics:

Live DC) Dn S a

Figure 9: Diagram of the broadercasting system

This culminates in the VR studio visible below:

19 jinal Broadcast Main Screen ditablg. Graphics

Twitte FeedA

Clip Selation Figure 10: Diagram showing the main components of the user's perspective while using the broadercasting system.

Customization: The VR system has been tested to up to 8 different video inputs. Each input can be placed on as many moveable screens as the user wants. The user can position these screens anywhere in the space in 3D. This enables the user to customize their space for efficiency or to utilize their spatial memory to locate a specific clip.

S.-

20 Figure 11: Example of a user resizing the capture window of a panoramic soccer game in the Broadercasting system. Camera feeds courtesy of Pixellot.

In the studio, the user can manipulate the output of regular cameras, or utilize new wide-field cameras which capture the whole field at once. In this mode, the user can identify the window within this field that is of interest. This window responds to both scale and position through a two handed pinch/zoom.

Figure 12: Example of a user resizing the capture window of a panoramic soccer game in the Broadercasting system. Camera feeds courtesy of Pixellot.

Video Inputs: When the user steps into the Broadercasting studio, they have a number of small screens in front of them and one large main screen. The smaller screens represent the video input feeds. By clicking on any of the smaller screens with the thumb-pad, they will activate these feeds and they will be sent to the main broadcast screen. This is the "program" monitor, the feed that gets transmitted.

On-Screen Graphics: On the mainscreen, the user can directly manipulate the graphics that will be transmitted. By clicking and grabbing the graphic, the graphic will move along a grid on the screen. The graphic is tracked to the grid to make it easier for the user to place the graphic in a suitable position. By clicking on the right or left thumb-pad, the user can change the size of the graphic element. The user can input any .JPG or .PNG for the graphic elements.

21 Figure 13: Example of a user resizing manipulating on-screen graphics in the Broadercasting system. Camera feeds courtesy of NBC.

Twitter: Broadercasting attempts to close the loop with the viewer of the broadcast by directly incorporating social media. In the studio, there is a twitter widget which enables the broadcaster to scroll through the latest tweets. When the user finds a tweet they would like to put on the main screen, the user can select the tweet in the same way that they would select the input feeds. This tweet will then be pushed to the main-screen.

Figure 14: Example of a user resizing selecting tweets. Camera feeds courtesy of NBC

22 Clips: Underneath every video input is a button with three bars. When the user clicks this button, a menu pops up with a set of thumbnails. Each thumbnail represents a clip that the user can put on the main screen. The user can scroll through the clips with the thumb-pad to select the appropriate one. When the clip is selected, it pops into a mini-screen to the right of the menu which enables the user to view the clip in gif form. In addition to being able to display this mini-clip on the main screen, the user is able click on the share button which will create a copy of the clip that the user can pick up and move around. This enables a division of tasks, wherein one user can perfrom the tasks of selecting clips and sharing them with other users, while some users focus only on switching between live-feeds.

Figure 15: Example of a user selecting clips in the Broadercasting system. Camera feeds courtesy of NBC.

The decision to a live feed into a set of thumbnails was made in order to optimize the VR experience. While the 3D-tracked controllers in VR are great for spatially placing and moving objects, they are not adept for settings the start and stop-times on a video timeline. On-surface interaction using a mouse is faster and more accurate for setting clip start and end-times. Despite the superiority of a mouse for this task, the VR style task of searching through a video by looking at visual thumbnails through time is an engaging way of going about the task.

23 Broadercasting Live Tests:

Broadercasting Test 1: The first evaluation of the Broadercasting system was done on February 16, 2018 during the Media Lab Talk on quantified forgiveness. The goal of this test was to establish the functionality and compatibility of the VR system with professional broadcast equipment. The VR broadcasting studio was connected to 4 live-cameras and 1 audio feed, and the was in VR for over an hour producing a 2D linear video. The operator switched between camera feeds, graphics, and monitored and displayed tweets.

The media lab talks is a series where a lab representative (usually Joi Ito) has a conversation with a guest or group of guests about their work. This is usually followed by a question and answer session which is open to the audience. The event is usually broadcast live online and recorded. Two studios work together to accomplish this, Studio 125 and Diginovations. This presented a convenient and instructive opportunity to evaluate the Broadercasting system.

The crew of Studio 125 setup four cameras on site, and established one dedicated feed for graphics. All these signals are transmitted at 1080i resolution at a frame-rate of 59.94Hz via SDI. One camera is focused entirely on the speaker, the second is focused on Joi Ito, the third is a wider shot focused on both, and the fourth video-camera roams the audience. The Studio 125 crew communicates with the camera-operators to zoom in/out, usually on Joi Ito and the guests.

In order to capture the feeds, we utilized a Blackmagic Decklink Duo 2 capture card which can digitize up to 4x HDSDI camera feeds (3G). During this test, we did not have a motherboard large enough (due to video-card crowding), to capture all 5 feeds, so the roaming camera feed was dropped. These feeds were communicated into Unity using the AV-PRO Live Camera Unity Asset Plugin. Audio was captured using an XLR to 3.5mm adapter into the motherboard's sound-card. This adapter wasn't great and didn't sit well, resulting in a scratchy noise added to the captured audio.

Two operators tried the system, with the first operator spending approximately 60 minutes in the system, and the second operator broadcasting the last 20 minutes. The experiment lead to the following observations:

It was very easy to switch between camera-feeds and the browser, furthermore seeing what was coming through on the live-cameras was also intuitive. Controlling the web-browser to go to various twitter sites was difficult, because it utilized both head-pose and mouse-pointing and was neither intuitive nor effective. In addition, the tweets displayed were cropped on the right and left side.

The largest hurdle was the attention problem. Monitoring four-screens and a twitter feed in real-time was a daunting task for a single operator. Often while searching through tweets, the operator would not pay attention to what was happening on the other screens. This would lead

24 the operator to not-switch away from the screen when the guests were doing uninteresting things, such as checking their phone. Another issue for this experiment was the lack of communication with the camera-operators. While the Studio-1 25 crew was able to tell the operators to focus on a specific person or shot, the operator in the VR system was only able to react to the changes. This led to quick cutaways as soon as the camera started moving. Overall it was promising to observe that the Broadercasting system was capable of producing content from a live-event. There wasn't a clear indication, however, that the experimental setup was introducing something new to the broadcasting task.

25 Broadercasting Live Test 2:

Figure 16: Screen capture from the second broadercasting live test. On the left side you can see a screenshot from the output of the system, including on screen logos on the top left and bottom left. On the bottom right is a view of the user in the VR studio switchign between sources and a twitter feed. Camera feeds courtesy of Studio 125, Diginovations, and the MIT Media Lab.

This was the second time trying the broadercasting system. The primary goal of this test was to capture a full video-feed using the VR system and compare it to the feed generated by the traditional broadcast crew. The footage captured during the first test cut-off some of the footage and thus prevented us from properly comparing the two feeds. A bug occured during this broadcast so that I was not able to remove the Viral Communicatoins logo from the top left of the screen. This presented a problem in the VR broadcast because Joe Paradiso (left) was often blocked.

a I It t. W 1, ,

26 Figure 17: Screenshot of a video comparison tool which enables a user to view two broadcasts simultaneously. This was used to compare the broadcast done by a production team (left) and the broadcast done in the VR studio (right). Camera feeds courtesy of Studio 125, Diginovations, and the MIT Media Lab.

Comparing the two broadcasts, many of the angle choices were the same between the VR broadcast and the traditional broadcast. The VR broadcast had cuts that occurred more often than the traditional broadcast. This included cuts between the main cameras, and cuts to outside content. In general, the traditional broadcast appeared much more smooth. The cuts during the VR broadcast to outside content were to the twitter website and google searches for images of the speaker's books. While these made sense at the time during the broadcast, in review they seem too distracting, and I would have rather seen the author's face in the broadcast. Halfway through the VR broadcast, I began to re-arrange the monitors to a configuration where it would be easier for me to see and contextualize changes in the event. The audience view was placed towards the front (similar to the physical audience configuration), while the other monitors were placed in their respective positions on . I felt this helped slightly in following the attention of the story, but was a minor change.

27 Broadercasting Live Test 3: In the third test of the broadercasting system, I spent more time preparing external video and image content for the broadcast. This resulted in a much more colorful broadcast than the traditional broadcast. Content was found by reading the speaker's New Yorker articles and compiling a list of content in a google-doc (images, youtube videos). Although the list covered most everything the speaker wrote about, Only about 25% of the content was actually relevant to the conversation. The other 75% was improvised, however, was very pertinent to the conversations being had.

Figure 18: Screen capture from the third broadercasting live test. On the left side you can see a screenshot from the output of the system, including on screen logos on the bottom left. On the bottom right is a view of the user in the VR studio switching between camera sources and multiple browsers.. Camera feeds courtesy of Studio 125, Diginovations, and the MIT Media Lab.

28 Figure 19: Screen capture from the third broadercasting live test where a youtube video was shown. On the left side you can see a screenshot from the output of the system, including an on-screen logo on the bottom left. On the bottom right is a view of the user in the VR studio switching between camera sources and multiple browsers.

For example, when Jill Lepore was talking about communications revolutions and party-realignments, we cut to a MAGA propaganda video of Donald Trump on Youtube.

Jill Lepore: "Every shift to a new party system is associated with a communications revolution. That is a really interesting pattern and we are absolutely in the party re-alignment and communications revolution right now." On-Screen Video: Pro-trump MAGA video from Youtube

In addition to more external content, we introduced an external camera which streamed video from an Android tablet to a Webm format accessible by the in-studio browser.

The differences between the two broadcasts can be seen here in this "time into space"2 0 view of the broadcasts. The live broadcast done by the production team can be seen on the left, while the VR broadcast can be seen on the right. With this view, it is clear that a lot more external content was added into the VR broadcast.

20 Ohmata, Hisayuki. Time into Space 29 Figure 20: Time-into-space view of multi-camera production broadcast done by the crew of Studio 125 and Diginovations

Figure 21: Time-into-space view of multi-camera production broadcast done by an amateur user of the broadercasting system.

30 Chapter 3: Spatial Organization of Content "It is all connected, it is all interdependent, You look out the window, and in my case, I saw the thinness of the atmosphere, and it really hit home, and I thought, 'Wow, this is a fragile ball of life that we're living on.' It is hard for you to appreciate that until you are outside of it." NASA Astronaut Sandra Magnus."

The overview effect is a reported shift in awareness reported by a number of astronauts upon returning from space-flight. Astronauts report heightened feelings of fragility for the planet, one-ness in the sense of national borders, and a motivation to accomplish global/societal goals.

The interfaces and points of contact for services such as Facebook and Twitter can even influence the content which is generated. Click-baity headlines and inflammatory statements are now a staple of journalism. Facebook has no dislike button, and is often criticized as being an echo-chamber. Twitter has the hashtag, which allows a tweet to be crudely-categorized, and is often criticised for encouraging tribalism. Interfaces matter to behavior.

Spatial tech offers a way to generate new perspectives, which has potential impact for personal connection, exploration, and discovery in news. The challenge of spatial organization, however, is not well explored in the mobile AR platform.

This chapter focuses on the design and evaluation of a world-event platform (GlobAR) and its evaluation through a user study of 20 participants. The result of the study showed that the interface exposed users to news stories with a greater diversity of localizations over the stock Android Google news application.

21 White, Frank. The overview effect: Space exploration and human evolution. AIAA, 1998. 31 GlobAR system overview: In the field of news, the internet has siphoned away most of the contextual visual design which connoted context and importance. News on the web is largely consumed on a feed, with each item on the feed having the same importance. Furthermore, algorithms based on user behavior and advertisers are heavily influencing what can appear on that feed. The net result of interface design, data-collection, and advertiser-interest has resulted in a news-feed which heavily filters and often polarizes the news.

Figure 22: Distorted graph showing the localization of the subject of news stories reported on by the Guardian in 2012.

AR is still new as a platform and design norms have not been established yet, however it offers two solutions to the news-feed problem. First, it offers a spatial interface that can reintroduce perspective. Second, it is entertaining and interactive, which can help to alleviate the convenience problem that led us to preferring feeds over newspapers.

The GlobAR app is an attempt to break through the news-filter and offer users a glimpse into the overview effect reported by astronauts. The app is a news-aggregator which places stores on an augmented reality globe that a user places in space. The app engages users to fly around the globe and discover news stories.

The app works in both an active and passive mode, allowing the user to watch random stories pop-up around the globe, or to allow their curiosity to click into and find out more about a specific region.

The news-aggregator works to enliven a 3D globe with live news sources. The globe, either on the wall or projected onto space in AR, acts as a portal connecting the user to the world in real-time. This interfaces acts as an answer to the anxiety or itch one feels when looking at a map and imagining what is going on in the far-flung locations.

System Design The system sources all of its news from google news. Google news is a web-crawler which aggregates stories from across the web. Google's algorithms are not described explicitly,

32 however the inclusion guidelines indicate that it prefers sources which: (https://support.google.com/news/publisher/answer/752681 1 ?hl=en) * Write news about recent events " Share original content " Accountable and Transparent " Do not misrepresent on purpose " Limit ads (more content than ads on the page)

Thus it is difficult to define which sources Google picks up and which it does not. In practice, there have been over 4,500 independent sources published on Google news on the topic of Osama Bin Laden before 2011, including foreign and domestic sources ".

Since google news does not list entries by latitude and longitude, we captured news items from a set of 250 location search terms. These included state names and country names. The system scrapes google news responses every 2 hours from these search terms, and assigns them the country/state's geographic coordinates. These are all aggregated on a separate server, the apps communicate with this server.

In addition to the news information, the app requests live precipitation data to display on the globe as well.

Wall-based: The wall-based version of the app was demoed on a PLANAR touch-display. The application has a globe at the center. The globe's position is fixed at the center of the screen. The rotation of the globe is controlled automatically in passive mode, or manually in active mode. In passive mode, the globe spins slowly at at 1 revolution every 960 seconds. There is a cursor at the center of the screen, and every 0.2 seconds, the app retrieves new news-items at the geographic location of this cursor. The app populates new news-items which are not currently present on the globe. The total number of items is currently set to 100, and the newest item will boot off the oldest one. When a news item is near or directly underneath the cursor, the news item moves to the front of the screen to be magnified. When it is far enough from the cursor, the news item returns back to its floating position.

In active mode, the user is able to spin the globe with a touch and drag. This will populate new news items briefly into the focal position, until the drag continues for long enough, and a new news item will populate the focal position. When the user is not touching the screen, the app will return to passive mode and will gently rotate on its own. This creates a showcase of news items as the globe spins.

22 Christian, Jon. We still Don't Know How Google News Works. (https://theoutline.com/post/2512/we-still-don-t-know-how-google-news-works) 33 Figure 23: Examples of someone using the GlobAR wall-based system. The user can see world news in passive mode, and highlight news from a specific location in active mode.

34 Mobile-AR:

Figure 23: Example shown from the GlobAR app. Here the user has focused over Mali in Africa, and a story populated on the main Carousel.

The mobile-AR app allows the user to view the globe on their AR-Core enabled phone. The user can place the globe into their physical space. There are two buttons presented to the user in the interface, rotate-globe, and place-globe. When the user hits the rotate button, the globe will rotate on its own, and news stories will populate underneath the cursor. When the user clicks on the globe to rotate it themselves, this rotation mode turns off. The rotation mode can also be turned off by clicking the rotate button again. The place globe button acts to place the globe 30 centimers in front of the user's phone. Thus users can place the globe anywhere, such as on top of a coffee mug, or on a certain spot on their desk. Once placed, the globe will not track changes in the environment, which can lead to impossible-occlusion situations. For example, if the globe is placed on a table top, and a large stack of books is placed on that particular spot, the globe will not adjust its position.

The largest difference between the wall and mobile interface is how the cursor is populated. The cursor point will populate at the intersection of the globe and a line drawn between the globe center and the user's phone. The cursor will not populate, however, unless the user is clicking on the main screen and manually moving the globe. The reason for this design choice

35 is that the user's hand position in 3D is never consistent and clean, as it is difficult to hold things in constant position for that long. Thus, to enable noise in story selection, the cursor only populates and captures news when the user is dragging on screen (while in active mode). If the globe is in rotation mode (or passive mode), then the user movement of their phone without touch will cause the cursor to move. This choice was made after testing the app and seeing that in passive mode, the user was not as focused on specific stories as they were in tryingi to see what was happening around the world.

Another difference between the mobile and wall-based versions is the positioning fo the stories while in focus mode. In the mobile version, two of the top stories are stacked vertically in the focus region. Whenever a user clicks on a story, a browser window ill open on the phone. If the user hits the back button, it will return back to the globe app. The app was designed to allow the user to see focus items while still being able to observe the globe. The focus items follow the user (locked in place with the camera), so that until the user interacts to move the globe, the focus item will not change.

Focal Story

Egyr Sii VOMs terfor

Place Globe Passive Mode Button Button

Figure 24: A description of the on-screen UI for the GlobAR application. The application here is shown running on a Google Pixel 2.

36 robbreport.com Ser zh Sargsyan: Protesters return ater talks ail in Armenia- Robb Report

Armenianm Want What They Need. After GeftVn What They IRAVEI DEStIMAUlM RadOFeeEuropiRa ioWby 4 Local Experts Share Their Favorite Monaco Hot Spots Where to eat, drink sleep, and shop in Europes mini municipality. BY NATASHA WOL FF

Figure 25: Example of stories populated into the carousel (top area). These remain sticky, even as the phone moves (left). Clicking on one of these stories opens a browser link (right).

37 Experimental Design In order to test the key hypotheses about the interface, we conducted user tests for the system. The main hypothesis to test was whether the GlobAR interface allowed for more news exposure than a standard 2D news interface while providing the same user engagement as a news feed. In order to explore this, we gave users the same allotment of time to utilize both the Android google news application and the GlobAR interface. The sites and exposure to stories was tracked. After the experiment, we asked users to fill out a survey about their utilization of the system. We asked them to report on their behavior while using both systems, and their normal sources and news-viewing behaviors. By tracking their behavior, we recovered quantitative data about user behavior and exposure. Through the questionnaire, we collected data on the ease of use and engagement while using both applications.

In order to determine the source of a story's content, entities were extracted from story titles and geo-located. Stories that did not mention a location were not localized and not counted. From the spatial news app, stories that appeared on the carousel (bottom enlarged area) of the app were counted. From the google news app, any story that appeared was counted.

From the google news application, 248 / 342 unique stories were localizable and included. From the spatial news application, 335 / 481 unique stories were localizable and included.

Participants utilized both interfaces for 60 seconds each. The order of the interface utilization was not randomized. The users utilized the google news application before using the spatial news application. Thus a valid interpretation of the quantitative results is that users were searching for more stories in the spatial news application after their exposure to a set of stories in the google news application. There is evidence, however, to show that users were actually seeing more stories (and not biased by ordering). Users reported seeing more stories they would not normally see while using the spatial news application, however this difference is not statistically significant (p=0.1511).

Evaluation: The user study was completed with 18 participants. The localization of stories is shown below in the following heat-maps.

38 Figure 26: Story localization heat maps for the spatial news app (left) and the google news app (right) after aggregating over all users.

The results show that the heat map is more spread-out while utilizing the spatial news application, while most of the news on the google news application is localized in North America, China, and the Middle-East.

The increased exposure could be due to the scrolling nature of the globe. As there is only one focused story at a time, stories pop-up as the user navigates to the area they target to look at. This is supported by viewing the navigational pathways of the users, which show that users did indeed take pathways around the globe that covered large swaths of latitude and longitude.

39 0.8-

0.6

0.4 -

0.2

0

-0.2-

-0.4-

-0.6-

-0.8 - 00.5

1 08 0.6 -. - 0 -0.2 -0.4 -0.6 -0.8 1 x Y

Figure 27: User pathways as they used the spatial news application. These paths represent motion of the user's phone in physical space as they hovered above the AR globe. A sphere representing the globe can be viewed at point (0,0,0). Units are in meters.

40 200

400

600

800

1000

1200

1400

1600

200 400 600 800 1000 1200 1400 1600 1800 2000

Figure 28: Projection of a subset of user pathways while using the spatial news application. Discontinuities are due to relocalization in tracking or replacement of the globe.

In addition to the distribution of stories, users reported on how they felt about the experience. Users were prompted with a series of statements, such as: "the application was intuitive" and were asked to rate their alignment with the statement (1-Strongly Disagree, 7 Strongly Agree) It is intuitive It is Simple to Use

8 6 (count) 8 6 (count) * 7 (count) * 1 (count)

6 6

C 4

s

0 0 P Ulf c LI, 0 ;) VL0LZ Q- 01) 0DUl 'D 0 U) 0 U) C t 0n LI) 0 LO 0 LI) 0n A:!

It Is simple to use. It is simple to use.

41 D8C8

7 -

6

5

>- 4

3

2

A B x

Figure 29: Responses for the google news application are in blue, and responses for the spatial news application are in red. The plot on the bottom shows the means of the response to: "the interface was simple to use". The plot on the left represents the google news app, and the plot on the right represents the spatial news app. Plot and calculations created through this calculator.23 The whiskers represent 1 standard deviation in either direction. A two-tailed, unordered, t-test has a p value of 0.001078, the means are statistically significant given a normal distribution assumption.

23 http://www.physics.csbsju.edu/cgi-bin/uncgi/plot/bulkbox 42 While users reported that both interfaces were intuitive, they were split about whether the spatial news application was simple to use. 11/19 users agreed that it was simple to use, while 14/18 users agreed that the google news application was simple to use (see Figure 29). This is also reflected in the short answer feedback of the spatial news application:

"It would take a little getting used to in order to use it as quickly and efficiently as a scroll-through app, and I can see myself being too lazy to dig for the stories that are hidden behind other stories. I enjoyed physically zooming in on stories, but I might feel a little silly doing that in public, or it might be inconvenient to do in a crowded space like the train."

The user highlighted both the difficulty of picking stories from a specific location (as it requires positioning the phone above that location or rotating the globe with screen touches). The AR nature of the application makes for a paradigm which is not amenable to usage on public transport or when not near a desk.

Users were in agreement that the spatial news application had stories that they would not usually see. It can be seen below in red that the spatial news responses are skewed towards the right.

There were stories I wouldn't usually see

8 I 4 (count) 7 (count) 6

4

2

o Ln C0 uO 0. LO Co LO o L 0 kl o LOj 0

t is simple to use.

Figure 30: Results for whether there were stories the user would not normally see. Strongly disagree is 1, while Strongly Agree is 7. The responses in red are the GlobAR application, while responses in blue are the stock google news applicaiton.

After utilizing both interfaces, users reported on which applications they preferred. Below you can see results aggregated over all users. Users selected the Google News Application twice as much as the spatial news application. A majority of users, however, agreed that the spatial news application was more fun to use.

43 Which interface is more fun to use? Which interface would you prefer? wcgi _ ____app

Scogle flo'.o apt

Spatial news app Iglo

Figure 31: Results of the user study asking which interface they would prefer and which is more fun to use. Blue is the proportion of users selecting the stock google news application. Red is the proportion of users selecting the GlobAR application.

Later in the user-study, an interface change was implemented to address user feedback by increase the size of stories, placing fewere stories in the carousel (reducing the number of stories that show at a time), and allow for faster navigation. Results are shown below for how the interface changes affected user experience: I can recover from mistakes easily (old It is Simple to Use (old vs new globe) vs new globe)

1(count) 8 8 S2 (count) * 6 (count) * 6 (count) 6

4 C 4 0 C-, 2L C-, I Ii I 0 000000 000000 C) U (N C) C) I= M M MA CD = M CD CD- M 0LA LO q LA 0 In LA 0 LA0 LA '.0

It is simple to use. It is simple to use.

Figure 32: Results of user experience for using the spatial news application before the interface change (blue) and after the interface change (red).

The interface changes had a positive effect on user preference towards using the spatial news application over the google news application:

44 -A

Which interface would you prefer? (Old Globe)

Spatial news app (glo

Which interface would you prefer? (New Globe)

Spatial news app (glo.. 400 1

Google news app 60c 0o1

Figure 33: Results showing how an interface change midway through the study had an affect on which application users preffered. The old glboe app is on top, and the globe application after the user chagne is on the bottom. Users preferred the spatial news application 12% more after the interface change.

The results of the study indicate that the spatial news application leads to discovery of stories users would not typically see, however the AR introduces a dynamic which only a subset of users liked. The final graph shows that with the new interface implementations, 60% of users preferred the spatial news application over the google news application. With further refinement of the interface, it seems possible to produce an application that could provide an alternative to the google news feed.

45 Chapter 4: Placing 2.5D Video Content in a User's Space

"Travel is fatal to prejudice, bigotry, and narrow-mindedness." - Mark Twain

Foreign correspondents have the difficult job of relaying the full context of a geo-political situation in a short segment that appears as part of a news feed on a 2D website. There are many layers of embedding that the story appears in, so it is challenging to express the full context of culture, language, and history. The mobile AR format presents an opportunity to break through those layers of embedding.

2.5D video introduces presence, interactivity, and spatially anchors the information. Organizations have begun to embrace VR journalism, an example of which is an experience where a user can wear a google-cardboard viewer to be immersed in a 360 video of a refugee's journey24 . AR experiences have also been introduced on AR headsets to put static cut-outs of images (often of refugees) into a user's space25 . These techniques require the use of a headset, however similar telepresence experiences have been achieved with 2 head-coupled perspective and autostereoscopic displays 27. Mobile AR offers the opportunity to put a video of a 3D captured person into a user's space without the user wearing a headset. A user could hold up their phone and see an interview being conducted in their space.

The mobile AR format has challenges associated in production. Proper capture of volumetric video has been accomplished with large rigs and setups. 2 29 Content must be placed inside the capture setup, making this prohibitive for field reporters or content that has already been captured. In addition, transfering 3D meshes can put a strain on the bandwidth / processing capacity of the mobile phone.

In this section I show some techniques to capture new 2.5D video content and to convert archived 2D content into volumetric video content. I also conduct a user study with 20 participants to establish a baseline for the effect of mobile AR on feelings of presence. There's not enough light shed on how this actually affects feelings of presence. Is it beyond fun/cool? In this chapter I propose a way to measure presence.

The result of this experiment was that the mobile AR format had no significant trend on confidence of recollecting details. When asked about their feeling of presence, there 12/20 users felt the AR video left more of an impression than the 2D video, with a majority of those 12 strongly agreeing that the AR video left more of an impression than the 2D video. When asked

24 The displaced. VR experience. 25 Holograms from Syria. Asad Malik. 26 Holosuite, An Exploration into Interactive Holographic Telepresence. Ermal Dreshaj 27 Maimone, Andrew, et al. "Enhanced personal autostereoscopic telepresence system using commodity depth cameras." Computers & Graphics 36.7 (2012): 791-807. 28 8i- Real Human Holograms for augmented, virtual, and mixed reality 29 Intel 360 Replay, Intel True View 46 specifically about their confidence in noticing specifics from those videos, mobile AR did not have an effect on confidence of recollection.

Stereo-capture for 2.5D Veumetrie Video:

Figure 34: Example of a "popout video" captured with a stereoscopic camera and displayed in mobile AR. On the left, the user selects a point on the ground to deploy the video. In the middle, a popout version of the video appears. On the right, the user can select to see more of the users background and explore. Video can be seen at: https://www.youtube.com/watch?v=_124hMG9jlo Video journalism manifests itself in our lives through videos viewed on flat displays (TV's, laptops, and mobile phones). While motion-pictures have been a staple in media for the last 140 years, they struggle to portray things like scale. Furthermore, the content of the story is boxed into the frame of the flat viewing device. This creates a barrier, or 4th wall, between the viewer and the content.

AR offers us an opportunity to break through that 4th wall and embed content directly into our world. While this can enable new forms of story-telling, I'm particularly interested in the effect this can have on understanding the real-world. Can we utilize AR to break the physical bubbles we shell ourselves into? Can a foreign correspondent do an interview in another country and have that interview streamed directly into your living room, so that the interviewee is standing a few feet in front of you?

My goal was to prove out mobile-AR and low-cost capture technology for this task. I set out to create a workflow which was accessible to the greatest number of journalists. New equipment, and new workflows are not available for most journalists, thus to make these findings accessible, the workflow and capture equipment must be low-cost and simple to use. For this reason, I focused on creating inexpensive 3D capture and workflows.

47 In the figure 34, we show an app where a user opens an app on a mobile phone, searches for a floor or table surface (indicated by the white triangular pattern), clicks on the surface, then a volumetric cutout video appears. In this case, it is a video of myself dancing. Since the phone is tracked with 6 degrees of freedom, the user can walk around the video, place it in another location, and even copy-paste.

In order to generate this experience, it is necessary to extract the depth from the video. I explored both stereo-depth extraction and deep-learning based segmentation for this task. For raw video, I captured videos using two-webcams or the SVPRO dual camera ($40). This provided two views of the same scene in 720p resolution each.

Figure 35: Stereoscopic camera and its 2-view output captured in OBS.

The first task in processing the dual images was registering the two images for the camera. This calibration step uses 25 frames randomly sampled from the video and performs feature matching.

200 200 200

.101

10D f0j

6 0)'. 367K 400 500 600 '00 200 300 400 437 600

Figure 36: Result of registering the two views to align them. The original left image is on the left. The registered right image is in the middle. The original right image is on the right.

Once the images were registered, depth was extracted by finding the disparity between the two images. The raw depth frames suffer from salt and pepper noise (from featureless regions

48 in the image) which are fixed by applying a median filter. In addition to the , a segmentation map was formed by utilizing a deep-lab segmentation implementation on 30 tensorflow . This model was trained to object classes in the PASCAL VOC dataset 31.

left Image depth segmented mask

200 20

X, 20- 30 4CO 500 00 2 c 200 300 400 500 6000 200300 430 500 00

Figure 37: Examples of generating a depth image (middle) and person-segmented image (right).

It is worth noting that the depth-based segmentation can run in real-time, however the deep-segmentation algorithm cannot and requires up to 2 hours of offline processing for a minute of video. The deep-segmentation has the advantage of not needing to change journalist workflow for capture (can use a monocular camera), however it comes at the cost of accurate

3' DEEPLAB Resnet v2: http://lianqchiehchen.com/projects/DeepLabv2 resnet.html, Deeplab Resnet: https://github.com/DrSleep/tensorflow-deeplab-resnet 31 PASCAL VOC dataset. http://host.robots.ox.ac.uk/pascal/VOC/ 49 depth information and increased processing time.

200 200

400 400

600 600

200 400 600 200 400 600

using segmentation using depth and segmentation

200 200

400 400

600 600

200 400 600 200 400 600

Figure 38: Result of segmentation using depth alone (top right), deep-learning based segmentation (bottom left), and a combination of both (bottom right). After the frames have been processed, they need to be displayed in 3D in some manner. While each frame can be saved as a mesh, it would require a lot of bandwidth to transmit the meshes and textures. I opted to create a height-shader that takes in two videos, one for the texture and the other for the depth. Once in the shader the app can use standard video-decoding to run quickly. The shader has two variables for modifying the volumetric video. The first changes which depth plane will be shown, the other changes the extent of "popout" of the video.

4)'

Figure 39: Example of loading the depth map into a shader and selecting the cutoff depth. Triangles below this cutoff depth are made to be transparent. Here is an example of slowly increasing the cutoff depth from left to right.

50 Figure 40: Example of loading the depth map into a shader and selecting the extrusion parameter. The z-component of the triangles are stretched based on the depth map and the extrusion parameter. Here is an example of slowly increasing the extrusion parameter from left to right. The shader has issues, however, with the heavy depth-edges of the image.

2.5D Video Interview Experiment Design:

For the experiment, we sought to examine whether the volumetric video format had an effect on retention of information and people presented. In order to test this, we recruited users to view the 2D and 3D videos of a person wearing various masks. The users were told to pay attention to the content of the videos. In each video, a person walks into frame wearing an animal mask, states their favorite food, and then exits the frame. The direction of exit from the frame of the video was randomized by flipping the video horizontally. After watching both videos, users perform a memory-task unconnected from the videos to aid in clearing out their short-term memory. Users were asked to list as many countries as they could for 90 seconds. This number was decided based on a series of pilot studies that were tuned so that a majority of the users would remember both videos.

After the short-term memory wiping task, users were asked to report on the favorite food in each video, the direction the video-person exited (left/right), and their confidence in that answer (1-10). If the augmented reality experience caused more of a feeling of presence or utilized more of a user's spatial cognition resources, then the they would have a better memory of the direction of exit of the person in the augmented reality video.

51 44.

Figure 41: Example of a volumetric video placed on a table. Users can move their phone around to get a sense of the depth of the person present in the video. The subject of the video wore different animal masks to differentiate the people the user would see. The direction of exit of the subject of the video was randomized.

52 2.5D Video Interview Evaluation The results show that the augmented reality had no effect on the confidence of user recollection of the direction of exit of the subject of the video. Of the 20 users tested, 11 subjects remembered both videos after the short-term memory task. Of thoste 11, 4 reported higher confidence in the AR video, 4 reported higher confidence in the 2D video, and 3 reported the same level of confidence for AR and 2D videos.

The AR person left more of an impression than the video person 6

4

2

0 0 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 5 50 6.00 6.50 7.00 7.50 7.60

Figure 42: Histogram of user responses. 1 meaning strongly disagree, 7 meaning strongly agree.

On the qualitative questions, users agreed more with the statement that the AR person left more of an impression than the 2D video person. After the first day of testing, I realized this question could be too leading, so an additional question was introduced, this time framed in the negative: "It is easier for me to remember the 2D person than the 3D person."

53 It's easier for me to remember the 2D Video in my mind than the AR person 2.0

1.5

1.0

05

0.0

Figure 43: Histogram of user responses. 1 meaning strongly disagree, 7 meaning strongly agree.

The results of this question were binodal, with roughly half the users disagreeing with the statement. This implies that roughly half the users were better able to remember the 3D person. No users reported an answer that was in the middle (neither agree nor disagree).

54 Spatial content and interactivity: AR basketball Augmented Reality Content is hard to produce due to its 3D nature. In the case of sports, companies are investing in making smart stadiums with 30+ cameras to create volumetric content3 2. Extending this interactivity to sports and news-events that do not have smart stadiums can lead to a broader availability of volumetric content. Here we explore using 2D basketball video and computer vision to create a pseudo 3D experience that can be consumed with an AR headset or mobile phone. The user can walk around the projection, pick up players, and reposition them. The experience is engaging and enables a user to "remix" and share reality.

Figure 44: Image of the AR basketball experience. The application was taken on a Microsoft Hololense. The user is able to pause the game, pick up, and freely place players in 3D.

In order to generate this experience, we started with a video of a European basketball game captured with a fixed camera33. The background was separated from the foreground by capturing the median color of 800 frames. Given that players move on and off the field, this is an effective way of separating out the background.

The pose of the camera can be estimated by identifying 4 or more key points on the basketball court. Luckily this information is well published about USA courts and European courts 34 .

32 Intel 360 replay. 13 Parisot, Pascaline, and Christophe De Vleeschouwer. "Scene-specific classifier for effective and efficient team sport players detection from a single calibrated camera." Computer Vision and Image Understanding 159 (2017): 74-88. 34Basketball court dimensions. https://en.wikipedia.org/wiki/Basketballcourt 55 Figure 45: Result of capturing the median of each pixel across 800 frames from the basketball game

After extracting the background, the foreground of each frame is identified by finding the difference between the frame and the background. Each major grouping of foreground pixels is identified as a player object. The lower-most pixel is identified as the point where the player touches the ground, and the player is projected onto that point on the court.

200

400

800

1000[X]: 1600 1200 Index: 0 [R,G.BJ: [0.2078 0.1647 0.5255] 12004 200 400 600 800 1000 1200 1400 1600

Figure 46: Identifying the point of projection for all foreground players on the court.

After identifying where all players will go, 3D meshes are generated for each of the players by finding the convex hull of each player, then producing low-polygonal models. The players are

56 all 2D cutouts, however this does not detract too much from the feeling that the players are popping out in 3D. Writing out a mesh of a player:

t'.

Figure 47: Writing out the mesh of a player and placing them in their correct position on a court.

Once all the players are assembled onto the court, it is possible to interact with each of the players in 3D. This was demonstrated on the Microsoft Hololens (AR headset), Vive (VR headset), and Google Pixel (Mobile phone).

57 Figure 48: Picking up two player cut-outs in the VR version of the experience. The user could fly around, grab players, and reposition them in 3D.

58 Creating 2.5D content from archival footage: Bringing back Malcolm X A difficult challenge in placing 2.5D video content in a user's space is generating the content when the subject can't be captuerd with a 3D camera. Archived footage often focuses on a human subject. Recorded history has latched onto remarkable people as explanations and embodiments of sea-changes in culture. Interviews are a method of scaling understanding where the interviewee has a direct link to the entire audience. The audience grants attention to the interviewee by watching the interview. With augmented reality, its possible to pay attention to history and to give space to it.

Presenting historical content on a mobile phone allows the user to see videos and interviews that are spatially or temporally localized. A user can walk up to a poster with a QR code / AR marker and see a life-size version of a historical figure giving a speech. This can commemorate a specific event, or be triggered by an important location.

In order to demonstrate this, we recreated a Malcolm X speech on the anniversary of his assassination. The figure of Malcolm was segmented from a video taken during the 1960's using a Deep-segmentation algorithm. The frames were recombined to make a transparent video that appeared when the user pointed their phone at a series of posters placed throughout the media lab.

I.'

Figure 49: Poster featuring AR trackable marker and QR code with website users can go to in order to activate the experience. This was placed on walls during the anniversary of Malcom X's assassination.

The poster was made to be compatible with AR.js, which meant that both iphone and android users could activate it utilizing their browsers and would not need to download an app. In order to enable a transparent video in AR.js, we generated a green-background video which was cleared to transparent on a video-element in A-frame.js.

The green background video was generated by processing each frame of the Malcolm X video using a deep segmentation algorithm. The algorithm is an implementation of the 59 deeplab-resnet 35which is trained on the PASCAL-VOC dataset3 6. The dataset is 11,530 images, with each pixel labeled with one of 20 different class labels.

Figure 50: Example of segmenting a scene from the 1960's video. Original image on the left. Segmented image on the right.

At the end of the video, we placed a few quotations of Malcolm X alongside photos from his life.

The demo had the following key observations. It was much easier and more accessible to utilize the mobile browser instead of downloading an app. The AR pposters had a dual use of also decorating the space / drawing attention when the phone was not out.

Having the segmented video stuck to the wall was not immersive, however with future adoption of web-AR, a full experience can be generated that is free-from the wall and free from marker tracking.

s DEEPLAB Resnet v2: http://liangchiehchen.com/projects/DeepLabv2 resnet.html, Deeplab Resnet: https://github.com/DrSleep/tensorflow-deeplab-resnet 36 PASCAL VOC dataset. http://host.robots.ox.ac.uk/pascal/VOC/ 60 Conclusion The first iteration of consumer-AR is upon us with mobile AR. It is appealing to the industry and content creators alike because it has a large install-base already. AR is compelling, engaging, and fun, but where does it fit into the news-media ecosystem?

AR can enable a connection between news-content and space. This thesis presents three interactive experiences which demonstrate the applicability of spatial tools to the production and consumption of news.

Chapter 3 shows that spatial organization of content can increase user exposure to stories they do not normally see through a news-feed. Chapter 2 shows that spatial tools can facilitate content creation without specialized broadcasting equipment. Furthermore, Chapter 4 shows that 2.5D video placed in a user's space can increase feelings of presence for many users.

Despite the potential, there are key limitations observed in this thesis for mobile AR. For example, the user study in chapter 4 demonstrates that the feeling of presence that one would expect from an AR experience is not observed in half of users, and is not necessarily better at communicating spatial information than a 2D video.

The lack of presence in a mobile experience is affected by the small size of the screen. A user views the experience through a hand-help window into the virtual world, which frames the mixed-reality experience. This limitation is not often expressed in video-recordings of the experiences, since the screen-capture abstracts away the border of the phone.

Since full-immersion cannot be expressed through the mobile-AR format, it is difficult to justify the consumption of linear-content such as video. The benefits of spatial technologies are not used by simply porting content into a new format, and poor interface implementation can result in an experience which is worse than removing AR entirely.

Non-linear content, such as an interactive basketball game is facilitated by the technology, but more review of interactive research 37 is necessary to establish an argument for why AR interactive television would be a fundamentally different experience.

The content creation side, however, is a different story, and the spatial tools will likely have an impact in increasing accessibility to 3D modeling, animation, and live-production studios.

The conclusions drawn from this thesis beget the following questions which can be answered with modified versions of the implemented experiments.

" Do the mixed-results of mobile AR presence translate to AR headsets such as the Hololens? " Are there social/security/trust applications that can be unlocked with spatial technologies or anchoring content spatially?

3 Lee, Barbara, and Robert S. Lee. "How and why people watch TV: Implications for the future of interactive television." Journal of advertising research 35.6 (1995): 9-19. 61 * What other spatial content is easier to produce when you turn a mobile phone into a 6-degree of freedom mouse?

Through experimentation, some key learnings can be drawn for design of future AR experiences. The first is that presence is tricky to measure and also tricky to communicate. For example, during many of the trials, users were not able to remember which experience was in AR or which was in 2D, even though they reported a perceptual difference between the two experiences. Using memory as a measure of presence is indirect and can be influenced by many external factors. It would be interesting to pursue other methods to measure presence directly, such as galvanic skin response38, perception of the passage of time", and long-term observation of user preference (AR vs 2D).

Another recommendation for AR researchers is to refocus attention on on-surface interactions. When a user tries to hold their hand steady in air, their positional variability is much greater than when their hand is resting on a surface, such as on oa mouse, mouse-pad, or touch-screen. In order to cope with this, interfaces must have larger buttons and interactions. These can become tiring quickly. That being said, certain interactions, such as rotations, are difficult to learn in 2D. Thus a mix of 2D and 3D interactions may create an interface which captures the best of both worlds.

Lastly, I would like to write a word of caution to the industry to evaluate objectively what happened to the recent wave of VR. There is very little compelling content, even years after impressive technology was introduced. Mobile AR and headset AR may go the same way unless content is created and advertised around the limitations of devices. Without the temperance, the industry may be saddled with expectations that cannot be fulfilled..

38 Shi, Yu, et al. "Galvanic skin response (GSR) as an index of cognitive load." CHI'07 extended abstracts on Human factors in computing systems. ACM, 2007. 39 Brown, Scott W. "Time perception and attention: The effects of prospective versus retrospective paradigms and task demands on perceived duration." Perception & Psychophysics38.2 (1985): 115-124. 62 Appendix:

Part A: GlobAR application

Figure 51: An example of a few news stories grouped together on instantiation using the GlobAR application. One of the collision surfacess (green) is highlighted. The stories are like elctrons that push off from each other. The stories are also attraced to a center point. The balance of these two keeps them grouped.

1200

1600 - ~ 1400

Figure 52: Full paths of all the users of the GlobAR system. Each user has a different color.

63 Which interface is easier for news discovery?

oog e 'iews app 4 Spatial news app (glo 01,

Figure 53: Graph showing user responses for the GlobAR application showing which application is easier for news discovery.

Figure 54: General user responses from the study (google news app in blue, GlobAR in red): It is Simple to Use

G ,gJN-w 8 6 (count) I 1 (count)

4., C 0 4I C 2 I

0 o O OCCDoO 00 0 0 C (M Ul 0') 1 UCD Ln D'n)0 Ul ) 0LOl MLO) 1 "M 4 '9 UU')'D'0r-:'r- ,- e

It is simple to use.

64 It is intuitive

8 6 (count) 6I.7 (count)

6

2

it is simple to use.

I can use it without written instructions Gogle NewS 8 I7 (count) 7 (count)

6

4 0H

2 II

0 M CD s sCDCD tM nCD CD L0C LO n Ln CDfLO C" 9 ) I

It is simple to use.

65 I can recover from mistakes easily Go Ie Ne WS 8 U 6 (count) U 2 (count)

6

4 0 0 I III 2 I II I1 I 1 I II I C 0 00 CD a ]an MCDCD CDM0 0It~ 0 nat n0L nnL I- I~

It is simple to use.

66 I felt informed about the news GJgle News 8 4 (count) 4 (count)

6

4

2 l i

0

it is simple to use.

There were stories I wouldn't usually see GSoogi News

8 4 (count) 7 (count) 6

4 Cd 4 0 0 Id LIIiII CD 0 0WM00 0 0O u Q 0

It is simple to use.

67 I have a good sense about what is going on in the world today Google New 8 (count) N (count) 6

0 4 U 2 I

~.1 0 oooo00000000660606 r -- '0

It is simple to use.

There were stories I've never heard of before Google New 8 (count) (count) 6

C 4

0 2

U CDOOOCOO ) DD DC) 0 0C)0 0 D0 C IC Vi C- qL 6 6 0r r

It is simple to use.

68 Part B: Volumetric Video User Study

I felt present in the AR person's space.

6 M (count)

4

2

0 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4-50 5.00 5.50 6.00 6.50 7.00 7.50 7.60

I felt like the person in the AR app was present in my space.

8 M (count)

6

4

2

0 1.00 1 50 2.00 2.50 3.00 3.50 4.00 4 50 5,00 5.50 6.00 6-50 7.00 7.50 7.60

69 The AR person left more of an impression than the video person 6

4

2

0 1M0 1,50 2.00 2.50 3,00 3.50 4.00 4.50 5.00 5. 50 6.00 6.50 7.00 7.50 7.60

It's easier for me to remember the 2D Video in my mind than the AR person 2.0

1.5

1.0

0.5

00 . 1.00 1.50 2.00 2.50 3.00 3.50 4 00 4 50 5.00 5,50 6.00 6-50 7.00 7 50 7.60

70