A Scalable Data-Driven Journalism Platform for Hyper-Local News
Total Page:16
File Type:pdf, Size:1020Kb
Crosstown Foundry: A Scalable Data-driven Journalism Platform for Hyper-local News Luciano Nocera George Constantinou Luan V. Tran University of Southern California University of Southern California University of Southern California Los Angeles, CA, USA Los Angeles, CA, USA Los Angeles, CA, USA [email protected] [email protected] [email protected] Seon Ho Kim Gabriel Kahn Cyrus Shahabi University of Southern California University of Southern California University of Southern California Los Angeles, CA, USA Los Angeles, CA, USA Los Angeles, CA, USA [email protected] [email protected] [email protected] ABSTRACT Scaling of this emerging journalistic model is challenging with the Generating hyper-local news at scale is challenging because shift from printed material to the online news format and tighter publicly available data is not provided at the desired spatial integration of news into social media platforms. It is even more and temporal granularity. Besides, there is a lack of automated challenging for local news outlets who wish to create hyper-local analytical and publishing tools. Crosstown Foundry, which is being news [7] to stay competitive by offering readers news that engages actively developed and used by engineers and journalists, is a novel their local communities, because they operate on limited budgets data-driven system that leverages a massive multi-modal dataset to and resources. Local outlets face a particular obstacle with scale. generate personalized newsletters for Los Angeles County readers. To cover new communities, they must hire more reporters and editors. For local outlets that reach a small audience, that is CCS CONCEPTS cost-prohibitive. Consequently, for local news sources, the smarter business decision has been to shrink rather than to expand. That • Information systems ! Data management systems; creates widening news deserts [6] – areas that receive little or no Document representation; • Social and professional topics news coverage. It leads to reduced government accountability and ! Professional topics; • Human-centered computing ! citizen engagement. Visualization. Several challenges prevent a journalist from harnessing data to KEYWORDS create timely news stories of interest to local communities. Cities, states, and countries have created sizable public data repositories data mining; visualization; data-driven journalism; hyper-local that are relevant. However, these data resources do not usually offer news insights that journalists can readily utilize. Besides, they are often ACM Reference Format: not standardized at a sufficient spatial resolution to use in a regional Luciano Nocera, George Constantinou, Luan V. Tran, Seon Ho Kim, Gabriel analysis (e.g., to compare trends across neighborhoods). Tools such Kahn, and Cyrus Shahabi. 2021. Crosstown Foundry: A Scalable Data-driven as Many Eyes and, most recently, Data Commons help "democratize" Journalism Platform for Hyper-local News. In Proceedings of the 2021 regional data sources with statistics and data visualizations and International Conference on Management of Data (SIGMOD ’21), June 20–25, provide broad user-level accessibility while fostering collaborative 2021, Virtual Event, China. ACM, New York, NY, USA, 5 pages. https: usage. However, they are not explicitly targeted for journalistic //doi.org/10.1145/3448016.3452751 applications, do not support online publishing at scale, and their databases are from public sources which are not adapted to 1 INTRODUCTION hyper-local data. Today’s data-driven journalism [8, 11] relies on Like any other modern industry, journalism is impacted by the data science and business intelligence tools such as Tableau [4] that ever-increasing amount of data and information available. News require a skill-set only available to few major outlets. publishers respond to this trend by developing in-house expertise To address these shortcomings we have developed the Crosstown to provide editors with the needed data-science skills, including Foundry, a platform that allows news editors to generate data analytics and data visualization, to create data-driven stories. scalable hyper-local news products quickly. Crosstown Foundry was developed in partnership by engineers and journalists Permission to make digital or hard copies of all or part of this work for personal or specifically for Crosstown, a nonprofit news organization based classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the University of Southern California’s Annenberg School on the first page. Copyrights for components of this work owned by others than ACM for Communication and Journalism. Crosstown Foundry gives must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, editors the means to outsource and craft stories leveraging curated to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. local data sources. It automates news stories replication and SIGMOD ’21, June 20–25, 2021, Virtual Event, China specialization over large areas to provide local audiences with news © 2021 Association for Computing Machinery. products targeted to local neighborhoods. This work builds on our ACM ISBN 978-1-4503-8343-1/21/06...$15.00 https://doi.org/10.1145/3448016.3452751 previous work to help Crosstown editors assemble datasets and from official sources contain spatio-temporal information such as timestamps and neighborhood identity collected from diverse sources and standardized to enable hyper-local news. To stay up to date, we have automated the crawling process to collect data periodically, at the same rate as each data source. 2.1.1 Crime dataset. We collect the crime data directly from the law enforcement agencies or through open data portals such as Socrata [20]. In Los Angeles, there are 46 agencies, but only the largest make their data easily accessible to the public, namely the Los Angeles Police Department and the Los Angeles County Sheriff’s Department, and a few smaller agencies such as Pasadena and Santa Monica police departments. Though there are some national standards for agencies Figure 1: Crosstown Foundry and Products overview. reporting their data, called Uniform Crime Report, there are often inconsistencies from agency to agency. Some departments use different nomenclature for the same crime (e.g., larceny instead source data-driven news. Data-driven news stories can be sourced of petty theft), while others report the crime’s address instead from other news stories by contrasting with the local data or around of the latitude and longitude. This can impact the number of predictions, e.g., with our previous work to predict pollution [9], crimes that are reported within a neighborhood. To handle the traffic [10, 16] and crime [13]. Another way to outsource news crime type inconsistency, Crosstown uses a bundling tool which is to build stories around data outliers, which is the topic of our is a customized, reader-friendly hierarchy of crimes. Each crime related work with the Crosstown Detective that implements an type reported from the various agencies, is mapped to a unique outlier detector on the Crosstown datasets to report data points that nickname. After that, each nickname is mapped to a small bundle satisfy a discordancy test assuming a hypothesized distribution [2]. name which contains several similar nicknames (e.g., burglaries, Crosstown Foundry allows local news to do something never before attempted burglaries and residential burglaries are all mapped possible: create numerous local reports at scale. Previously, when to burglaries). This provides the flexibility to integrate new law news outlets wanted to cover more neighborhoods, they had to enforcement agencies with minimal effort and a unified framework hire more people. With Foundry, numerous local reports can be to generate crime reports. We have released publicly this dataset generated automatically without added costs. with our related work [14] where we have predicted crime trends. 2 SYSTEM 2.1.2 Traffic dataset. The traffic dataset consists of traffic data, parking citations and traffic collisions. Crosstown uses the Archived Figure 1 presents an overview of the Crosstown ecosystem, Data Management System (ADMS) [1] which collects, stores, including the Foundry and Products Dashboards. We follow a manages, and analyzes a variety of data sources, including massive three-tier architecture with the presentation, access and storage. traffic sensor data. ADMS collects data from 14,000 loop detectors The front-end layer includes the Foundry Dashboard used to deployed in arterial and highways covering 5,400 miles of roadway create newsletters and other hyper-local news content, and the in Los Angeles county and records traffic signals in real-time. Products Dashboards that provide data exploration to readers Data collected from these sensors, enable to calculate the volume of the Crosstown website and newsletter subscribers. The APIs of cars and traffic speed at various spatial and temporal resolutions. layer provides access to data and services for these dashboards, For a particular neighborhood, arterial sensor data deployed within specifically for the Foundry to publish and distribute content and the boundaries of the neighborhood are aggregated in hourly to the Products to access and explore the Crosstown datasets. The buckets with their