COVID-19 Image Data Collection
Total Page:16
File Type:pdf, Size:1020Kb
COVID-19 Image Data Collection Joseph Paul Cohen 1 2 Paul Morrison 3 Lan Dao 4 Abstract This paper describes the initial COVID-19 open image data collection. It was created by assem- bling medical images from websites and publi- cations and currently contains 123 frontal view X-rays. 1. Motivation In the context of a COVID-19 pandemic, is it crucial to (a) Day 10 (b) Day 13 streamline diagnosis. Data is the first step to developing any diagnostic tool or treatment. While there exist large public datasets of more typical chest X-rays (Wang et al., 2017; Bustos et al., 2019; Irvin et al., 2019; Johnson et al., 2019; Demner-Fushman et al., 2016), there is no collection of COVID-19 chest X-rays or CT scans designed to be used for computational analysis. In this paper, we describe the public database of pneumonia cases with chest X-ray or CT images, specifically COVID- 19 cases as well as MERS, SARS, and ARDS. Data will be (c) Day 17 (d) Day 25 collected from public sources in order not to infringe patient confidentiality. Example images shown in Figure1. Figure 1. Example images from the same patient (#19) extracted Our team believes that this database can dramatically im- from Cheng et al.(2020). This 55 year old female survived a prove identification of COVID-19. Notably, this would COVID-19 infection. provide essential data to train and test a Deep Learning- based system, likely using some form of transfer learning. These tools could be developed to identify COVID-19 char- pneumonia. Similarly to the outcome of the Chest Xray14 acteristics as compared to other types of pneumonia or in (Wang et al., 2017) dataset which enabled significant ad- order to predict survival. vances in medical imaging, tools can be developed to pre- dict not only the type of pneumonia, but also its outcome. Currently, all images and data are released under the Eventually, our model could take inspiration from work by arXiv:2003.11597v1 [eess.IV] 25 Mar 2020 following URL: https://github.com/ieee8023/ Rajpurkar et al.(2017), which could predict pneumonia, as covid-chestxray-dataset. As stated above, im- well as Cohen et al.(2019), which deployed such models. ages collected have already been made public. Tools could be built to triage cases in the absence of phys- 2. Expected outcome ical tests, particularly in the context of polymerase chain reaction (PCR) tests shortage (Satyanarayana, 2020; Kelly This dataset can be used to study the progress of COVID-19 Geraldine Malone, 2020). These tools could predict patient and how its radiological findings vary from other types of outcomes such as survival, allowing a physician to plan ahead for specific patients and facilitate management. In 1 2 Mila, Quebec Artificial Intelligence Institute University extreme situations, where physicians could be faced with of Montreal 3Department of Mathematics and Computer Sci- ence, Fontbonne University 4Faculty of Medicine, Univer- the extraordinary decision to choose which patient should sity of Montreal. Correspondence to: Joseph Paul Cohen be allocated healthcare resources (Yascha Mounk, 2020), <[email protected]>. such a tool could potentially serve as a measuring device. COVID-19 Image Data Collection Figure 2. Demographics for each frontal X-ray image Type of pneumonia PA AP AP Supine Attribute Description SARSr-CoV-2 or COVID-19 76 11 13 Patient ID Internal identifier SARSr-CoV-1 or SARS 11 0 0 Offset Number of days since the start of Streptococcus spp. 6 0 0 symptoms or hospitalization for each Pneumocystis spp. 1 0 0 image. If a report indicates ”after a ARDS 4 0 0 few days”, then 5 days is assumed. Sex Male (M), Female (F), or blank Table 1. Counts of each finding and view. PA = posteroanterior, Age Age of the patient in years AP = anteroposterior, AP Supine = laying down, SARSr-CoV-2 = Finding Type of pneumonia Severe acute respiratory syndrome-related coronavirus 2, SARS- Survival Yes (Y) or no (N) CoV-1 or SARS = Severe acute respiratory syndrome-related coro- View Posteroanterior (PA), Anteroposterior navirus 1, ARDS = acute respiratory distress syndrome (AP), AP Supine (APS), or Lateral (L) for X-rays; Axial or Coronal for CT scans Furthermore, these tools could monitor the progression of Modality CT, X-ray, or something else COVID-19 positive patients in order to better track the evo- Date Date on which the image was acquired lution of their condition. Ultimately, this dataset and its Location Hospital name, city, state, country analysis could help us better understand the dynamics of the Filename Name with extension disease and better prepare treatments. doi Digital object identifier (DOI) of the 3. Dataset research article url URL of the paper or website where the The current statistics as of March 25th 2020 are shown in image came from Table1. For each image, attributes shown in Table2 are License License of the image such as CC BY- collected. Data is largely compiled from websites such NC-SA. Blank if unknown as Radiopaedia.org, the Italian Society of Medical and In- Clinical notes Clinical notes about the image and/or 1 2 terventional Radiology , and Figure1.com . Images are the patient extracted from online publications, website, or directly from Other notes e.g. credit the PDF using the tool pdfimages3. The goal during this process is to maintain the quality of the images. Table 2. Descriptions of each attribute of the metadata Data was collected from the following papers: (Phan et al., 2020; Liu et al., 2020; Chen et al., 2020; Paul et al., 2004; Silverstein et al., 2020; Shi et al., 2020; Holshue et al., 2020; References Ng et al., 2020; Kong & Agarwal, 2020; Lim et al., 2020; Bustos, Aurelia, Pertusa, Antonio, Salinas, Jose-Maria, and Zu et al., 2020; Cheng et al., 2020; jin Zhang et al., 2020; de la Iglesia-Vaya,´ Maria. PadChest: A large chest x-ray Lee et al., 2020; Wu et al., 2020; Yoon et al., 2020; Hsih image dataset with multi-label annotated reports. arXiv et al., 2020; Cuong et al., 2020; Thevarajan et al., 2020; Wei preprint, 1 2019. et al., 2020) 1https://www.sirm.org/category/senza-categoria/covid-19/ Chen, Nanshan, Zhou, Min, Dong, Xuan, Qu, Jieming, 2https://www.figure1.com/covid-19-clinical-cases Gong, Fengyun, Han, Yang, Qiu, Yang, Wang, Jingli, Liu, 3https://poppler.freedesktop.org/ Ying, Wei, Yuan, Xia, Jia'an, Yu, Ting, Zhang, Xinxin, COVID-19 Image Data Collection and Zhang, Li. Epidemiological and clinical characteris- Curtis P., Patel, Bhavik N., Lungren, Matthew P., and tics of 99 cases of 2019 novel coronavirus pneumonia in Ng, Andrew Y. CheXpert: A Large Chest Radiograph wuhan, china: a descriptive study. The Lancet, February Dataset with Uncertainty Labels and Expert Comparison. 2020. doi: 10.1016/s0140-6736(20)30211-7. In AAAI Conference on Artificial Intelligence, 1 2019. Cheng, Shao-Chung, Chang, Yuan-Chia, Chiang, Yu- jin Zhang, Jin, Dong, Xiang, yuan Cao, Yi, dong Yuan, Ya, Long Fan, Chien, Yu-Chan, Cheng, Mingte, Yang, Chin- bin Yang, Yi, qin Yan, You, Akdis, Cezmi A., and dong Hua, Huang, Chia-Husn, and Hsu, Yuan-Nian. First case Gao, Ya. Clinical characteristics of 140 patients infected of coronavirus disease 2019 (COVID-19) pneumonia in with SARS-CoV-2 in wuhan, china. Allergy, February taiwan. Journal of the Formosan Medical Association, 2020. doi: 10.1111/all.14238. March 2020. doi: 10.1016/j.jfma.2020.02.007. Cohen, Joseph Paul, Bertin, Paul, and Frappier, Vincent. Johnson, Alistair E. W., Pollard, Tom J., Berkowitz, Seth J., Chester: A Web Delivered Locally Computed Chest X- Greenbaum, Nathaniel R., Lungren, Matthew P., Deng, Ray Disease Prediction System. arXiv:1901.11210, 2019. Chih-ying, Mark, Roger G., and Horng, Steven. MIMIC- CXR: A large publicly available database of labeled chest Cuong, Le Van, Giang, Hoang Thi Nam, Linh, Le Khac, radiographs. Nature Scientific Data, 1 2019. doi: 10. Shah, Jaffer, Sy, Le Van, Hung, Trinh Huu, Reda, Ab- 1038/s41597-019-0322-0. dullah, Truong, Luong Ngoc, Tien, Do Xuan, and Huy, Nguyen Tien. The first vietnamese case of COVID-19 Kelly Geraldine Malone. Testing backlog linked to shortage acquired from china. The Lancet Infectious Diseases, of chemicals needed for COVID-19 test — CTV News, 3 February 2020. doi: 10.1016/s1473-3099(20)30111-0. 2020. Demner-Fushman, Dina, Kohli, Marc D., Rosenman, Kong, Weifang and Agarwal, Prachi P. Chest imaging Marc B., Shooshan, Sonya E., Rodriguez, Laritza, Antani, appearance of COVID-19 infection. Radiology: Car- Sameer, Thoma, George R., and McDonald, Clement J. diothoracic Imaging, January 2020. doi: 10.1148/ryct. Preparing a collection of radiology examinations for dis- 2020200028. tribution and retrieval. Journal of the American Medical Informatics Association, 3 2016. doi: 10.1093/jamia/ Lee, Nan-Yao, Li, Chia-Wen, Tsai, Huey-Pin, Chen, Po-Lin, ocv080. Syue, Ling-Shan, Li, Ming-Chi, Tsai, Chin-Shiang, Lo, Ching-Lung, Hsueh, Po-Ren, and Ko, Wen-Chien. A Holshue, Michelle L., DeBolt, Chas, Lindquist, Scott, Lofy, case of COVID-19 and pneumonia returning from macau Kathy H., Wiesman, John, Bruce, Hollianne, Spitters, in taiwan: Clinical course and anti-SARS-CoV-2 IgG Christopher, Ericson, Keith, Wilkerson, Sara, Tural, Ah- dynamic. Journal of Microbiology, Immunology and met, Diaz, George, Cohn, Amanda, Fox, LeAnne, Patel, Infection, March 2020. doi: 10.1016/j.jmii.2020.03.003. Anita, Gerber, Susan I., Kim, Lindsay, Tong, Suxiang, Lu, Xiaoyan, Lindstrom, Steve, Pallansch, Mark A., Weldon, Lim, Jaegyun, Jeon, Seunghyun, Shin, Hyun-Young, Kim, William C., Biggs, Holly M., Uyeki, Timothy M., and Moon Jung, Seong, Yu Min, Lee, Wang Jun, Choe, Pillai, Satish K. First case of 2019 novel coronavirus Kang-Won, Kang, Yu Min, Lee, Baeckseung, and Park, New England Journal of Medicine in the united states.