Federated Learning used for predicting outcomes in SARS-COV-2 patients Ittai Dayan1,*, Holger Roth2,*, Aoxiao Zhong3,*, Ahmed Harouni2, Amilcare Gentili4, Anas Abidin2, Andrew Liu2, Anthony Beardsworth Costa5, Bradford J. Wood6, Chien-Sung Tsai7, Chih-Hung Wang8, Chun-Nan Hsu9, CK Lee2, Peiying Ruan2, Daguang Xu2, Dufan Wu3, Eddie Huang2, Felipe Campos Kitamura10, Griffin Lacey2, Gustavo César de Antônio Corradi10, Gustavo Nino11, Hao-Hsin Shin12, Hirofumi Obinata13, Hui Ren3, Jason C. Crane14, Jesse Tetreault2, Jiahui Guan2, John W. Garrett15, Josh D Kaggie16, Jung Gil Park17, Keith Dreyer1,18, Krishna Juluru12, Kristopher Kersten2, Marcio Aloisio Bezerra Cavalcanti Rockenbach18, Marius George Linguraru19, Masoom A. Haider20, Meena AbdelMaseeh21, Nicola Rieke2, Pablo F. Damasceno14, Pedro Mario Cruz e Silva2, Pochuan Wang22, Sheng Xu23, Shuichi Kawano13, Sira Sriswasdi24, Soo Young Park25, Thomas M. Grist26, Varun Buch18, Watsamon Jantarabenjakul27, Weichung Wang28, Won Young Tak25, Xiang Li3, Xihong Lin29, Young Joon Kwon5, Abood Quraini2, Andrew Feng2, Andrew N Priest30, Baris Turkbey31, Benjamin Glicksberg32, Bernardo Bizzo18, Byung Seok Kim33, Carlos Tor Diez19, Chia-Cheng Lee34, Chia-Jung Hsu34, Chin Lin35, Chiu-Ling Lai28, Christopher P. Hess14, Colin Compas2, Deepeksha Bhatia2, Eric K Oermann36, Evan Leibovitz18, Hisashi Sasaki13, Hitoshi Mori13, Isaac Yang2, Jae Ho Sohn14, Krishna Nand Keshava Murthy12, Li-Chen Fu37, Matheus Ribeiro Furtado de Mendonça10, Mike Fralick38, Min Kyu Kang17, Mohammad Adil2, Natalie Gangai12, Peerapon Vateekul39, Pierre Elnajjar12, Sarah Hickman16, Sharmila Majumdar14,Shelley L. McLeod40, Sheridan Reed23, Stefan Graf41, Stephanie Harmon42, Tatsuya Kodama13, Thanyawee Puthanakit27, Tony Mazzulli43, Vitor de Lima Lavor10, Yothin Rakvongthai44, Yu Rim Lee25, Yuhong Wen2, Fiona J Gilbert6,*, Mona G. Flores2,*, & Quanzheng Li3,* 1MGH Radiology and Harvard Medical School, Boston, MA, USA. 2NVIDIA, Santa Clara, CA, USA. 3Center for Advanced Medical Computing and Analysis, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA. 4San Diego VA Health Care System, San Diego, CA, USA. 5Department of Neurosurgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA. 6Radiology & Imaging Sciences / Clinical Center, National Institutes of Health, Bethesda, MD, USA. 7Division of Cardiovascular Surgery, Department of Surgery, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan, R.O.C. 8Department of Otolaryngology-Head and Neck Surgery, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan, R.O.C. and Graduate Institute of Medical Sciences, National Defense Medical Center, Taipei, Taiwan, R.O.C. 9Center for Research in Biological Systems, University of California, San Diego, CA, USA. 10Diagnósticos da América SA (DASA), Brazil. 11Division of Pediatric Pulmonary and Sleep Medicine, Children's National Hospital, Washington, DC, USA. 12Memorial Sloan Kettering Cancer Center, New York, NY, USA. 13Self-Defense Forces Central Hospital, Tokyo, Japan. 14Center for Intelligent Imaging, 2Department of Radiology and Biomedical Imaging, University of California, San Francisco, California, USA. 15Departments of Radiology and Medical Physics, The University of Wisconsin-Madison School of Medicine and Public Health, Madison, WI, USA. 16Department of Radiology, NIHR Cambridge Biomedical Resource Centre, University of Cambridge, Cambridge, UK. 17Department of Internal Medicine, Yeungnam University College of Medicine, Daegu, South Korea. 18Center for Clinical Data Science, Massachusetts General Brigham, Boston, MA, USA. 19Sheikh Zayed Institute for Pediatric Surgical Innovation, Children's National Hospital, Washington, DC, USA. 20Joint Dept. of Medical Imaging, Sinai Health System, University of Toronto, Toronto, Canada and Lunenfeld-Tanenbaum Research Institute, Toronto, Canada. 21Lunenfeld-Tanenbaum Research Institute, Toronto, Canada. 22MeDA Lab and Institute of Applied Mathematical Sciences, National Taiwan University, Taipei, Taiwan. 23Center for Interventional Oncology, National Institutes of Health, Bethesda, MD, USA. 24Research Affairs, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand, Center for Artificial Intelligence in Medicine, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand. 25Department of Internal Medicine, School of Medicine, Kyungpook National University, Daegu, South Korea. 26Departments of Radiology, Medical Physics, and Biomedical Engineering, The University of Wisconsin-Madison School of Medicine and Public Health, Madison, WI, USA. 27Department of Pediatrics, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand and Thai Red Cross Emerging Infectious Diseases Clinical Center, King Chulalongkorn Memorial Hospital, Bangkok, Thailand. 28Medical Review and Pharmaceutical Benefits Division, National Health Insurance Administration, Taipei. Taiwan. 29Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA, USA. 30Department of Radiology, NIHR Cambridge Biomedical Resource Centre, Cambridge University Hospital. Cambridge, UK. 31Department of Radiology and Imaging Sciences, National Institutes of Health, Bethesda, MD, USA and National Cancer Institute, National Institutes of Health, Bethesda, MD, USA. 32Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA. 33Department of Internal Medicine, Catholic University of Daegu School of Medicine, Daegu, South Korea. 34Planning and Management Office, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan, R.O.C. 35School of Medicine, National Defense Medical Center, Taipei, Taiwan, R.O.C. and School of Public Health, National Defense Medical Center, Taipei, Taiwan, R.O.C. and Graduate Institute of Life Sciences, National Defense Medical Center, Taipei, Taiwan, R.O.C. 36Department of Neurosurgery, NYU Grossman School of Medicine, New York, NY, USA. 37MOST/NTU All Vista Healthcare Center, Center for Artificial Intelligence and Advanced Robotics, National Taiwan University, Taipei, Taiwan. 38Division of General Internal Medicine and Geriatrics (Fralick), Sinai Health System, Toronto, Canada. 39Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand. 40Schwartz/Reisman Emergency Medicine Institute, Sinai Health, Toronto, ON, Canada and Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada. 41Department of Medicine, NIHR Cambridge Biomedical Resource Centre, University of Cambridge, Cambridge, UK. 42National Cancer Institute, National Institutes of Health, Bethesda, MD, USA and Clinical Research Directorate, Frederick National Laboratory for Cancer, National Cancer Institute. Frederick, MD, USA. 43Department of Microbiology, Sinai Health/University Health Network, Toronto, Canada and Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto. Canada Public Health Ontario Laboratories, Toronto, Canada. 44Chulalongkorn University Biomedical Imaging Group and Division of Nuclear Medicine, Department of Radiology, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand. *These authors contributed equally: Ittai Dayan, Holger Roth, Aoxiao Zhong, Fiona J Gilbert, Quanzheng Li, Mona G. Flores. e-mail:mflores@nvidia.com Abstract ‘Federated Learning’ (FL) is a method to train Artificial Intelligence (AI) models with data from multiple sources while maintaining anonymity of the data thus removing many barriers to data sharing. During the SARS-COV-2 pandemic, 20 institutes collaborated on a healthcare FL study to train a global AI model, “EXAM” (EMR CXR AI Model), that predicts future oxygen requirements of symptomatic patients using inputs of vital signs, laboratory data, and chest x-rays. EXAM achieved an average Area Under the Curve (AUC) of over 0.92, an average improvement of 16%, and a 38% increase in generalisability over local models. FL was successfully applied to facilitate a rapid data science collaboration without data exchange, resulting in a model that generalised across heterogeneous, unharmonized datasets. This provided the broader healthcare community with a validated model to respond to COVID-19 challenges, as well as set the stage for broader use of FL in healthcare. Main The scientific, academic, medical and data science communities have come together in the face of the pandemic crisis in order to rapidly assess novel paradigms in Artificial Intelligence (AI) that are rapid and secure, and potentially incentivize data sharing and model training and testing without the usual privacy and data ownership hurdles of conventional collaborations1,2. Healthcare providers, researchers and industry have pivoted their focus to address unmet and critical clinical needs created by the crisis, with remarkable results3–9. Clinical trial recruitment has been expedited and facilitated by national regulatory bodies and an international cooperative spirit10–12. The data analytics and artificial intelligence (AI) disciplines have always fostered open and collaborative approaches, embracing concepts such as open-source software, reproducible research, data repositories, and making anonymized datasets publicly available13,14. The pandemic has emphasized the need to expeditiously conduct data collaborations that empower the clinical and scientific communities when responding to rapidly evolving and widespread global challenges. Data sharing has ethical, regulatory and legal complexities that are underscored, and perhaps somewhat complicated by the recent entrance of large tech companies into the healthcare data world15–17. A concrete example of these types of collaborations is our previous work on an AI-based SARS-COV-2 Clinical Decision Support (CDS) model. The CDS model was developed at Mass General Brigham (MGB) and was validated across multiple health systems’ data. The inputs to the CDS model were Chest X-Ray (CXR) images, vital signs, demographic data, and lab values that were shown in previous publications to be predictive of COVID-19 patient outcomes18-21. CXR was selected as the imaging input because it is widely available and commonly indicated by guidelines such as provided by ACR22, Fleischner Society23, the WHO24, national thoracic societies25, national health ministry COVID handbooks and radiology societies throughout the world26. The output of the CDS model was a score, termed ‘CORISK’27 that corresponded to oxygen support requirements, and that could aid in triaging patients by front-line clinicians28+30. Healthcare providers have been known to prefer models that were validated on their own data27. To date, most AI models, including the aforementioned CDS model, have been trained and validated on ’narrow’ data that often lacks in diversity31,32, potentially resulting in over-fitting and lower generalisability. This can be mitigated by training with diverse data from multiple sites without centralising data33 using methods such as Transfer Learning34,35Click or tap here to enter text. or FL. FL is a method to train Artificial Intelligence (AI) models on disparate data sources, without data being transported or exposed outside its original location. While applicable to many industries, FL has recently been proposed for cross-institutional healthcare research36 . FL supports the rapid launch of centrally orchestrated experiments with improved traceability of data and assessment of algorithmic changes and impact37. One approach to FL, called ‘client-server’, sends an ‘un-trained’ model to other servers (‘nodes’) that conduct partial training tasks, in turn sending the results back to be merged in the central (‘federated’) server. This is conducted as an iterative process until training is complete36. Governance of data for FL is maintained locally, alleviating privacy concerns, with only model weights or gradients communicated between the client-sites and the federated server38,39. FL has already shown promise in recent medical imaging applications40-43, including in COVID-19 analysis8,44,45. A notable example is a mortality prediction model in patients infected with SARS-COV-2 that uses clinical features, albeit limited in terms of number of modalities and scale46. Our objective was to develop a robust, generalizable model that could assist in triaging patients. We theorized that the CDS model can be federated successfully, given its use of data inputs that are relatively common in clinical practice and that do not heavily rely on operator-dependent assessments of patient condition (such as clinical impressions or reported symptoms). Rather, lab results, vital signs, an imaging study and a commonly captured demographic (ie, age), were used. We therefore retrained the CDS model with diverse data using a client-server FL approach to develop a new global FL model, which was named ‘EXAM’ (EMR Chest X-Ray AI Model) using CXR and EMR features as input. By leveraging FL, the participating institutions would not have to transfer data to a central repository, but rather leverage a distributed data framework. Our hypothesis was that EXAM would perform better than local models and would generalise better across healthcare systems. Results The EXAM Model Architecture EXAM is based on the CDS model mentioned above27. In total, 20 features (19 from the EMR and a CXR) were used as input to the model. The outcome (i.e., "ground truth") labels were assigned based on patient oxygen therapy after 24-hour and 72-hour periods from initial admission to the ED. A detailed list of the requested features and outcomes can be seen in Table 1. The outcome labels of patients were set to 0, 0.25, 0.5, and 0.75 depending on the most intensive oxygen therapy the patient received in the prediction window. The oxygen therapy categories were, respectively, room air (RA), low-flow oxygen (LFO), high-flow oxygen (HFO)/non-invasive ventilation (NIV), or MV. If the patient died within the prediction window, the outcome label was set to 1. This resulted in each case being assigned two labels in the range of 0 to 1, corresponding to each of the prediction windows (ie, 24-hours and 72 hours). For EMR features, only the first values captured in the ED were used, and data pre-processing included de-identification, missing value imputation and normalization to zero-mean and unit variance. For CXR images, only the first one obtained in the ED was used. The model therefore fuses information from both the EMR features and CXR features, using a ResNet-34 to extract features from a CXR and a Deep & Cross network to concatenate the features together with the EMR features (for more expanded details, see ‘methods’ section). The model output is a risk score, termed EXAM score, which is a continuous value in the range of 0 - 1 for each of the 24 hour and 72-hour predictions, corresponding to the labels described above. Federating The Model EXAM was trained using a cohort of 16,148 cases, making it into not only one of the first FL models for COVID-19, but also the largest and first multi-continent development projects in clinically-relevant AI (Fig. 1 a,b). Data between sites was not harmonized prior to being extracted, and in light of real-life clinical informatics circumstances, a meticulous harmonization of the data input was not conducted by the authors, as seen in Fig. 1 c,d. We compared the locally trained models with the global FL model on each client’s test data. Training the model through FL resulted in a significant performance improvement (p<<1e-3, Wilcoxon signed-rank test) of 16% (as defined by the average-AUC when running the model on respective local test sets; from 0.795 to 0.920 or 12.5 percentage points) (Fig. 2a). It also resulted in a 38% generalisability improvement (as defined by the average-AUC when running the model on all test sets; from 0.667 to 0.920 or 25.3 percentage points) of the best global model for predicting 24h oxygen treatment compared to models trained only on a site’s own data (Fig. 2b). For the prediction results of 72h oxygen treatments, the best global model training resulted in an average performance improvement of 18% compared to locally trained models, while generalisability of the global model improved on average by 34% (Extended Data Fig. 1). The stability of our results was validated by repeating three runs of local and FL training on different randomized data splits. Local models that were trained using unbalanced cohorts (e.g., mostly mild cases of COVID-19) markedly benefited from the FL approach with a substantial improvement in prediction avg. AUC performance for the categories with only a few cases. This was evident at Client-site (#16), with an unbalanced dataset, with most patients experiencing mild disease severity, and with only a few severe cases. The FL model achieved a higher true positive rate for the two positive (severe) cases and a markedly lower false positive rate compared to the local model, both shown in the receiver operating characteristic (ROC) plots and confusion matrices (Fig. 3a and Extended Data Fig. 2). More important, the generalisability of the FL model was considerably increased over the locally trained model. In the case of client sites with relatively small datasets, the best FL model markedly outperformed the ‘local’ model as well as models trained on larger datasets from 5 client-sites in the Boston area (Fig. 3b). The global model performed well in predicting oxygen need at 24/72h on both COVID positive and COVID negative patients (Extended Data Fig. 3). Validation At Independent Sites Following the initial training, EXAM was subsequently tested at three independent sites, Cooley Dickinson Hospital (CDH), Martha's Vineyard Hospital (MVH), and Nantucket Cottage Hospital (NCH), all of them in Massachusetts, USA. These hospitals had different patient population characteristics and were distinctly not part of the EXAM training sites. The model was not re-trained at these sites, and it was only used for validation purposes. The validation data included patients from March 2020 to February 2021 that satisfied the inclusion criteria detailed in the METHODS section. The cohort size, labels model inference results are summarized in Table 2., and the ROC curves and confusion matrices for the largest data set, from CDH, are shown in Fig. 4. The operating point was set to discriminate between non- mechanical ventilation (MV) treatment and MV treatment (or death). The FL global trained model, EXAM, achieved an average AUC of 0.944 and 0.924 for 24/72h prediction tasks respectively (Table 2 b), which exceeded the average performance among sites used in training EXAM. For MV at 72h, EXAM had a low false-negative rate of 7.1%. Representative failure cases are presented in Extended Data Fig. 4, showing two false-negative cases from CDH where one case had many missing EMR data features, and the other case had a CXR with motion artifact and some missing EMR features. Use Of Differential Privacy A primary motivation for healthcare institutes to use FL is to preserve the security and privacy of their data, as well as adhere to data compliance measures. For FL, there remains the potential risk of model ‘inversion’47 or even the reconstruction of training images from the model gradients themselves48. To counter these risks, security-enhancing measures were used, to mitigate risk in the event of data ‘interception’ during site-server communication49. We experimented with techniques to avoid ‘interception’ of FL data, an added a security feature that we believe could encourage more institutions to use FL. We thus validated previous findings showing that partial weight sharing, and other differential privacy techniques can successfully be applied in FL50. Through investigating a partial weight-sharing scheme50,51, we showed that models can reach a comparable performance even when only 25% of the weight updates are shared (Extended Data Fig. 5). Discussion To our knowledge, this study features the largest real-world healthcare FL study to date in terms of number of sites and number of data points used. We believe that it provides a powerful proof-of-concept of the feasibility of using FL for fast and collaborative development of needed AI models in healthcare. Our study involved multiple sites across four continents and under the oversight of different regulatory bodies, and thus holds the promise of being provided to different regulated markets in an expedited way. The global FL model, EXAM, proved to be more robust and achieved better results on individual sites than any model that was trained only on local data. We believe that consistent improvement was achieved not only due to larger, but also a more diverse data set, the use of data inputs that can be standardized and avoidance of clinical impressions / reported symptoms. These factors played a significant part in increasing the benefits from this FL approach and its impact on performance, generalisability and ultimately, the model’s usability. For a client-site with a relatively small dataset, two typical approaches could be used for fitting a useful model: one is to train locally with its own data, the other is to apply a model trained on a larger dataset. For sites with small datasets, it would have been virtually impossible to build a performant deep learning model using only their local data. The finding, that these two approaches were outperformed on all three prediction tasks by the global FL model, indicate that the benefit for client-sites with small datasets arising from participation in FL collaborations is substantial. This is likely a reflection of FL’s ability to capture more diversity than local training, and to mitigate the bias present in models trained on a homogenous population. An under-represented population or age group in one hospital/region might be highly represented in another region, such as children, who might be differentially affected by COVID-19, including disease manifestations in lung imaging46. The validation results confirmed that the global model is robust, supporting our hypothesis that FL trained models are generalisable across healthcare systems. They provide a compelling case for the use of predictive algorithms in covid-19 patient care, and the use of FL in model creation and testing. By participating in this study, the client-sites received access to EXAM, to be further validated ahead of pursuing any regulatory approval or future introduction into clinical care. Plans are underway to validate EXAM prospectively in 'production' settings at MGB leveraging COVID-19 targeted resources53 as well as at different sites that were not part of the EXAM training. Over 200 prediction models to support decision making in patients with COVID-19 have been published19. Unlike the majority of the publications, focused on diagnosis of COVID-19 or predicting mortality, we predicted oxygen requirements that have implications for patient management. We also used cases with unknown SARS-COV-2 status, and so the model could provide input to the physician ahead of receiving an RT-PCR test result, making it useful for a real-life clinical setting. The model’s imaging input is used in common practice, in contrast with models that use chest Computed Tomography (CT), a non-consensual diagnostic modality. The model’s design was constrained to objective predictors, unlike many published studies that leveraged subjective clinical impressions. The data collected reflects varied incidence rates, and thus the 'population momentum' we encountered is more diverse. That implies that the algorithm can be useful for populations with different incidence rates. Patient cohort identification and data harmonization are not novel issues in research and data science54, but are further complicated, when using FL, given the lack of visibility on other sites’ data sets. Improvements to clinical information systems are needed in order to streamline data preparation, leading to better leverage of a network of sites participating in FL. This, in conjunction with hyperparameter engineering, can allow algorithms to ‘learn’ more effectively from larger data batches and adapt model parameters to a particular site for further personalization, for example through further fine-tuning on that site39. A system that would allow seamless, close-to real-time model inference and results processing would also be of benefit and would ‘close the loop’ from training to model deployment. As data was not centralized, it is not readily accessible. Given that, any future analysis of the results, beyond what was derived and collected, is limited. Similar to other machine learning models, EXAM is limited by the quality of the training data. Institutions interested in deploying this algorithm for clinical care need to understand potential biases in the training. For example, the labels used as ‘ground truth’ in the training of the EXAM model were derived from 24- and 72- hour oxygen consumption in the patient. It is assumed that oxygen delivered to the patient equates with the oxygen need. However, in the early phase of the COVID-19 pandemic, many patients were provided high flow oxygen prophylactically, regardless of their oxygen need. Such clinical practice could skew the predictions made by this model. Since our data access was limited, we did not have sufficient available information for the generation of significant statistics regarding failure causes, post-hoc, at most sites. However, we did study the failure cases from the largest independent test site, CDH, and were able to generate hypotheses that we can test in the future. For high-performing sites, it seems that most failure cases fall into two categories: 1) low quality of input data, e.g. missing data or motion artifact in CXR; 2) out-of-distribution data, e.g. a very young patient. In the future, we also intend to investigate the potential for a ‘population drift’ due to different phases of disease progression. We believe that due to the diversity across 20 sites, this risk may have been mitigated. A feature that would enhance these kinds of large-scale collaborations, is the ability to predict each client-site’s contribution towards improving the global FL model. This will help in client-site selection, and prioritizing data acquisition and annotation efforts. The latter is especially important given the high costs and difficult logistics of these large consortia endeavours, and it will enable these endeavours to capture diversity rather than sheer quantity of data samples. Future approaches may incorporate automated hyperparameter searching55, neural architecture search56, and other automated machine learning (AutoML)57 approaches to find the optimal training parameters for each client-site more efficiently. Known issues of Batch Normalization (BN) in FL58 motivated us to fix our base model for image feature extraction49 in order to reduce the divergence between unbalanced client-sites. Future work might explore different types of normalization techniques in order to allow the training of AI models in FL more effectively when the clients’ data is non-independent and identically distributed (non-IID). Recent works on privacy attacks within the FL setting have raised concerns on data leakage during model training59. Meanwhile, the protection algorithms are still under-explored and constrained by multiple factors. While differential privacy algorithms36,48,49 show good protection, they may weaken the model’s performance. The encryption algorithms, such as Homomorphic Encryption60 shall maintain the performance but may significantly increase the message sizes and training time. A quantifiable way to measure privacy would allow better choices for deciding the minimal privacy parameters necessary while maintaining clinically acceptable performance36,48,49. Following more validation, we envision the deployment of the EXAM model in the ED setting, as a way to evaluate risk on a per-patient and on a population level and for providing clinicians with an additional reference point when making the often-difficult task of triaging patients. We also envision using the model as a more sensitive population level metric, to help balance resources between regions, hospitals and departments. Our hope is that similar FL efforts can break the data silos and allow for faster development of much needed AI models in the near future. Main References 1. Budd, J. et al. Digital technologies in the public-health response to COVID-19. Nat. Med. 26, 1183–1192 (2020). 2. Moorthy, V., Henao Restrepo, A. M., Preziosi, M.-P. & Swaminathan, S. Data sharing for novel coronavirus (COVID-19). Bull. World Health Organ. 98, 150 (2020). 3. Chen, Q., Allot, A. & Lu, Z. Keep up with the latest coronavirus research. Nature 579, 193 (2020). 4. Fabbri, F., Bhatia, A., Mayer, A., Schlotter, B. & Kaiser, J. BCG IT Spend Pulse: How COVID-19 Is Shifting Tech Priorities. (2020). 5. Candelon, F., Reichert, T., Duranton, S., di Carlo, R. C. & De Bondt, M. The Rise of the AI-Powered Company in the Postcrisis World. (2020). 6. Chao, H. et al. Integrative analysis for COVID-19 patient outcome prediction. Medical image analysis 67, 101844 (2021). 7. Zhu, X. et al. Joint prediction and time estimation of COVID-19 developing severe symptoms using  chest CT scan. Medical image analysis 67, 101824 (2021). 8. Yang, D. et al. Federated semi-supervised learning for Covid region segmentation in chest ct using multi-national data from china, italy, japan. arXiv 1–19 (2020) doi:10.1016/j.media.2021.101992. 9. Minaee, S., Kafieh, R., Sonka, M., Yazdani, S. & Jamalipour Soufi, G. Deep-COVID: Predicting COVID-19 from chest X-ray images using deep transfer  learning. Medical image analysis 65, 101794 (2020). 10. COVID-19 Studies from the World Health Organization Database. https://clinicaltrials.gov/ct2/who_table (2020). 11. ACTIV. https://www.nih.gov/research-training/medical-research-initiatives/activ (2020). 12. Food and Drug Administration (FDA).Coronavirus Treatment Acceleration Program (CTAP). https://www.fda.gov/drugs/coronavirus-covid-19-drugs/coronavirus-treatment-acceleration-program-ctap (2020). 13. Gleeson, P., Davison, A. P., Silver, R. A. & Ascoli, G. A. A Commitment to Open Source in Neuroscience. Neuron 96, 964–965 (2017). 14. Piwowar, H. et al. The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles. PeerJ 6, e4375 (2018). 15. European Society of Radiology (ESR)., Neri, E., de Souza, N. et al. What the radiologist should know about artificial intelligence – an ESR white paper. Insights Imaging 10, 44 (2019). https://doi.org/10.1186/s13244-019-0738-2 16. Pesapane, F., Codari, M. & Sardanelli, F. Artificial intelligence in medical imaging: threat or opportunity? Radiologists again at the forefront of innovation in medicine. European Radiology Experimental 2, (2018). 17. Price  2nd, W. N. & Cohen, I. G. Privacy in the age of medical big data. Nat. Med. 25, 37–43 (2019). 18. Liang, W. et al. Development and Validation of a Clinical Risk Score to Predict the Occurrence of Critical Illness in Hospitalized Patients With COVID-19. JAMA Intern. Med. 180, 1081–1089 (2020). 19. Wynants, L. et al. Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal. BMJ 369, m1328 (2020). 20. Zhang, L. et al. D-dimer levels on admission to predict in-hospital mortality in patients with Covid-19. J. Thromb. Haemost. 18, 1324–1329 (2020). 21. Sands, K. E. et al. Patient characteristics and admitting vital signs associated with coronavirus disease 2019 (COVID-19)-related mortality among patients admitted with noncritical illness. Infect. Control Hosp. Epidemiol. 1–7 (2020) doi:10.1017/ice.2020.461. 22. ACR Recommendations for the use of Chest Radiography and Computed Tomography (CT) for Suspected COVID-19 Infection | American College of Radiology. https://www.acr.org/Advocacy-and-Economics/ACR-Position-Statements/Recommendations-for-Chest-Radiography-and-CT-for-Suspected-COVID19-Infection. 23. Rubin, G. D. et al. The Role of Chest Imaging in Patient Management during the COVID-19 Pandemic: A Multinational Consensus Statement from the Fleischner Society. Radiology 296, 172–180 (2020). 24. World Health Organization. Use of chest imaging in COVID-19. https://www.who.int/publications/i/item/use-of-chest-imaging-in-covid-19 (2020) 25. American Thoracic Society. Diagnosis and Management of COVID-19 Disease. 201, 15–19 (2020). 26. Redmond, C. E., Nicolaou, S., Berger, F. H., Sheikh, A. M. & Patlas, M. N. Emergency Radiology During the COVID-19 Pandemic: The Canadian Association of Radiologists Recommendations for Practice. Canadian Association of Radiologists Journal 71, 425–430 (2020). 27. Zhong, A. et al. Deep metric learning-based image retrieval system for chest radiograph and its clinical applications in COVID-19. Medical Image Analysis 70, 101993 (2021). 28. Lyons, C. & Callaghan, M. The use of high-flow nasal oxygen in COVID-19. Anaesthesia 75, 843–847 (2020). 29. Whittle, J. S., Pavlov, I., Sacchetti, A. D., Atwood, C. & Rosenberg, M. S. Respiratory support for adult patients with COVID-19. J Am Coll Emerg Physicians Open (2020) doi:10.1002/emp2.12071. 30. Ai, J., Li, Y., Zhou, X. & Zhang, W. COVID-19: treating and managing severe cases. Cell Res. 30, 370–371 (2020). 31. Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29 (2019). 32. Cahan, E. M., Hernandez-Boussard, T., Thadaney-Israni, S. & Rubin, D. L. Putting the data before the algorithm in big data addressing personalized healthcare. npj Digital Medicine 2, 1–6 (2019). 33. Thrall, J. H. et al. Artificial Intelligence and Machine Learning in Radiology: Opportunities, Challenges, Pitfalls, and Criteria for Success. J. Am. Coll. Radiol. 15, 504–508 (2018). 34. Shilo, S., Rossman, H. & Segal, E. Axes of a revolution: challenges and promises of big data in healthcare. Nat. Med. 26, 29–38 (2020). 35. Gao, Y. & Cui, Y. Deep transfer learning for reducing health care disparities arising from biomedical data inequality. Nat. Commun. 11, 5131 (2020). 36. Rieke, N. et al. The Future of Digital Health with Federated Learning. (2020). 37. Yang, Q., Liu, Y., Chen, T. & Tong, Y. Federated Machine Learning: Concept and Applications. (2019). 38. Ma, C. et al. On Safeguarding Privacy and Security in the Framework of Federated Learning. IEEE Netw. 34, 242–248 (2020). 39. Brisimi, T. S. et al. Federated learning of predictive models from federated Electronic Health Records. Int. J. Med. Inform. 112, 59–67 (2018). 40. Roth, H. R. et al. Federated Learning for Breast Density Classification: A Real-World Implementation: Second MICCAI Workshop, DART 2020, and First MICCAI Workshop, DCL 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 4–8, 2020, Proceedings. in Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning (eds. Albarqouni, S. et al.) vol. 12444 181–191 (Springer International Publishing, 2020). 41. Sheller, M. J. et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci. Rep. 10, 12598 (2020). 42. Remedios, S. W., Butman, J. A., Landman, B. A. & Pham, D. L. Federated Gradient Averaging for Multi-Site Training with Momentum-Based Optimizers. Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning 170–180 (2020) doi:10.1007/978-3-030-60548-3_17. 43. Xu, Y. et al. A collaborative online AI engine for CT-based COVID-19 diagnosis. medRxiv (2020) doi:10.1101/2020.05.10.20096073. 44. Raisaro, J. L. et al. SCOR: A secure international informatics infrastructure to investigate COVID-19. Journal of the American Medical Informatics Association (2020) doi:10.1093/jamia/ocaa172. 45. Vaid, A. et al. Federated Learning of Electronic Health Records to Improve Mortality Prediction in Hospitalized Patients With COVID-19: Machine Learning Approach. JMIR Medical Informatics 9, (2021). 46. Nino, G. et al. Pediatric lung imaging features of COVID-19: A systematic review and meta-analysis. Pediatric Pulmonology 56, 252–263 (2021). 47. Fredrikson, M., Jha, S. & Ristenpart, T. Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures. Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security 1322–1333 (2015) doi:10.1145/2810103.2813677. 48. Zhu, L., Liu, Z. & Han, S. Deep Leakage from Gradients. in Advances in Neural Information Processing Systems 32 (eds. Wallach, H. et al.) 14774–14784 (Curran Associates, Inc., 2019). 49. Kaissis, G. A., Makowski, M. R., Rückert, D. & Braren, R. F. Secure, privacy-preserving and federated machine learning in medical imaging. Nature Machine Intelligence vol. 2 305–311 (2020). 50. Li, W. et al. Privacy-Preserving Federated Brain Tumour Segmentation. Machine Learning in Medical Imaging 133–141 (2019) doi:10.1007/978-3-030-32692-0_16. 51. Shokri, R. & Shmatikov, V. Privacy-preserving deep learning. 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton) (2015) doi:10.1109/allerton.2015.7447103. 52. Li, X. et al. Multi-site fMRI analysis using privacy-preserving federated learning and domain adaptation: ABIDE results. Med. Image Anal. 65, 101765 (2020). 53. Estiri, H. et al. Predicting COVID-19 mortality with electronic medical records. npj Digital Medicine 4, (2021). 54. Jiang, G. et al. Harmonization of detailed clinical models with clinical study data standards. Methods Inf. Med. 54, 65–74 (2015). 55. Yang, D. et al. Searching Learning Strategy with Reinforcement Learning for 3D Medical Image Segmentation. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019 3–11 (2019) doi:10.1007/978-3-030-32245-8_1. 56. Elsken, T., Metzen, J. H. & Hutter, F. Neural Architecture Search: A Survey. arXiv [stat.ML] (2018). 57. Yao, Q. et al. Taking Human out of Learning Applications: A Survey on Automated Machine Learning. arXiv [cs.AI] (2018). 58. Ioffe, S. & Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv [cs.LG] (2015). 59. Kaufman, S., Rosset, S., Perlich, C. Leakage in Data Mining: Formulation, Detection, and Avoidance. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 556–563, 2011. 60. Zhang, C. et al. BatchCrypt: Efficient homomorphic encryption for cross-silo federated learning. Proceedings of the 2020 USENIX Annual Technical Conference, ATC 2020 493–506 (2020). Tables Table 1 | EMR (electronic medical record) data used in the EXAM study Category Subcategory Component Name Definition Units LOINC Code Demographic - Patient Age - Years 30525-0 Imaging Portable Chest X-Ray - AP or PA Portable Chest X-ray - 36554-4 Lab Value C-Reactive Protein C Reactive Protein Blood C-Reactive Protein Concentration mg/L 1988-5 Lab Value CBC (Complete Blood Count) Neutrophils Blood Absolute Neutrophils 10^9/L 751-8 Lab Value CBC (Complete Blood Count) White Blood Cells Blood White Blood Cell Count 10^9/L 33256-9 Lab Value D-Dimer D-Dimer Blood D-Dimer Concentration ng/mL 7799-0 Lab Value Lactate Lactate Blood Lactate Concentration mmol/L 2524-7 Lab Value LDH (Lactate Dehydrogenase) LDH Blood Lactate Dehydrogenase Concentration U/L 2532-0 Lab Value Metabolic Panel Creatinine Blood Creatinine Concentration mg/dL 2160-0 Lab Value Procalcitonin Procalcitonin Blood Procalcitonin Concentration ng/mL 33959-8 Lab Value Metabolic Panel eGFR Estimated Glomerular Filtration Rate mL/min/1.73m2 69405-9 Lab Value Troponin Troponin-T Blood Troponin Concentration ng/ml 67151-1 Lab Value Hepatic Panel AST Blood AST Concentration IU/L 1920-8 Lab Value Metabolic Panel Glucose Blood Glucose Concentration mg/dL 2345-7 Vital Sign - Oxygen Saturation Oxygen Saturation % 59408-5 Vital Sign - Systolic Blood Pressure Systolic Blood Pressure mmHg 8480-6 Vital Sign - Diastolic Blood Pressure Diastolic Blood Pressure mmHg 8462-4 Vital Sign - Respiratory Rate Respiratory Rate breaths per minute 9279-1 Vital Sign COVID PCR test PCR for RNA [not used as input to model] 95425-5 Vital Sign Oxygen Device used at Emergency Department (ED) Oxygen Device Ventilation, High-flow/NIV, Low-flow, Room Air - 41925-9 Outcome 24Hr Oxygen Device Oxygen Device Ventilation, High-flow/NIV, Low-flow, Room Air - 41925-9 Outcome 72Hr Oxygen Device Oxygen Device Ventilation, High-flow/NIV, Low-flow, Room Air - 41925-9 Outcome Death - - - - Outcome Time of Death - - Hours - Table 2 | Performance of the best global model, EXAM, on independent data sets. a, Breakdown of patients by level of oxygen need at different time points across the 3 independent datasets, CDH, MVH and NCH. b, AUC of predicting level of oxygen need at 24 hr and 72 hrs at the 3 independent datasets (with 95% confidence. The AUC at NCH for ³MV for 24 hrs could not be calculated as there were no mechanically ventilated patients. AUC - Area Under the Curve, RA- Room air, LFO -LowFLow Oxygen, HFO-NV – High Flow Oxygen, No Mechanical Ventilation, MV – Mechanical Ventilation a # Cases for each level of Oxygen Therapy Site # Cases # Pos. Cases Prediction Interval RA LFO HFO_NV MV&DEATH CDH 840 244 24 hours 608 162 48 22 72 hours 575 173 62 30 MVH 399 30 24 hours 356 36 3 4 72 hours 351 39 3 6 NCH 264 29 24 hours 237 23 4 0 72 hours 235 22 4 3 b AUC for each level of Oxygen Therapy Site Prediction Interval ≥LFO ≥HFO_NV ≥MV Average AUC CDH 24 hours 0.925 (0.903, 0.945) 0.950 (0.926, 0.971) 0.956 (0.918, 0.984) 0.944 72 hours 0.902 (0.881, 0.924) 0.931 (0.905, 0.955) 0.938 (0.893, 0.927) 0.924 MVH 24 hours 0.904 (0.844, 0.954) 0.836 (0.620, 0.978) 0.964 (0.925, 1.000) 0.901 72 hours 0.887 (0.827, 0.940) 0.872 (0.663, 0.992) 0.988 (0.973, 0.997) 0.916 NCH 24 hours 0.895 (0.833, 0.950) 0.984 (0.957, 1.000) N/A N/A 72 hours 0.904 (0.850, 0.949) 0.947 (0.890, 0.991) 0.931 (0.897, 0.959) 0.927 Figures a b c d Fig. 1 | Data used in the EXAM Federated Learning study. a, EXAM included 20 different sites from around the globe. b, Number of cases that each institution or site contributed (client #1 being the largest site). c, CXR intensity distributions at each client site. d, Age of patients included at each client-site showing the min. and max. ages (asterisks) and mean and standard deviation (length of bars). a b Fig. 2 | Federated Learning vs. local training performance. a, Test performance of models predicting 24h oxygen treatment trained on local data only (Local) versus the performance of the best global model available on the server (FL (gl. best)). b, Generalisability (average performance on other sites’ test data) as a function of a client’s dataset size (# cases). The average performance improved by 16% (from 0.795 to 0.920 or 12.5 percentage points) compared to locally trained models alone, while average generalisability of the global model improved by 38% (from 0.667 to 0.920 or 25.3 percentage points). Note, we show the performance for 18 of 20 clients here as client 12 had only outcomes for 72 hours (see Extended Data Fig. 1) and client 14 only cases with room air treatment, resulting in the evaluation metric (avg. AUC) being not applicable (see Methods). Therefore, client 14 was also excluded from the computation of the baseline average generalisability numbers. a b Fig. 3 | Comparison of Federated Learning trained vs. locally trained models. a, ROC at a site with unbalanced data and mostly mild cases. b, ROC of local model at client-site 12 (small dataset), mean ROC of models trained on larger datasets, and ROC of the best global model to predict oxygen treatment at 72h. The Mean ROC is calculated based on 5 locally trained models, with the gray-area showing the standard deviation of the ROCs. We show the ROCs for three different cut-off values t of the EXAM risk score. a b Fig. 4 | Performance of the best global model on the largest independent data set. a, ROC Performance and confusion matrices on the independent dataset, CDH, predicting oxygen treatment at 24 hr. b, ROC Performance and confusion matrices on the independent dataset, CDH, predicting oxygen treatment at 72 hr. We show the ROCs for three different cut-off values t of the EXAM risk score. 1 Methods Ethics Approval All procedures were conducted in accordance with principles for human experimentation as defined in the Declaration of Helsinki and International Conference on Harmonization Good Clinical Practice guidelines and approved by the relevant institutional review boards (e.g., the MGB ethics board, reference # 2020P002673). Since no patient data was transferred between any of the participants and the study was considered of no more than minimal risk to patients, the requirement of a full IRB process was largely waived according to the Ethical Principles and Guidelines for the Protection of Human Subjects of Research (the “Belmont Report”) and the requirements of the Health Insurance Portability and Accountability Act (HIPAA) of 1996. Study Setting The study included data from 20 institutions (Fig. 1a); Mass Gen Brigham (MGB) affiliated hospitals (Mass General Hospital (MGH), Brigham and Women’s Hospital, Newton-Wellesley Hospital, North Shore Medical Center, Faulkner Hospital); Children’s National Hospital in Washington, D.C.; NIHR Cambridge Biomedical Research Centre; The Self-Defense Forces Central Hospital in Tokyo; National Taiwan University MeDA Lab and MAHC and Taiwan National Health Insurance Administration; Tri-Service General Hospital in Taiwan; Kyungpook National University Hospital in South Korea; Faculty of Medicine, Chulalongkorn University in Thailand; Diagnosticos da America SA in Brazil; University of California, San Francisco; VA San Diego; University of Toronto; National Institutes of Health in Bethesda, Maryland; University of Wisconsin-Madison School of Medicine and Public Health; Memorial Sloan Kettering Cancer Center in New York; and Mount Sinai Health System in New York. Institutions were recruited between March and May 2020. The dataset curation started in June 2020 and the last data cohort was added in September 2020. Between August and October 2020, 140 independent FL runs were conducted to develop the EXAM model, and by end-October 2020, EXAM was made public on NVIDIA NGC61-63. Data Collection The 20 client-sites prepared a total of 16,148 cases (both positive and negative) for the purpose of training, validating, and testing the model (Fig. 1b). Medical data was pulled in relation to patients who satisfied the study inclusion criteria. Client-sites strived to include all the COVID positive cases they had from the beginning of the pandemic in December 2019, and up to the time they started local training for the EXAM study. All local training had started by September 30, 2020. The sites also included other patients in the same period that had negative RT-PCR test results. Since most of the sites had more SARS-COV-2 negative than positive patients, we limited the number of negative patients included to, at –most, 95% of the total cases at each client-site. A ‘case’ included a CXR and the requisite data inputs taken from the patient’s medical record. A breakdown of the cohort size of the dataset for each client-site is shown in Fig. 1b. The distribution and patterns of CXR image intensities (pixel values) varied significantly among the sites due to a multitude of patient and site-specific factors, such as different device manufacturers and imaging protocols, as shown in Fig. 1c,d. Patient age and EMR feature distributions varied greatly between sites, as expected due to the differing demographics between globally distributed hospitals (Extended Data Fig. 6). Patient inclusion criteria Patient inclusion criteria were: 1. patient presented to the hospital’s ED or equivalent, 2. patient had a RT-PCR test done anytime between presentation to the ED and discharge from the hospital, 3. patient had a CXR in the ED, 4. Patient’s record had at least 5 of the EMR values detailed in Table 1, all obtained in the ED, and the relevant outcomes captured during the hospitalization. Of note, The CXR, lab values, and vitals used were the first available captured during the visit to the ED. The model did not incorporate any CXR, lab values, or vitals acquired after leaving the ED. Model input In total, 21 EMR features were used as input to the model. The outcome (i.e., "ground truth") labels were assigned based on patient requirements after 24-hour and 72-hour periods from initial admission to the ED. A detailed list of the requested EMR features and outcomes can be seen in Table 1. The distribution of oxygen treatment using different devices at different client-sites is shown in Extended Data Fig. 7, which details the device usage at admission to the ED, and after 24-hour and 72-hour periods. The difference in dataset distribution for the largest and the smallest client-sites can be seen in Extended Data Fig. 8. The number of positive COVID-19 cases, confirmed by a single RT-PCR test obtained anytime between presentation to the ED and discharge from the hospital, are listed in Supplemental Table 1. Each client-site was asked to randomly split its dataset into 3 parts, 70% for training, 10% for validation, and 20% for testing. For both the 24h and 72h outcome prediction models, the random splits for each of the three repeated local and FL training and evaluation experiments were independently generated. EXAM Model Development There is wide variation in the clinical course of patients who present to the hospital with symptoms of COVID-19, with some experiencing rapid deterioration in respiratory function requiring different interventions in order to prevent or mitigate hypoxemia62,63. A critical decision made during the evaluation of a patient at the initial point of care or the ED, is whether the patient is likely to require more invasive or resource-limited countermeasures or interventions (such as mechanical ventilation or monoclonal antibodies), and should therefore receive a scarce but effective therapy, a therapy with a narrow risk-benefit ratio due to side effects, or a higher level of care, such as admittance to the ICU64. In contrast, a patient who is at a lower risk of requiring invasive oxygen therapy may be placed in a less intensive care setting such as a regular ward or even released from the ED for continued self-monitoring at home65. EXAM was developed to help triage these patients. Of note, the model is not approved by any regulatory agency at this time, and it should only be used for research purposes. EXAM score EXAM was trained using FL, and it outputs a risk score termed EXAM score similar to CORISK27 (Extended Data Fig. 9a) and can be used in the same way to triage patients. It corresponds to a patient’s oxygen support requirements within two windows, 24 hours and 72 hours after initial presentation to the ED. Extended Data Fig. 9b illustrates how CORISK and the EXAM score can be used for patient triage. CXR images were pre-processed to select the Anterior Position image and exclude lateral view images, and then scale to a resolution of 224x224. As shown in Extended Data Fig. 9a, the model fuses information from both the EMR features and CXR features (based on a modified ResNet-34 with spatial attention67 pre-trained on the CheXpert dataset)68, and Deep & Cross network69. To converge these different data types, a 512-dimensional feature vector was extracted from each CXR image using a pre-trained ResNet-34, with spatial attention, then concatenated with the EMR features as the input for the Deep & Cross network. The final output was a continuous value in the range of 0 - 1 for both the 24 hour and 72-hour predictions, corresponding to the labels described above, as shown in Extended Data Fig. 9b. We used cross-entropy as the loss function and ‘Adam’ as the optimizer. The model was implemented in Tensorflow70 using the NVIDIA Clara Train SDK71. The average AUC for the classification tasks (≥ LFO, ≥ HFO/NIV, or ≥ MV) was calculated and used as the final evaluation metric, and normalization to zero-mean and unit variance. CXR images were pre-processed to select the right series and exclude lateral view images, then scaled to a resolution of 224x22427. Feature imputation & normalization A MissForest algorithm66 was used to impute EMR features, based on the local training dataset. If an EMR feature was completely missing from a client-site dataset, the mean value of that feature, calculated exclusively on data from MGB client-sites, was used. Then, EMR features were rescaled to zero-mean and unit-variance based on statistics calculated on data from the MGB client-sites. Details of the EMR-CXR data fusion using deep & cross network To model the interactions of features from the EMR and CXR data on a case-level, a deep feature scheme was used, based on a Deep & Cross Network architecture69. Binary and categorical features for the EMR inputs, as well as 512-dimensional image features in the CXR, were transformed into fused dense vectors of real values by embedding and stacking layers. The transformed dense vectors served as input to the fusion framework, that specifically employed a crossing network to enforce fusion among input from different sources. The crossing network performed explicit feature crossing within its layers, by conducting inner products between the original input feature and output from the previous layer, thus increasing the degree of interaction across features. At the same time, two individual classic deep neural networks with several stacked fully-connected feed-forward layers were trained. The final output of our framework was then derived from the concatenation of both classic and crossing networks. Federated Learning Details Arguably, the most established form of FL is implementing the Federated Averaging algorithm proposed by McMahan et al72, or variations thereof. This algorithm can be realized using a client-server setup, where each participating site acts as a client. One can think of FL as a method aiming to minimize a global loss function by reducing a set of local loss functions, which are estimated at each site. By minimizing each client site's local loss while also synchronizing the learned client site weights on a centralized aggregation server, one can minimize the global loss without needing to access the entire dataset in a centralized location. Each client site learns locally, and shares model weight updates with a central server that aggregates contributions using secure SSL encryption and communication protocols. The server then sends an updated set of weights to each client site after the aggregation, and sites resume training locally. The server and client site iterate back and forth until the model converges (Extended Data Fig. 9c). A pseudo-algorithm of FL is shown in the Supplemental Note. In our experiments, we set the number of federated rounds to be T=200, with one local training epoch per round t at each client. The number of clients K was up to 20, depending on the network connectivity of clients or available data for a specific targeted outcome period (24h or 72h). The number of local training iterations nk depends on the dataset size at each client k and is used to weigh each client's contributions when aggregating the model weights in Federated Averaging. During the FL training task, each client-site selects its best local model by tracking the model's performance on its local validation set. At the same time, the server determines the best global model based on the average validation scores sent from each client-site to the server after each FL round. After the FL training finishes, the best local models and the best global model are automatically shared with all client-sites and evaluated on their local test data.  When training on local data only (the baseline), we set the epoch number to 200. The Adam optimizer was used for both local training and FL with an initial learning rate of 5e-5 and a stepwise learning rate decay with a factor 0.5 after every 40 epochs, which is important for the convergence of FederatedAveraging73. Random affine transformations, including rotation, translations, shear, scaling, and random intensity noise and shifts were applied to the images for data augmentation during training. Due to the sensitivity of Batch Normalization (BN) layers58 when dealing with different clients in a non-IID settingClick or tap here to enter text., we found the best model performance to occur when keeping the pre-trained ResNet34 with spatial attention47 parameters fixed during FL training (i.e. using a learning rate of zero for those layers). The Deep & Cross network that combines image features with the EMR features does not contain BN layers and hence was not affected by BN's instability issues.  In this study, we investigated a privacy-preserving scheme that shares only partial model updates between server and client-sites. The weight updates were ranked during each iteration by magnitude of contribution and only a certain percentage of the largest weight updates were shared with the server. To be exact, the weight updates (aka. gradients) were shared only if their absolute value was above a certain percentile threshold k (t)(Extended Data Fig. 5), which was computed from all non-zero gradients Wk(t) and could be different for each client k in each FL round t. Variations of this scheme could include additional clipping of large gradients or differential privacy schemes49 that add random noise to the gradients or even to the raw data before feeding it to the network51. Statistical Analysis We conducted a Wilcoxon signed-rank test to confirm the significance of the observed improvement in performance between the locally trained model and the FL model for the 24 and 72 hr time point (see Fig. 2 and Extended Data Fig. 1). The null hypothesis was rejected with a one-sided p-value << 1e-3 in both cases. A Pearson's correlation was used to assess the generalisability (robustness of the avg. AUC value to other client-sites' test data) of locally trained models in relation to respective local dataset size. Only a moderate correlation was observed (r=0.43, p=0.035, df = 17 for the 24h model and r=0.62, p=0.003, df=16 for the 72h model). This indicates that dataset size alone is not the only factor in determining a model's robustness to unseen data. To compare the ROC curves from local models trained in different sites, and the global FL one (Extended Data Fig. 3), we bootstrapped 1,000 samples from the data and computed the resulting AUCs. We then calculated the difference between the two series and standardized using the formula: D = (AUC1-AUC2)/s, where s is the standard deviation of the bootstrap differences, and AUC1 and AUC2 are the corresponding bootstrapped AUC series. By comparing D with the normal distribution, we obtained the p-values illustrated in Supplemental Table 2. The results show that the null hypothesis was rejected with very small p-values, indicating the statistical significance of the superiority of FL outcomes. The computation of p-values was conducted in R with the pROC library74. Since the model predicts a discrete outcome, a continuous score from 0 to 1, a straightforward calibration evaluation such as a qqplot is not possible. Hence, for a quantified estimate of calibration, we quantified discrimination (Extended Data Fig. 10). We conducted one-way ANOVA tests to compare local and FL model scores among four ground truth categories (RA, LFO, HFO, MV). The F-statistic, calculated as the variation between the sample means divided by variation within the samples, and representing the degree of dispersion among different groups, was used to quantify the models. Our results show that the F-values of 5 different local sites are 245.7, 253.4, 342.3, 389.8, while the F-value of the FL model is 843.5. Given that larger F values mean that groups are more separable, the scores from our FL model clearly show a greater dispersion among the 4 ground truth categories. Furthermore, the p-value of the ANOVA test on the FL model is <2e-16, indicating that the FL prediction scores are statistically significantly different among the different prediction classes. Data availability The dataset from the 20 institutes that participated in this study remains under their custody. This data was used for training at each of the local sites and was not shared with any of the other participant institutions or with the Federated Server, and it is not publicly available. Code availability The model, code for training, validating testing the model, readme file, installation guideline, and license files can be accessed at NVIDIA NGC61: https://ngc.nvidia.com/catalog/models/nvidia:med:clara_train_covid19_exam_ehr_xray Methods References 61. Nvidia NGC Catalog: COVID-19 Related Models. https://ngc.nvidia.com/catalog/models?orderBy=scoreDESC&pageNumber=0&query=covid&quickFilter=models&filters= (2020). 62. Marini, J. J. & Gattinoni, L. Management of COVID-19 Respiratory Distress. JAMA 323, 2329–2330 (2020). 63. Cook, T. M. et al. Consensus guidelines for managing the airway in patients with COVID-19: Guidelines from the Difficult Airway Society, the Association of Anaesthetists the Intensive Care Society, the Faculty of Intensive Care Medicine and the Royal College of Anaesthetist. Anaesthesia 75, 785–799 (2020). 64. Galloway, J. B. et al. A clinical risk score to identify patients with COVID-19 at high risk of critical care admission or death: An observational cohort study. J. Infect. 81, 282–288 (2020). 65. Kilaru, A. S. et al. Return Hospital Admissions Among 1419 COVID-19 Patients Discharged from Five U.S. Emergency Departments. Acad. Emerg. Med. 27, 1039–1042 (2020). 66. Stekhoven, D. J. & Bühlmann, P. MissForest--non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 (2012). 67. He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) doi:10.1109/cvpr.2016.90. 68. Irvin, J. et al. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. Proceedings of the AAAI Conference on Artificial Intelligence vol. 33 590–597 (2019). 69. Wang, R., Fu, B., Fu, G. & Wang, M. Deep & Cross Network for Ad Click Predictions. Proceedings of the ADKDD’17 on ZZZ - ADKDD’17 (2017) doi:10.1145/3124749.3124754. 70. Martın Abadi  Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, P. B. Tensorflow: A system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation (2016) doi:10.1007/978-1-4842-6699-1_1. 71. NVIDIA Clara Imaging. https://developer.nvidia.com/clara-medical-imaging (2020). 72. McMahan, H., Moore, E., Ramage, D., Hampson, S. & y Arcas, B. A. Communication-Efficient Learning of Deep Networks from Decentralized Data. in AISTATS (2017). 73. Hsieh, K., Phanishayee, A., Mutlu, O. & Gibbons, P. B. The Non-IID Data Quagmire of Decentralized Machine Learning. arXiv [cs.LG] (2019). 74. Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011). Acknowledgements The views expressed in this study are those of the authors and not necessarily those of the NHS, the NIHR, the Department of Health and Social Care or any of the organizations associated with the authors. MGB would like to acknowledge the following individuals for their support: James Brink MD, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA; Mannudeep Kalra MD, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA; Nir Neumark MD, MSc., Center for Clinical Data Science, Massachusetts General Brigham, Boston, MA; Thomas Schultz, Department of Radiology, Massachusetts General Hospital, Boston, MA; Ning Guo, Center for Advanced Medical Computing and Analysis, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA; Jayashree Kalpathy Cramer PhD, Director, QTIM lab at the Athinoula A. Martinos Center for Biomedical Imaging at MGH; Stuart Pomerant, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA; Giles Boland MD, Department of Radiology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA; William Mayo-Smith MD, Department of Radiology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA UCSF would like to acknowledge Peter B. Storey, Jed Chan and Jeff Block for implementing the UCSF FL client infrastructure and Wyatt Tellis, PhD for providing the source imaging repository for this work. The UCSF EMR and clinical notes for this study were accessed via the COVID-19 Research Data Mart https://data.ucsf.edu/covid19. Faculty of Medicine, Chulalongkorn University would like to acknowledge the Ratchadapisek Sompoch Endowment Fund RA (PO) 001/63 for the Collection and Management of COVID‐19 Related Clinical Data and Biological Specimens for Research Task Force, Faculty of Medicine, Chulalongkorn University. NIHR Cambridge Biomedical Research Centre would like to acknowledge that Andrew Priest is supported by the National Institute for Health Research (Cambridge Biomedical Research Centre at the Cambridge University Hospitals NHS Foundation Trust). National Taiwan University MeDA Lab and MAHC and Taiwan National Health Insurance Administration would like to acknowledge MOST Joint Research Center for AI Technology and All Vista Healthcare (AINTU) National Health Insurance Administration, Taiwan and Ministry of Science and Technology, Taiwan National Center for Theoretical Sciences Mathematics Division. National Institutes of Health (NIH) would like to acknowledge that The National Institutes of Health (NIH) Medical Research Scholars Program is a public-private partnership supported jointly by the NIH and generous contributions to the Foundation for the NIH from the Doris Duke Charitable Foundation, the American Association for Dental Research, the Colgate-Palmolive Company, Genentech, alumni of student research programs, and other individual supporters via contributions to the Foundation for the National Institutes of Health. Author information These authors contributed equally: Ittai Dayan, Holger Roth, Aoxiao Zhong, Fiona J Gilbert, Quanzheng Li, Mona G. Flores Affiliations 1. MGH Radiology and Harvard Medical School, Boston, MA, USA Keith Dreyer & Ittai Dayan 2. NVIDIA, Santa Clara, CA, USA Holger Roth, Ahmed Harouni, Anas Abidin, Andrew Liu, CK Lee, Colleen Ruan, Daguang Xu, Eddie Huang, Griffin Lacey, Jesse Tetreault, iahui Guan, Kristopher Kersten, Nicola Rieke, Pedro Mario Cruze Silva, Mona G. Flores, Abood Quraini, Andrew Feng, Colin Compas, Deepeksha Bhatia, Isaac Yang, Mohammad Adil & Yuhong Wen 3. Center for Advanced Medical Computing and Analysis, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA Aoxiao Zhong, Dufan Wu, Hui Ren,Xiang Li & Quanzheng Li 4. San Diego VA Health Care System, San Diego, CA, USA Amilcare Gentili 5. Department of Neurosurgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA Anthony Beardsworth Costa & Young Joon Kwon 6. Radiology & Imaging Sciences / Clinical Center, National Institutes of Health, Bethesda, MD, USA Bradford J. Wood 7. Division of Cardiovascular Surgery, Department of Surgery, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan, R.O.C. Chien-Sung Tsai 8. Department of Otolaryngology-Head and Neck Surgery, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan, R.O.C. and Graduate Institute of Medical Sciences, National Defense Medical Center, Taipei, Taiwan, R.O.C. Chih-Hung Wang 9. Center for Research in Biological Systems, University of California, San Diego, CA, USA Chun-Nan Hsu 10. Diagnósticos da América SA (DASA), Brazil Felipe Campos Kitamura, Gustavo César de Antônio Corradi, Matheus Ribeiro Furtado de Mendonça & Vitor de Lima Lavor 11. Division of Pediatric Pulmonary and Sleep Medicine, Children's National Hospital, Washington, DC, USA Gustavo Nino 12. Memorial Sloan Kettering Cancer Center, New York, NY, USA Hao-Hsin Shin, Krishna Juluru, Krishna Nand Keshava Murthy, Natalie Gangai & Pierre Elnajjar 13. Self-Defense Forces Central Hospital, Tokyo, Japan Hirofumi Obinata, Shuichi Kawano, Hisashi Sasaki, Hitoshi Mori & Tatsuya Kodama 14. Center for Intelligent Imaging, 2Department of Radiology and Biomedical Imaging, University of California, San Francisco, California, USA Jason C. Crane, Pablo F. Damasceno, Christopher P. Hess, Jae Ho Sohn & Sharmila Majumdar 15. Departments of Radiology and Medical Physics, The University of Wisconsin-Madison School of Medicine and Public Health, Madison, WI, USA John W. Garrett 16. Department of Radiology, NIHR Cambridge Biomedical Resource Centre, University of Cambridge, Cambridge, UK Josh D Kaggie, Fiona J Gilbert & Sarah Hickman 17. Department of Internal Medicine, Yeungnam University College of Medicine, Daegu, South Korea Jung Gil Park & Min Kyu Kang 18. Center for Clinical Data Science, Mass General Brigham, Boston, MA, USA Keith Dreyer, Marcio Aloisio Bezerra Cavalcanti Rockenbach, Varun Buch, Bernardo Bizzo and Evan Leibovitz 19. Sheikh Zayed Institute for Pediatric Surgical Innovation, Children's National Hospital, Washington, DC, USA Carlos Tor Diez & Marius George Linguraru 20. Joint Dept. of Medical Imaging, Sinai Health System, University of Toronto, Toronto, Canada and Lunenfeld-Tanenbaum Research Institute, Toronto, Canada Masoom A. Haider 21. Lunenfeld-Tanenbaum Research Institute, Toronto, Canada Meena AbdelMaseeh 22. MeDA Lab and Institute of Applied Mathematical Sciences, National Taiwan University, Taipei, Taiwan Pochuan Wang & Weichung Wang 23. Center for Interventional Oncology, National Institutes of Health, Bethesda, MD, USA Sheng Xu & Sheridan Reed 24. Research Affairs, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand, Center for Artificial Intelligence in Medicine, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand Sira Sriswasdi 25. Department of Internal Medicine, School of Medicine, Kyungpook National University, Daegu, South Korea Soo Young Park, Won Young Tak & Yu Rim Lee 26. Departments of Radiology, Medical Physics, and Biomedical Engineering, The University of Wisconsin-Madison School of Medicine and Public Health, Madison, WI, USA Thomas M. Grist 27. Department of Pediatrics, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand and Thai Red Cross Emerging Infectious Diseases Clinical Center, King Chulalongkorn Memorial Hospital, Bangkok, Thailand Watsamon Jantarabenjakul & Thanyawee Puthanakit 28. Medical Review and Pharmaceutical Benefits Division, National Health Insurance Administration, Taipei. Taiwan Weichung Wang & Chiu-Ling Lai 29. Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA, USA Xihong Lin 30. Department of Radiology, NIHR Cambridge Biomedical Resource Centre, Cambridge University Hospital. Cambridge, UK Andrew N Priest 31. Department of Radiology and Imaging Sciences, National Institutes of Health, Bethesda, MD, USA and National Cancer Institute, National Institutes of Health, Bethesda, MD, USA Baris Turkbey 32. Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA Benjamin Glicksberg 33. Department of Internal Medicine, Catholic University of Daegu School of Medicine, Daegu, South Korea Byung Seok Kim 34. Planning and Management Office, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan, R.O.C. Chia-Jung Hsu & Chia-Cheng Lee 35. School of Medicine, National Defense Medical Center, Taipei, Taiwan, R.O.C. and School of Public Health, National Defense Medical Center, Taipei, Taiwan, R.O.C. and Graduate Institute of Life Sciences, National Defense Medical Center, Taipei, Taiwan, R.O.C. Chin Lin 36. Department of Neurosurgery, NYU Grossman School of Medicine, New York, NY, USA Eric K Oermann 37. MOST/NTU All Vista Healthcare Center, Center for Artificial Intelligence and Advanced Robotics, National Taiwan University, Taipei, Taiwan Li-Chen Fu 38. Division of General Internal Medicine and Geriatrics (Fralick), Sinai Health System, Toronto, Canada Mike Fralick 39. Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand Peerapon Vateekul 40. Schwartz/Reisman Emergency Medicine Institute, Sinai Health, Toronto, ON, Canada and Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada Shelley L. McLeod 41. Department of Medicine, NIHR Cambridge Biomedical Resource Centre, University of Cambridge, Cambridge, UK Stefan Graf 42. National Cancer Institute, National Institutes of Health, Bethesda, MD, USA and Clinical Research Directorate, Frederick National Laboratory for Cancer, National Cancer Institute. Frederick, MD, USA Stephanie Harmon 43. Department of Microbiology, Sinai Health/University Health Network, Toronto, Canada and Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto. Canada Public Health Ontario Laboratories, Toronto, Canada Tony Mazzulli 44. Chulalongkorn University Biomedical Imaging Group and Division of Nuclear Medicine, Department of Radiology, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand Yothin Rakvongthai Contributions Ittai Dayan and Mona G. Flores contributed to the acquisition of the data, study support, drafting and revising the manuscript, study design, study concept, and analysis and interpretation of the data; Holger Roth, Aoxiao Zhong and Quanzheng Li, contributed to the acquisition of the data, study support, drafting and revising the manuscript, study design, and analysis and interpretation of the data; Fiona J Gilbert contributed to the acquisition of the data, study support, drafting and revising the manuscript; Jiahui Guan contributed to the support of the study, drafting and revising the manuscript, and analysis and interpretation of the data; Varun Buch contributed to the acquisition of the data, study support and study design; Daguang Xu contributed to the acquisition of the data, study support, drafting and revising the manuscript, and analysis and interpretation of the data; Anthony Beardsworth Costa, Bradford J. Wood, John W. Garrett and Krishna Juluru contributed to the acquisition of the data and drafting, and revising the manuscript; Nicola Rieke, contributed to the support of the study, and drafting and revising the manuscript; Ahmed Harouni, Anas Abidin, Andrew Liu, CK Lee, Colleen Ruan, Eddie Huang, Griffin Lacey, Jesse Tetreault, Kristopher Kersten, Pedro Mario Cruz e Silva, Abood Quraini, Andrew Feng, Colin Compas, Deepeksha Bhatia, Isaac Yang, Mohammad Adil and Yuhong Wen contributed to the support of the study; Amilcare Gentili, Chien-Sung Tsai, Chih-Hung Wang, Chun-Nan Hsu, Dufan Wu, Felipe Campos Kitamura, Gustavo César de Antônio Corradi, Gustavo Nino, Hao-Hsin Shin, Hirofumi Obinata, Hui Ren, Jason C. Crane, Josh D Kaggie, Jung Gil Park, Keith Dreyer, Marcio Aloisio Bezerra Cavalcanti Rockenbach, Marius George Linguraru, Masoom A. Haider, Meena AbdelMaseeh, Pablo F. Damasceno, Pochuan Wang, Sheng Xu, Shuichi Kawano, Sira Sriswasdi, Soo Young Park, Thomas M. Grist, Watsamon Jantarabenjakul, Weichung Wang, Won Young Tak, Xiang Li, Xihong Lin, Young Joon Kwon, Andrew N Priest, Baris Turkbey, Benjamin Glicksberg, Bernardo Bizzo, Byung Seok Kim, Carlos Tor Diez, Chia-Cheng Lee, Chia-Jung Hsu, Chin Lin, Chiu-Ling Lai, Christopher P. Hess, Eric K Oermann, Evan Leibovitz, Hisashi Sasaki, Hitoshi Mori, Jae Ho Sohn, Krishna Nand Keshava Murthy, Li-Chen Fu, Matheus Ribeiro Furtado de Mendonça, Mike Fralick, Min Kyu Kang, Natalie Gangai, Peerapon Vateekul, Pierre Elnajjar, Sarah Hickman, Sharmila Majumdar, Shelley L. McLeod, Sheridan Reed, Stefan Graf, Stephanie Harmon, Tatsuya Kodama, Thanyawee Puthanakit, Tony Mazzulli, Vitor de Lima Lavor, Yothin Rakvongthai and Yu Rim Lee contributed to the acquisition of the data. Corresponding authors Correspondence to Mona G. Flores Ethics declarations Competing interests This study was organized and coordinated by NVIDIA. Y.W., M.A., I.Y., A.Q., C.C., D.B., A.F., H.R., J.G., D.X., N.R., A.H., K.K., C.R., A.A., C.K.L, E.H., A.L., G.L., P.M.C.S, J.T., and M.G.F. are employees of NVIDIA and own stock as part of the standard compensation package. I.D. is the CEO of Rhino HealthTech Inc., and owns stock in the company. J.G. declared ownership of NVIDIA Stock. C.H. declared Research travel, Siemens Healthineers AG; Conference Travel, EUROKONGRESS; GmBH; and Personal fees (Consultant, GE Healthcare LLC; DSMB Member, Focused Ultrasound Foundation). F.J.G declared research collaborations with Merantix, Screen-Point, Lunit and Volpara, GE Healthcare and undertakes paid consultancy for Kheiron and Alphabet. M.L. declared that he is the co-founder of PediaMetrix Inc. and is on the Board of the SIPAIM Foundation S.E.H declared research collaborations with Merantix, Screen-Point, Lunit and Volpara. B.J.W and S.X. declared that NIH and NVIDIA have a Cooperative Research and Development Agreement. This work was supported (+/- in part) by the NIH Center for Interventional Oncology and the Intramural Research Program of the National Institutes of Health, via intramural NIH Grants Z1A CL040015, 1ZIDBC011242. Work supported by the NIH Intramural Targeted Anti-COVID-19 (ITAC) Program, funded by the National Institute of Allergy and Infectious Diseases. NIH may have intellectual property in the field. Additional information Correspondence and requests for materials should be addressed to Mona G. Flores, MD.  Reprints and permissions information is available at www.nature.com/reprints. Supplemental data figures and tables a b Extended Data Fig. 1| Test performance of models predicting 72h oxygen treatment trained on local data only (Local) versus the performance of the best global model available on the server (FL (gl. best)). b, Generalisability (average performance on other sites’ test data) as a function of a site’s dataset size (# cases). The average performance improved by 18% (from 0.760 to 0.899 or 13.9 percentage points) compared to locally trained models alone, while average generalisability of the global model improved by 34% (from 0.669 to 0.899 or 23.0 percentage points). a b Extended Data Fig. 2 | Confusion Matrices at a site with unbalanced data and mostly mild cases. a, Confusion matrices on the test data at site 16 predicting oxygen treatment at 72h using the locally trained model. b, Confusion matrices on the test data at site 16 predicting oxygen treatment at 72h using the best Federated Learning global model. We show the ROCs for two different cut-off values t of the EXAM risk score. 24h Prediction for COVID positive patients 24h Prediction for COVID negative patients 72h Prediction for COVID positive patients 72h Prediction for COVID negative patients Extended Data Fig. 3 | Effect of data set size on model performance. ROCs of the best global model in comparison to the mean ROCs of models trained on local datasets to predict 24/72-h oxygen treatment devices for COVID positive/negative patients respectively, using the test data of 5 large datasets from sites in the Boston area. The Mean ROC is calculated based on 5 locally trained models, with the gray-area showing the standard deviation of the ROCs. We show the ROCs for three different cut-off values t of the EXAM risk score. FEAT_VITAL_DBP_FIRST: 54.0 FEAT_VITAL_SBP_FIRST: 136.0 FEAT_PT_AGE: 87 FEAT_LAB_LDH_FIRST: NaN FEAT_LAB_CRP_FIRST: NaN FEAT_VITAL_SPO2_FIRST: 97.0 FEAT_VITAL_RR_FIRST: 17.0 FEAT_LAB_AST_FIRST: 26.0 FEAT_LAB_PCLC_FIRST: NaN FEAT_LAB_LAC_FIRST: NaN FEAT_LAB_NEUT_FIRST: 4.23 FEAT_LAB_GLU_FIRST: 79.0 FEAT_LAB_WBC_FIRST: 6.34 FEAT_LAB_TNT_FIRST: 16.0 FEAT_LAB_GFR_FIRST: 45.0 FEAT_LAB_CR_FIRST: 1.1 FEAT_LAB_DDMR_FIRST: NaN FEAT_ED_OD: RA PCR POS ED: True PCR POS EVER : True FEAT_VITAL_DBP_FIRST: 84.0 FEAT_VITAL_SBP_FIRST: 146.0 FEAT_PT_AGE: 72 FEAT_LAB_LDH_FIRST: 228.0 FEAT_LAB_CRP_FIRST: 102.0 FEAT_VITAL_SPO2_FIRST: 94.0 FEAT_VITAL_RR_FIRST: 16.0 FEAT_LAB_AST_FIRST: 4.0 FEAT_LAB_PCLC_FIRST: NaN FEAT_LAB_LAC_FIRST: 1.09 FEAT_LAB_NEUT_FIRST: 6.63 FEAT_LAB_GLU_FIRST: 165.0 FEAT_LAB_WBC_FIRST: 10.52 FEAT_LAB_TNT_FIRST: NaN FEAT_LAB_GFR_FIRST: 47.0 FEAT_LAB_CR_FIRST: 1.2 FEAT_LAB_DDMR_FIRST: NaN FEAT_ED_OD: RA PCR POS ED: True PCR POS EVER : True Extended Data Fig. 4| Failures cases at an independent test site. CXRs from two failure cases at CDH. The above is noisy data where each available value has been anonymized by adding a zero-mean Gaussian noise with the standard deviation of 1/5 of the standard deviation of the cohort distribution. Extended Data Fig. 5 | Safety enhancing features used in EXAM. Additional data-safety-enhancing features were assessed by only sharing a certain percentage of weight updates with the largest magnitudes before sending them to the server after each round of learning52. We show that by using partial weight updates during FL, models can be trained that reach a performance comparable to training while sharing the full information. This differential privacy technique decreases the risk for model inversion or reconstruction of the training image data through gradient interception. Extended Data Fig. 6 | Characteristics of EMR data used in EXAM. Min. and max. values (asterisks) and mean and standard deviation (length of bars) for each EMR feature used as an input to the model. n specifies the number of sites that had this particular feature available. Missing values were imputed using a MissedForest algorithm. Extended Data Fig. 7 | Distribution of oxygen treatments between EXAM sites. The boxplots show the quartiles of the minimum, the maximum, the sample median, and the first and third quartiles (excluding outliers) of the oxygen treatments applied at different sites at time of Emergency Department admission and after 24 and 72- hour periods. The types of oxygen treatments administered are ‘room air’, ‘low-flow oxygen’, ‘high-flow oxygen (non-invasive)’, and ‘ventilator’. Extended Data Fig. 8 | Site variations in oxygen usage. Normalized distributions of oxygen devices at different time points, comparing the site with largest dataset size (site 1) and a site with unbalanced data, including mostly mild cases (site #16). a b c Extended Data Fig. 9 | Description of the EXAM Federated Learning study. a, Previously developed model, CDS, to predict a risk score that corresponds to respiratory outcomes in patients with SARS-COV-2. b, Histogram of CORISK results at MGB, with an illustration of how the score can be used for patient triage, in which ‘A’ is an example threshold for safe discharge that has 99.5% negative predictive value, and ‘B’ is an example threshold for Intensive Care Unit (ICU) admission that has 50.3% positive predictive value. For the purpose of the NPV calculation (threshold A), we defined the Model Inference to be Positive if it predicted oxygen need as LFO or above (COVID risk score ³0.25) and Negative if it predicted oxygen need as RA (<0.25). We defined the Disease to be Negative if the patient was discharged and not readmitted, and Positive if the patient was readmitted for treatment. For the purpose of PPV calculation (threshold B), we defined the Model Inference to be Positive if it predicted oxygen need as MV or above (³0.75) and Negative if it predicted oxygen need as HFO or less (<0.75). We defined the disease to be Positive if the patient required MV or if they died, and we defined the disease as Negative if the patient survived and did not require MV. The EXAM score can be used in the same way. c, Federated Learning using a client-server setup. Extended Data Fig. 10 | Calibration Plots for the MGB data and the new independent dataset, CDH, used for model validation. Supplemental Table 1 | Number of PCR positive and negative cases across the sites Site # Cases # Pos. Cases # Neg. Cases % Pos. Cases 1 2994 1057 1937 35.3% 2 2825 139 2686 4.9% 3 2697 258 2439 9.6% 4 1786 618 1168 34.6% 5 1065 347 718 32.6% 6 853 427 426 50.1% 7 724 168 556 23.2% 8 637 232 405 36.4% 9 565 342 223 60.5% 10 485 304 181 62.7% 11 400 400 0 100.0% 12 346 346 0 100.0% 13 213 114 99 53.5% 14 176 72 104 40.9% 15 102 102 0 100.0% 16 99 99 0 100.0% 17 74 49 25 66.2% 18 55 55 0 100.0% 19 28 28 0 100.0% 20 24 15 9 62.5% Total 16148 5172 10976 32.0% Supplemental Table 2 | p-values and 95% CI of ROC comparisons of local training to Federated Learning shown in Extended Data Fig. 1. These ROC comparisons are for three different cut-off values t of the EXAM risk score p-values (95% CI improvement in AUC) t>=0.25 t>=0.5 t>=0.75 Site 1 0.02674 (0.0103, 0.0134) 0.03455 (0.0096, 0.0222) 0.04015 (0.0457, 0.0122) Site 4 4.866e-05 (0.0277, 0.0358) 1.08e-05 (0.0465, 0.0850) 3.92e-05 (0.0614, 0.1264) Site 5 3.005e-06 (0.0329, 0.0403) 5.098e-05 (0.04428, 0.0786) 0.0001362 (0.0473, 0.0928) Site 6 1.717e-14 (0.0782, 0.0938) 5.816e-08 (0.1044, 0.1678) 2.357e-05 (0.0953, 0.1882) Site 8 2.872e-10 (0.0520, 0.0618) 3.19e-06 (0.0570, 0.0906) 5.872e-05 (0.0575, 0.1095) Site 12 0.08332 (0.0401, 0.0565) 0.0507 (0.006, 0.891) 0.02664 (0.0235, 0.1633) Supplemental Note | Client-server based Federated Learning using the Federated Averaging algorithm 53,58 as implemented in NVIDIA Clara Train SDK52 1 image1.png Brazil image2.png 1800 ‘Ne cases (otal 16,148) image3.png 0.0200 0.0175 0.0150 0.0125 0.0100 counts 0.0075 client 1 client 2 client 3 client 4 client 5 client 6 ient 7 client & client 9 client 10 client 12 client 12 client 13 client 14 client 15 client 16 client 17 ient 18 client 19 client 20 100 150 grayscale value image4.png Patient Age LS * ca * eK LS ee ek * = * * * * i=) * i) LO Oo LO + x * * | oO e* \ ee ce, yg [SN * * ke bo ONOMOLATMNAONCOMOLATMNA Attest qualid image5.png size-based ordering @ Local M& FL (gl. best) 1.000 Avg. AUC 0.800 0.600 | 0.400 1 2 3 4 5 6 7 8 9 10 11 13 15 16 17 18 19 20 Avg. Client image6.png Client 2} tat} to] ttt] 5] 6] tr] 19] 20) Avg. Cocal _| 0.970] 0692| 0.73%] 0.860] oaa] 0-718] 0 B16] 0.887| 0836) 0-603] 0702] O05] 0,722) 0872] 0755) 0.686] 0 Be9| 0.542) 0.705 FL (gl bes [0.938] 0.902|0.912| 0.929] 0.950] 0.887] 1.000]0.963| 0.849] 0998] 0.980] 0.925] 0.979| 0.839] 0 606] 0.988] 1.000] 0.875| 0.920] image7.png Size vs. generalizability Avg. AUC (across sites) 1.000 0.750 0.500 0.250 0.000 oO @ Local (others) @ FL (gl. best) 1000 2000 # cases 3000 image8.png t >= 0.25 (pos=6/neg=14) t >= 0.5 (pos=2/neg=18) ‘True Positive Rate 2 os 00 g = False Positive Rate False Positive Rate image9.png True Positive Rate 10 os os 04 02 oo t >= 0.25 (pos=32/neg=38) t >= 0.5 (pos=17/neg=53) 10 10 o8 o8 os os Sos Sos ‘Mean ROC (AUC = 0.885 + 0.046) | — Mean ROC (AUC = 0.922 + 0.048) | Mean ROC (AUC = 0.875 + 0.075) Local (AUC = 0.884) 02 Local (AUC = 0.927) 02 Local (AUC = 0.874) FLAUC = 0.933) LAUC = 0.977) FL(AUC = 0.970) $1 std. dev oo $1 std. dev oo $1 std. dev oo 02 04 08 08 10 oo 02 04 08 08 10 os 06 08 10 False Positive Rate False Positive Rate False Positive Rate image10.png t= 0.25 {pos=226)neg=592) >= 0.75 (pos=20)neg=798) Hoe a io : Loos thes Huben) ‘ ‘ H i i 5 i i edition MS eticion Se etion image11.png t>= 0.25 (pos=259>neg=559) >= 0.75 (pos=28/neg=790) Hoe a io : FLibeet thes Fubes) ‘ ‘ H i i 5 i i edition Metin etiton image12.png size-based ordering (72h model) 1.000 0.750 Avg. AUC 0.501 S 0.250 ® Local M FL (gl. best) 1 2 3 4 5 6 8 g 11 12 13 15 16 17 18 19 20 Avg. Client image13.png Client 1 2 3 4 5 6 8 9 1 12 13 15 16 17 18 19 20]Avg. Local 0.876] 0.893] 0.813] 0.870} 0.835) 0.784] 0.874] 0.781] 0.764] 0.896] 0.838] 0.790] 0.761} 0.660] 0.634] 0.313] 0.542] 0.760 FL (gl. best) | 0.900] 0.890} 0.921] 0.904] 0.923] 0.900] 0.941] 0.870] 0.846] 0.956] 0.913} 0.873] 0.789] 0.705] 0.945] 1.000} 1.000} 0.899 image14.png Size vs. generalizability (72h model) Avg. AUC (across sites) 1.000 0.750 0.500 0.250 0.000 0 @ Local (others) @ FL Ql. best) 1000 # cases 2000 3000 image15.png Ground truth Pos Neg 4 (100.0%) 2 (33.3%) Local: t >= 0.25 0 (0.0%) 4 (66.7%) Neg Pos Prediction Ground truth Pos Neg Local: t >= 0.5 18 (100.0%) 0 (0.0%) 1 (50.0%) 1 (50.0%) Neg Pos Prediction image16.png Ground truth Neg Pos FL (best): t >= 0.25 12 (85.7%) 2 (14.3%) 0 (0.0%) Neg Pos Prediction Ground truth Neg Pos FL (best): t >= 0.5 13 (72.2%) 0 (0.0%) 2 (100.0%) Neg Pos Prediction image17.png 1.0 08 2 a 2 5 Sensitivity 0.2 0.0 t >= 0.25 (pos=234/neg=292) t >= 0.5 (pos=71/neg=455) t >= 0.75 (pos=42/neg=484) 10 10 08 08 Boe B06 2 2 2 2 B04 G04 — Mean Local (AUC = 0.889 + 0.029) 0.2 — Mean Local (AUC = 0.879 + 0.032) 0.2 — Mean Local (AUC = 0.885 + 0.032) — FL (best) (AUC = 0.916) — FL (best) (AUC = 0.934) 77 — FL (best) (AUC = 0.947) + 1 std. dev. 0.0 + 1 std. dev. oo 4 b* + 1 std. dev. 0.0 0.2 0.4 0.6 08 10 0.0 0.2 0.4 0.6 08 10 0.0 0.2 0.4 0.6 08 10 1 - Specificity 1 - Specificity 1 - Specificity image18.png 1.0 08 2 a 2 5 Sensitivity 0.2 0.0 t >= 0.25 (pos=241/neg=702) t >= 0.5 (pos=42/neg=901) t >= 0.75 (pos=26/neg=917) 10 10 08 08 Boe B06 2 2 2 2 B04 G04 — Mean Local (AUC = 0.825 + 0.024) 0.2 — Mean Local (AUC = 0.843 + 0.053) 0.2 — Mean Local (AUC = 0.853 + 0.048) — FL (best) (AUC = 0.886) — FL (best) (AUC = 0.948) 7 — FL (best) (AUC = 0.978) + 1 std. dev. 0.0 + 1 std. dev. oo 4 b* + 1 std. dev. 0.0 0.2 0.4 0.6 08 10 0.0 0.2 0.4 0.6 08 10 0.0 0.2 0.4 0.6 08 10 1 - Specificity 1 - Specificity 1 - Specificity image19.png 1.0 08 2 a 2 5 Sensitivity 0.2 0.0 t >= 0.25 (pos=262/neg=264) t >= 0.5 (pos=101/neg=425) t >= 0.75 (pos=62/neg=464) 10 10 08 08 Boe B06 2 2 2 2 B04 G04 — Mean Local (AUC = 0.870 + 0.027) 0.2 — Mean Local (AUC = 0.861 + 0.021) 0.2 — Mean Local (AUC = 0.862 + 0.023) — FL (best) (AUC = 0.903) — FL (best) (AUC = 0.904) 77 — FL (best) (AUC = 0.913) + 1 std. dev. 0.0 + 1 std. dev. oo 4 b* + 1 std. dev. 0.0 0.2 0.4 0.6 08 10 0.0 0.2 0.4 0.6 08 10 0.0 0.2 0.4 0.6 08 10 1 - Specificity 1 - Specificity 1 - Specificity image20.png 1.0 08 2 a 2 5 Sensitivity 0.2 0.0 t >= 0.25 (pos=288/neg=655) t >= 0.5 (pos=61/neg=882) t >= 0.75 (pos=37/neg=906) 10 10 08 08 Boe B06 2 2 2 2 B04 G04 — Mean Local (AUC = 0.819 + 0.026) 0.2 — Mean Local (AUC = 0.808 + 0.033) 0.2 — Mean Local (AUC = 0.808 + 0.031) — FL (best) (AUC = 0.871) — FL (best) (AUC = 0.891) 77 — FL (best) (AUC = 0.913) + 1 std. dev. 0.0 + 1 std. dev. oo 4 b* + 1 std. dev. 0.0 0.2 0.4 0.6 08 10 0.0 0.2 0.4 0.6 08 10 0.0 0.2 0.4 0.6 08 10 1 - Specificity 1 - Specificity 1 - Specificity image21.png 100 200 300 400 500 image22.png wo 200300400 image23.png 100 200 300 400 500 image24.png wo 200300400 image25.png Privacy-preserving FL Avg. AUC 1.000 0.750 0.500 0.250 0.000 @ =FLAvg. local {§ FL Avg. Global (best) 99% shared 75% shared 50% shared 25% shared Percentile Protocol 20% shared 10% shared image26.png Diastolic Blood Pressure (n=14) Systolic Blood Pressure (n=17) Patient Age (n=19) LDH (n=17) C-Reactive Protein (n=17) Oxygen Saturation (n=17) ~ ~ a a gi gf. 3 i ao | ies CAE : : : 4 = rrr i WE. . + ; a a : en “8 ; ' { ; A) ee AP ws SO ‘ his: _ os Hoe 8 : 8 8 oe a] ree. Hoye Ss. ; Hen. z= fe. sora et |g eS Wee. 3 st - Hoth ly Hi. qo . . = 4] ZB ov eo] e a3. i= . $ ei qo it] 4 His pes H: Ee Hiss | AE He .° | #2 13 nd 4 23 qo se i qe. qa. ii. é Se ee 32500 so00 veo tbo wo eo) Respiratory Rate (n=16) Procalcitonin (n=16) Lactate (n=15) Neutrophils (n=14) 4) of > = f= #] : 3 : we ; a. wan ble. eae Ps : z. aye: aia. mt & ca - 1/2 i 7 : . SO A]. . a a: He : ; Wee ee a He ' ; We =. ae it t = ays e :. 7 | HE = He = : le i A Fe as z . Met a 5 on qc. = A 5 bo abo 6 00 ton 1560 Eee ee baabo abo White Blood Cells (n=18) Troponin-T (n=12) eGFR (n=13) Creatinine (n=: D-Dimer (n=: 7} a Vie Bra Vie Va) ® oe ctient 3 : : : } Wi eee By ¥ ¥ oo Bie {|e * sens i. a. ‘ Hit. mo. eat j i i i i Den : : + -+|8 S at. He o Ho oe ry i fitch . = |e yank af. H - | aE: p. ye qac:. ; Hes His ai. 2 SSE ie. i | His ia ce ° 100 «200-300 ° 1000 «2000 «30000 200 400 ° 50 100 o 50000 100000 cent hee 18 fen 20 eeee image27.png Rate 10 0.8 0.6 0.4 0.2 0.0 Emergency department ae Oxygen Device HE Room air [= low-flow HE High-flow [= ventilator 24 hours Time 72 hours image28.png Emergency department mmm client 1 client 16 Room air low-flow —_High-flow ‘Oxygen Device Ventilator 0.25 0.00 24 hours =m client 1 client 16 Room air Low-flow —_High-flow Ventilator Oxygen Device 1.00 5 0.75 g 050 0.25 0.00 72 hours mmm client 1 client 16 Room air Low-flow _High-flow Ventilator Oxygen Device image29.png oom air fow-fiow high-tow ventiator death CORISK Score 5 05 075, 10 Deep & Cross Network { CXR Features ] [ EHR Features Pretrained ResNet-34 with Spatial Attention image30.png CORISK24 Score Stable "= Deteriorated image31.png Federated clients training on private data Local Data Local Data Local Data Federated Server image32.png 10 08 © a Predicted Outcome 2 S 02 00 MGB — Local Model — FL Model = LFO HFO_NV Observed Outcome mv DEATH Predicted Outcome 10 08 ° a ° = 02 00 CDH — FL Model Tr LFO HFO_NV Observed Outcome mv DEATH image33.png Require: Number federated rounds T. Require: Number of local training iterations n, for client k. 1: procedure FEDERATEDAVERAGING 2 Initialize model weights. 3 for Round t of T do 4 for client k of K do 5: Send current global model to client. 6: Evaluate global model on local validation data. 7: Initialize optimizer and momentums. 8 9 Perform training on local data. Apply privacy-preserving scheme. 10: Send weight updates and validation scores to server. 11: Select locally best model based on validation score. 12: end for 13: Select best global model based on local validation scores. 14: Aggregate the client weight updates. 15: Update the global model weighted by local iterations nz. 16: end for 17: Exchanged models and evaluate on each client. 18: return Validation results. 19: return Final global model. 20: return Best global model. 21: return Locally best model for each client. > Executed in parallel