Person: Crosas, Merce
Email Address
AA Acceptance Date
Birth Date
Research Projects
Organizational Units
Job Title
Last Name
First Name
Name
Search Results
Publication How Do Astronomers Share Data? Reliability and Persistence of Datasets Linked in AAS Publications and a Qualitative Study of Data Practices among US Astronomers
(Public Library of Science, 2014) Pepe, Alberto; Goodman, Alyssa; Muench, August; Crosas, Merce; Erdmann, ChristopherWe analyze data sharing practices of astronomers over the past fifteen years. An analysis of URL links embedded in papers published by the American Astronomical Society reveals that the total number of links included in the literature rose dramatically from 1997 until 2005, when it leveled off at around 1500 per year. The analysis also shows that the availability of linked material decays with time: in 2011, 44% of links published a decade earlier, in 2001, were broken. A rough analysis of link types reveals that links to data hosted on astronomers' personal websites become unreachable much faster than links to datasets on curated institutional sites. To gauge astronomers' current data sharing practices and preferences further, we performed in-depth interviews with 12 scientists and online surveys with 173 scientists, all at a large astrophysical research institute in the United States: the Harvard-Smithsonian Center for Astrophysics, in Cambridge, MA. Both the in-depth interviews and the online survey indicate that, in principle, there is no philosophical objection to data-sharing among astronomers at this institution. Key reasons that more data are not presently shared more efficiently in astronomy include: the difficulty of sharing large data sets; over reliance on non-robust, non-reproducible mechanisms for sharing data (e.g. emailing it); unfamiliarity with options that make data-sharing easier (faster) and/or more robust; and, lastly, a sense that other researchers would not want the data to be shared. We conclude with a short discussion of a new effort to implement an easy-to-use, robust, system for data sharing in astronomy, at theastrodata.org, and we analyze the uptake of that system to-date.
Publication Automating Open Science for Big Data
(SAGE Publications, 2015) Crosas, Merce; King, Gary; Honaker, James; Sweeney, LatanyaThe vast majority of social science research presently uses small (MB or GB scale) data sets. These fixed scale sets are commonly downloaded to the researcher's computer where the analysis is performed locally, and are often shared and cited with well-established technologies, such as the Dataverse Project (see Dataverse.org), to support the published results. The trend towards Big Data - including large scale streaming data - is starting to transform research and has the potential to impact policy-making and our understanding of the social, economic, and political problems that affect human societies. However, this research poses new challenges in execution, accountability, preservation, reuse, and reproducibility. Downloading these data sets to a researcher's computer is infeasible or not practical; hence, analyses take place in the cloud, require unusual expertise, and benefit from collaborative teamwork and novel tool development. The advantage of these data sets in how informative they are also means that they are much more likely to contain highly sensitive personally identifiable information. In this paper, we discuss solutions to these new challenges so that the social sciences can realize the potential of Big Data.
Publication Ten Simple Rules for the Care and Feeding of Scientific Data
(Public Library of Science (PLoS), 2014) Goodman, Alyssa; Pepe, Alberto; Blocker, Alexander Weaver; Borgman, Christine L.; Cranmer, Kyle; Crosas, Merce; Di Stefano, Rosanne; Gil, Yolanda; Groth, Paul; Hedstrom, Peg; Hogg, David W.; Kashyap, Vinay; Mahabal, Ashish; Siemiginowska, Aneta; Slavkovic, AleksandraPublication Data publication with the structural biology data grid supports live analysis
(Nature Publishing Group, 2016) Meyer, Peter A.; Socias, Stephanie; Key, Jason; Ransey, Elizabeth; Tjon, Emily C.; Buschiazzo, Alejandro; Lei, Ming; Botka, Chris; Withrow, James; Neau, David; Rajashankar, Kanagalaghatta; Anderson, Karen S.; Baxter, Richard H.; Blacklow, Stephen C.; Boggon, Titus J.; Bonvin, Alexandre M. J. J.; Borek, Dominika; Brett, Tom J.; Caflisch, Amedeo; Chang, Chung-I; Chazin, Walter J.; Corbett, Kevin D.; Cosgrove, Michael S.; Crosson, Sean; Dhe-Paganon, Sirano; Di Cera, Enrico; Drennan, Catherine L.; Eck, Michael J.; Eichman, Brandt F.; Fan, Qing R.; Ferré-D'Amaré, Adrian R.; Christopher Fromme, J.; Garcia, K. Christopher; Gaudet, Rachelle; Gong, Peng; Harrison, Stephen; Heldwein, Ekaterina E.; Jia, Zongchao; Keenan, Robert J.; Kruse, Andrew C.; Kvansakul, Marc; McLellan, Jason S.; Modis, Yorgo; Nam, Yunsun; Otwinowski, Zbyszek; Pai, Emil F.; Pereira, Pedro José Barbosa; Petosa, Carlo; Raman, C. S.; Rapoport, Tom; Roll-Mecak, Antonina; Rosen, Michael K.; Rudenko, Gabby; Schlessinger, Joseph; Schwartz, Thomas U.; Shamoo, Yousif; Sondermann, Holger; Tao, Yizhi J.; Tolia, Niraj H.; Tsodikov, Oleg V.; Westover, Kenneth D.; Wu, Hao; Foster, Ian; Fraser, James S.; Maia, Filipe R. N C.; Gonen, Tamir; Kirchhausen, Tom; Diederichs, Kay; Crosas, Merce; Sliz, PiotrAccess to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models and indispensable for development of structural biology processing methods. Here, we established a diffraction data publication and dissemination system, Structural Biology Data Grid (SBDG; data.sbgrid.org), to preserve primary experimental data sets that support scientific publications. Data sets are accessible to researchers through a community driven data grid, which facilitates global data access. Our analysis of a pilot collection of crystallographic data sets demonstrates that the information archived by SBDG is sufficient to reprocess data to statistics that meet or exceed the quality of the original published structures. SBDG has extended its services to the entire community and is used to develop support for other types of biomedical data sets. It is anticipated that access to the experimental data sets will enhance the paradigm shift in the community towards a much more dynamic body of continuously improving data analysis.
Publication If these data could talk
(Springer Nature, 2017) Pasquier, Thomas; Lau, Matthew; Trisovic, Ana; Boose, Emery; Couturier, Ben; Crosas, Merce; Ellison, Aaron; Gibson, Valerie; Jones, Chris R.; Seltzer, MargoIn the last few decades, data-driven methods have come to dominate many fields of scientific inquiry. Open data and open-source software have enabled the rapid implementation of novel methods to manage and analyze the growing flood of data. However, it has become apparent that many scientfic fields exhibit distressingly low rates of repeatability and reproducibility. Although there are many dimensions to this issue, we believe that there is a lack of formalism used when describing end-to-end published results, from the data source to the analysis to the final published results. Even when authors do their best to make their research and data accessible, this lack of formalism reduces the clarity and effciency of reporting, which contributes to issues of reproducibility. Data provenance aids both repeatability and reproducibility through systematic and formal records of the relationships among data sources, processes, datasets, publications and researchers.
Publication Repository Approaches to Improving the Quality of Shared Data and Code
(MDPI AG, 2021-02-03) Trisovic, Ana; Mika, Katherine; Boyd, Ceilyn; Feger, Sebastian; Crosas, MerceSharing data and code for reuse has become increasingly important in scientific work over the past decade. However, in practice, shared data and code may be unusable, or published results obtained from them may be irreproducible. Data repository features and services contribute significantly to the quality, longevity, and reusability of datasets. This paper presents a combination of original and secondary data analysis studies focusing on computational reproducibility, data curation, and gamified design elements that can be employed to indicate and improve the quality of shared data and code. The findings of these studies are sorted into three approaches that can be valuable to data repositories, archives, and other research dissemination platforms.