The ongoing artificial intelligence (AI) revolution is poised to reshape almost every line of work. Despite enormous efforts devoted to understanding AI's economic impacts, we lack a systematic understanding of the benefits to scientific research associated with the use of AI.
Drawing from the literature on the future of work and the science of science, here we develop a measurement framework to estimate the direct use of AI and associated benefits in science based on millions of scientific publications and patents.
We find that the use and benefits of AI appear widespread throughout the sciences, growing especially rapidly since 2015. Moreover, papers that use AI exhibit a citation premium, more likely to be highly cited both within and outside their disciplines.
Despite the considerable potential for AI to benefit numerous scientific fields, there is a substantial gap between AI education and its application in research, highlighting a misalignment between the supply and demand of AI expertise.
Our analysis also reveals demographic disparities, with disciplines with higher proportions of women or Black scientists reaping fewer benefits from AI, suggesting that AI's growing impact on research may further exacerbate existing inequalities in science.
As the connection between AI and scientific research deepens, these findings may become increasingly important, with implications for the equity and sustainability of the research enterprise.
To estimate the use and potential benefits of AI for science, we use a variety of datasets that include information regarding scientific publications, patents, course syllabi, and the demographics of researchers.
MAG: We use the Microsoft Academic Graph (MAG) database for publication data. We collect information on 74.6 million publications between 1960 and 2019. These publications are categorized into 19 disciplines (e.g., "computer science") and 292 fields (e.g., "machine learning") under the MAG "field of study" taxonomy.
USPTO: We collect information on 7.1 million patents granted between 1976 and 2019 from PatentsView, a data platform based on bulk data from the U.S. Patent and Trademark Office (USPTO).
OSP: We use syllabus data that is sourced from the Open Syllabus Project (OSP), the world's first large-scale database of university course syllabus documents. Our syllabus dataset contains 4.2 million English-language syllabi published between 2000 and 2018.
SDR: We use the Survey of Doctorate Recipients (SDR) for de-identified demographic data regarding individuals with a U.S. research doctoral degree in a science, engineering, or health field. We use the 2017 SDR data on scientists and engineers, including the discipline of their doctorate, their sex, and their race and ethnicity.
The MAG data are available at https://zenodo.org/record/6511057. The USPTO patent data are available at https://patentsview.org. The OSP dataset is available from the paper at https://www.pnas.org/doi/10.1073/pnas.1804247115. The SDR data are available at https://www.nsf.gov/statistics/srvydoctoratework, and the datasets used in this study are de-identified, containing only summary statistics for each discipline.
The data and code necessary to reproduce all main plots and statistical analyses is freely available for download.