Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Appendix H Topic Analysis of National Institutes of Health Autoimmune Research Grant Abstracts Authored by:1 Chris Barousse2,3 INTRODUCTION In 2019, Congress tasked the National Academy of Sciences, Engi- neering and Medicine (NASEM) to evaluate National Institutes of Health (NIH) research on autoimmune disease. The scope of work for the com- mittee included a review of trends in the focus (topics) of autoimmune disease research. The goal of this paper is to provide insight into the most popular research topics associated with 8,470 NIH autoimmune disease research grant abstracts using latent dirichlet allocation (LDA), a statisti- cal modeling technique used in natural language processing. BACKGROUND LDA is a popular, well-documented statistical method used in natu- ral language processing (NLP) settings. LDA groups what the NLP field refers to as a corpus of texts by âlatentâ topics, which are found by look- ing at the similarity of the textsâ contents (Blei et al., 2003). As of 2021, the original paper describing LDA methodology has been cited 5,714 times. Several software packages in the R statistical language can imple- ment LDA, and this method has been applied specifically to scientific abstracts to analyze funding patterns and trends in research (Bittermann and Fischer, 2018; Park et al., 2016; Porturas and Taylor, 2021). An LDA model consists of probabilities for each word belonging to each topic, and probabilities of each document belonging to each topic. LDA makes several assumptions about the corpus. First, it assumes that 509 PREPUBLICATION COPYâUncorrected Proofs
510 ENHANCING NIH RESEARCH ON AUTOIMMUNE DISEASE each document is a collection of words and disregards the sequence and grammar of the document; this is called the bag-of-words model. Second, LDA assumes that the corpus contains knowledge about many topics k and that the user has already removed words that are either too rare or too common and stopwords, which are words that do not provide any mean- ingful information and in most situations include pronouns, prepositions, articles, and conjunctions. If words are sparse throughout the corpus, the model will take a long time to search through the corpus finding the rare words. Including words that are too common will generate topics that are too similar to each other and make discerning between them difficult. LDA sees documents as consisting of one or more words, and words can belong to one or more topics with a different probability of belonging to each topic. The LDA algorithm is iterative, meaning that the user must decide how many times it is run on a corpus. Running it more increases the chance of finding distinct topics, but running LDA is time and computa- tionally intensive. First, each word is randomly assigned a probability for belonging in topic t. The code then goes through each word w belonging to document d and computes the proportion of words in the document d that are assigned to the topic t, the proportion of assignments to topic t over all the documents that contain the word w, and the updated prob- ability of the word w belonging to topic t. Averaging the probabilities of each word belonging to a specific topic gives the topic probabilities for each document. Methods LDA works well to âassess trends in the focus of NIH research and address whether the trends are reflective of the changes in epidemiol- ogy as compared to other factors such as availability of research tools and technologies, and emerging biomedical knowledge and conceptsâ (NASEM, 2021). Given a set of 8,470 NIH research grant abstracts related to autoimmune diseases that were funded between 2008 and 2020, topic modeling using LDA was implemented to discern which topics are preva- lent within the abstracts. In preparing the grant abstracts for analysis, words that were too common in the abstracts, including the words research and studies, were removed. Because there was no pre-determined value of k to use, coher- ence scores were calculated to determine an optimal number of topics (Syed and Spruit, 2017). Coherence is a measure of how similar words within a topic are and how distinct topics are from each other. Coherence is calculated on a full LDA model; this means that the LDA algorithm was run 60 times to compare 60 values of k. Figure H-1 is a plot that calculates PREPUBLICATION COPYâUncorrected Proofs
APPENDIX H 511 the coherence of models fit using various values of k, the number of top- ics. A higher coherence score implies that the topics generated using k number of topics fit the data well. In consultation with the committee, it was decided to use the model specified with 30 topics. Once the LDA model was chosen, the names of the 30 topics needed to be determined. LDA does not assign a name to the topics, and it is com- mon practice to look at the top words assigned to each topic to determine what ideas each topic is trying to convey. There is no significance in the numbers associated with each topic or the order of topics given by the model. Figure H-2 shows the top 10 most frequent words assigned to each topic. In consultation with committee members, a name was assigned to each topic. Table H-1 lists the final topic names; some topics were given the same topic name because they were deemed too similar, and their topic assignments were combined in the later plots. Final Results One of the outputs of the LDA model is the theta matrix, which shows the proportion of topics assigned to each abstract. For example, 30 percent of an abstract may be attributed to topic 1, and 70 percent of it may be attributed to topic 2. Theta has N (the number of abstracts) rows and K (the number of topics) columns, and each row in the matrix sums to 1. For each fiscal year, the proportion of each topic attributed to all abstracts funded that year was summed and a separate plot was made for each topic. In other words, the y-axis is the proportion of abstracts funded in a given year that was attributed to that topic. Figure H-3 groups the topics by theme: immune response related, clinical, disease focused, and administrative. Conclusion Figure H-3 can be used to determine trends in topics over time using the LDA model. Treatment/therapy, lung disease, diagnostic [tools], imaging, and IBD have trended upward from 2008 in contrast to animal models, genetics, and pathogenesis, and diabetes is consistently preva- lent among topics over time. Cancer, multiple sclerosis, cardiovascular, psoriasis, lung disease, and rheumatoid arthritis are also consistent over time but not as popular as diabetes. It is difficult to explain the spikes in popularity in administrative topics. This could be related to NIH funding policies or other funding pattern changes. One of the downsides of using LDA for topic modeling is that the topics must be discovered âlatentlyâ within the texts; the topics cannot be inputted into the model algorithm. Furthermore, it would have been PREPUBLICATION COPYâUncorrected Proofs
512 PREPUBLICATION COPYâUncorrected Proofs FIGURE H-1â Coherence scores for k number of topics.
PREPUBLICATION COPYâUncorrected Proofs FIGURE H-2â Top 10 words per topic. SOURCE: NIH, 2021. 513
514 ENHANCING NIH RESEARCH ON AUTOIMMUNE DISEASE TABLE H-1â LDA Generated Topics for NIH Autoimmune Disease Grants, Fiscal Years 2008â2020 Topic Number Topic Name 1 Immune Response (innate immunity) 2 Immune Response (adaptive immunity [antibodies]) 3 Psoriasis 4 Other (non-antibody) Mechanisms of Adaptive Immunity 5 Centers/Core Project Funding 6 Treatment/Therapy 7 Lung Disease 8 Animal Model 9 Adaptive Immunity 10 Inflammatory Response (innate) 11 Type 1 Diabetes 12 Gene Expression 13 Epithelial Barrier 14 Type 1 Diabetes 15 Disease Progression 16 SLE 17 Cancer 18 Rheumatoid Arthritis 19 Cardiovascular 20 Centers/Core Project Funding 21 Adaptive Immunity 22 Training (funding) 23 Genetics 24 Quality of Life 25 Inflammatory Bowel Disease 26 Multiple Sclerosis 27 Pathogenesis 28 Diagnostic 29 Virus (infectious etiology) 30 Imaging SOURCE: NIH, 2021. PREPUBLICATION COPYâUncorrected Proofs
PREPUBLICATION COPYâUncorrected Proofs FIGURE H- 3â LDA Generated Topics for NIH Autoimmune Disease Grants, FY 2008â2020: Immune Response Related Topics. 515 SOURCE: NIH, 2021.
516 PREPUBLICATION COPYâUncorrected Proofs FIGURE H-4â LDA Generated Topics for NIH Autoimmune Disease Grants, FY 2008â2020: Clinical Topics. SOURCE: NIH, 2021.
PREPUBLICATION COPYâUncorrected Proofs FIGURE H-5â LDA Generated Topics for NIH Autoimmune Disease Grants, FY 2008â2020: Disease Focused Topics. 517 SOURCE: NIH, 2021.
518 PREPUBLICATION COPYâUncorrected Proofs FIGURE H-6â LDA Generated Topics for NIH Autoimmune Disease Grants, FY 2008â2020: Administrative Topics. SOURCE: NIH, 2021.
519 interesting to see how popular specific diseases of interest, including celiac disease, autoimmune thyroid disease (consisting of Hashimoto and Gravesâ disease), antiphospholipid syndrome, primary biliary cholangitis, and SjÃ¶grenâs disease, are over time. If the LDA model had been allowed to include a greater number of topics as input, the likelihood of seeing smaller, more specific topics would increase. However, increasing the number of topics could also allow the algorithm to detect groupings of words that are not meaningful as it tries to find more topics within the texts. It can be concluded from the omission of specific diseases from the final list of topics that they are not well represented within the abstracts analyzed. REFERENCES Bittermann, A., and A. Fischer. 2018. How to identify hot topics in psychology using topic modeling. Zeitschrift fÃ¼r Psychologie 226(1):3-13. Blei, D. M., A. Y. Ng, and M. I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3:993â1022. NASEM (National Academies of Sciences, Engineering, and Medicine). 2021. Assessment of NIH research on autoimmune diseases. https://www.nationalacademies.org/our-work/ assessment-of-nih-research-on-autoimmune-diseases#sectionWebFriendly (accessed January 3, 2022). NIH (National Institutes of Health). 2021. NIH RePORTER database. Park, J., M. Blume-Kohout, R. Krestel, E. Nalisnick, and P. Smyth. 2016. Analyzing NIH fund- ing patterns over time with statistical text analysis. Association for the Advancement of Artificial Intelligence. Porturas, T., and R. A. Taylor. 2021. Forty years of emergency medicine research: Uncover- ing research themes and trends through topic modeling. American Journal of Emergency Medicine 45:213-220. Syed, S., and M. Spruit. 2017. Full-text or abstract? Examining topic coherence scores using latent dirichlet allocation. Paper read at 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 19-21 Oct. 2017. PREPUBLICATION COPYâUncorrected Proofs
PREPUBLICATION COPYâUncorrected Proofs