Journal of Computer Science and Artificial Intelligence

Employing Statistical Machine Reading for Inferring Key Concepts of A Research Field from a Body of Abstracts and Blog Posts
Author(s): Wenfa Ng

The world of science is drowned in a wealth of information. How to make sense of this wealth of published articles, blog posts and abstracts has become an important challenge given the importance of science to different aspects of societal function. At the crux of the issue lies the increasing trend where scientific discovery informs decision making at the societal level. One example, is the elucidation of the ozone hole to the promulgation of the Montreal Protocol in 1987, and documenting increasing atmospheric carbon dioxide concentration led to climate action and signing of the Paris Agreement in 2015. Hence, understanding a research field becomes an important need for many decision makers across different sectors of society. But, the scientific literature is cryptic and esoteric, and presents a significant barrier to comprehension. One approach to ameliorate the problem is statistical machine reading, which provides the critical capability of identifying key concepts that underpins a research field. Such important concepts help provide an incision point to gain further understanding of the field and initiating further conversation about the field. This work sought to validate the concept of whether applying statistical machine reading to a body of literature comprising short blog posts and abstracts of published articles help in understanding the field of metabolic engineering. One important angle pursued in this research is whether the tabulated list of terms and phrases identified by statistical machine reading could be creatively analyzed to gain a deeper understanding of the research field. For example, the most frequently occurring terms and phrases could describe key concepts of the research field. Moving down in frequency occurrence would be terms and phrases that describe methodologies and approaches of the field. Finally, less frequently occurring terms and phrases may be tools and resources used in the research field. Results validated the utility of statistical machine reading in identifying important terms and phrases associated with the research field. But the small dataset of blog posts and abstracts used in this study severely hampered the identification of most of the key concepts of metabolic engineering, which is a fairly broad field of research. Overall, statistical machine reading shows utility in identifying terms and phrases that could describe a field. However, the level of understanding is closely tied in to the breadth and depth of reading material available, which meant that the methodology is data intensive in nature. Future use of supercomputing or quantum computing could help alleviate constraints of computational capacity, and help tackle the exponential rise in computational complexity as the size of the reading material for machine reading expands.