Science Research Management ›› 2021, Vol. 42 ›› Issue (1): 20-32.

Previous Articles     Next Articles

A study of the research front identification method based on machine learning

Li Xin, Wen Yang, Huang Lucheng, Miao Hong   

  1. College of Economic and Management, Beijing University of Technology, Beijing 100124, China
  • Received:2020-11-07 Revised:2020-12-03 Online:2021-01-20 Published:2021-01-22

Abstract: Research front is the most potential and forward-looking research direction in the process of technological innovation. It is very important to early identify the research fronts for scientific research, optimal allocation of enterprises′ R&D resources, governments′ innovation strategies formulation. Faced the massive amount of scientific research results data, how to quickly and accurately identify research fronts has become the focus of the academic community. Many scholars have used bibliometric methods to identify research fronts. Citation analysis is one of the most commonly used methods to identify research fronts, and highly cited papers are regarded as an important data source. However, it takes a certain amount of time to accumulate citations of papers. The existing citation analysis method cannot incorporate newly published papers and papers that will be highly cited in the future into the data collection of highly cited papers that identify research fronts. Therefore, aiming at the current deficiencies in the research on research fronts identification, this paper proposed a novel framework for identifying research fronts based on machine learning methods. The research steps of this framework are as follows.
   Firstly, we used Web of Science (WoS) as the data source to download historical highly cited papers and the references of the highly cited papers. Secondly, we constructed the identification indexes system of the highly cited papers and calculated the corresponding values of the indexes. Then we divided the obtained data into the training data set and the testing data set for machine learning model. Thirdly, we constructed support vector machine (SVM), random forest (RF), and eXtreme Gradient Boosting (XGBoost) models, and continuously adjusted model parameters to make the three models to be optimal. Fourth, we downloaded the newly published papers from WoS to verify the generalization ability of each machine learning model. Then we selected the model with the best generalization ability to predict the future citations of the newly published papers and identified potentially highly cited papers, and we incorporated the potentially highly cited papers into the core data set of the highly cited papers. Fifth, we used the core data set of the highly cited papers as the data source to identify the research front topics by applying cluster analysis. Finally, the research front topics are compared and evaluated to identify the research fronts. We selected solar cells as a case study to verify the valid and flexible of this framework.
    The research results show that emerging research fronts in the research field of solar cells include: Ternary organic solar cells/Ternary polymer solar cells, PbS quantum-dot solar cells, inverted planar perovskite solar cells; the growing research fronts include: Non-fullerene polymer solar cells/Non-fullerene organic solar cells, CH3NH3PbI3 perovskite solar cells. We found that the research fronts we identified were basically consistent with the research fronts in the field of solar cells in existing authoritative research reports. In addition, we invited three well-known experts in the field of solar cells to evaluate the research fronts identified in this paper, and they basically agreed with the results. This verifies the effectiveness and feasibility of the method proposed in this paper.

Key words: machine learning, research front, citation analysis, evaluation, identification