科研管理 ›› 2021, Vol. 42 ›› Issue (1): 20-32.

• 论文 • 上一篇    下一篇

一种基于机器学习的研究前沿识别方法研究

李欣,温阳,黄鲁成,苗红   

  1. 北京工业大学 经济与管理学院,北京100124
  • 收稿日期:2020-11-07 修回日期:2020-12-03 出版日期:2021-01-20 发布日期:2021-01-22
  • 通讯作者: 李欣
  • 基金资助:
    国家自然科学基金面上项目:“基于多源异构数据的新兴技术形成机理研究”(71673018,2017.01—2020.12)。

A study of the research front identification method based on machine learning

Li Xin, Wen Yang, Huang Lucheng, Miao Hong   

  1. College of Economic and Management, Beijing University of Technology, Beijing 100124, China
  • Received:2020-11-07 Revised:2020-12-03 Online:2021-01-20 Published:2021-01-22

摘要: 研究前沿是科技创新过程中最具潜力和前瞻性的研究方向,尽早识别研究前沿对科学研究、企业研发资源优化配置、政府创新战略前瞻部署等至关重要。针对目前在研究前沿识别研究中存在的不足,提出一种基于机器学习的研究前沿识别方法。该方法首先通过构建机器学习模型来识别出潜在高被引论文,解决利用引文分析法来识别研究前沿的时滞性问题,并将潜在高被引论文纳入研究前沿识别的高被引论文核心文档集中;其次,以高被引论文核心文档集为数据源,利用聚类分析法识别出研究前沿主题,并对研究前沿主题进行对比和评价分析,进而识别出研究前沿;最后,以太阳能光伏电池研究领域为例进行了实证研究,验证了该方法的可行性和有效性,为研究前沿识别提供了新的研究方法。

随着科学技术的飞速发展,面对海量的科研成果数据,如何快速、准确地识别出研究前沿对科学研究、企业研发资源优化配置、政府创新战略前瞻部署等至关重要。
  许多学者利用文献计量方法来识别研究前沿,其中引文分析是研究前沿识别中最常用、最重要的方法之一,且高被引论文被视为识别研究前沿的重要数据来源。然而引文分析法识别研究前沿存在时滞性问题。因此,针对目前引文分析法识别研究前沿的不足,即引文分析法以高被引论文为研究对象来识别研究前沿存在时滞性问题,本文提出了一种基于机器学习的研究前沿识别方法,并以太阳能电池研究领域为例,验证了该方法的有效性和可行性。案例研究结果表明,该方法能够有效地解决引文分析法的时滞性问题,且能快速、准确地对研究前沿进行识别。
  本文的主要贡献是提出了一种基于机器学习的研究前沿识别方法。在利用引文分析法来识别研究前沿时,多以高被引论文为研究对象,但高被引论文的形成需要时间的积累,现有引文分析法无法将潜在高被引论文(也就是新近发表的论文,且未来一段时间内将会被高度引用的论文)纳入到识别研究前沿的高被引论文集中。因此,本文利用机器学习方法预测论文未来一段时间内的被引量,进而可以识别出潜在的高被引论文,解决了利用高被引论文作为核心数据集来识别研究前沿的时滞性问题。此外,本文所提出的机器学习模型框架是开放的,我们可以利用不同的机器学习算法来对某领域历史的高被引论文进行分析,并获取论文特征、作者特征和期刊特征与论文被引量之间的关系模式。当该领域新论文一经公开,我们就可以获取论文特征、作者特征和期刊特征数据,并可以利用机器学习模型来对其未来被引量进行预判,从而可以对该领域研究前沿进行有效识别,进而为实现快速、准确识别研究前沿的智能化提供了可能。因此,基于机器学习的研究前沿识别方法为研究前沿识别研究提供了新的研究方法。
   本文的研究仍有待于未来进一步深入研究的地方:首先,本文构建的高被引论文识别指标体系是根据学者们以前关于预测高被引论文研究的指标总结得到的,未考虑各个指标的重要性,以后可尝试纳入更多指标,并通过指标约简获取最佳指标组合;其次,本文只选择了支持向量机、随机森林、极端梯度提升三种机器学习模型,未来可考虑选用更多机器学习模型,并选择分类准确率更好的机器学习模型来进行识别研究前沿。

关键词: 机器学习, 研究前沿, 引文分析, 评价, 识别

Abstract: Research front is the most potential and forward-looking research direction in the process of technological innovation. It is very important to early identify the research fronts for scientific research, optimal allocation of enterprises′ R&D resources, governments′ innovation strategies formulation. Faced the massive amount of scientific research results data, how to quickly and accurately identify research fronts has become the focus of the academic community. Many scholars have used bibliometric methods to identify research fronts. Citation analysis is one of the most commonly used methods to identify research fronts, and highly cited papers are regarded as an important data source. However, it takes a certain amount of time to accumulate citations of papers. The existing citation analysis method cannot incorporate newly published papers and papers that will be highly cited in the future into the data collection of highly cited papers that identify research fronts. Therefore, aiming at the current deficiencies in the research on research fronts identification, this paper proposed a novel framework for identifying research fronts based on machine learning methods. The research steps of this framework are as follows.
   Firstly, we used Web of Science (WoS) as the data source to download historical highly cited papers and the references of the highly cited papers. Secondly, we constructed the identification indexes system of the highly cited papers and calculated the corresponding values of the indexes. Then we divided the obtained data into the training data set and the testing data set for machine learning model. Thirdly, we constructed support vector machine (SVM), random forest (RF), and eXtreme Gradient Boosting (XGBoost) models, and continuously adjusted model parameters to make the three models to be optimal. Fourth, we downloaded the newly published papers from WoS to verify the generalization ability of each machine learning model. Then we selected the model with the best generalization ability to predict the future citations of the newly published papers and identified potentially highly cited papers, and we incorporated the potentially highly cited papers into the core data set of the highly cited papers. Fifth, we used the core data set of the highly cited papers as the data source to identify the research front topics by applying cluster analysis. Finally, the research front topics are compared and evaluated to identify the research fronts. We selected solar cells as a case study to verify the valid and flexible of this framework.
    The research results show that emerging research fronts in the research field of solar cells include: Ternary organic solar cells/Ternary polymer solar cells, PbS quantum-dot solar cells, inverted planar perovskite solar cells; the growing research fronts include: Non-fullerene polymer solar cells/Non-fullerene organic solar cells, CH3NH3PbI3 perovskite solar cells. We found that the research fronts we identified were basically consistent with the research fronts in the field of solar cells in existing authoritative research reports. In addition, we invited three well-known experts in the field of solar cells to evaluate the research fronts identified in this paper, and they basically agreed with the results. This verifies the effectiveness and feasibility of the method proposed in this paper.

Key words: machine learning, research front, citation analysis, evaluation, identification