一种基于机器学习的研究前沿识别方法研究

摘要/Abstract

摘要： 研究前沿是科技创新过程中最具潜力和前瞻性的研究方向，尽早识别研究前沿对科学研究、企业研发资源优化配置、政府创新战略前瞻部署等至关重要。针对目前在研究前沿识别研究中存在的不足，提出一种基于机器学习的研究前沿识别方法。该方法首先通过构建机器学习模型来识别出潜在高被引论文，解决利用引文分析法来识别研究前沿的时滞性问题，并将潜在高被引论文纳入研究前沿识别的高被引论文核心文档集中；其次，以高被引论文核心文档集为数据源，利用聚类分析法识别出研究前沿主题，并对研究前沿主题进行对比和评价分析，进而识别出研究前沿；最后，以太阳能光伏电池研究领域为例进行了实证研究，验证了该方法的可行性和有效性，为研究前沿识别提供了新的研究方法。

随着科学技术的飞速发展，面对海量的科研成果数据，如何快速、准确地识别出研究前沿对科学研究、企业研发资源优化配置、政府创新战略前瞻部署等至关重要。
许多学者利用文献计量方法来识别研究前沿，其中引文分析是研究前沿识别中最常用、最重要的方法之一，且高被引论文被视为识别研究前沿的重要数据来源。然而引文分析法识别研究前沿存在时滞性问题。因此，针对目前引文分析法识别研究前沿的不足，即引文分析法以高被引论文为研究对象来识别研究前沿存在时滞性问题，本文提出了一种基于机器学习的研究前沿识别方法，并以太阳能电池研究领域为例，验证了该方法的有效性和可行性。案例研究结果表明，该方法能够有效地解决引文分析法的时滞性问题，且能快速、准确地对研究前沿进行识别。
本文的主要贡献是提出了一种基于机器学习的研究前沿识别方法。在利用引文分析法来识别研究前沿时，多以高被引论文为研究对象，但高被引论文的形成需要时间的积累，现有引文分析法无法将潜在高被引论文（也就是新近发表的论文，且未来一段时间内将会被高度引用的论文）纳入到识别研究前沿的高被引论文集中。因此，本文利用机器学习方法预测论文未来一段时间内的被引量，进而可以识别出潜在的高被引论文，解决了利用高被引论文作为核心数据集来识别研究前沿的时滞性问题。此外，本文所提出的机器学习模型框架是开放的，我们可以利用不同的机器学习算法来对某领域历史的高被引论文进行分析，并获取论文特征、作者特征和期刊特征与论文被引量之间的关系模式。当该领域新论文一经公开，我们就可以获取论文特征、作者特征和期刊特征数据，并可以利用机器学习模型来对其未来被引量进行预判，从而可以对该领域研究前沿进行有效识别，进而为实现快速、准确识别研究前沿的智能化提供了可能。因此，基于机器学习的研究前沿识别方法为研究前沿识别研究提供了新的研究方法。
本文的研究仍有待于未来进一步深入研究的地方：首先，本文构建的高被引论文识别指标体系是根据学者们以前关于预测高被引论文研究的指标总结得到的，未考虑各个指标的重要性，以后可尝试纳入更多指标，并通过指标约简获取最佳指标组合；其次，本文只选择了支持向量机、随机森林、极端梯度提升三种机器学习模型，未来可考虑选用更多机器学习模型，并选择分类准确率更好的机器学习模型来进行识别研究前沿。

关键词: 机器学习, 研究前沿, 引文分析, 评价, 识别

Abstract: Research front is the most potential and forward-looking research direction in the process of technological innovation. It is very important to early identify the research fronts for scientific research, optimal allocation of enterprises′ R&D resources, governments′ innovation strategies formulation. Faced the massive amount of scientific research results data, how to quickly and accurately identify research fronts has become the focus of the academic community. Many scholars have used bibliometric methods to identify research fronts. Citation analysis is one of the most commonly used methods to identify research fronts, and highly cited papers are regarded as an important data source. However, it takes a certain amount of time to accumulate citations of papers. The existing citation analysis method cannot incorporate newly published papers and papers that will be highly cited in the future into the data collection of highly cited papers that identify research fronts. Therefore, aiming at the current deficiencies in the research on research fronts identification, this paper proposed a novel framework for identifying research fronts based on machine learning methods. The research steps of this framework are as follows.
Firstly, we used Web of Science (WoS) as the data source to download historical highly cited papers and the references of the highly cited papers. Secondly, we constructed the identification indexes system of the highly cited papers and calculated the corresponding values of the indexes. Then we divided the obtained data into the training data set and the testing data set for machine learning model. Thirdly, we constructed support vector machine (SVM), random forest (RF), and eXtreme Gradient Boosting (XGBoost) models, and continuously adjusted model parameters to make the three models to be optimal. Fourth, we downloaded the newly published papers from WoS to verify the generalization ability of each machine learning model. Then we selected the model with the best generalization ability to predict the future citations of the newly published papers and identified potentially highly cited papers, and we incorporated the potentially highly cited papers into the core data set of the highly cited papers. Fifth, we used the core data set of the highly cited papers as the data source to identify the research front topics by applying cluster analysis. Finally, the research front topics are compared and evaluated to identify the research fronts. We selected solar cells as a case study to verify the valid and flexible of this framework.
The research results show that emerging research fronts in the research field of solar cells include: Ternary organic solar cells/Ternary polymer solar cells, PbS quantum-dot solar cells, inverted planar perovskite solar cells; the growing research fronts include: Non-fullerene polymer solar cells/Non-fullerene organic solar cells, CH3NH3PbI3 perovskite solar cells. We found that the research fronts we identified were basically consistent with the research fronts in the field of solar cells in existing authoritative research reports. In addition, we invited three well-known experts in the field of solar cells to evaluate the research fronts identified in this paper, and they basically agreed with the results. This verifies the effectiveness and feasibility of the method proposed in this paper.

Key words: machine learning, research front, citation analysis, evaluation, identification

李欣温阳黄鲁成苗红. 一种基于机器学习的研究前沿识别方法研究[J]. 科研管理, 2021, 42(1): 20-32.

Li Xin, Wen Yang, Huang Lucheng, Miao Hong. A study of the research front identification method based on machine learning[J]. Science Research Management, 2021, 42(1): 20-32.

参考文献

[] 侯海燕, 刘则渊, 栾春娟. 基于知识图谱的国际科学计量学研究前沿计量分析[J]. 科研管理, 2009, 30(1): 164-170.

Hou Haiyan, Liu Zeyuan, Luan Chunjuan. Quantitative analysis on the research front of international scientometrics based on mapping of knowledge[J]. Science Research Management, 2009, 30(1): 164-170.

[] 李丹, 杨建君. 国内绿色技术创新文献特色及前沿探究[J]. 科研管理, 2015, 36(6): 109-118.

Li Dan, Yang Jianjun. A literature research on characteristics and trend for domestic green technology innovation[J]. Science Research Management, 2015, 36(6): 109-118.

[] 罗瑞, 许海云, 董坤. 领域前沿识别方法综述[J]. 图书情报工作, 2018, 62(23):119-131.

Luo Rui, Xu Haiyun, Dong Kun. A review of the main recognition methods of frontier research[J]. Library and Information Service, 2018, 62(23):119-131.

[] Boyack K W, Klavans R. Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? [J]. Journal of the American Society for Information Science and Technology, 2010, 61(12):2389-2404.

[] Shibata N, Kajikawa Y, Takeda Y, et al. Comparative study on methods of detecting research fronts using different types of citation[J]. Journal of the American Society for Information Science & Technology, 2010, 60(3):571-580.

[] Huang M H, Chang C P. A comparative study on detecting research fronts in the organic light-emitting diode (OLED) field using bibliographic coupling and co-citation[J]. Scientometrics, 2015, 102(3):2041-2057.

[] Kessler M. Bibliographic coupling between scientific papers[J]. American Documentation, 1996, 14:10-25.

[] Huang M H, Chang C P. Detecting research fronts in OLED field using bibliographic coupling with sliding window[J]. Scientometrics, 2014, 98(3):1721-1744.

[] Inchae P, Keeeun L, Byungun Y. Exploring promising research frontiers based on knowledge maps in the solar cell technology field[J]. Sustainability, 2015, 7(10):13660.

[] 黄福, 侯海燕, 任佩丽等. 基于共被引与文献耦合的研究前沿探测方法鄰选[J]. 情报杂志, 2018, 37(12):13-19+35.

Huang Fu, Hou Haiyan, Ren Peili, et al. Selection of research front detection methods based on bibliographic coupling and co-citation[J]. Journal of Intelligence, 2018, 37(12):13-19+35.

[] 高楠, 赵蕴华, 彭鼎原. 基于引用关系与词汇分析法的研究前沿识别研究——以人工智能领域为例[J]. 情报杂志, 2020, 039(004):44-50,13.

Gao Nan, Zhao Yunhua, Peng Dingyuan. Research on frontier prediction based on citation relation and lexical analysis——Taking the field of artificial intelligence as an example[J]. Journal of Intelligence, 2020, 039(004):44-50,13.

[] 韩毅, 金碧辉. 引文网络主路径分析方法的形成与演化[C]. 第六届中国科技政策与管理学术年会, 2010.

Han Yi, Jin Bihui. The origin and evolution of main path analysis in citation network[C]. The 6th China Science and Technology Policy and Management Academic Annual Conference, 2010.

[] Levitt J M, Thelwall M. The most highly cited Library and Information Science articles: Interdisciplinarity, first authors and citation patterns[J]. Scientometrics, 2009, 78(1):45-67.

[] Ho, Yuh-Shan. Classic articles on social work field in Social Science Citation Index: A bibliometric analysis[J]. Scientometrics, 2014, 98(1):137-155.

[] Price D J. Networks of scientific papers[J]. Science, 1965, 149(3683):510-515.

[] Yan R, Tang J, Liu X B, et al. Citation count prediction: Learning to estimate future citations for literature[C]. Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, 2011:1247-1252.

[] Chakraborty T, Kumar S, Goyal P, et al. Towards a stratified learning approach to predict future citation counts [C]. 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), 2014:351-360.

[] Dong Y, Johnson R A, Chawla N V. Will this paper increase your h-index? Scientific impact prediction[C]. International Conference on Web Search and Data Mining (WSDM). 2014.

[] 耿骞, 景然, 靳健等. 学术论文引用预测及影响因素分析[J]. 图书情报工作, 2018, 62(14):29-40.

Geng Qian, Jing Ran, Jin Jian, et al. Citation prediction and influencing factors analysis on academic papers[J]. Library and Information Service, 2018, 62(14):29-40.

[] 钟镇. 从高被引与零被引论文的引文结构差异看Research Front与Research Frontier的区别[J]. 图书情报工作, 2015, 59(08):87-96.

Zhong Zhen. The difference between research front and research frontier based on the reference structure difference between highly cited papers and un-cited papers[J]. Library and Information Service, 2015, 59(08):87-96.

[] Upham S P, Small H. Emerging research fronts in science and technology: patterns of new knowledge development[J]. Scientometrics, 2010, 83(1):15-38.

[] 卢超, 侯海燕, Ying D, et al. 国外新兴研究话题发现研究综述[J]. 情报学报, 2019, 38(01):97-110.

Lu Chao, Hou Haiyan, Ying D, et al. Review of international studies on discovering emerging topics[J]. Journal of The China Society for Scientific and Technical Information, 2019, 38(01):97-110.

[] Pobiedina N, Ichise R. Predicting citation counts for academic literature using graph pattern mining[J]. European Respiratory Journal, 2014, 45(4):1027-36.

[] Harish S. Bhat, Li-Hsuan Huang, Sebastian Rodriguez, et al. Citation prediction using diverse features[C]. IEEE International Conference on Data Mining Workshop. IEEE, 2015.

[] 肖学斌, 柴艳菊. 论文的相关参数与被引频次的关系研究[J]. 现代图书情报技术, 2016(06):46-53.

Xiao Xuebin, Chai Yanju. Properties of scholarly papers and number of citations[J]. New Technology of Library and Information Service, 2016(06):46-53.

[] Chakraborty T, Kumar S, Reddy M D, et al. Automatic classification and analysis of interdisciplinary fields in computer sciences[C]. International Conference on Social Computing. IEEE, 2013.

[] 吕晓赞, 王晖, 周萍. 中美大数据论文的跨学科性比较研究[J]. 科研管理, 2019, 40(04):1-13.

Lv Xiaozan, Wang Hui, Zhouping. A comparative study of the interdisciplinarity of big data research in China and the USA[J]. Science Research Management, 2019, 40(04):1-13.

[] 朱鑫萍. 论文影响力的预测方法研究[D]. 内蒙古大学, 2018.

Zhu Xinping. Research on the prediction method of paper influence[D]. Inner Mongolia University, 2018.

[] Yan R, Huang C, Tang J, et al. To better stand on the shoulder of giants[C]. Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries. Washington, DC, USA: Association for Computing Machinery, 2012, 51-60.

[] Singh M, Patidar V, Kumar S, et al. The role of citation context in predicting long-term citation profiles: An experimental study based on a massive bibliographic text dataset[C]. ACM International on Conference on Information & Knowledge Management. ACM, 2015.

[] Yu T, Yu G, Li P Y, et al. Citation impact prediction for scientific papers using stepwise regression analysis[J]. Scientometrics, 2014, 101:1233-1252.

[] Didegah F, Thelwall M. Determinants of research citation impact in nanoscience and nanotechnology[J]. Journal of the American Society for Information Science and Technology, 2013, 64(5):1055-1064.

[] Kostoff R N. The difference between highly and poorly cited medical articles in the journal Lancet[J]. Scientometrics, 2007, 72(3):513-520.

[] Cortes C, Vapnik V N. Support vector networks[J]. Machine Learning, 1995, 20(3):273-297.

[] Weston J, Watkins C. Support vector machines for multi-class pattern recognition[C]. Proc European Symposium on Artificial Neural Networks. 1999.

[] 汪海燕, 黎建辉, 杨风雷. 支持向量机理论及算法研究综述[J]. 计算机应用研究, 2014, 31(05):1281-1286.

Wang Haiyan, Li Jianhui, Yang Fenglei. An overview on theory and algorithm of support vector machines[J]. Application Research of Computers, 2014, 31(05):1281-1286.

[] Breiman L. Random Forests[J]. Machine Learning, 2001, 45(1):5-32.

[] 孙菲菲, 曹卓, 肖晓雷. 基于随机森林的分类器在犯罪预测中的应用研究[J]. 情报杂志, 2014, 33(10):148-152.

Sun Feifei, Cao Zhuo, Xiao Xiaolei. Application of an improved random forest based classifier in crime prediction domain[J]. Journal of Intelligence, 2014, 33(10):148-152.

[] Chen T, Guestrin C. XGBoost: A scalable tree boosting system[J]. 2016, 785-794.

[] Friedman J H. Greedy function approximation: a gradient boosting machine[J]. Annals of Statistics, 2001, 29(5):1189-1232.

[] 束学渊, 汪立新. 联合循环平稳特征PCA与XGBoost的频谱感知[J]. 计算机应用与软件, 2020, 37(04):114-118+126.

Shu Xueyuan, Wang Lixin. Spectrum sensing by combining cyclostationary features PCA with XGBoost[J]. Computer Applications and Software, 2020, 37(04):114-118+126.

[] Forman G. An extensive empirical study of feature selection metrics for text classification[J]. Journal of Machine Learning Research, 2003, 3(2):1289-1305.

[] Andrew Ng. Clustering with the k-means algorithm[J]. Machine Learning, 2012.

[1]	王晓岭庞梦茵金家华陶星伊董恒敏. 基于机器学习的重污染企业绿色创新绩效影响因素研究[J]. 科研管理, 2025, 46(8): 178-189.
[2]	刘泽琨江采欣. 中国新能源汽车产业政策变迁及阶段特征研究[J]. 科研管理, 2025, 46(6): 146-156.
[3]	吕镯张恒鑫李连伟. 人工智能对企业成长的影响研究[J]. 科研管理, 2025, 46(5): 55-64.
[4]	陈钰芬, 杨双双, 胡思慧. 新质生产力评价指标体系构建及测度分析——基于“投入-过程-产出”视角[J]. 科研管理, 2025, 46(2): 1-11.
[5]	周建平, 徐维祥, 郭加新. 城市知识创新网络对新质生产力形成的影响研究[J]. 科研管理, 2025, 46(2): 53-63.
[6]	骆嘉琪, 封垚, 孟斌, 贾丹丹. 基于文本挖掘的企业社会责任报告质量评价体系构建[J]. 科研管理, 2025, 46(11): 128-139.
[7]	余江, 李婉晴, 陈凤, 卢燃. 人工智能驱动企业创新的生命周期异质性研究[J]. 科研管理, 2025, 46(10): 72-81.
[8]	曹萍, 陈由彬. 组态视角下数据要素利用水平的影响因素研究[J]. 科研管理, 2025, 46(1): 1-11.
[9]	张磊杨洪涛史航宇姚楠. 创业失败恐惧与创业机会识别的关系研究[J]. 科研管理, 2025, 46(1): 174-182.
[10]	胡尊国郭雨如邓理婕毛军. 数字技能型劳动力集聚对企业创新的影响及传导机制研究[J]. 科研管理, 2025, 46(1): 183-192.
[11]	张熠焦飞飞王先甲. 中国半导体产业技术创新效率评价研究——来自A股上市公司的经验证据[J]. 科研管理, 2024, 45(9): 11-20.
[12]	武建龙, 刘禹彤, 陈劲, 王今, 鲍萌萌. 基于专利挖掘和Gompertz模型的颠覆性技术识别方法研究[J]. 科研管理, 2024, 45(4): 62-72.
[13]	李牧南王良赖华鹏. 基于深度学习的我国科技政策属性识别[J]. 科研管理, 2024, 45(2): 1-11.
[14]	吴丛阿儒涵朱蕾娜. 国立科研机构基本科研业务费政策执行满意度评价研究——基于科研人员的视角[J]. 科研管理, 2024, 45(2): 61-69.
[15]	王金凤, 张芷芯, 冯立杰, 张珂. 基于LDA与共现网络动态分析的技术机会识别[J]. 科研管理, 2024, 45(2): 176-188.