一种基于机器学习的研究前沿识别方法研究

李欣 温阳 黄鲁成 苗红

科研管理 ›› 2021, Vol. 42 ›› Issue (1) : 20-32.

PDF(965 KB)
PDF(965 KB)
科研管理 ›› 2021, Vol. 42 ›› Issue (1) : 20-32.
论文

一种基于机器学习的研究前沿识别方法研究

  • 李欣,温阳,黄鲁成,苗红
作者信息 +

A study of the research front identification method based on machine learning

  • Li Xin, Wen Yang, Huang Lucheng, Miao Hong
Author information +
文章历史 +

摘要

研究前沿是科技创新过程中最具潜力和前瞻性的研究方向,尽早识别研究前沿对科学研究、企业研发资源优化配置、政府创新战略前瞻部署等至关重要。针对目前在研究前沿识别研究中存在的不足,提出一种基于机器学习的研究前沿识别方法。该方法首先通过构建机器学习模型来识别出潜在高被引论文,解决利用引文分析法来识别研究前沿的时滞性问题,并将潜在高被引论文纳入研究前沿识别的高被引论文核心文档集中;其次,以高被引论文核心文档集为数据源,利用聚类分析法识别出研究前沿主题,并对研究前沿主题进行对比和评价分析,进而识别出研究前沿;最后,以太阳能光伏电池研究领域为例进行了实证研究,验证了该方法的可行性和有效性,为研究前沿识别提供了新的研究方法。

Abstract

Research front is the most potential and forward-looking research direction in the process of technological innovation. It is very important to early identify the research fronts for scientific research, optimal allocation of enterprises′ R&D resources, governments′ innovation strategies formulation. Faced the massive amount of scientific research results data, how to quickly and accurately identify research fronts has become the focus of the academic community. Many scholars have used bibliometric methods to identify research fronts. Citation analysis is one of the most commonly used methods to identify research fronts, and highly cited papers are regarded as an important data source. However, it takes a certain amount of time to accumulate citations of papers. The existing citation analysis method cannot incorporate newly published papers and papers that will be highly cited in the future into the data collection of highly cited papers that identify research fronts. Therefore, aiming at the current deficiencies in the research on research fronts identification, this paper proposed a novel framework for identifying research fronts based on machine learning methods. The research steps of this framework are as follows.
   Firstly, we used Web of Science (WoS) as the data source to download historical highly cited papers and the references of the highly cited papers. Secondly, we constructed the identification indexes system of the highly cited papers and calculated the corresponding values of the indexes. Then we divided the obtained data into the training data set and the testing data set for machine learning model. Thirdly, we constructed support vector machine (SVM), random forest (RF), and eXtreme Gradient Boosting (XGBoost) models, and continuously adjusted model parameters to make the three models to be optimal. Fourth, we downloaded the newly published papers from WoS to verify the generalization ability of each machine learning model. Then we selected the model with the best generalization ability to predict the future citations of the newly published papers and identified potentially highly cited papers, and we incorporated the potentially highly cited papers into the core data set of the highly cited papers. Fifth, we used the core data set of the highly cited papers as the data source to identify the research front topics by applying cluster analysis. Finally, the research front topics are compared and evaluated to identify the research fronts. We selected solar cells as a case study to verify the valid and flexible of this framework.
    The research results show that emerging research fronts in the research field of solar cells include: Ternary organic solar cells/Ternary polymer solar cells, PbS quantum-dot solar cells, inverted planar perovskite solar cells; the growing research fronts include: Non-fullerene polymer solar cells/Non-fullerene organic solar cells, CH3NH3PbI3 perovskite solar cells. We found that the research fronts we identified were basically consistent with the research fronts in the field of solar cells in existing authoritative research reports. In addition, we invited three well-known experts in the field of solar cells to evaluate the research fronts identified in this paper, and they basically agreed with the results. This verifies the effectiveness and feasibility of the method proposed in this paper.

关键词

机器学习 / 研究前沿 / 引文分析 / 评价 / 识别

Key words

machine learning / research front / citation analysis / evaluation / identification


引用本文

导出引用
李欣 温阳 黄鲁成 苗红. 一种基于机器学习的研究前沿识别方法研究[J]. 科研管理. 2021, 42(1): 20-32
Li Xin, Wen Yang, Huang Lucheng, Miao Hong. A study of the research front identification method based on machine learning[J]. Science Research Management. 2021, 42(1): 20-32

参考文献

[] 侯海燕, 刘则渊, 栾春娟. 基于知识图谱的国际科学计量学研究前沿计量分析[J]. 科研管理, 2009, 30(1): 164-170.

Hou Haiyan, Liu Zeyuan, Luan Chunjuan. Quantitative analysis on the research front of international scientometrics based on mapping of knowledge[J]. Science Research Management, 2009, 30(1): 164-170.

[] 李丹, 杨建君. 国内绿色技术创新文献特色及前沿探究[J]. 科研管理, 2015, 36(6): 109-118.

Li Dan, Yang Jianjun. A literature research on characteristics and trend for domestic green technology innovation[J]. Science Research Management, 2015, 36(6): 109-118.

[] 罗瑞, 许海云, 董坤. 领域前沿识别方法综述[J]. 图书情报工作, 2018, 62(23):119-131.

Luo Rui, Xu Haiyun, Dong Kun. A review of the main recognition methods of frontier research[J]. Library and Information Service, 2018, 62(23):119-131.

[] Boyack K W, Klavans R. Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? [J]. Journal of the American Society for Information Science and Technology, 2010, 61(12):2389-2404.

[] Shibata N, Kajikawa Y, Takeda Y, et al. Comparative study on methods of detecting research fronts using different types of citation[J]. Journal of the American Society for Information Science & Technology, 2010, 60(3):571-580.

[] Huang M H, Chang C P. A comparative study on detecting research fronts in the organic light-emitting diode (OLED) field using bibliographic coupling and co-citation[J]. Scientometrics, 2015, 102(3):2041-2057.

[] Kessler M. Bibliographic coupling between scientific papers[J]. American Documentation, 1996, 14:10-25.

[] Huang M H, Chang C P. Detecting research fronts in OLED field using bibliographic coupling with sliding window[J]. Scientometrics, 2014, 98(3):1721-1744.

[] Inchae P, Keeeun L, Byungun Y. Exploring promising research frontiers based on knowledge maps in the solar cell technology field[J]. Sustainability, 2015, 7(10):13660.

[] 黄福, 侯海燕, 任佩丽等. 基于共被引与文献耦合的研究前沿探测方法鄰选[J]. 情报杂志, 2018, 37(12):13-19+35.

Huang Fu, Hou Haiyan, Ren Peili, et al. Selection of research front detection methods based on bibliographic coupling and co-citation[J]. Journal of Intelligence, 2018, 37(12):13-19+35.

[] 高楠, 赵蕴华, 彭鼎原. 基于引用关系与词汇分析法的研究前沿识别研究——以人工智能领域为例[J]. 情报杂志, 2020, 039(004):44-50,13.

Gao Nan, Zhao Yunhua, Peng Dingyuan. Research on frontier prediction based on citation relation and lexical analysis——Taking the field of artificial intelligence as an example[J]. Journal of Intelligence, 2020, 039(004):44-50,13.

[] 韩毅, 金碧辉. 引文网络主路径分析方法的形成与演化[C]. 第六届中国科技政策与管理学术年会, 2010.

Han Yi, Jin Bihui. The origin and evolution of main path analysis in citation network[C]. The 6th China Science and Technology Policy and Management Academic Annual Conference, 2010.

[] Levitt J M, Thelwall M. The most highly cited Library and Information Science articles: Interdisciplinarity, first authors and citation patterns[J]. Scientometrics, 2009, 78(1):45-67.

[] Ho, Yuh-Shan. Classic articles on social work field in Social Science Citation Index: A bibliometric analysis[J]. Scientometrics, 2014, 98(1):137-155.

[] Price D J. Networks of scientific papers[J]. Science, 1965, 149(3683):510-515.

[] Yan R, Tang J, Liu X B, et al. Citation count prediction: Learning to estimate future citations for literature[C]. Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, 2011:1247-1252.

[] Chakraborty T, Kumar S, Goyal P, et al. Towards a stratified learning approach to predict future citation counts [C]. 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), 2014:351-360.

[] Dong Y, Johnson R A, Chawla N V. Will this paper increase your h-index? Scientific impact prediction[C]. International Conference on Web Search and Data Mining (WSDM). 2014.

[] 耿骞, 景然, 靳健等. 学术论文引用预测及影响因素分析[J]. 图书情报工作, 2018, 62(14):29-40.

Geng Qian, Jing Ran, Jin Jian, et al. Citation prediction and influencing factors analysis on academic papers[J]. Library and Information Service, 2018, 62(14):29-40.

[] 钟镇. 从高被引与零被引论文的引文结构差异看Research FrontResearch Frontier的区别[J]. 图书情报工作, 2015, 59(08):87-96.

Zhong Zhen. The difference between research front and research frontier based on the reference structure difference between highly cited papers and un-cited papers[J]. Library and Information Service, 2015, 59(08):87-96.

[] Upham S P, Small H. Emerging research fronts in science and technology: patterns of new knowledge development[J]. Scientometrics, 2010, 83(1):15-38.

[] 卢超, 侯海燕, Ying D, et al. 国外新兴研究话题发现研究综述[J]. 情报学报, 2019, 38(01):97-110.

Lu Chao, Hou Haiyan, Ying D, et al. Review of international studies on discovering emerging topics[J]. Journal of The China Society for Scientific and Technical Information, 2019, 38(01):97-110.

[] Pobiedina N, Ichise R. Predicting citation counts for academic literature using graph pattern mining[J]. European Respiratory Journal, 2014, 45(4):1027-36.

[] Harish S. Bhat, Li-Hsuan Huang, Sebastian Rodriguez, et al. Citation prediction using diverse features[C]. IEEE International Conference on Data Mining Workshop. IEEE, 2015.

[] 肖学斌, 柴艳菊. 论文的相关参数与被引频次的关系研究[J]. 现代图书情报技术, 2016(06):46-53.

Xiao Xuebin, Chai Yanju. Properties of scholarly papers and number of citations[J]. New Technology of Library and Information Service, 2016(06):46-53.

[] Chakraborty T, Kumar S, Reddy M D, et al. Automatic classification and analysis of interdisciplinary fields in computer sciences[C]. International Conference on Social Computing. IEEE, 2013.

[] 吕晓赞, 王晖, 周萍. 中美大数据论文的跨学科性比较研究[J]. 科研管理, 2019, 40(04):1-13.

Lv Xiaozan, Wang Hui, Zhouping. A comparative study of the interdisciplinarity of big data research in China and the USA[J]. Science Research Management, 2019, 40(04):1-13.

[] 朱鑫萍. 论文影响力的预测方法研究[D]. 内蒙古大学, 2018.

Zhu Xinping. Research on the prediction method of paper influence[D]. Inner Mongolia University, 2018.

[] Yan R, Huang C, Tang J, et al. To better stand on the shoulder of giants[C]. Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries. Washington, DC, USA: Association for Computing Machinery, 2012, 51-60.

[] Singh M, Patidar V, Kumar S, et al. The role of citation context in predicting long-term citation profiles: An experimental study based on a massive bibliographic text dataset[C]. ACM International on Conference on Information & Knowledge Management. ACM, 2015.

[] Yu T, Yu G, Li P Y, et al. Citation impact prediction for scientific papers using stepwise regression analysis[J]. Scientometrics, 2014, 101:1233-1252.

[] Didegah F, Thelwall M. Determinants of research citation impact in nanoscience and nanotechnology[J]. Journal of the American Society for Information Science and Technology, 2013, 64(5):1055-1064.

[] Kostoff R N. The difference between highly and poorly cited medical articles in the journal Lancet[J]. Scientometrics, 2007, 72(3):513-520.

[] Cortes C, Vapnik V N. Support vector networks[J]. Machine Learning, 1995, 20(3):273-297.

[] Weston J, Watkins C. Support vector machines for multi-class pattern recognition[C]. Proc European Symposium on Artificial Neural Networks. 1999.

[] 汪海燕, 黎建辉, 杨风雷. 支持向量机理论及算法研究综述[J]. 计算机应用研究, 2014, 31(05):1281-1286.

Wang Haiyan, Li Jianhui, Yang Fenglei. An overview on theory and algorithm of support vector machines[J]. Application Research of Computers, 2014, 31(05):1281-1286.

[] Breiman L. Random Forests[J]. Machine Learning, 2001, 45(1):5-32.

[] 孙菲菲, 曹卓, 肖晓雷. 基于随机森林的分类器在犯罪预测中的应用研究[J]. 情报杂志, 2014, 33(10):148-152.

Sun Feifei, Cao Zhuo, Xiao Xiaolei. Application of an improved random forest based classifier in crime prediction domain[J]. Journal of Intelligence, 2014, 33(10):148-152.

[] Chen T, Guestrin C. XGBoost: A scalable tree boosting system[J]. 2016, 785-794.

[] Friedman J H. Greedy function approximation: a gradient boosting machine[J]. Annals of Statistics, 2001, 29(5):1189-1232.

[] 束学渊, 汪立新. 联合循环平稳特征PCAXGBoost的频谱感知[J]. 计算机应用与软件, 2020, 37(04):114-118+126.

Shu Xueyuan, Wang Lixin. Spectrum sensing by combining cyclostationary features PCA with XGBoost[J]. Computer Applications and Software, 2020, 37(04):114-118+126.

[] Forman G. An extensive empirical study of feature selection metrics for text classification[J]. Journal of Machine Learning Research, 2003, 3(2):1289-1305.

[] Andrew Ng. Clustering with the k-means algorithm[J]. Machine Learning, 2012.


基金

国家自然科学基金面上项目:“基于多源异构数据的新兴技术形成机理研究”(71673018,2017.01—2020.12)。

PDF(965 KB)

Accesses

Citation

Detail

段落导航
相关文章

/