A study of the research front identification method based on machine learning

Abstract

Abstract: Research front is the most potential and forward-looking research direction in the process of technological innovation. It is very important to early identify the research fronts for scientific research, optimal allocation of enterprises′ R&D resources, governments′ innovation strategies formulation. Faced the massive amount of scientific research results data, how to quickly and accurately identify research fronts has become the focus of the academic community. Many scholars have used bibliometric methods to identify research fronts. Citation analysis is one of the most commonly used methods to identify research fronts, and highly cited papers are regarded as an important data source. However, it takes a certain amount of time to accumulate citations of papers. The existing citation analysis method cannot incorporate newly published papers and papers that will be highly cited in the future into the data collection of highly cited papers that identify research fronts. Therefore, aiming at the current deficiencies in the research on research fronts identification, this paper proposed a novel framework for identifying research fronts based on machine learning methods. The research steps of this framework are as follows.
Firstly, we used Web of Science (WoS) as the data source to download historical highly cited papers and the references of the highly cited papers. Secondly, we constructed the identification indexes system of the highly cited papers and calculated the corresponding values of the indexes. Then we divided the obtained data into the training data set and the testing data set for machine learning model. Thirdly, we constructed support vector machine (SVM), random forest (RF), and eXtreme Gradient Boosting (XGBoost) models, and continuously adjusted model parameters to make the three models to be optimal. Fourth, we downloaded the newly published papers from WoS to verify the generalization ability of each machine learning model. Then we selected the model with the best generalization ability to predict the future citations of the newly published papers and identified potentially highly cited papers, and we incorporated the potentially highly cited papers into the core data set of the highly cited papers. Fifth, we used the core data set of the highly cited papers as the data source to identify the research front topics by applying cluster analysis. Finally, the research front topics are compared and evaluated to identify the research fronts. We selected solar cells as a case study to verify the valid and flexible of this framework.
The research results show that emerging research fronts in the research field of solar cells include: Ternary organic solar cells/Ternary polymer solar cells, PbS quantum-dot solar cells, inverted planar perovskite solar cells; the growing research fronts include: Non-fullerene polymer solar cells/Non-fullerene organic solar cells, CH3NH3PbI3 perovskite solar cells. We found that the research fronts we identified were basically consistent with the research fronts in the field of solar cells in existing authoritative research reports. In addition, we invited three well-known experts in the field of solar cells to evaluate the research fronts identified in this paper, and they basically agreed with the results. This verifies the effectiveness and feasibility of the method proposed in this paper.

Key words: machine learning, research front, citation analysis, evaluation, identification

Li Xin, Wen Yang, Huang Lucheng, Miao Hong. A study of the research front identification method based on machine learning[J]. Science Research Management, 2021, 42(1): 20-32.

References

[] 侯海燕, 刘则渊, 栾春娟. 基于知识图谱的国际科学计量学研究前沿计量分析[J]. 科研管理, 2009, 30(1): 164-170.

Hou Haiyan, Liu Zeyuan, Luan Chunjuan. Quantitative analysis on the research front of international scientometrics based on mapping of knowledge[J]. Science Research Management, 2009, 30(1): 164-170.

[] 李丹, 杨建君. 国内绿色技术创新文献特色及前沿探究[J]. 科研管理, 2015, 36(6): 109-118.

Li Dan, Yang Jianjun. A literature research on characteristics and trend for domestic green technology innovation[J]. Science Research Management, 2015, 36(6): 109-118.

[] 罗瑞, 许海云, 董坤. 领域前沿识别方法综述[J]. 图书情报工作, 2018, 62(23):119-131.

Luo Rui, Xu Haiyun, Dong Kun. A review of the main recognition methods of frontier research[J]. Library and Information Service, 2018, 62(23):119-131.

[] Boyack K W, Klavans R. Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? [J]. Journal of the American Society for Information Science and Technology, 2010, 61(12):2389-2404.

[] Shibata N, Kajikawa Y, Takeda Y, et al. Comparative study on methods of detecting research fronts using different types of citation[J]. Journal of the American Society for Information Science & Technology, 2010, 60(3):571-580.

[] Huang M H, Chang C P. A comparative study on detecting research fronts in the organic light-emitting diode (OLED) field using bibliographic coupling and co-citation[J]. Scientometrics, 2015, 102(3):2041-2057.

[] Kessler M. Bibliographic coupling between scientific papers[J]. American Documentation, 1996, 14:10-25.

[] Huang M H, Chang C P. Detecting research fronts in OLED field using bibliographic coupling with sliding window[J]. Scientometrics, 2014, 98(3):1721-1744.

[] Inchae P, Keeeun L, Byungun Y. Exploring promising research frontiers based on knowledge maps in the solar cell technology field[J]. Sustainability, 2015, 7(10):13660.

[] 黄福, 侯海燕, 任佩丽等. 基于共被引与文献耦合的研究前沿探测方法鄰选[J]. 情报杂志, 2018, 37(12):13-19+35.

Huang Fu, Hou Haiyan, Ren Peili, et al. Selection of research front detection methods based on bibliographic coupling and co-citation[J]. Journal of Intelligence, 2018, 37(12):13-19+35.

[] 高楠, 赵蕴华, 彭鼎原. 基于引用关系与词汇分析法的研究前沿识别研究——以人工智能领域为例[J]. 情报杂志, 2020, 039(004):44-50,13.

Gao Nan, Zhao Yunhua, Peng Dingyuan. Research on frontier prediction based on citation relation and lexical analysis——Taking the field of artificial intelligence as an example[J]. Journal of Intelligence, 2020, 039(004):44-50,13.

[] 韩毅, 金碧辉. 引文网络主路径分析方法的形成与演化[C]. 第六届中国科技政策与管理学术年会, 2010.

Han Yi, Jin Bihui. The origin and evolution of main path analysis in citation network[C]. The 6th China Science and Technology Policy and Management Academic Annual Conference, 2010.

[] Levitt J M, Thelwall M. The most highly cited Library and Information Science articles: Interdisciplinarity, first authors and citation patterns[J]. Scientometrics, 2009, 78(1):45-67.

[] Ho, Yuh-Shan. Classic articles on social work field in Social Science Citation Index: A bibliometric analysis[J]. Scientometrics, 2014, 98(1):137-155.

[] Price D J. Networks of scientific papers[J]. Science, 1965, 149(3683):510-515.

[] Yan R, Tang J, Liu X B, et al. Citation count prediction: Learning to estimate future citations for literature[C]. Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, 2011:1247-1252.

[] Chakraborty T, Kumar S, Goyal P, et al. Towards a stratified learning approach to predict future citation counts [C]. 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), 2014:351-360.

[] Dong Y, Johnson R A, Chawla N V. Will this paper increase your h-index? Scientific impact prediction[C]. International Conference on Web Search and Data Mining (WSDM). 2014.

[] 耿骞, 景然, 靳健等. 学术论文引用预测及影响因素分析[J]. 图书情报工作, 2018, 62(14):29-40.

Geng Qian, Jing Ran, Jin Jian, et al. Citation prediction and influencing factors analysis on academic papers[J]. Library and Information Service, 2018, 62(14):29-40.

[] 钟镇. 从高被引与零被引论文的引文结构差异看Research Front与Research Frontier的区别[J]. 图书情报工作, 2015, 59(08):87-96.

Zhong Zhen. The difference between research front and research frontier based on the reference structure difference between highly cited papers and un-cited papers[J]. Library and Information Service, 2015, 59(08):87-96.

[] Upham S P, Small H. Emerging research fronts in science and technology: patterns of new knowledge development[J]. Scientometrics, 2010, 83(1):15-38.

[] 卢超, 侯海燕, Ying D, et al. 国外新兴研究话题发现研究综述[J]. 情报学报, 2019, 38(01):97-110.

Lu Chao, Hou Haiyan, Ying D, et al. Review of international studies on discovering emerging topics[J]. Journal of The China Society for Scientific and Technical Information, 2019, 38(01):97-110.

[] Pobiedina N, Ichise R. Predicting citation counts for academic literature using graph pattern mining[J]. European Respiratory Journal, 2014, 45(4):1027-36.

[] Harish S. Bhat, Li-Hsuan Huang, Sebastian Rodriguez, et al. Citation prediction using diverse features[C]. IEEE International Conference on Data Mining Workshop. IEEE, 2015.

[] 肖学斌, 柴艳菊. 论文的相关参数与被引频次的关系研究[J]. 现代图书情报技术, 2016(06):46-53.

Xiao Xuebin, Chai Yanju. Properties of scholarly papers and number of citations[J]. New Technology of Library and Information Service, 2016(06):46-53.

[] Chakraborty T, Kumar S, Reddy M D, et al. Automatic classification and analysis of interdisciplinary fields in computer sciences[C]. International Conference on Social Computing. IEEE, 2013.

[] 吕晓赞, 王晖, 周萍. 中美大数据论文的跨学科性比较研究[J]. 科研管理, 2019, 40(04):1-13.

Lv Xiaozan, Wang Hui, Zhouping. A comparative study of the interdisciplinarity of big data research in China and the USA[J]. Science Research Management, 2019, 40(04):1-13.

[] 朱鑫萍. 论文影响力的预测方法研究[D]. 内蒙古大学, 2018.

Zhu Xinping. Research on the prediction method of paper influence[D]. Inner Mongolia University, 2018.

[] Yan R, Huang C, Tang J, et al. To better stand on the shoulder of giants[C]. Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries. Washington, DC, USA: Association for Computing Machinery, 2012, 51-60.

[] Singh M, Patidar V, Kumar S, et al. The role of citation context in predicting long-term citation profiles: An experimental study based on a massive bibliographic text dataset[C]. ACM International on Conference on Information & Knowledge Management. ACM, 2015.

[] Yu T, Yu G, Li P Y, et al. Citation impact prediction for scientific papers using stepwise regression analysis[J]. Scientometrics, 2014, 101:1233-1252.

[] Didegah F, Thelwall M. Determinants of research citation impact in nanoscience and nanotechnology[J]. Journal of the American Society for Information Science and Technology, 2013, 64(5):1055-1064.

[] Kostoff R N. The difference between highly and poorly cited medical articles in the journal Lancet[J]. Scientometrics, 2007, 72(3):513-520.

[] Cortes C, Vapnik V N. Support vector networks[J]. Machine Learning, 1995, 20(3):273-297.

[] Weston J, Watkins C. Support vector machines for multi-class pattern recognition[C]. Proc European Symposium on Artificial Neural Networks. 1999.

[] 汪海燕, 黎建辉, 杨风雷. 支持向量机理论及算法研究综述[J]. 计算机应用研究, 2014, 31(05):1281-1286.

Wang Haiyan, Li Jianhui, Yang Fenglei. An overview on theory and algorithm of support vector machines[J]. Application Research of Computers, 2014, 31(05):1281-1286.

[] Breiman L. Random Forests[J]. Machine Learning, 2001, 45(1):5-32.

[] 孙菲菲, 曹卓, 肖晓雷. 基于随机森林的分类器在犯罪预测中的应用研究[J]. 情报杂志, 2014, 33(10):148-152.

Sun Feifei, Cao Zhuo, Xiao Xiaolei. Application of an improved random forest based classifier in crime prediction domain[J]. Journal of Intelligence, 2014, 33(10):148-152.

[] Chen T, Guestrin C. XGBoost: A scalable tree boosting system[J]. 2016, 785-794.

[] Friedman J H. Greedy function approximation: a gradient boosting machine[J]. Annals of Statistics, 2001, 29(5):1189-1232.

[] 束学渊, 汪立新. 联合循环平稳特征PCA与XGBoost的频谱感知[J]. 计算机应用与软件, 2020, 37(04):114-118+126.

Shu Xueyuan, Wang Lixin. Spectrum sensing by combining cyclostationary features PCA with XGBoost[J]. Computer Applications and Software, 2020, 37(04):114-118+126.

[] Forman G. An extensive empirical study of feature selection metrics for text classification[J]. Journal of Machine Learning Research, 2003, 3(2):1289-1305.

[] Andrew Ng. Clustering with the k-means algorithm[J]. Machine Learning, 2012.

[1]	Wang Pengju, Xiong Zhuang. Evaluation and distinction of the objects of entrepreneurial failure recovery in technology-based enterprises [J]. Science Research Management, 2023, 44(6): 144-153.
[2]	Wang Dongmei, Wang Xiangning. Reference significance of the scientific research evaluation system in British universities [J]. Science Research Management, 2023, 44(3): 187-192.
[3]	Zhang Yurong, Wu Wenfei. An empirical study of the influencing factors to determine the FRAND license royalties for standard-essential patents [J]. Science Research Management, 2023, 44(2): 146-155.
[4]	Zhang Chao, Leng Fuhai. An analysis of the machine-learning-assisted intelligent decision-making——A study by taking "green innovation" as an example [J]. Science Research Management, 2022, 43(9): 32-40.
[5]	Huang Xinzhuo, Mi Jianing, Zhang Changping, , Gong Yixuan . The evolution, knowledge system and method tools of scientific data reuse——A concurrent discussion of the influence of the fourth research paradigm [J]. Science Research Management, 2022, 43(8): 100-108.
[6]	Zhang Xichun, Zhu Shaotang, Li Shenghui. The transmission of S&T talent evaluation policies and individual irrational behavior——An analysis based on the behavioral public policy [J]. Science Research Management, 2022, 43(8): 183-191.
[7]	Fan Zhifang, Kuangyang Hongyi. Research on risk management and control of project funds in the process of aerospace project development [J]. Science Research Management, 2022, 43(8): 201-208.
[8]	Tang Linjia, Mei Zi, Guo Yuanyuan. Research on the implementation effect of "innovation and entrepreneurship" series policies in China: A study from the perspective of policy mix effect [J]. Science Research Management, 2022, 43(5): 34-43.
[9]	Chen Nabo, Chen Yonghong, Tan Tongdan. The "control right" distribution and action logic in government performance evaluation [J]. Science Research Management, 2022, 43(5): 191-199.
[10]	. An analysis of the texts of China′s policies for S&T talents evaluation from the perspective of policy tool [J]. Science Research Management, 2022, 43(3): 55-62.
[11]	Zhang Bo, Ding Jinhong. An analysis of the impact of the talent ecological environment on the agglomeration effect of highly educated talents in China [J]. Science Research Management, 2022, 43(12): 24-33.
[12]	Tang Xiaowen, Sun Yue, Tang Xiaobin. An evaluation of technological innovation capability of the advanced equipment manufacturing industry in China [J]. Science Research Management, 2021, 42(9): 1-9.
[13]	Li Shenghui, Zhu Shaotang. Has S&T evaluation effectively promoted regional S&T innovation?——A study from the policy-driven perspective [J]. Science Research Management, 2021, 42(7): 11-21.
[14]	Cui Shaoze, Qiu Huaxin, Wang Sutong. A research on the urban talent attraction evaluation model by taking Shenzhen as an example [J]. Science Research Management, 2021, 42(7): 60-67.
[15]	Sun Daguang, Tang Xiaofei, Lu Pingjun. Hotspots and evolution of knowledge innovation management research—— An analysis based on cocitation matrix and time-heterogenous log-multiplicative model [J]. Science Research Management, 2021, 42(2): 1-11.