科研管理 ›› 2022, Vol. 43 ›› Issue (1): 176-183.

• 论文 • 上一篇    下一篇

基于时间序列聚类的科研成果关键词分析方法

李海林,林春培   

  1. 华侨大学 工商管理学院,福建 泉州362021
  • 收稿日期:2018-12-15 修回日期:2019-05-24 出版日期:2022-01-20 发布日期:2022-01-19
  • 通讯作者: 林春培
  • 基金资助:
    国家自然科学基金项目(71771094, 2018.01—2021.12);福建省自然科学基金项目(2019J01010557, 201905—2021.05)。

An analysis of keywords of research achievements based on time series clustering

Li Hailin, Lin Chunpei   

  1. College of Business Administration, Huaqiao University, Quanzhou 362021, Fujian, China
  • Received:2018-12-15 Revised:2019-05-24 Online:2022-01-20 Published:2022-01-19
  • Supported by:
    A mechanism research of Knowledge modular for generalized manufacturing system oriented industry cluster

摘要:    鉴于传统方法对科研成果关键词研究存在较强主观影响和较少考虑时间因素等问题,提出基于时间序列聚类的科研成果关键词分析方法。该方法通过统计分析方法验证关键词出现顺序在一定程度上反映了关键词反映主题思想的重要性,将关键词的重要度转化为时间序列数据,分别从重要度的数值和趋势两个角度出发,使用动态时间弯曲方法度量关键词重要度时间序列数据之间的相似性,结合近邻传播方法对关键词时间序列数据之间的相似性矩阵进行聚类分析,实现科研成果的关键词分析研究。通过对某科研管理类重要期刊2008—2017年期间刊发的科研成果论文关键词研究发现:新方法不仅可以对科研成果中关键词的关注热度和趋势进行聚类划分,自适应地找到中心关键词作为相应类别的特征代表对象,还能为科研成果关键词的主题分析提供理论方法和决策支持。

关键词: 关键词分析, 科研成果, 时间序列, 聚类分析, 近邻传播

Abstract:     Keyword is an important part of scientific research literature and related references. Limited keywords can describe clearly some aspects of scientific research achievements including the main research objects, problems to be solved, methods to be used, conclusions and other related information. Meanwhile, it also reflects the research theme of scientific research achievements to some extent. Academic papers are the main forms of scientific research achievements at present. By analyzing the keywords of academic papers, we can find out the themes that the research objects are currently concerned about and their evolutionary trends in the research field at a certain time. Keyword-based topic research and analysis is necessary to establish similarities between keywords and combine clustering analysis method to find and track hot topics. To establish the similarity between keywords, co-word analysis is often used, which considers that there is a certain relationship between keywords appearing in the same document. We use statistical analysis to define a similarity method between keywords appearing together many times. The higher the frequency of keywords appearing together, the greater the similarity they are. Hierarchical clustering is one of the most commonly used methods in the field of scientific literature. It can observe and discover the similarity of key words or themes intuitively so as to divide hot issues and topics. However, hierarchical clustering needs to classify subject categories artificially, which makes clustering analysis results vulnerable to subjective factors. At the same time, the traditional method neglects the importance of time. It only divides all keywords in a general way based on statistics and only considers the frequency and location of keywords. It also ignores the importance of time to the classification of keywords. In view of the limitations of the traditional methods, which fail to consider and analyze the time factor of keyword, this paper proposes keyword analysis of research achievements based on time series clustering.
    Firstly, keywords are collected from many academic papers according to a special topic. The collected keywords of relevant scientific research achievements are parsed and incorporated into a database. With the ranking order and the frequency of keywords appearing together in all relevant scientific research achievements, the importance of the keyword is obtained by calculating the weight according to the order. In addition, the time series data of keyword importance value is formed according to the distribution of keyword importance in time sequence. Secondly, dynamic time warping (DTW) is used to measure the distance between keyword time series, the obtained distance matrix can be transformed into the corresponding similarity matrix that can used to start affinity propagation clustering (AP). AP based on the similarity matrix and hierarchical clustering method based on the distance matrix in traditional literature analysis were used to cluster the keyword time series. Once the keyword time series with the same trend are clustered together, the clusters have the characteristics of large similarity in the changeable trend of inter-cluster keywords and small similarity in the changeable trend of intra-cluster keywords.  We also compare the effects of the different clustering results obtained by the two methods for keyword analysis. Finally, the clustering results are combined with visualization technology to realize the analysis of the keywords of scientific research achievements.
    The keywords of scientific research achievements published in an important journal of scientific and technological innovation management from 2008 to 2017 are regarded as the research object. The effectiveness of the proposed method is tested and its application in the analysis of key words of scientific research achievements is further elaborated. On the one hand, the proposed method in this paper can adapt to divide multiple clusters. Meanwhile, the influence of time on the division of keyword clusters is taken into consideration. The method divides the keywords with similar changeable trends in the same cluster. There may be interaction or mutual promotion between these words in a cluster and some close relationships between them can be found. At the same time, they may not only have some practical effects on keywords belonging to the same cluster, but more importantly, they may describe the same topic together. We can test and observe the effect of the proposed method on keyword analysis of scientific research achievements from a new perspective of time change. On the other hand, the changeable trend of the main keywords concerned by the published papers of the target journals is analyzed and studied. It can discover some knowledge including the main research topics of this journal from 2008 to 2017, the changeable trend of main keywords and the subject to which these keywords belong, as well as the overall change trend of these topic categories. All the knowledge is benefit for finding out the attention degree and evolution law of the published topics.
    The main contributions of this study are as follows: (1) It is considering that when the authors give keywords, they usually give more important keywords at the first position, which means that the order of keywords reflects the importance of keywords. This study confirms this conjecture that the importance of a keyword is related to the order the author listed. According to the order of keywords given in the literature, the weight of keywords is designed and converted into a time series data. The correlation between keywords is studied from the perspective of time series. (2) Affinity propagation can be used to cluster keyword time series data adaptively. It avoids subjective factors impact on the clustering results caused by the given number of clusters. The center object is good at representing the themes reflected by all the members of a cluster, which provides a theoretical basis for topic extraction and evolutionary analysis. (3) Dynamic time warping method is used to measure the similarity of keyword importance time series. The changeable theme using time series data are studied from the perspectives of numerical value and morphology, respectively. Furthermore, the numerical characteristics and the changeable trends in different clusters can be analyzed, which provides a feature description method and visualization technology for scientific research topic analysis.

Key words: keyword analysis, research achievement, time series, clustering analysis, affinity propagation