基于深度学习的我国科技政策属性识别

李牧南, 王良, 赖华鹏

科研管理 ›› 2024, Vol. 45 ›› Issue (2) : 1-11.

PDF(1455 KB)
PDF(1455 KB)
科研管理 ›› 2024, Vol. 45 ›› Issue (2) : 1-11. DOI: 10.19571/j.cnki.1000-2995.2024.02.001

基于深度学习的我国科技政策属性识别

作者信息 +

Identification of China's S&T policy properties based on deep learning

Author information +
文章历史 +

摘要

当前基于深度学习算法的文本分析更多聚焦于微博、评论和新闻头条为代表的舆情监测和情感分析等短文本信息处理,而针对各类政策文本、论文和专利全文的属性识别和长文本分类等相关研究较少,存在一定拓展空间。与传统的机器学习模型相比,深度学习在自然语言处理和文本特征提取方面具有显著优势,其可通过预训练语言模型降低特征工程的人工干预,从而在政策属性和政策工具识别等领域具有较好的应用前景。本文针对我国科技政策属性(引导型、强制型和鼓励型)的自动识别问题,导入当前流行的几种深度学习模型进行了对比分析。与此同时,本文还针对政策文本的取词长度、数据增强和文本信息量估算等关联计算问题也进行了理论解析,从而进一步丰富了深度学习模型在科学计量,尤其是科技政策文本分析领域的应用。理论和实证分析结果显示,经过基于EDA(Easy Data Augmentation)方法的文本数据增强之后,当前几种代表性的深度学习模型在面向较为抽象的科技政策属性识别问题上均显著提升了处理能力,其中EDA+Bi-LSTM-Attention的识别准确率超过88%,其他参与实验的深度学习模型(TextCNN、Bi-LSTM、RCNN、CapsNet和FastText等)在文本增强之后的平均识别率也超过了80%;但是,文本取词长度从500词增加到2000词对中文科技政策属性识别的效果提升不显著。本文的研究对于科技政策属性自动识别、中文长文本分类和政策工具识别等科技管理相关量化分析具有一定的启示意义和参考价值。

Abstract

The current text analysis based on deep learning algorithm focuses more on short text information processing such as public opinion monitoring and sentiment analysis represented by microblog, online comments and news headlines and so on, while there are few related research on property identification and long text classification of various policy texts, paper full-text and patent full-text, which has significant room for exploration and expansion. Compared with traditional machine learning models, the relevant models or algorithms on deep learning have significant advantages in NLP (natural language processing) and text feature extraction. Deep learning algorithms can reduce manual intervention in feature engineering through pre-training language models, and thus has a promising application prospect in such fields as policy attribute or property identification and policy-instrument recognition. This paper aims at the automatic identification of the properties of science and technology policies, and the properties of policy are divided into such types as guiding, compulsory and encouraging. The main approach is to import several popular models of deep learning for comparative analysis. At the same time, this paper also carried out theoretical analysis on related computing problems such as (1) the impact on property identification among the different text length of policies; (2) the impact of data augmentation of text data; (3) and facilitate the information estimation of policy texts. In order to further enrich the application of deep learning model in scientometrics and informetrics, especially in the field of text analysis on science and technology policies, the experiments on property identification of science and technology policies form China local governments were conducted based on those selected models of deep learning, which are very popular in the latest studies on text classification. The theoretical and empirical analysis showed that the current representative deep learning models have significantly enhanced their processing capacity for property identification of science and technology policy after manipulation of data augmentation based on the EDA (Easy Data Augmentation) method that just presents the excellent performance in the English text processing in the relevant studies. The identifying accuracy of EDA+ bi-LSTM-Attention was more than 88%, and the average recognition accuracy of the other deep learning models (TextCNN, Bi-LSTM, RCNN, CapsNet and FastText, etc.) also can reach over 80% after text augmentation based on the EDA method. However, increasing the length of text interception from 500 words to 2000 words has no significant effect on the property-identification of Chinese science and technology policy, and these experimental results could also be useful for the following studies on policy-text analyses because it implied the full-text of policy could be unnecessary in the similar task of long-text processing. The research of this paper has certain significance of enlightenment and reference value for the quantitative analysis of science and technology management, such as automatic identification of science and technology policy attributes, classification of Chinese long text and identification of policy tools. Meanwhile, the output of this paper could be controversial for the limited policy-text, and in another data source of policy-text, e.g. energy policies, environment policies and financial policies and so on, whether those mentioned models of deep learning in this paper are still effective, should be further explored and discussed the future work.

关键词

深度学习 / 科技政策 / 属性识别 / 数据增强 / 文本分类

Key words

deep learning / science and technology policy / property identification / data augmentation / text classification

引用本文

导出引用
李牧南, 王良, 赖华鹏. 基于深度学习的我国科技政策属性识别[J]. 科研管理. 2024, 45(2): 1-11 https://doi.org/10.19571/j.cnki.1000-2995.2024.02.001
Li Munan, Wang Liang, Lai Huapeng. Identification of China's S&T policy properties based on deep learning[J]. Science Research Management. 2024, 45(2): 1-11 https://doi.org/10.19571/j.cnki.1000-2995.2024.02.001
中图分类号: F124.3   

参考文献

[1]
SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587):484-489.
[2]
SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the game of Go without human knowledge[J]. Nature, 2017, 550(7676):354-359.
[3]
LITJENS G, KOOI T, BEJORDI B E, et al. A survey on deep learning in medical image analysis[J]. Medical Image Analysis, 2017, 42: 60-88.
Deep learning algorithms, in particular convolutional networks, have rapidly become a methodology of choice for analyzing medical images. This paper reviews the major deep learning concepts pertinent to medical image analysis and summarizes over 300 contributions to the field, most of which appeared in the last year. We survey the use of deep learning for image classification, object detection, segmentation, registration, and other tasks. Concise overviews are provided of studies per application area: neuro, retinal, pulmonary, digital pathology, breast, cardiac, abdominal, musculoskeletal. We end with a summary of the current state-of-the-art, a critical discussion of open challenges and directions for future research.Copyright © 2017 Elsevier B.V. All rights reserved.
[4]
GU J X, WANG Z H, KUEN J, et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 2018, 77: 354-377.
[5]
王超. 深度学习在行业指数技术分析中的应用研究[J]. 管理评论, 2021, 33(3):75-83.
WANG Chao. Study on applications of deep learning in technical analysis of sector indexes[J]. Management Review, 2021, 33(3):75-83.
[6]
GUNNARSSON B R, BROUCKE S V, BAESENS B, et al. Deep learning for credit scoring: Do or don't?[J]. European Journal of Operational Research, 2021, 295(1): 292-305.
[7]
LEHRER S, XIE T, ZHANG X Y. Social media sentiment, model uncertainty, and volatility forecasting[J]. Economic Modeling, 2021, 102: 105556. DOI:10.1016/j.econmod.2021.105556.
[8]
钟佳娃, 刘巍, 王思丽, 等. 文本情感分析方法及应用综述[J]. 数据分析与知识发现, 2021, 5(6):1-13.
摘要
【目的】 通过文献调研梳理并综述文本情感分析的技术发展态势及应用场景。【文献范围】 以Web of Science核心数据库和CNKI为检索来源,利用情感分析的相关概念、方法、技术构造检索策略,对2011-2020年文本情感分析方法的研究论文进行计量统计。【方法】 从时间、主题等维度对文本情感分析的主要模型方法和应用场景分别进行归纳、分析和总结,并在此基础上探讨其现状和不足。【结果】 根据分析结果可以看出,面向不同应用场景,主要有基于情感词典与规则、基于传统机器学习和基于深度学习三种文本情感分析方法,各种方法均存在优缺点。同时,近年来基于多策略混合的方法逐渐成为重要的改进方向。【局限】 主要从宏观技术方法的角度对文本情感分析方法及应用进行综述分析,没有对各类情感分析算法的技术细节进行对比和阐述。【结论】 在大数据和深度学习带来的人工智能技术变革背景下,文本情感分析在技术方法上还有改进空间,同时在面向商业决策等应用场景中也有很大的发展潜力。
ZHONG Jiawa, LIU Wei, WANG Sili, et al. Review of methods and applications of text sentiment analysis[J]. Data Analysis and Knowledge Discovery, 2021, 5(6): 1-13.

[Objective] This paper reviews literature on text sentiment analysis, aiming to summarize its technical development trends and applications. [Coverage] We searched relevant literature from the Web of Science Core Collection and CNKI database on the concepts, methods and techniques of sentiment analysis. A total of 69 papers were retrieved from 2011 to 2020 and then analyzed. [Methods] We summarized the main models and applications of text sentiment analysis from the dimensions of time and theme. We also discussed the fields needs to be improved. [Results] There were mainly three methods for text sentiment analysis, which were based on sentiment lexicon and rules, machine learning, as well as deep learning. Each method has advantages and disadvantages. The methods based on multi-strategy hybrid became more popular in recent years. [Limitations] We reviewed previous literature on text sentiment analysis from the perspective of macro-technical methods. More research is needed to compare and elaborate the technical details of sentiment analysis algorithms. [Conclusions] The development of artificial intelligence technology (big data and deep learning) will further improve text sentiment analysis, and benefit business decision making applications.

[9]
WU H P, LIU Y L, WANG J W. Review of text classification methods on deep learning[J]. CMC-Computers Materials & Continua, 2020, 63 (3): 1309-1321.
[10]
张海涛, 王丹, 徐海玲, 等. 基于卷积神经网络的微博舆情情感分类研究[J]. 情报学报, 2018, 37(7): 695-702.
ZHANG Haitao, WANG Dan, XU Hailing, et al. Sentiment classification of micro-blog public opinion based on convolution neural network[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(7): 695-702.
[11]
GUO B, ZHANG C X, LIU J M, et al. Improving text classification with weighted word embeddings via a multi-channel TextCNN model[J]. Neurocomputing, 2019, 363: 366-374.
[12]
吴鹏, 应杨, 沈思. 基于双向长短期记忆模型的网民负面情感分类研究[J]. 情报学报, 2018, 37(8):845-853.
WU Peng, YING Yang, SHEN Si. Negative emotions of online users' analysis based on bidirectional long short-term memory[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(8): 845-853.
[13]
贺鸣, 孙建军, 成颖. 基于朴素贝叶斯的文本分类研究综述[J]. 情报科学, 2016, 34(7): 147-154.
HE Ming, SUN Jianjun, CHENG Ying. Text classification based on naive Bayes:A review[J]. Information Science, 2016, 34(7): 147-154.
[14]
SINOARA R A, CAMACHO-Collados J, ROSSI R G, et al. Knowledge-enhanced document embeddings for text classification[J]. Knowledge-based Systems, 2019, 163: 955-971.
Accurate semantic representation models are essential in text mining applications. For a successful application of the text mining process, the text representation adopted must keep the interesting patterns to be discovered. Although competitive results for automatic text classification may be achieved with traditional bag of words, such representation model cannot provide satisfactory classification performances on hard settings where richer text representations are required. In this paper, we present an approach to represent document collections based on embedded representations of words and word senses. We bring together the power of word sense disambiguation and the semantic richness of word and word-sense embedded vectors to construct embedded representations of document collections. Our approach results in semantically enhanced and low-dimensional representations. We overcome the lack of interpretability of embedded vectors, which is a drawback of this kind of representation, with the use of word sense embedded vectors. Moreover, the experimental evaluation indicates that the use of the proposed representations provides stable classifiers with strong quantitative results, especially in semantically-complex classification scenarios. (C) 2018 Elsevier B.V.
[15]
LI Q, LI P F, MAO K Z, et al. Improving convolutional neural network for text classification by recursive data pruning[J]. Neurocomputing, 2020, 414: 143-152.
[16]
XIE J B, HOU Y J, WANG Y J, et al. Chinese text classification based on attention mechanism and feature-enhanced fusion neural network[J]. Computing, 2020, 102(3): 683-700.
[17]
DENG J F, CHENG L L, WANG Z W. Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification[J]. Computer Speech And Language, 2021, 68: 101182. DOI:10.1016/j.csl.2020.101182.
[18]
李娜, 姜恩波, 朱一真, 等. 政策工具自动识别方法与实证研究[J]. 图书情报工作, 2021, 65(7): 115-122.
摘要
[目的/意义] 政策工具的识别与分析是政策研究的重要手段之一。此项工作目前多以人工开展。本文运用深度学习方法进行政策工具的自动识别,以期提高政策工具识别的效率。[方法/过程] 设计与实施政策数据采集与清洗——政策工具人工标引——模型训练——结果解读的政策工具自动识别的实验流程,并以北上广贵四地的政府信息公开政策为例,对比传统机器学习方法和深度学习方法在政策工具识别任务上的性能表现。此外,提出整合政策全局信息进行各段落政策工具识别的方案,并通过实验证明方案的有效性。[结果/结论] 深度学习模型CNN在全量测试数据上达到76.51%的准确率,整合全局信息的CNN模型达到77.13%的准确率。而仅对模型的高置信度结果进行评估发现,整合全局信息的CNN模型在其中55.63%的测试数据上准确率达到了95.44%。该准确率已经达到了实用的要求,表明超过一半的政策工具标引可以借用模型的高置信度结果,无需人工复核。基于深度学习方法研究政策工具的自动识别取得较好的效果,提升政策工具标引的效率,为大数据量的政策工具自动识别提供正面经验。
LI Na, JIANG Enbo, ZHU Yizhen, et al. Policy tool identification method and empirical research based on deep learning[J]. Library and Information Service, 2021, 65(7): 115-122.
<b>[Purpose/significance]</b> The identification and analysis of policy tools is one of the important methods of policy research. However, the identification of policy tools is mostly manual. In this article, we attempt to use deep learning methods to automatically identify policy tools, aiming at improving the efficiency of policy tool identification. <b>[Method/process]</b> We designed and implemented the policy tool automatic identification experimental process of "Policy data collection and cleaning-policy tool manual indexing-model training-result interpretation". We take the open government data policies of Beijing, Shanghai, Guangzhou, and Guiyang as an example to compare the performance of traditional machine learning methods and deep learning methods on the task of identifying policy tools. In addition, we have proposed to integrate policy global information to identify policy tools in each paragraph, and our experiments have proved the effectiveness of the idea. <b>[Result/conclusion]</b> The deep learning model CNN achieves an accuracy of 76.51% on the full test data, and the CNN model that integrates global information achieves an accuracy of 77.13%. When evaluating the high-confident results of the model, we find that the model achieves an accuracy of 95.44% on 55.63% of the test data, which has reached the practical requirements. This shows that more than half of the data can be indexed with the model’s high-confidence results without manual review. Deep learning methods have been applied to the automatic identification of policy tools and has achieved good results. It could help to improve the efficiency of policy tool labeling and provide positive experience for the automatic identification of policy tools with big data. And it provides a positive experience for automatic identification of policy tools with large data volumes.
[19]
胡吉明, 付文麟, 钱玮, 等. 融合主题模型和注意力机制的政策文本分类模型[J]. 情报理论与实践, 2021, 44(7): 159-165.
摘要
[目的/意义]从政策文本的内容特征分析和语义特征表示出发,利用基于深度学习的政策文本分类模型,提高政策文本分类的效果,发挥和利用政策文本所蕴含的巨大价值。[方法/过程]以LDA主题模型和注意力机制为核心,构建政策文本表示和分类的一体化框架,即利用LDA模型和改进的TextRank模型增强政策文本表示效果,采用CNNBiLSTMAttention集成模型提取政策文本内涵特征,提升政策文本分类的效果和准确度。[结果/结论]对比实验表明,与单一模型相比,文章所提的集成模型对政策文本分类的准确率更高,且使用主题模型增强政策文本表示效果后,能够进一步促进模型分类的准确率,而融入注意力机制对特征进行权重分配后,模型的分类准确率达到最高,其F1值为9111%。[局限]未充分挖掘和利用政策文本的功能属性和结构特征,缺乏对政策文本类别不均衡问题的考虑。
HU Jiming, FU Wenlin, QIAN Wei, et al. Research on policy text classification model based on topic model and attention mechanism[J]. Information Studies: Theory & Application, 2021, 44(7): 159-165.
[20]
ARRIETA A B, DIAZ-Rodriguez N, DEL Ser J, et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI[J]. Information Fusion, 2020, 58: 82-115.
[21]
TSAI K C, WANG L, HAN Z. Caching for mobile social networks with deep learning: Twitter analysis for 2016 US election[J]. IEEE Transactions on Network Science and Engineering, 2020, 7(1): 193-203.
[22]
SAHOO S R, GUPTA B B. Multiple features based approach for automatic fake news detection on social networks using deep learning[J]. Applied Soft Computing, 2021, 100: 106983. DOI:10.1016/j.asoc.2020.106983.
[23]
LI S B, HU J, CUI Y X, et al. DeepPatent: Patent classification with convolutional neural networks and word embedding[J]. Scientometrics, 2018, 117(2): 721-744.
[24]
YI X, WALIA E, BABYN P. Generative adversarial network in medical imaging: A review[J]. Medical Image Analysis, 2019, 58: 101552. DOI:10.1016/j.media.2019.101552.
[25]
WANG X, WANG K, LIAN S G. A survey on face data augmentation for the training of deep neural networks[J]. Neural Computing & Applications, 2020, 32(19): 15503-15531.
[26]
张一珂, 张鹏远, 颜永红. 基于对抗训练策略的语言模型数据增强技术[J]. 自动化学报, 2018, 44(5): 891-900.
ZHANG Yike, ZHANG Pengyuan, YAN Yonghong. Data augmentation for language models via adversarial training[J]. Acta Automatica Sinica, 44(5): 891-900.
[27]
李茜茜, 沈晓燕, 任福继, 等. 面向数据增强的多种语音情感分类算法研究[J]. 智能系统学报, 2021, 16(1): 170-177.
LI Qianqian, SHEN Xiaoyan, REN Fuji, et al. Investigation of multiple speech emotion classification algorithms based on data enhancement[J]. CAAI transactions on intelligent systems, 2021, 16(1): 170-177.
[28]
陆垚杰, 林鸿宇, 韩先培, 等. 基于语言学扰动的事件检测数据增强方法[J]. 中文信息学报, 2019, 33(7): 110-117.
摘要
近年来,深度学习在事件检测领域取得了长足进展。但是,现有方法通常受制于事件检测标注数据的规模和训练阶段的不稳定性。针对上述问题,本文提出了基于语言学扰动的事件检测数据增强方法,从语法和语义两个角度生成伪数据来提升事件检测的性能。为了有效的利用生成的伪数据,该文探索了数据增加和多实例学习两个训练策略。在KBP 2017事件检测数据集上的实验验证了我们方法的有效性。此外,在人工构造的少量ACE2005数据集上的实验结果证明该文方法可以大幅度提升小数据情况下的模型学习性能。
LU Yaojie, LIN Hongyu, HAN Xianpei, et al. Linguistic perturbation based on data augmentation for event detection[J]. Journal of Chinese Information Processing, 2019, 33(7): 110-117.
Deep learning recently applied in the event detection task is limited by the scarcity of the annotated data and the instability during the training phase. This paper proposes a data augmentation method based on linguistic perturbation for event detection, which generates pseudo data from both syntactic and semantic perspectives to improve the performance of event detection systems. In order to effectively exploit generated pseudo data, this paper explores two training strategies: data addition and multi-instance learning. Experiments on the KBP 2017 event detection dataset demonstrate the effectiveness of our approach. Furthermore, the empirical results on a manual constructed portion of ACE2005 dataset show that the proposed method can significantly improve the model performance on small training data.
[29]
刘彤, 刘琛, 倪维健. 多层次数据增强的半监督中文情感分析方法[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
摘要
【目的】 针对在自然语言处理领域中高质量的标签数据较难获取的问题,设计基于多层次数据增强的半监督中文情感分析方法。【方法】 采用简单数据增强和反向翻译的文本增强技术获取大量无标签数据,通过对无标签数据计算一致性正则提取无标签数据的数据信号;对弱增强数据计算其预判标签,将强增强数据与预判标签一起构建监督训练信号,通过置信度阈值过滤使模型得出置信度高的预测结果。【结果】 在三个公开情感分析数据集上进行实验,在Waimai和Weibo数据集上仅使用1 000条有标签文档就可以分别获得超过BERT 2.311%和6.726%的性能提升。【局限】 实验均在公开通用语料上进行,未验证在垂直领域数据集上的效果。【结论】 所提方法充分挖掘了无标签数据的信息,可以缓解标签数据不易获取的问题,同时具有较强的预测稳定性。
LIU Tong, LIU Chen, NI Weijian. A semi-supervised sentiment analysis method for Chinese based on multi-level data augmentation[J]. Data Analysis and Knowledge Discovery, 2021, 5(5): 51-58.

[Objective] This paper designs a semi-supervised model for sentiment analysis based on multi-level data augmentation, aiming to generate high-quality labeled data for natural language processing in Chinese. [Methods] First, we collected large amount of unlabeled data with the help of simple data enhancement and reverse translation of text enhancement techniques. Then, we extracted the data signals of unlabeled samples by calculating their consistency norms. Third, we calculated the pseudo-label of the weakly enhanced samples, and constructed the supervised training signal from the strongly enhanced sample together with the pseudo-label. Finally, we set confidence threshold for the model to generate prediction results. [Results] We examined the proposed model with three publicly available datasets for sentiment analysis. With only 1 000 labeled documents from the Waimai and Weibo datasets, the performance of our model was 2.311% and 6.726% better than those of the BERT. [Limitations] We did not evaluate the model’s performance with vertical domain datasets. [Conclusions] The proposed method fully utilizes the information of unlabeled samples to address the issue of insufficient labeled data, and shows strong predicting stability.

[30]
WEI J, ZOU K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks[J/OL]. arxiv preprint arxiv:1901.11196,2019. https://arxiv.org/abs/1901.11196, 2019.
[31]
EKBOIR J M. Research and technology policies in innovation systems: Zero tillage in Brazil[J]. Research Policy, 2003, 32(4): 573-586.
[32]
MOROSINI P. Industrial clusters, knowledge integration and performance[J]. World Development, 2004, 32(2):305-326.
[33]
ONISHI A. A new challenge to economic science: Global model simulation[J]. Journal of Policy Modeling, 2010, 32(1):1-46.
[34]
张永安, 耿喆, 王燕妮. 区域科技创新政策分类与政策工具挖掘:基于中关村数据的研究[J]. 科技进步与对策, 2015, 32(17): 116-22.
摘要
区域科技创新政策对我国建设创新型国家具有重要战略意义,区域科技创新政策分类与政策工具挖掘亟待关注。以中关村国家自主创新示范区为例,采用文本挖掘及网络分析法进行区域科技创新政策分类与政策工具分析。实证结果表明:区域科技创新政策可分为强制型与引导型,搭配比例上引导型政策居多,强制型政策较少,且政策力度较弱;对于政策工具,目前运用较多的是供给型,其中人才激励和研发补贴占比最多,对于需求型政策工具的运用仍存在较大空间。最后,提出了下一步研究方向。
ZHANG Yongan, GENG Zhe, WANG Yanni. Regional science and technology innovation policy classification and policy tools mining based on Zhongguancun data[J]. Science & Technology Progress and Policy, 2015, 32(17): 116-122.
[35]
张宝建, 李鹏利, 陈劲, 等. 国家科技创新政策的主题分析与演化过程:基于文本挖掘的视角[J]. 科学学与科学技术管理, 2019, 40(11): 15-31.
ZHANG Baojian, LI Pengli, CHEN Jin, et al. Thematic analysis and evolution process of national science and technology innovation policy: Based on the perspective of text mining[J]. Science of Science and Management of S. & T., 2019, 40(11): 15-31.
[36]
陈玲, 段尧清. 我国政府开放数据政策的实施现状和特点研究:基于政府公报文本的量化分析[J]. 情报学报, 2020, 39(7): 698-709.
CHEN Ling, DUAN Yaoqing. Analyzing implementation of the Chinese government open data policy using government bulletin text as example[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(7): 698-709.
[37]
杨锐, 陈伟, 何涛, 等. 融合主题信息的卷积神经网络文本分类方法研究[J]. 现代情报, 2020, 40(4): 42-49.
摘要
[目的/意义]针对能源政策语义信息丰富的特点,研究不同环境下卷积神经网络模型对能源政策文本特征分类识别的效果并提出优化方法,辅助能源政策信息资源进行自动分类操作,方便研究人员更好地进行能源政策解读。[方法/过程]在不同环境下利用字符级和词级卷积神经网络模型对能源政策自动文本分类识别效果进行实验,从标题、内容、核心主题句等角度全面对比分析,利用Doc2Vec抽取不同比例核心主题句,将这些主题信息融入卷积神经网络模型中以对实验进行优化。[结果/结论]随着核心主题句抽取率的提高F1均值呈正态分布,当抽取率为70%时达到平衡,神经网络模型评估F1均值为83.45%,较实验中的其它方法均有所提高,通过Doc2Vec提取主题信息,并将其融入卷积神经网络的方法有效提升了卷积神经网络模型自动文本分类的效果。
YANG Rui, CHEN Wei, HE Tao, et al. Text classification method based on convolutional neural network using topic information[J]. Journal of Modern Information, 2020, 40(4): 42-49.
[38]
黄水清, 王东波. 新时代人民日报分词语料库构建、性能及应用(二):深度学习自动分词模型构建[J]. 图书情报工作, 2019, 63(23): 5-12.
摘要
[目的/意义] 在新时代人民日报分词语料库的基础上构建的深度学习自动分词模型,不仅有助于为高性能分词模型的构建提供经验,也可以借助具体的自然语言处理研究任务验证深度学习相应模型的性能。[方法/过程] 在介绍双向长短时记忆模型(Bi-LSTM)和双向长短时记忆与条件随机场融合模型(Bi-LSTM-CRF)的基础上,阐明汉语分词语料预处理、评价指标和参数与硬件平台的过程、种类和情况,分别构建Bi-LSTM和Bi-LSTM-CRF汉语自动分词模型,并对模型的整体性能进行分析。[结果/结论] 从精准率、召回率和调和平均值3个指标上看,所构建的Bi-LSTM和Bi-LSTM-CRF汉语自动分词模型的整体性能相对较为合理。在具体性能上,Bi-LSTM分词模型优于Bi-LSTM-CRF分词模型,但这一差距非常细微。
HUANG Shuiqing, WANG Dongbo. Construction, performance and application of new era People's Daily segmented corpus (Ⅱ):Constructing automatic word segmentation model of deep learning[J]. Library and Information Service, 2019, 63(23): 5-12.
<strong>[Purpose/significance]</strong> On the basis of the new era People's Daily(NEPD) word segmentation corpus, the construction of the automatic word segmentation model of deep learning not only can help to provide relevant experience for the construction of high-performance word segmentation model, but also can verify the performance of the corresponding model of deep learning through specific natural language processing tasks.<strong>[Method/process]</strong> Based on the introduction of Bi-directional Long Short-Term Memory (Bi-LSTM) and Bi-directional Long Short-Term Memory with conditional random field (Bi-LSTM-CRF), this paper expounded the process, type and situation of Chinese word segmentation preprocessing, the evaluation indexes and parameters and hardware platform, the Bi-LSTM and Bi-LSTM-CRF Chinese automatic word segmentation models were constructed respectively, and the overall performance of the models was analyzed.<strong>[Result/conclusion]</strong> The overall performance of the Bi-LSTM and Bi-LSTM-CRF Chinese automatic word segmentation model is relatively reasonable from the three indexes of precision, recall and F value. In terms of specific performance, Bi-LSTM word segmentation model is superior to Bi-LSTM-CRF word segmentation model, but the difference is very small.
[39]
王昊, 邓三鸿, 苏新宁, 等. 基于深度学习的情报学理论及方法术语识别研究[J]. 情报学报, 2020, 39(8): 817-828.
WANG Hao, DENG Sanhong, SU Xinning, et al. A study on Chinese terminology recognition of theory and method from information science: Based on deep learning[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(8): 817-828.
[40]
HUANG Y, CHEN J, ZHENG S, et al. Hierarchical multi-attention networks for document classification[J]. International Journal of Machine Learning and Cybernetics, 2021, 12(6): 1639-1647.
[41]
ZHENG X, CHEN W Z. An attention-based Bi-LSTM method for visual object classification via EEG[J]. Biomedical Signal Procesing and Control, 2021, 63: 102174. DOI:10.1016/j.bspc.2020.102174.
[42]
施国良, 陈宇奇. 文本增强与预训练语言模型在网络问政留言分类中的集成对比研究[J]. 图书情报工作, 2021, 65(13): 96-107.
摘要
[目的/意义] 政府网络问政平台是政府部门知晓民意的重要途径之一,为提高问政留言分类的精度以及处理留言数据质量差、数量少等问题,对比多种基于BERT改进模型与文本增强技术结合的分类效果并探究其差异原因。[方法/过程] 设计网络问政留言分类集成对比模型,文本增强方面采用EDA技术与SimBERT文本增强技术进行对比实验,文本分类模型方面则采用多种基于BERT改进的预训练语言模型(如ALBERT、RoBERTa)进行对比实验。[结果/结论] 实验结果表明,基于RoBERTa与SimBERT文本增强的文本分类模型效果最佳,在测试集上的F1值高达92.05%,相比于未进行文本增强的BERT-base模型高出2.89%。同时,SimBERT文本增强后F1值相比未增强前平均提高0.61%。实验证明了基于RoBERTa与SimBERT文本增强模型能够有效提升多类别文本分类的效果,在解决同类问题时具有较强可借鉴性。
SHI Guoliang, CHEN Yuqi. A comparative study on the integration of text enhanced and pre-trained language models in the classification of Internet political messages[J]. Library and Information Service, 2021, 65(13): 96-107.
<b>[Purpose/significance]</b> Government network platform for political inquiry is one of the important ways for rulers to know public opinions. In order to improve the accuracy of the classification of political inquiry messages and to deal with the problems such as poor quality and small quantity of message data, the classification effects of various BERT improved models combined with text enhancement technology and the reasons for their differences were explored.<b>[Method/process]</b> Design the network political inquiry message classification integrated comparison model,the EDA (Easier Data Augment) technology and SimBERT text Augment technology were used for comparison experiment in the aspect of text augmentation, and various pre-training language models (such as ALBERT and RoBERTa) based on BERT improvement were used for comparison experiment in the aspect of text classification model.<b>[Result/conclusion]</b> The experimental results showed that the text classification model based on RoBERTa and SimBERT text enhancement had the best effect, and the F1 value on the test set was as high as 92.05%, 2.89% higher than that of the Bert-Base model without text enhancement. At the same time, F1 value after SimBERT text enhancement was 0.61% higher than that before no enhancement. The experiment proved that text enhancement model based on RoBERTa and SimBERT can effectively improve the classification effect of multiple categories of text classification problems, and has strong referability in solving similar problems.
[43]
KIM S, PARK H, LEE J. Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: A study on blockchain technology trend analysis[J]. Expert Systems with Applications, 2020, 152: 113401. DOI:10.1016/j.eswa.2020.113401.
[44]
李牧南, 王良, 赖华鹏. 中文科技政策文本分类:增强的TextCNN视角[J]. 科技管理研究, 2023, 43 (2): 160-166.
LI Munan, WANG Liang, LAI Huapeng. Text classification of Chinese S&T policies: Enhanced TextCNN perspective[J]. Science and Technology Management Research, 2023, 43 (2): 160-166.
[45]
DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J/OL]. rxiv preprint arxiv:1810.04805,2018. https://arxiv.org/pdf/1810.04805.pdf, 2019, 05.
[46]
冯国明, 张晓, 刘素辉. 基于CapsNet的中文文本分类研究[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
摘要
【目的】解决长文本的表示问题并将CapsNet应用于中文文本分类任务中, 提高分类精度。【方法】针对长文本提出LDA矩阵和词向量体表示法, 并结合CapsNet提出基于CapsNet的中文文本分类模型。以搜狗新闻语料与复旦大学文本分类语料作为实验数据, 将TextCNN、DNN等模型作为对比对象进行文本分类实验与分析。【结果】CapsNet模型在中文文本分类的各评价指标上均优于其他模型, 在5类短文本、长文本分类中准确率分别达89.6%、96.9%, 且收敛速度比CNN模型快近两倍。【局限】模型计算时间复杂度高, 实验语料规模受限。【结论】本文方法和CapsNet模型在中文文本分类中相对于已有方法有更好的准确率、收敛速度和鲁棒性。
FENG Guoming, ZHANG Xiaodong, LIU Suhui. Classifying Chinese texts with CapsNet[J]. Data Analysis and Knowledge Discovery, 2018, 2(12): 68-76.

[Objective] This study tries to address the issues facing long text representation and use CapsNet to improve the accuracy of Chinese text classification. [Methods] First, we proposed a LDA matrix and word vector to represent the long texts. Then, we constructed a Chinese classification model based on CapsNet. Third, we examined the proposed model with Sogou news corpus and the text classification corpus of Fudan University. Finally, we compared our results with those of the classic models (e.g., TextCNN, DNN and so on). [Results] The performance of CapsNet model was better than other models. The classification accuracy in five categories of short and long texts reached 89.6% and 96.9% respectively. The convergence speed of the proposed model was almost two times faster than that of the CNN model. [Limitations] The computational complexity of the model is high, which limits the size of testing corpus. [Conclusions] The proposed Chinese text representation method and the modified CapsNet model have better accuracy, convergence speed and robustness than the existing ones.

基金

国家自然科学基金面上项目:“基于多源数据融合与机器学习的新兴技术风险挖掘研究”(72074081)
国家自然科学基金面上项目:“基于多源数据融合与机器学习的新兴技术风险挖掘研究”(2021.01—2024.12)
国家社会科学基金重点项目:“加快我国科技自立自强发展战略问题研究”(22AZD035)
国家社会科学基金重点项目:“加快我国科技自立自强发展战略问题研究”(2022.05—2025.12)

PDF(1455 KB)

Accesses

Citation

Detail

段落导航
相关文章

/