作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于网站层次结构和主题模型LDA的网站自动摘要

李舒嫒,杨静,顾君忠   

  1. (华东师范大学 计算机科学技术系,上海 200241)
  • 收稿日期:2016-04-18 出版日期:2017-04-15 发布日期:2017-04-14
  • 作者简介:李舒嫒(1993-),女,硕士研究生,主研方向为数据挖掘;杨静,副教授;顾君忠,教授。
  • 基金资助:

    国家科技支撑计划项目(2015BAH01F02);上海张江国家自主创新示范区专项发展资金计划项目(201411-JA-B108-002)。

Website Automatic Summarization Based on Website Hierarchy and Latent Dirichlet Allocation

LI Shu’ai,YANG Jing,GU Junzhong   

  1. (Department of Computer Science and Technology,East China Normal University,Shanghai 200241,China)
  • Received:2016-04-18 Online:2017-04-15 Published:2017-04-14

摘要:

近年来自动摘要方面的研究大多是关于多文档和Web网页的,而对网站自动摘要的研究较少。为此,基于主题模型隐含狄利克雷分布(LDA)和网站层次结构提出一个可以自动生成网站摘要的算法。该算法可获取整个网站内的网页信息并进行整合,根据提出的句子权重公式计算句子权重,选取权重最高的句子作为网站摘要。以20个商业和学术网站作为实验对象,使用ROUGE评测标准,结果表明,与仅使用主题模型LDA获取的网站摘要相比,不带停用词的ROUGE-1和ROUGE-L提高 0.32,带停用词的ROUGE-1提高0.39,ROUGE-L提高0.38。与网站首页摘要相比,不带停用词的ROUGE-1提高 0.03,ROUGE-L提高0.06,带停用词的ROUGE-1提高 0.08,ROUGE-L提高0.07。

关键词: Web网页, 网站自动摘要, 隐含狄利克雷分布, 网站层次结构, 宽度优先搜索

Abstract:

In recent years,the research of automatic summarization is mostly about multi-documents and Web pages,but less about website summarization.A method that summarizes a website automatically based on the hierarchical structure of the website and Latent Dirichlet Allocation is proposed.This method gets the information from web pages in the given website and fuses it,and calculates the weight of sentences according to the proposed sentence weighting formula,and selects the highest weight sentences as the website summarization.An experiment is done based on 20 commercial websites and academic websites,and using ROUGE evaluation.Results show that compared with the summaries only using LDA,ROUGE-1 and ROUGE-L are increased by 0.32 with no stop words;ROUGE-1 is increased by 0.39 and ROUGE-L is increased by 0.38 with stop words.Compared with the summaries only from homepage,ROUGE-1 is increased by 0.03 and ROUGE-L is increased by 0.06 with no stop words;ROUGE-1 is increased by 0.08 and ROUGE-L is increased by 0.07 with stop words.

Key words: Web pages, website automatic summarization, Latent Dirichlet Allocation(LDA), website hierarchy, breadth-first search

中图分类号: