生物医学领域中植物与疾病关系的语料库。

PubMed ID
发表日期 2019年月

原始出处 公共科学图书馆一号
PloS one
作者 Kim  Baeksoo  Choi  Wonjun  Lee  Hyunju 

文献标题 生物医学领域中植物与疾病关系的语料库。
A corpus of plant-disease relations in the biomedical domain.

文献摘要 BACKGROUND

许多新药都是从植物等天然资源中提取出来的,这些植物在治疗疾病方面有着悠久的历史。因此,研究了它们的益处和副作用,并在Medline文章中积累了包括植物和疾病关系在内的植物相关信息。由于Medline中有大量可用的文章,而且都是用自然语言编写的,因此文本挖掘非常重要。然而,植物与疾病关系的语料库还不可用。因此,我们旨在构建这样一个语料库。

METHODS AND RESULTS

本研究设计并注释了一个植物病害关系语料库,提出了一个利用该语料库预测植物病害关系的计算模型。我们将植物与病害的关系分为四种类型:病害处理、病害成因、关联和负相关。为了建立一个植物病害关系的语料库,我们首先创建了它的注释指南,并随机选取了200篇Medline摘要。从这些摘要中,我们确定了1405和1755种植物和疾病,分别标注了105和237种独特的植物和疾病标识符。选取包含至少一种植物和一种病害的句子,提取878个植物和1077个病害实体,最终从199篇摘要中提取出1309个植物病害关系语料库。为了验证语料库的有效性,我们提出了一种最短依赖路径卷积神经网络模型(SDP-CNN),并将其应用到构建的语料库中。十倍交叉验证的微F值为0.764。我们还将提出的SDP-CNN模型应用于所有的医学论文摘要。对随机抽取的483个植物病害共现句进行测试,模型的精度为0.707。

CONCLUSION

植物病害关系语料库具有独特的特点,是生物医学文本挖掘的重要资源。植物与疾病关系的语料库可在http://gcancer.org/pdr/。


BACKGROUND

Many new medicines have been derived from natural sources such as plants, which have a long history of being used for disease treatment. Thus, their benefits and side effects have been studied, and plant-related information including plant and disease relations have been accumulated in Medline articles. Because numerous articles are available in Medline and are written in natural language, text-mining is important. However, a corpus of plant and disease relations is not available yet. Thus, we aimed to construct such a corpus.

METHODS AND RESULTS

In this study, we designed and annotated a plant-disease relations corpus, and proposed a computational model to predict plant-disease relations using the corpus. We categorized plant and disease relations into four types: treatments of diseases, causes of diseases, associations, and negative relations. To construct a corpus of plant-disease relations, we first created its annotation guidelines and randomly selected 200 Medline abstracts. From these abstracts, we identified 1,405 and 1,755 plant and disease mentions, annotated to 105 and 237 unique plant and disease identifiers, respectively. When we selected sentences containing at least one plant and one disease mention, we extracted 878 plant and 1,077 disease entities, which finally generated a corpus of plant-disease relations including 1,309 relations from 199 abstracts. To verify the effectiveness of the corpus, we proposed a convolutional neural network model with the shortest dependency path (SDP-CNN) and applied it to the constructed corpus. The micro F-score with ten-fold cross-validation was found to be 0.764. We also applied the proposed SDP-CNN model to all Medline abstracts. When we measured its performance for 483 randomly selected plant-disease co-occurring sentences, the model showed a precision of 0.707.

CONCLUSION

The plant-disease relations corpus is unique and represents an important resource for biomedical text-mining. The corpus of plant and disease relations is available at http://gcancer.org/pdr/.


获取全文 10.1371/journal.pone.0221582