MEDDOPROF corpus: complete gold standard annotations for occupation detection in medical documents in Spanish
Date
Journal Title
Journal ISSN
Volume Title
Publisher
European Organization for Nuclear Research
Abstract
<strong>UPDATE 27/09/2022: </strong>A complete normalization of all mentions in the corpus to SNOMED CT has been added to the 'meddoprof-norm.tsv' file. <strong>Description</strong> This repository contains the complete MEDDOPROF Gold Standard, a collection of 1,844 clinical cases in Spanish with annotations for occupations, working statuses and activities. MEDDOPROF is a Shared Task celebrated in 2021 that explores the application of natural language processing to occupational health. If you'd like to learn more, please visit: https://temu.bsc.es/meddoprof. <strong>Folder and File Structure</strong> The corpus' files are presented in the format used by the annotation tool brat. That is, for each clinical case there is a .txt file with the text and a .ann file with its corresponding annotations. <em>- meddoprof-ner/</em> Clinical cases annotated with these labels: PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD). <em>- meddoprof-class/</em> Clinical cases with the same annotations as 'meddoprof-ner' but with these labels instead: PACIENTE (patient), FAMILIAR (family member), SANITARIO (health professional) or OTRO (other). <em>- ner_class_joint/</em> Clinical cases with both levels of annotation (ner and class) joint (that is, a mention classified as as PROFESOR in meddoprof-ner and as PACIENTE in meddoprof-class would be PROFESION-PACIENTE here). <em>- meddoprof-norm.tsv</em> Tab-separated file (.tsv) with the mapping of each mention in the corpus to ESCO and SNOMED CT. The file has five columns: filename, mention text, span, ESCO code and SNOMED code. Additionally, two files with the filenames of the train and test partitions are included. <strong>Please cite if you use this resource:</strong> Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Brivá-Iglesias and Martin Krallinger. NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. In Procesamiento del Lenguaje Natural, 67. 2021. <pre><code>@article{meddoprof, title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts}, author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin}, journal = {Procesamiento del Lenguaje Natural}, volume = {67}, year={2021}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6393}, pages = {243--256} }</code></pre> <strong>Related Resources:</strong> - Web - Training Data - Test set - Codes Reference List (for MEDDOPROF-NORM) - Annotation Guidelines - Occupations Gazetteer MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL) and the Spanish government's 2020 Proyectos de I+D+i RTI Tipo A (AI4PROFHEALTH - DESCIFRANDO EL PAPEL DE LAS PROFESIONES EN LA SALUD DE LOS PACIENTES A TRAVES DE LA MINERIA DE TEXTOS (PID2020-119266RA-I00)).
Description
Citaciones: 2