Browsing by Autor "Martin Krallinger"

Now showing 1 - 18 of 18

MEDDOPROF corpus: complete gold standard annotations for occupation detection in medical documents in Spanish
(European Organization for Nuclear Research, 2022) Salvador Lima López; Eulàlia Farré-Maduell; Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Martin Krallinger
UPDATE 27/09/2022: A complete normalization of all mentions in the corpus to SNOMED CT has been added to the 'meddoprof-norm.tsv' file. Description This repository contains the complete MEDDOPROF Gold Standard, a collection of 1,844 clinical cases in Spanish with annotations for occupations, working statuses and activities. MEDDOPROF is a Shared Task celebrated in 2021 that explores the application of natural language processing to occupational health. If you'd like to learn more, please visit: https://temu.bsc.es/meddoprof. Folder and File Structure The corpus' files are presented in the format used by the annotation tool brat. That is, for each clinical case there is a .txt file with the text and a .ann file with its corresponding annotations. - meddoprof-ner/ Clinical cases annotated with these labels: PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD). - meddoprof-class/ Clinical cases with the same annotations as 'meddoprof-ner' but with these labels instead: PACIENTE (patient), FAMILIAR (family member), SANITARIO (health professional) or OTRO (other). - ner_class_joint/ Clinical cases with both levels of annotation (ner and class) joint (that is, a mention classified as as PROFESOR in meddoprof-ner and as PACIENTE in meddoprof-class would be PROFESION-PACIENTE here). - meddoprof-norm.tsv Tab-separated file (.tsv) with the mapping of each mention in the corpus to ESCO and SNOMED CT. The file has five columns: filename, mention text, span, ESCO code and SNOMED code. Additionally, two files with the filenames of the train and test partitions are included. Please cite if you use this resource: Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Brivá-Iglesias and Martin Krallinger. NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. In Procesamiento del Lenguaje Natural, 67. 2021. <pre><code>@article{meddoprof, title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts}, author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin}, journal = {Procesamiento del Lenguaje Natural}, volume = {67}, year={2021}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6393}, pages = {243--256} }</code></pre> Related Resources: - Web - Training Data - Test set - Codes Reference List (for MEDDOPROF-NORM) - Annotation Guidelines - Occupations Gazetteer MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL) and the Spanish government's 2020 Proyectos de I+D+i RTI Tipo A (AI4PROFHEALTH - DESCIFRANDO EL PAPEL DE LAS PROFESIONES EN LA SALUD DE LOS PACIENTES A TRAVES DE LA MINERIA DE TEXTOS (PID2020-119266RA-I00)).
MEDDOPROF corpus: complete gold standard annotations for occupation detection in medical documents in Spanish
(European Organization for Nuclear Research, 2022) Salvador Lima López; Eulàlia Farré-Maduell; Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Martin Krallinger
UPDATE 27/09/2022: A complete normalization of all mentions in the corpus to SNOMED CT has been added to the 'meddoprof-norm.tsv' file. Description This repository contains the complete MEDDOPROF Gold Standard, a collection of 1,844 clinical cases in Spanish with annotations for occupations, working statuses and activities. MEDDOPROF is a Shared Task celebrated in 2021 that explores the application of natural language processing to occupational health. If you'd like to learn more, please visit: https://temu.bsc.es/meddoprof. Folder and File Structure The corpus' files are presented in the format used by the annotation tool brat. That is, for each clinical case there is a .txt file with the text and a .ann file with its corresponding annotations. - meddoprof-ner/ Clinical cases annotated with these labels: PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD). - meddoprof-class/ Clinical cases with the same annotations as 'meddoprof-ner' but with these labels instead: PACIENTE (patient), FAMILIAR (family member), SANITARIO (health professional) or OTRO (other). - ner_class_joint/ Clinical cases with both levels of annotation (ner and class) joint (that is, a mention classified as as PROFESOR in meddoprof-ner and as PACIENTE in meddoprof-class would be PROFESION-PACIENTE here). - meddoprof-norm.tsv Tab-separated file (.tsv) with the mapping of each mention in the corpus to ESCO and SNOMED CT. The file has five columns: filename, mention text, span, ESCO code and SNOMED code. Additionally, two files with the filenames of the train and test partitions are included. Please cite if you use this resource: Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Brivá-Iglesias and Martin Krallinger. NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. In Procesamiento del Lenguaje Natural, 67. 2021. <pre><code>@article{meddoprof, title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts}, author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin}, journal = {Procesamiento del Lenguaje Natural}, volume = {67}, year={2021}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6393}, pages = {243--256} }</code></pre> Related Resources: - Web - Training Data - Test set - Codes Reference List (for MEDDOPROF-NORM) - Annotation Guidelines - Occupations Gazetteer MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL) and the Spanish government's 2020 Proyectos de I+D+i RTI Tipo A (AI4PROFHEALTH - DESCIFRANDO EL PAPEL DE LAS PROFESIONES EN LA SALUD DE LOS PACIENTES A TRAVES DE LA MINERIA DE TEXTOS (PID2020-119266RA-I00)).
MEDDOPROF corpus: complete gold standard annotations for occupation detection in medical documents in Spanish
(European Organization for Nuclear Research, 2021) Salvador Lima López; Eulàlia Farré-Maduell; Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Martin Krallinger
The MEDDOPROF Shared Task tackles the detection of occupations and employment statuses in clinical cases in Spanish from different specialties. Systems capable of automatically processing clinical texts are of interest to the medical community, social workers, researchers, the pharmaceutical industry, computer engineers, AI developers, policy makers, citizen’s associations and patients. Additionally, other NLP tasks (such as anonymization) can also benefit from this type of data. MEDDOPROF has three different sub-tasks: 1) MEDDOPROF-NER: Participants must find the beginning and end of occupation mentions and classify them as PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD). 2) MEDDOPROF-CLASS: Participants must find the beginning and end of occupation mentions and classify them according to their referent (PACIENTE [patient], FAMILIAR [family member], SANITARIO [health professional] or OTRO [other]). 3) MEDDOPROF-NORM: Participants must find the beginning and end of occupation mentions and normalize them according to a reference codes list. This is the complete Gold Standard. Annotations for the NER and CLASS sub-track are provided both separately and joint together (with each annotation level separated by a dash, e.g. PROFESION-PACIENTE). The normalized mentions are given as tab-separated file (.tsv) with four columns: filename, mention text, span and code. Please cite if you use this resource: Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Brivá-Iglesias and Martin Krallinger. NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. In Procesamiento del Lenguaje Natural, 67. 2021. <pre><code>@article{meddoprof, title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts}, author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin}, journal = {Procesamiento del Lenguaje Natural}, volume = {67}, year={2021}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6393}, pages = {243--256} }</code></pre> Resources: - Web - Training Data - Test set - Codes Reference List (for MEDDOPROF-NORM) - Annotation Guidelines MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL).
MEDDOPROF corpus: complete gold standard annotations for occupation detection in medical documents in Spanish
(European Organization for Nuclear Research, 2022) Salvador Lima López; Eulàlia Farré-Maduell; Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Martin Krallinger
The MEDDOPROF Shared Task tackles the detection of occupations and employment statuses in clinical cases in Spanish from different specialties. Systems capable of automatically processing clinical texts are of interest to the medical community, social workers, researchers, the pharmaceutical industry, computer engineers, AI developers, policy makers, citizen’s associations and patients. Additionally, other NLP tasks (such as anonymization) can also benefit from this type of data. MEDDOPROF has three different sub-tasks: 1) MEDDOPROF-NER: Participants must find the beginning and end of occupation mentions and classify them as PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD). 2) MEDDOPROF-CLASS: Participants must find the beginning and end of occupation mentions and classify them according to their referent (PACIENTE [patient], FAMILIAR [family member], SANITARIO [health professional] or OTRO [other]). 3) MEDDOPROF-NORM: Participants must find the beginning and end of occupation mentions and normalize them according to a reference codes list. This is the complete Gold Standard. Annotations for the NER and CLASS sub-track are provided both separately and joint together (with each annotation level separated by a dash, e.g. PROFESION-PACIENTE). The normalized mentions are given as tab-separated file (.tsv) with four columns: filename, mention text, span and code. Please cite if you use this resource: Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Brivá-Iglesias and Martin Krallinger. NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. In Procesamiento del Lenguaje Natural, 67. 2021. <pre><code>@article{meddoprof, title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts}, author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin}, journal = {Procesamiento del Lenguaje Natural}, volume = {67}, year={2021}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6393}, pages = {243--256} }</code></pre> Resources: - Web - Training Data - Test set - Codes Reference List (for MEDDOPROF-NORM) - Annotation Guidelines - Occupations Gazetteer MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL).
MEDDOPROF corpus: sample set
(European Organization for Nuclear Research, 2021) Salvador Lima López; Eulàlia Farré-Maduell; Vicent Brivá-Iglesias; Antonio Miranda-Escalada; Martin Krallinger
The MEDDOPROF corpus is a collection of 2000 clinical cases from over 20 different specialties annotated with professions, employment statuses and other work-related activities. It is used for the MEDDOPROF Shared Task on occupations and employment status detection and normalization in Spanish medical documents, which will be celebrated as part of IberLEF 2021. The sample set is composed of 15 clinical cases extracted from the training set from four different specialties: radiology, oncology, psychiatry and occupational health. The files are distributed as follows: - For the subtask 1 (MEDDOPROF-NER), annotations are distributed in Brat standoff format with PROFESION/SITUACION_LABORAL tags only. - For the subtask 2 (MEDDOPROF-CLASS), annotations are distributed in Brat standoff format with PACIENTE/FAMILIAR/SANITARIO/OTROS tags only. - For the subtask 3 (MEDDOPROF-NORM), annotations are distributed in a tab-separated file (TSV) with a code column that includes the mapping of entities to ESCO and SNOMED-CT. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es
MEDDOPROF corpus: sample set
(European Organization for Nuclear Research, 2021) Salvador Lima López; Eulàlia Farré-Maduell; Vicent Brivá-Iglesias; Antonio Miranda-Escalada; Martin Krallinger
The MEDDOPROF corpus is a collection of 2000 clinical cases from over 20 different specialties annotated with professions, employment statuses and other work-related activities. It is used for the MEDDOPROF Shared Task on occupations and employment status detection and normalization in Spanish medical documents, which will be celebrated as part of IberLEF 2021. The sample set is composed of 15 clinical cases extracted from the training set from four different specialties: radiology, oncology, psychiatry and occupational health. The files are distributed as follows: - For the subtask 1 (MEDDOPROF-NER), annotations are distributed in Brat standoff format with PROFESION/SITUACION_LABORAL tags only. - For the subtask 2 (MEDDOPROF-CLASS), annotations are distributed in Brat standoff format with PACIENTE/FAMILIAR/SANITARIO/OTROS tags only. - For the subtask 3 (MEDDOPROF-NORM), annotations are distributed in a tab-separated file (TSV) with a code column that includes the mapping of entities to ESCO and SNOMED-CT. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es
MEDDOPROF corpus: test set
(European Organization for Nuclear Research, 2021) Eulàlia Farré-Maduell; Salvador Lima López; Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Martin Krallinger
The MEDDOPROF Shared Task tackles the detection of occupations and employment statuses in clinical cases in Spanish from different specialties. Systems capable of automatically processing clinical texts are of interest to the medical community, social workers, researchers, the pharmaceutical industry, computer engineers, AI developers, policy makers, citizen’s associations and patients. Additionally, other NLP tasks (such as anonymization) can also benefit from this type of data. MEDDOPROF has three different sub-tasks: 1) MEDDOPROF-NER: Participants must find the beginning and end of occupation mentions and classify them as PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD). 2) MEDDOPROF-CLASS: Participants must find the beginning and end of occupation mentions and classify them according to their referent (PACIENTE [patient], FAMILIAR [family member], SANITARIO [health professional] or OTRO [other]). 3) MEDDOPROF-NORM: Participants must find the beginning and end of occupation mentions and normalize them according to a reference codes list. Please cite if you use this resource: Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Brivá-Iglesias and Martin Krallinger. NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. In Procesamiento del Lenguaje Natural, 67. 2021. <pre><code>@article{meddoprof, title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts}, author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin}, journal = {Procesamiento del Lenguaje Natural}, volume = {67}, year={2021}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6393}, pages = {243--256} }</code></pre> Resources: - Web - Complete corpus - Training Data - Codes Reference List (for MEDDOPROF-NORM) - Annotation Guidelines MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL).
MEDDOPROF corpus: test set
(European Organization for Nuclear Research, 2021) Eulàlia Farré-Maduell; Salvador Lima López; Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Martin Krallinger
The MEDDOPROF Shared Task tackles the detection of occupations and employment statuses in clinical cases in Spanish from different specialties. Systems capable of automatically processing clinical texts are of interest to the medical community, social workers, researchers, the pharmaceutical industry, computer engineers, AI developers, policy makers, citizen’s associations and patients. Additionally, other NLP tasks (such as anonymization) can also benefit from this type of data. MEDDOPROF has three different sub-tasks: 1) MEDDOPROF-NER: Participants must find the beginning and end of occupation mentions and classify them as PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD). 2) MEDDOPROF-CLASS: Participants must find the beginning and end of occupation mentions and classify them according to their referent (PACIENTE [patient], FAMILIAR [family member], SANITARIO [health professional] or OTRO [other]). 3) MEDDOPROF-NORM: Participants must find the beginning and end of occupation mentions and normalize them according to a reference codes list. Please cite if you use this resource: Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Brivá-Iglesias and Martin Krallinger. NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. In Procesamiento del Lenguaje Natural, 67. 2021. <pre><code>@article{meddoprof, title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts}, author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin}, journal = {Procesamiento del Lenguaje Natural}, volume = {67}, year={2021}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6393}, pages = {243--256} }</code></pre> Resources: - Web - Complete corpus - Training Data - Codes Reference List (for MEDDOPROF-NORM) - Annotation Guidelines MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL).
MEDDOPROF corpus: test set
(2021) Eulàlia Farré-Maduell; Salvador Lima López; Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Martin Krallinger
The MEDDOPROF Shared Task tackles the detection of occupations and employment statuses in clinical cases in Spanish from different specialties. Systems capable of automatically processing clinical texts are of interest to the medical community, social workers, researchers, the pharmaceutical industry, computer engineers, AI developers, policy makers, citizen’s associations and patients. Additionally, other NLP tasks (such as anonymization) can also benefit from this type of data. MEDDOPROF has three different sub-tasks: 1) MEDDOPROF-NER: Participants must find the beginning and end of occupation mentions and classify them as PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD). 2) MEDDOPROF-CLASS: Participants must find the beginning and end of occupation mentions and classify them according to their referent (PACIENTE [patient], FAMILIAR [family member], SANITARIO [health professional] or OTRO [other]). 3) MEDDOPROF-NORM: Participants must find the beginning and end of occupation mentions and normalize them according to a reference codes list. This repository hosts the 344 files that make up the test set. ONLY the text files to be used for the evaluation phase have been uploaded. The Gold Standard annotations will be included once the evaluation phase is over. Resources: - Web - Training Data - Codes Reference List (for MEDDOPROF-NORM) - Annotation Guidelines MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL).
MEDDOPROF corpus: training set
(European Organization for Nuclear Research, 2021) Eulàlia Farré-Maduell; Salvador Lima-López; Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Martin Krallinger
The MEDDOPROF Shared Task tackles the detection of occupations and employment statuses in clinical cases in Spanish from different specialties. Systems capable of automatically processing clinical texts are of interest to the medical community, social workers, researchers, the pharmaceutical industry, computer engineers, AI developers, policy makers, citizen’s associations and patients. Additionally, other NLP tasks (such as anonymization) can also benefit from this type of data. MEDDOPROF has three different sub-tasks: 1) MEDDOPROF-NER: Participants must find the beginning and end of occupation mentions and classify them as PROFESION (PROFESSION) or SITUACION_LABORAL (WORKING_STATUS) 2) MEDDOPROF-CLASS: Participants must find the beginning and end of occupation mentions and classify them according to their referent (PACIENTE [patient], FAMILIAR [family member], SANITARIO [health professional] or OTRO [other]). 3) MEDDOPROF-NORM: Participants must find the beginning and end of occupation mentions and normalize them according to a reference codes list. MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL). Resources: - Web - Annotation Guidelines
MEDDOPROF corpus: training set
(Figshare (United Kingdom), 2021) Eulàlia Farré-Maduell; Salvador Lima López; Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Martin Krallinger
The MEDDOPROF Shared Task tackles the detection of occupations and employment statuses in clinical cases in Spanish from different specialties. Systems capable of automatically processing clinical texts are of interest to the medical community, social workers, researchers, the pharmaceutical industry, computer engineers, AI developers, policy makers, citizen’s associations and patients. Additionally, other NLP tasks (such as anonymization) can also benefit from this type of data. MEDDOPROF has three different sub-tasks: 1) MEDDOPROF-NER: Participants must find the beginning and end of occupation mentions and classify them as PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD). 2) MEDDOPROF-CLASS: Participants must find the beginning and end of occupation mentions and classify them according to their referent (PACIENTE [patient], FAMILIAR [family member], SANITARIO [health professional] or OTRO [other]). 3) MEDDOPROF-NORM: Participants must find the beginning and end of occupation mentions and normalize them according to a reference codes list. MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL). UPDATE 26/04/21: A new version of the training data has been uploaded after detecting some minor errors in some of the annotations. Training data for Task 3 (MEDDOPROF-NORM) has also been added. Please make sure to download the latest version! Resources: - Web - Codes Reference List (for MEDDOPROF-NORM) - Annotation Guidelines
MEDDOPROF corpus: training set + complementary entities
(Figshare (United Kingdom), 2021) Eulàlia Farré-Maduell; Salvador Lima López; Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Martin Krallinger
The MEDDOPROF Shared Task tackles the detection of occupations and employment statuses in clinical cases in Spanish from different specialties. Systems capable of automatically processing clinical texts are of interest to the medical community, social workers, researchers, the pharmaceutical industry, computer engineers, AI developers, policy makers, citizen’s associations and patients. Additionally, other NLP tasks (such as anonymization) can also benefit from this type of data. MEDDOPROF has three different sub-tasks: 1) MEDDOPROF-NER: Participants must find the beginning and end of occupation mentions and classify them as PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD). 2) MEDDOPROF-CLASS: Participants must find the beginning and end of occupation mentions and classify them according to their referent (PACIENTE [patient], FAMILIAR [family member], SANITARIO [health professional] or OTRO [other]). 3) MEDDOPROF-NORM: Participants must find the beginning and end of occupation mentions and normalize them according to a reference codes list. UPDATE 20/05/21: We have updated the .zip file to include additional mentions of automatically labelled annotations to complement the Gold Standard MEDDOPROF corpus. This complement is called MEDDOPROF-CE (Complementary Entities). The CE version of the training data includes the Shared Task's original manual annotations (with the labels for task one and two joint together, e.g. "PACIENTE-PROFESION") and automatically generated clinical and linguistic entities. All in all, nine new entity types have been included: "síntoma" (symptom), "enfermedad" (disease), "procedimiento" (procedure), "fármaco" (drug), "org_vivo" (living organisms), "neg"/"nsco" (negation trigger and scope) and "unc"/"usco" (uncertainty trigger and scope). The entities in the MEDDOPROF-CE version will not be evaluated in the task, but they can be used to test the impact of other entity types in the Shared Task's tracks or for information discovery. We encourage participants to be creative and incorporate these additional layers into their systems as they wish. Please cite if you use this resource: Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Brivá-Iglesias and Martin Krallinger. NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. In Procesamiento del Lenguaje Natural, 67. 2021. <pre><code>@article{meddoprof, title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts}, author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin}, journal = {Procesamiento del Lenguaje Natural}, volume = {67}, year={2021}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6393}, pages = {243--256} }</code></pre> Resources: - Web - Complete corpus - Test set - Codes Reference List (for MEDDOPROF-NORM) - Annotation Guidelines MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL).
MEDDOPROF corpus: training set + complementary entities
(European Organization for Nuclear Research, 2021) Eulàlia Farré-Maduell; Salvador Lima López; Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Martin Krallinger
The MEDDOPROF Shared Task tackles the detection of occupations and employment statuses in clinical cases in Spanish from different specialties. Systems capable of automatically processing clinical texts are of interest to the medical community, social workers, researchers, the pharmaceutical industry, computer engineers, AI developers, policy makers, citizen’s associations and patients. Additionally, other NLP tasks (such as anonymization) can also benefit from this type of data. MEDDOPROF has three different sub-tasks: 1) MEDDOPROF-NER: Participants must find the beginning and end of occupation mentions and classify them as PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD). 2) MEDDOPROF-CLASS: Participants must find the beginning and end of occupation mentions and classify them according to their referent (PACIENTE [patient], FAMILIAR [family member], SANITARIO [health professional] or OTRO [other]). 3) MEDDOPROF-NORM: Participants must find the beginning and end of occupation mentions and normalize them according to a reference codes list. UPDATE 20/05/21: We have updated the .zip file to include additional mentions of automatically labelled annotations to complement the Gold Standard MEDDOPROF corpus. This complement is called MEDDOPROF-CE (Complementary Entities). The CE version of the training data includes the Shared Task's original manual annotations (with the labels for task one and two joint together, e.g. "PACIENTE-PROFESION") and automatically generated clinical and linguistic entities. All in all, nine new entity types have been included: "síntoma" (symptom), "enfermedad" (disease), "procedimiento" (procedure), "fármaco" (drug), "org_vivo" (living organisms), "neg"/"nsco" (negation trigger and scope) and "unc"/"usco" (uncertainty trigger and scope). The entities in the MEDDOPROF-CE version will not be evaluated in the task, but they can be used to test the impact of other entity types in the Shared Task's tracks or for information discovery. We encourage participants to be creative and incorporate these additional layers into their systems as they wish. Please cite if you use this resource: Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Brivá-Iglesias and Martin Krallinger. NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. In Procesamiento del Lenguaje Natural, 67. 2021. <pre><code>@article{meddoprof, title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts}, author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin}, journal = {Procesamiento del Lenguaje Natural}, volume = {67}, year={2021}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6393}, pages = {243--256} }</code></pre> Resources: - Web - Complete corpus - Test set - Codes Reference List (for MEDDOPROF-NORM) - Annotation Guidelines MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL).
ProfNER corpus: gold standard annotations for profession detection in Spanish COVID-19 tweets
(European Organization for Nuclear Research, 2020) Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Eulàlia Farré-Maduell; Salvador Lima López; Marvin Aguero; Martin Krallinger
THERE IS A NEWER VERSION (1.3) THAT INCORPORATES THE TEXT FILES AUTOMATICALLY TRANSLATED TO ENGLISH Gold Standard annotations for SMM4H-Spanish shared task. SMM4H 2021 accepted at NAACL (scheduled in Mexico City in June) https://2021.naacl.org/. Introduction: The entire corpus contains 10,000 annotated tweets. It has been split into training, validation and test (60-20-20). The current version contains the training and development set of the shared task with Gold Standard annotations. In future versions of the dataset, test and background sets will be released. For the subtask-1 (classification), annotations are distributed in a tab-separated file (TSV). The TSV format follows the format employed in SMM4H 2019 Task 2: tweet_id class For the subtask-2 (Named Entity Recognition, profession detection), annotations are distributed in 2 formats: Brat standoff and TSV. See Brat webpage for more information about Brat standoff format (https://brat.nlplab.org/standoff.html). The TSV format follows the format employed in SMM4H 2019 Task 2: tweet_id begin end type extraction In addition, we provide a tokenized version of the dataset, for participant's convenience. It follows the BIO format (similar to CONLL). The files were generated with the brat_to_conll.py script (included), which employs the es_core_news_sm-2.3.1 Spacy model for tokenization. Zip structure: txt-files: folder with text files. One text file per tweet. One sub-directory per corpus split (train and valid). subtask-1: One file per corpus split (train.tsv and valid.tsv). subtask-2: brat: folder with annotations in Brat format. One sub-directory per corpus split (train and valid). tsv: folder with annotations in TSV. One file per corpus split (train and valid). BIO: folder with corpus in BIO tagging. One file per corpus split (train and valid). Annotation quality: We have performed a consistency analysis of the corpus. 10% of the documents have been annotated by an internal annotator as well as by the linguist experts following the same annotation guideliens. The preliminary Inter-Annotator Agreement (pairwise agreement) is 0.919. Important shared task information: SYSTEM PREDICTIONS MUST FOLLOW THE TSV FORMAT. And systems will only be evaluated for the PROFESION and SITUACION_LABORAL predictions (despite the Gold Standard contains 2 extra entity classes). For more information about the evaluation scenario, see the Codalab link, or the evaluation webpage. For further information, please visit https://temu.bsc.es/smm4h-spanish/ or email us at encargo-pln-life@bsc.es Do not share the data with other individuals/teams without permission from the task organizer. Tweets IDs are the primary source of information. Tweet texts are provided as support material. By downloading this resource, you agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy. Resources: Web Annotation guidelines (in Spanish) Annotation guidelines (in English) FastText COVID-19 Twitter embeddings
ProfNER corpus: gold standard annotations for profession detection in Spanish COVID-19 tweets
(European Organization for Nuclear Research, 2020) Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Eulàlia Farré-Maduell; Salvador Lima López; Marvin Aguero; Martin Krallinger
THERE IS A NEWER VERSION (1.3) THAT INCORPORATES THE UNANNOTATED TEST AND BACKGROUND FILES Gold Standard annotations for SMM4H-Spanish shared task. SMM4H 2021 accepted at NAACL (scheduled in Mexico City in June) https://2021.naacl.org/. Introduction: The entire corpus contains 10,000 annotated tweets. It has been split into training, validation and test (60-20-20). The current version contains the training and development set of the shared task with Gold Standard annotations. In future versions of the dataset, test and background sets will be released. For the subtask-1 (classification), annotations are distributed in a tab-separated file (TSV). The TSV format follows the format employed in SMM4H 2019 Task 2: tweet_id class For the subtask-2 (Named Entity Recognition, profession detection), annotations are distributed in 2 formats: Brat standoff and TSV. See Brat webpage for more information about Brat standoff format (https://brat.nlplab.org/standoff.html). The TSV format follows the format employed in SMM4H 2019 Task 2: tweet_id begin end type extraction In addition, we provide a tokenized version of the dataset, for participant's convenience. It follows the BIO format (similar to CONLL). The files were generated with the brat_to_conll.py script (included), which employs the es_core_news_sm-2.3.1 Spacy model for tokenization. Zip structure: txt-files: folder with text files. One text file per tweet. One sub-directory per corpus split (train and valid). txt-files-english: folder with text files Machine Translated to English. subtask-1: One file per corpus split (train.tsv and valid.tsv). subtask-2: brat: folder with annotations in Brat format. One sub-directory per corpus split (train and valid). tsv: folder with annotations in TSV. One file per corpus split (train and valid). BIO: folder with corpus in BIO tagging. One file per corpus split (train and valid). Annotation quality: We have performed a consistency analysis of the corpus. 10% of the documents have been annotated by an internal annotator as well as by the linguist experts following the same annotation guideliens. The preliminary Inter-Annotator Agreement (pairwise agreement) is 0.919. Important shared task information: SYSTEM PREDICTIONS MUST FOLLOW THE TSV FORMAT. And systems will only be evaluated for the PROFESION and SITUACION_LABORAL predictions (despite the Gold Standard contains 2 extra entity classes). For more information about the evaluation scenario, see the Codalab link, or the evaluation webpage. For further information, please visit https://temu.bsc.es/smm4h-spanish/ or email us at encargo-pln-life@bsc.es Do not share the data with other individuals/teams without permission from the task organizer. Tweets IDs are the primary source of information. Tweet texts are provided as support material. By downloading this resource, you agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy. Resources: Web Annotation guidelines (in Spanish) Annotation guidelines (in English) FastText COVID-19 Twitter embeddings Occupations gazetteer
ProfNER corpus: gold standard annotations for profession detection in Spanish COVID-19 tweets
(2020) Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Eulàlia Farré-Maduell; Salvador Lima López; Marvin Aguero; Martin Krallinger
Gold Standard annotations for SMM4H-Spanish shared task and unannotated test and background files. SMM4H 2021 accepted at NAACL (scheduled in Mexico City in June) https://2021.naacl.org/. Please, cite: Miranda-Escalada, A., Farré-Maduell, E., Lima-López, S., Gascó, L., Briva-Iglesias, V., Agüero-Torales, M., & Krallinger, M. (2021, June). The profner shared task on automatic recognition of occupation mentions in social media: systems, evaluation, guidelines, embeddings and corpora. In Proceedings of the Sixth Social Media Mining for Health (# SMM4H) Workshop and Shared Task (pp. 13-20). <pre><code>@inproceedings{miranda2021profner, title={The profner shared task on automatic recognition of occupation mentions in social media: systems, evaluation, guidelines, embeddings and corpora}, author={Miranda-Escalada, Antonio and Farr{\'e}-Maduell, Eul{\`a}lia and Lima-L{\'o}pez, Salvador and Gasc{\'o}, Luis and Briva-Iglesias, Vicent and Ag{\"u}ero-Torales, Marvin and Krallinger, Martin}, booktitle={Proceedings of the Sixth Social Media Mining for Health (\# SMM4H) Workshop and Shared Task}, pages={13--20}, year={2021} }</code></pre> Introduction: The entire corpus contains 10,000 annotated tweets. It has been split into training, validation and test (60-20-20). The current version contains the training and development set of the shared task with Gold Standard annotations. In addition, it contains the unannotated test and background sets will be released. Participants must submit predictions for the files under the directory "test-background-txt-files" For subtask-1 (classification), annotations are distributed in a tab-separated file (TSV). The TSV format follows the format employed in SMM4H 2019 Task 2: tweet_id class For subtask-2 (Named Entity Recognition, profession detection), annotations are distributed in 2 formats: Brat standoff and TSV. See the Brat webpage for more information about the Brat standoff format (https://brat.nlplab.org/standoff.html). The TSV format follows the format employed in SMM4H 2019 Task 2: tweet_id begin end type extraction In addition, we provide a tokenized version of the dataset, for participant's convenience. It follows the BIO format (similar to CONLL). The files were generated with the brat_to_conll.py script (included), which employs the es_core_news_sm-2.3.1 Spacy model for tokenization. Zip structure: subtask-1: files of tweet classification subtask. Content: One TSV file per corpus split (train and valid). train-valid-txt-files: folder with training and validation text files. One text file per tweet. One sub-directory per corpus split (train and valid). train-valid-txt-files-english: folder with training and validation text files Machine Translated to English. test-background-txt-files: folder with the test and background text files. You must make your predictions for these files and upload them to CodaLab. subtask-2: files of Named Entity Recognition subtask. Content: brat: folder with annotations in Brat format. One sub-directory per corpus split (train and valid) TSV: folder with annotations in TSV. One file per corpus split (train and valid) BIO: folder with corpus in BIO tagging. One file per corpus split (train and valid) train-valid-txt-files: folder with training and validation text files. One text file per tweet. One sub-directory per corpus split (train and valid) train-valid-txt-files-english: folder with training and validation text files Machine Translated to English. test-background-txt-files: folder with the test and background text files. You must make your predictions for these files and upload them to CodaLab. Annotation quality: We have performed a consistency analysis of the corpus. 10% of the documents have been annotated by an internal annotator as well as by the linguist experts following the same annotation guideliens. The preliminary Inter-Annotator Agreement (pairwise agreement) is 0.919. For further information, please visit https://temu.bsc.es/smm4h-spanish/ or email us at encargo-pln-life@bsc.es Do not share the data with other individuals/teams without permission from the task organizer. Tweets IDs are the primary source of information. Tweet texts are provided as support material. By downloading this resource, you agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy. Resources: Web Annotation guidelines (in Spanish) Annotation guidelines (in English) FastText COVID-19 Twitter embeddings Occupations gazetteer Conference Proceedings
ProfNER corpus: gold standard annotations for profession detection in Spanish COVID-19 tweets
(European Organization for Nuclear Research, 2020) Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Eulàlia Farré-Maduell; Salvador Lima López; Marvin Aguero; Martin Krallinger
Gold Standard annotations for SMM4H-Spanish shared task and unannotated test and background files. SMM4H 2021 accepted at NAACL (scheduled in Mexico City in June) https://2021.naacl.org/. Please, cite: Miranda-Escalada, A., Farré-Maduell, E., Lima-López, S., Gascó, L., Briva-Iglesias, V., Agüero-Torales, M., & Krallinger, M. (2021, June). The profner shared task on automatic recognition of occupation mentions in social media: systems, evaluation, guidelines, embeddings and corpora. In Proceedings of the Sixth Social Media Mining for Health (# SMM4H) Workshop and Shared Task (pp. 13-20). <pre><code>@inproceedings{miranda2021profner, title={The profner shared task on automatic recognition of occupation mentions in social media: systems, evaluation, guidelines, embeddings and corpora}, author={Miranda-Escalada, Antonio and Farr{\'e}-Maduell, Eul{\`a}lia and Lima-L{\'o}pez, Salvador and Gasc{\'o}, Luis and Briva-Iglesias, Vicent and Ag{\"u}ero-Torales, Marvin and Krallinger, Martin}, booktitle={Proceedings of the Sixth Social Media Mining for Health (\# SMM4H) Workshop and Shared Task}, pages={13--20}, year={2021} }</code></pre> Introduction: The entire corpus contains 10,000 annotated tweets. It has been split into training, validation and test (60-20-20). The current version contains the training and development set of the shared task with Gold Standard annotations. In addition, it contains the unannotated test and background sets will be released. Participants must submit predictions for the files under the directory "test-background-txt-files" For subtask-1 (classification), annotations are distributed in a tab-separated file (TSV). The TSV format follows the format employed in SMM4H 2019 Task 2: tweet_id class For subtask-2 (Named Entity Recognition, profession detection), annotations are distributed in 2 formats: Brat standoff and TSV. See the Brat webpage for more information about the Brat standoff format (https://brat.nlplab.org/standoff.html). The TSV format follows the format employed in SMM4H 2019 Task 2: tweet_id begin end type extraction In addition, we provide a tokenized version of the dataset, for participant's convenience. It follows the BIO format (similar to CONLL). The files were generated with the brat_to_conll.py script (included), which employs the es_core_news_sm-2.3.1 Spacy model for tokenization. Zip structure: subtask-1: files of tweet classification subtask. Content: One TSV file per corpus split (train and valid). train-valid-txt-files: folder with training and validation text files. One text file per tweet. One sub-directory per corpus split (train and valid). train-valid-txt-files-english: folder with training and validation text files Machine Translated to English. test-background-txt-files: folder with the test and background text files. You must make your predictions for these files and upload them to CodaLab. subtask-2: files of Named Entity Recognition subtask. Content: brat: folder with annotations in Brat format. One sub-directory per corpus split (train and valid) TSV: folder with annotations in TSV. One file per corpus split (train and valid) BIO: folder with corpus in BIO tagging. One file per corpus split (train and valid) train-valid-txt-files: folder with training and validation text files. One text file per tweet. One sub-directory per corpus split (train and valid) train-valid-txt-files-english: folder with training and validation text files Machine Translated to English. test-background-txt-files: folder with the test and background text files. You must make your predictions for these files and upload them to CodaLab. Annotation quality: We have performed a consistency analysis of the corpus. 10% of the documents have been annotated by an internal annotator as well as by the linguist experts following the same annotation guideliens. The preliminary Inter-Annotator Agreement (pairwise agreement) is 0.919. For further information, please visit https://temu.bsc.es/smm4h-spanish/ or email us at encargo-pln-life@bsc.es Do not share the data with other individuals/teams without permission from the task organizer. Tweets IDs are the primary source of information. Tweet texts are provided as support material. By downloading this resource, you agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy. Resources: Web Annotation guidelines (in Spanish) Annotation guidelines (in English) FastText COVID-19 Twitter embeddings Occupations gazetteer Conference Proceedings
ProfNER corpus: gold standard annotations for profession detection in Spanish COVID-19 tweets
(European Organization for Nuclear Research, 2020) Antonio Miranda-Escalada; Vicent Brivá-Iglesias; Eulàlia Farré-Maduell; Salvador Lima López; Marvin Aguero; Martin Krallinger
LAST DATASET VERSION IS 1.3 Gold Standard annotations for SMM4H-Spanish shared task. SMM4H 2021 accepted at NAACL (scheduled in Mexico City in June) https://2021.naacl.org/. Introduction: The entire corpus contains 10,000 annotated tweets. It has been split into training, validation and test (60-20-20). The current version contains the training and development set of the shared task with Gold Standard annotations. In future versions of the dataset, test and background sets will be released. Annotations are distributed in 2 formats: Brat standoff and TSV. See Brat webpage for more information about Brat standoff format (https://brat.nlplab.org/standoff.html) The TSV format follows the format employed in SMM4H 2019 Task 2: Tweet ID Begin End Class text In addition, we provide a tokenized version of the dataset, for participant's convenience. It follows the BIO format (similar to CONLL). The files were generated with the brat_to_conll.py script (included), which employs the es_core_news_sm-2.3.1 Spacy model for tokenization. Zip structure: txt-files: folder with text files. One text file per tweet. One sub-directory per corpus split (train and valid). brat: folder with annotations in Brat format. One sub-directory per corpus split (train and valid). TSV: folder with annotations in TSV. One file per corpus split (train and valid). BIO: folder with corpus in BIO tagging. One file per corpus split (train and valid). Important shared task information: SYSTEM PREDICTIONS MUST FOLLOW THE TSV FORMAT. And systems will only be evaluated for the PROFESION and SITUACION_LABORAL predictions (despite the Gold Standard contains 2 extra entity classes). For more information about the evaluation scenario, see the Codalab link, or the evaluation webpage. For further information, please visit https://temu.bsc.es/smm4h-spanish/ or email us at encargo-pln-life@bsc.es Do not share the data with other individuals/teams without permission from the task organizer. Tweets IDs are the primary source of information. Tweet texts are provided as support material. By downloading this resource, you agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy. Resources: Web Annotation guidelines