Title: | Clinical Natural Language Processing using 'spaCy', 'scispaCy', and 'medspaCy' |
---|---|
Description: | Performs biomedical named entity recognition, Unified Medical Language System (UMLS) concept mapping, and negation detection using the Python 'spaCy', 'scispaCy', and 'medspaCy' packages, and transforms extracted data into a wide format for inclusion in machine learning models. The development of the 'scispaCy' package is described by Neumann (2019) <doi:10.18653/v1/W19-5034>. The 'medspacy' package uses 'ConText', an algorithm for determining the context of clinical statements described by Harkema (2009) <doi:10.1016/j.jbi.2009.05.002>. Clinspacy also supports entity embeddings from 'scispaCy' and UMLS 'cui2vec' concept embeddings developed by Beam (2018) <arXiv:1804.01486>. |
Authors: | Karandeep Singh [aut, cre], Benjamin Kompa [aut], Andrew Beam [aut], Allen Schmaltz [aut] |
Maintainer: | Karandeep Singh <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.2.9000 |
Built: | 2025-02-19 05:29:36 UTC |
Source: | https://github.com/kdpsingh/clinspacy |
This function binds columns containing either the lemma of the entity or the UMLS concept unique identifier (CUI) with frequencies to a data frame. The resulting data frame can be used to train a machine learning model or for additional feature selection.
bind_clinspacy( clinspacy_output, df, cs_col = NULL, df_id = NULL, subset = "is_negated == FALSE" )
bind_clinspacy( clinspacy_output, df, cs_col = NULL, df_id = NULL, subset = "is_negated == FALSE" )
clinspacy_output |
A data.frame or file name containing the output from
|
df |
The data.frame to which you would like to bind the output of
|
cs_col |
Name of the column in the |
df_id |
The name of the |
subset |
Logical criteria represented as a string by which the
|
A data frame containing the original data frame as well as additional column names for each lemma or UMLS concept unique identifer found with values containing frequencies.
## Not run: mtsamples <- dataset_mtsamples() mtsamples[1:5,] %>% clinspacy(df_col = 'description') %>% bind_clinspacy(mtsamples[1:5,]) ## End(Not run)
## Not run: mtsamples <- dataset_mtsamples() mtsamples[1:5,] %>% clinspacy(df_col = 'description') %>% bind_clinspacy(mtsamples[1:5,]) ## End(Not run)
dataset_cui2vec_embeddings
dataset included with this package.The embeddings are derived from Andrew Beam's cui2vec R package.
bind_clinspacy_embeddings( clinspacy_output, df, type = "scispacy", df_id = NULL, subset = "is_negated == FALSE" )
bind_clinspacy_embeddings( clinspacy_output, df, type = "scispacy", df_id = NULL, subset = "is_negated == FALSE" )
clinspacy_output |
A data.frame or file name containing the output from
|
df |
The data.frame to which you would like to bind the output of
|
type |
The type of embeddings to return. One of |
df_id |
The name of the |
subset |
Logical criteria represented as a string by which the
|
Citation
Beam, A.L., Kompa, B., Schmaltz, A., Fried, I., Griffin, W, Palmer, N.P., Shi, X., Cai, T., and Kohane, I.S.,, 2019. Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data. arXiv preprint arXiv:1804.01486.
License
The cui2vec data is made available under a CC BY 4.0 license. The only change made to the original dataset is the renaming of columns.
A data frame containing the original data frame as well as the concept embeddings. For scispacy embeddings, this returns 200 columns of embeddings. For cui2vec embeddings, this returns 500 columns of embedings. The resulting data frame can be used to train a machine learning model.
## Not run: mtsamples <- dataset_mtsamples() mtsamples[1:5,] %>% clinspacy(df_col = 'description', return_scispacy_embeddings = TRUE) %>% bind_clinspacy_embeddings(mtsamples[1:5,]) ## End(Not run)
## Not run: mtsamples <- dataset_mtsamples() mtsamples[1:5,] %>% clinspacy(df_col = 'description', return_scispacy_embeddings = TRUE) %>% bind_clinspacy_embeddings(mtsamples[1:5,]) ## End(Not run)
clinspacy
package.This is the primary function for processing both data frames and character
vectors in the clinspacy
package.
clinspacy( x, df_col = NULL, df_id = NULL, threshold = 0.99, semantic_types = c(NA, "Acquired Abnormality", "Activity", "Age Group", "Amino Acid Sequence", "Amino Acid, Peptide, or Protein", "Amphibian", "Anatomical Abnormality", "Anatomical Structure", "Animal", "Antibiotic", "Archaeon", "Bacterium", "Behavior", "Biologic Function", "Biologically Active Substance", "Biomedical Occupation or Discipline", "Biomedical or Dental Material", "Bird", "Body Location or Region", "Body Part, Organ, or Organ Component", "Body Space or Junction", "Body Substance", "Body System", "Carbohydrate Sequence", "Cell", "Cell Component", "Cell Function", "Cell or Molecular Dysfunction", "Chemical", "Chemical Viewed Functionally", "Chemical Viewed Structurally", "Classification", "Clinical Attribute", "Clinical Drug", "Conceptual Entity", "Congenital Abnormality", "Daily or Recreational Activity", "Diagnostic Procedure", "Disease or Syndrome", "Drug Delivery Device", "Educational Activity", "Element, Ion, or Isotope", "Embryonic Structure", "Entity", "Environmental Effect of Humans", "Enzyme", "Eukaryote", "Event", "Experimental Model of Disease", "Family Group", "Finding", "Fish", "Food", "Fully Formed Anatomical Structure", "Functional Concept", "Fungus", "Gene or Genome", "Genetic Function", "Geographic Area", "Governmental or Regulatory Activity", "Group", "Group Attribute", "Hazardous or Poisonous Substance", "Health Care Activity", "Health Care Related Organization", "Hormone", "Human", "Human-caused Phenomenon or Process", "Idea or Concept", "Immunologic Factor", "Indicator, Reagent, or Diagnostic Aid", "Individual Behavior", "Injury or Poisoning", "Inorganic Chemical", "Intellectual Product", "Laboratory or Test Result", "Laboratory Procedure", "Language", "Machine Activity", "Mammal", "Manufactured Object", "Medical Device", "Mental or Behavioral Dysfunction", "Mental Process", "Molecular Biology Research Technique", "Molecular Function", "Molecular Sequence", "Natural Phenomenon or Process", "Neoplastic Process", "Nucleic Acid, Nucleoside, or Nucleotide", "Nucleotide Sequence", "Occupation or Discipline", "Occupational Activity", "Organ or Tissue Function", "Organic Chemical", "Organism", "Organism Attribute", "Organism Function", "Organization", "Pathologic Function", "Patient or Disabled Group", "Pharmacologic Substance", "Phenomenon or Process", "Physical Object", "Physiologic Function", "Plant", "Population Group", "Professional or Occupational Group", "Professional Society", "Qualitative Concept", "Quantitative Concept", "Receptor", "Regulation or Law", "Reptile", "Research Activity", "Research Device", "Self-help or Relief Organization", "Sign or Symptom", "Social Behavior", "Spatial Concept", "Substance", "Temporal Concept", "Therapeutic or Preventive Procedure", "Tissue", "Vertebrate", "Virus", "Vitamin"), return_scispacy_embeddings = FALSE, verbose = TRUE, output_file = NULL, overwrite = FALSE )
clinspacy( x, df_col = NULL, df_id = NULL, threshold = 0.99, semantic_types = c(NA, "Acquired Abnormality", "Activity", "Age Group", "Amino Acid Sequence", "Amino Acid, Peptide, or Protein", "Amphibian", "Anatomical Abnormality", "Anatomical Structure", "Animal", "Antibiotic", "Archaeon", "Bacterium", "Behavior", "Biologic Function", "Biologically Active Substance", "Biomedical Occupation or Discipline", "Biomedical or Dental Material", "Bird", "Body Location or Region", "Body Part, Organ, or Organ Component", "Body Space or Junction", "Body Substance", "Body System", "Carbohydrate Sequence", "Cell", "Cell Component", "Cell Function", "Cell or Molecular Dysfunction", "Chemical", "Chemical Viewed Functionally", "Chemical Viewed Structurally", "Classification", "Clinical Attribute", "Clinical Drug", "Conceptual Entity", "Congenital Abnormality", "Daily or Recreational Activity", "Diagnostic Procedure", "Disease or Syndrome", "Drug Delivery Device", "Educational Activity", "Element, Ion, or Isotope", "Embryonic Structure", "Entity", "Environmental Effect of Humans", "Enzyme", "Eukaryote", "Event", "Experimental Model of Disease", "Family Group", "Finding", "Fish", "Food", "Fully Formed Anatomical Structure", "Functional Concept", "Fungus", "Gene or Genome", "Genetic Function", "Geographic Area", "Governmental or Regulatory Activity", "Group", "Group Attribute", "Hazardous or Poisonous Substance", "Health Care Activity", "Health Care Related Organization", "Hormone", "Human", "Human-caused Phenomenon or Process", "Idea or Concept", "Immunologic Factor", "Indicator, Reagent, or Diagnostic Aid", "Individual Behavior", "Injury or Poisoning", "Inorganic Chemical", "Intellectual Product", "Laboratory or Test Result", "Laboratory Procedure", "Language", "Machine Activity", "Mammal", "Manufactured Object", "Medical Device", "Mental or Behavioral Dysfunction", "Mental Process", "Molecular Biology Research Technique", "Molecular Function", "Molecular Sequence", "Natural Phenomenon or Process", "Neoplastic Process", "Nucleic Acid, Nucleoside, or Nucleotide", "Nucleotide Sequence", "Occupation or Discipline", "Occupational Activity", "Organ or Tissue Function", "Organic Chemical", "Organism", "Organism Attribute", "Organism Function", "Organization", "Pathologic Function", "Patient or Disabled Group", "Pharmacologic Substance", "Phenomenon or Process", "Physical Object", "Physiologic Function", "Plant", "Population Group", "Professional or Occupational Group", "Professional Society", "Qualitative Concept", "Quantitative Concept", "Receptor", "Regulation or Law", "Reptile", "Research Activity", "Research Device", "Self-help or Relief Organization", "Sign or Symptom", "Social Behavior", "Spatial Concept", "Substance", "Temporal Concept", "Therapeutic or Preventive Procedure", "Tissue", "Vertebrate", "Virus", "Vitamin"), return_scispacy_embeddings = FALSE, verbose = TRUE, output_file = NULL, overwrite = FALSE )
x |
Either a data.frame or a character vector |
df_col |
If |
df_id |
If |
threshold |
Defaults to 0.99. The confidence threshold value used by
clinspacy (can be higher than the |
semantic_types |
Character vector containing any combination of the following: c(NA, "Acquired Abnormality", "Activity", "Age Group", "Amino Acid Sequence", "Amino Acid, Peptide, or Protein", "Amphibian", "Anatomical Abnormality", "Anatomical Structure", "Animal", "Antibiotic", "Archaeon", "Bacterium", "Behavior", "Biologic Function", "Biologically Active Substance", "Biomedical Occupation or Discipline", "Biomedical or Dental Material", "Bird", "Body Location or Region", "Body Part, Organ, or Organ Component", "Body Space or Junction", "Body Substance", "Body System", "Carbohydrate Sequence", "Cell", "Cell Component", "Cell Function", "Cell or Molecular Dysfunction", "Chemical", "Chemical Viewed Functionally", "Chemical Viewed Structurally", "Classification", "Clinical Attribute", "Clinical Drug", "Conceptual Entity", "Congenital Abnormality", "Daily or Recreational Activity", "Diagnostic Procedure", "Disease or Syndrome", "Drug Delivery Device", "Educational Activity", "Element, Ion, or Isotope", "Embryonic Structure", "Entity", "Environmental Effect of Humans", "Enzyme", "Eukaryote", "Event", "Experimental Model of Disease", "Family Group", "Finding", "Fish", "Food", "Fully Formed Anatomical Structure", "Functional Concept", "Fungus", "Gene or Genome", "Genetic Function", "Geographic Area", "Governmental or Regulatory Activity", "Group", "Group Attribute", "Hazardous or Poisonous Substance", "Health Care Activity", "Health Care Related Organization", "Hormone", "Human", "Human-caused Phenomenon or Process", "Idea or Concept", "Immunologic Factor", "Indicator, Reagent, or Diagnostic Aid", "Individual Behavior", "Injury or Poisoning", "Inorganic Chemical", "Intellectual Product", "Laboratory or Test Result", "Laboratory Procedure", "Language", "Machine Activity", "Mammal", "Manufactured Object", "Medical Device", "Mental or Behavioral Dysfunction", "Mental Process", "Molecular Biology Research Technique", "Molecular Function", "Molecular Sequence", "Natural Phenomenon or Process", "Neoplastic Process", "Nucleic Acid, Nucleoside, or Nucleotide", "Nucleotide Sequence", "Occupation or Discipline", "Occupational Activity", "Organ or Tissue Function", "Organic Chemical", "Organism", "Organism Attribute", "Organism Function", "Organization", "Pathologic Function", "Patient or Disabled Group", "Pharmacologic Substance", "Phenomenon or Process", "Physical Object", "Physiologic Function", "Plant", "Population Group", "Professional or Occupational Group", "Professional Society", "Qualitative Concept", "Quantitative Concept", "Receptor", "Regulation or Law", "Reptile", "Research Activity", "Research Device", "Self-help or Relief Organization", "Sign or Symptom", "Social Behavior", "Spatial Concept", "Substance", "Temporal Concept", "Therapeutic or Preventive Procedure", "Tissue", "Vertebrate", "Virus", "Vitamin") |
return_scispacy_embeddings |
Defaults to |
verbose |
Defaults to |
output_file |
Defaults to |
overwrite |
Defaults to |
If output_file
is NULL
(the default), then this
function returns a data frame containing the UMLS concept unique
identifiers (cui), entities, lemmatized entities, CyContext negation status
(TRUE
means negated, FALSE
means *not* negated), other
CyContext contexts, and section title from the clinical sectionizer. If
output_file
points to a file name, then the name of the created file
will be returned.
## Not run: clinspacy('This patient has diabetes and CKD stage 3 but no HTN.') clinspacy(c('This pt has CKD and HTN', 'Pt only has CKD but no HTN')) data.frame(text = c('This pt has CKD and HTN', 'Diabetes is present'), stringsAsFactors = FALSE) %>% clinspacy(df_col = 'text') if (!dir.exists(rappdirs::user_data_dir('clinspacy'))) { dir.create(rappdirs::user_data_dir('clinspacy'), recursive = TRUE) } clinspacy(c('This pt has CKD and HTN', 'Has CKD but no HTN'), output_file = file.path(rappdirs::user_data_dir('clinspacy'), 'output.csv'), overwrite = TRUE) ## End(Not run)
## Not run: clinspacy('This patient has diabetes and CKD stage 3 but no HTN.') clinspacy(c('This pt has CKD and HTN', 'Pt only has CKD but no HTN')) data.frame(text = c('This pt has CKD and HTN', 'Diabetes is present'), stringsAsFactors = FALSE) %>% clinspacy(df_col = 'text') if (!dir.exists(rappdirs::user_data_dir('clinspacy'))) { dir.create(rappdirs::user_data_dir('clinspacy'), recursive = TRUE) } clinspacy(c('This pt has CKD and HTN', 'Has CKD but no HTN'), output_file = file.path(rappdirs::user_data_dir('clinspacy'), 'output.csv'), overwrite = TRUE) ## End(Not run)
Initializes clinspacy. This function is optional to run but gives you more control over the parameters used by scispacy at initiation. If you do not run this function, it will be run with default parameters the first time that any of the package functions are run.
clinspacy_init( miniconda = TRUE, use_linker = FALSE, linker_threshold = 0.99, ... )
clinspacy_init( miniconda = TRUE, use_linker = FALSE, linker_threshold = 0.99, ... )
miniconda |
Defaults to TRUE, which results in miniconda being installed
(~400 MB) and configured with the "clinspacy" conda environment. If you
want to override this behavior, set |
use_linker |
Defaults to |
linker_threshold |
Defaults to 0.99. This arguemtn is only relevant if
|
... |
Additional settings available from: https://github.com/allenai/scispacy. |
No return value.
This dataset contains definitions for the Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). These come from Andrew Beam's cui2vec R package.
dataset_cui2vec_definitions()
dataset_cui2vec_definitions()
A data frame with 3053795 rows and 3 variables:
A Unified Medical Language System (UMLS) Concept Unique Identifier (CUI)
Semantic type of the CUI
Definition of the CUI
License
This data is made available under a MIT license. The data is copyrighted in 2019 by Benjamin Kompa, Andrew Beam, and Allen Schmaltz. The only change made to the original dataset is the renaming of columns.
Returns the cui2vec UMLS definitions as a data frame.
https://github.com/beamandrew/cui2vec
This dataset contains Unified Medical Langauge System (UMLS) concept embeddings from Andrew Beam's cui2vec R package. There are 500 embeddings included for each concept.
dataset_cui2vec_embeddings()
dataset_cui2vec_embeddings()
A data frame with 109053 rows and 501 variables:
A Unified Medical Language System (UMLS) Concept Unique Identifier (CUI)
Concept embedding vector #1
Concept embedding vector #2
and so on...
Concept embedding vector #500
This dataset is not viewable until it has been downloaded, which will occur
the very first time you run clinspacy_init()
after installing this
package.
Citation
Beam, A.L., Kompa, B., Schmaltz, A., Fried, I., Griffin, W, Palmer, N.P., Shi, X., Cai, T., and Kohane, I.S.,, 2019. Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data. arXiv preprint arXiv:1804.01486.
License
This data is made available under a CC BY 4.0 license. The only change made to the original dataset is the renaming of columns.
Returns the cui2vec UMLS embeddings as a data frame.
https://figshare.com/s/00d69861786cd0156d81
This dataset contains sample medical transcriptions for various medical specialties.
dataset_mtsamples()
dataset_mtsamples()
A data frame with 4999 rows and 6 variables:
A unique identifier for each note
A description or chief concern
Medical specialty of the note
mtsamples.com note name
Transcription of note text
Keywords
Acknowledgements
This data was scraped from https://mtsamples.com by Tara Boyle.
License This data is made available under a CC0: Public Domain license.
Returns the mtsamples dataset as a data frame.
https://www.kaggle.com/tboyle10/medicaltranscriptions/data