Combining Entities Recognized by Different Models & by the AbbreviationDetector

I recently encountered both spaCy and ScispaCy and so far I think ScispaCy is an awesome tool to be able to identify and link biomedical entities found in text with concepts from UMLS and other knowledge bases.

I was thinking it would be even more powerful if the entities identified by different models and by the AbbreviationDetector can be combined. This would allow the shortcomings of one model to be compensated by another model. It would also allow a model's shortcomings to be compensated by the long forms of any detected abbreviations.

For example, the identified entities in "Spinal and bulbar muscular atrophy (SBMA)" using the `en_core_sci_lg` model in the [ScispaCy Demo](url) are: 
- "Spinal"
- "bulbar muscular atrophy"
- "SBMA"

However, after adding the AbbreviationDetector as a pipe, we would recognize "SBMA" as an abbreviation for "Spinal and bulbar muscular atrophy", so really, the entities should be the following, but they are not corrected as such:
- "Spinal and bulbar muscular atrophy"
- "SBMA"

Similarly, some models may identify fragments of a phrase as separate entities while another model may recognize a whole phrase as one entity. Or, some models may recognize certain entities while other models may completely ignore them. If there is some way of consolidating entities found by different models, then a more accurate and complete list of entities will be obtained than just using any given model individually.

There are also times when a longer phrased entity is not always better, because it may yield poor matching results that are below the desired mention threshold for a given knowledge base. For example, in the [ScispaCy Demo](https://scispacy.apps.allenai.org/), the `en_core_sci_md` model identifies "inherited motor neuron disease" as an entity but gives no results satisfying the mention threshold of 0.85. On the other hand, the `en_core_sci_sm` model identifies "inherited" and "motor neuron disease" as separate entities, each of which have matches above the 0.85 mention threshold. Therefore, it may generally be helpful to also keep track of any related original, unconsolidated entities from each model and pick the next longest phrased entities that have matching results above the desired mention threshold.

Overall, a function with the following components would be roughly what I'm looking for:
- Parameters to take in:
   - The text string from which entities will be identified.
   - A boolean for whether or not to identify the long forms of abbreviations as entities. (e.g., True)
   - A list of the desired models to use (e.g., ["en_core_sci_sm", "en_core_sci_scibert", "en_ner_bc5cdr_md"]).
   - A dictionary with any desired configurations of the scispacy linker, including the linker name (e.g., {"resolve_abbreviations": True, "filter_for_definitions": False, "no_definition_threshold": 0.85, "linker_name": "umls"})
  
- Output: A tuple with the following two items:
   - The nlp object that can be used to make the linker to the utilized knowledge base.
   - A Doc object with the longest length entities that also have matches above the user's desired mention threshold.

Here is how use of the proposed function, which I call `consolidated_entities_tuple` might look like (This is NOT functioning code, just an example of how I imagine the functionality to be):
```
import spacy
import scispacy

from scispacy.linking import EntityLinker
from scispacy.abbreviation import AbbreviationDetector

def consolidated_entities_tuple(text: str, long_form_abbrev_ents: bool, model_list: list, scispacy_linker_config: dict):
     # place code for function here, likely to utilize the imported modules above
     return (nlp, doc)

text = "Spinal and bulbar muscular atrophy (SBMA) is an \
inherited motor neuron disease caused by the expansion \
of a polyglutamine tract within the androgen receptor (AR). \
SBMA can be caused by this easily."

tup = consolidated_entities_tuple(text, True, ["en_core_sci_sm", "en_core_sci_scibert", "en_ner_bc5cdr_md"], 
                                  {"resolve_abbreviations": True, "filter_for_definitions": False, 
                                   "no_definition_threshold": 0.85, "linker_name": "umls"})

nlp = tup[0]
doc = tup[1]

# Let's look at the first entity
entity = doc.ents[0]

print("Name: ", entity)
>>> Name: Spinal and bulbar muscular atrophy

# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
linker = nlp.get_pipe("scispacy_linker")
for umls_ent in entity._.kb_ents:
	print(linker.kb.cui_to_entity[umls_ent[0]])

>>> CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked
>>> Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the
  				gene encoding the ANDROGEN RECEPTOR.
>>> TUI(s): T047
>>> Aliases (abbreviated, total: 50):
         Bulbo-Spinal Atrophy, X-Linked, Bulbo-Spinal Atrophy, X-Linked, ....

>>> CUI: C0752353, Name: Atrophy, Muscular, Spinobulbar
>>> Definition: .....
>>> TUI(s): T047
>>> Aliases: (total: ?):
         ... , ... , ... , ...

>>> .....

# Now let's look at the abbreviations in the text
print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
	print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

>>> Abbreviation	 Span	    Definition
>>> SBMA 		 (33, 34)   Spinal and bulbar muscular atrophy
>>> SBMA 	   	 (6, 7)     Spinal and bulbar muscular atrophy
>>> AR   		 (29, 30)   androgen receptor
```

Thank you for taking the time to read this. If this sort of function already exists in ScispaCy, please let me know. Otherwise, if this sort of function or some other code that accomplishes the same thing can be added to ScispaCy, that would be awesome. I believe it can be a powerful addition to the library. Let me know your thoughts.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining Entities Recognized by Different Models & by the AbbreviationDetector #388

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development