Spacy Nlp Pipeline Order Of Operations

December 13, 2023 Post a Comment

Does anyone have a chronological list of operations performed by import spacy nlp = spacy.load('en_core_web_sm') doc = nlp(text) I can see the major components with nlp.pipe_names

Solution 1:

The answer to your question is more complicated than I originally thought, but now I will explain it in detail.

SpaCy lemmatization usually is performed based on a lookup table. That means that is independent on the pipeline components and it lemmatization happens before the pipe. However, English language and Greek language are designed such that a rule based lemmatization can be performed when pos tag is available. That means that if tagger is enabled then we can take advantage of the POS tag in order to find the best lemma matching the word based on its' tag. In this case, lemmatization happens just after the tagger pipeline component.

Briefly, if tagger is disabled the we follow a static lemmatization procedure based on a lookup table that matches words to their lemmas and lemmatization happens before any pipeline component. Contrary to that, when tagger is enabled the lemmatization procedure is rule based and dependent on the POS tag, so it happens after tagger. I repeat that this case can happen only for certain languages that support rule based lemmatization such as English and Greek language.

A code example:

import spacy
nlp = spacy.load('en')
nlp.remove_pipe('parser')
# uncommenting the following line means we go to rule based lemmatization# nlp.remove_pipe('tagger')
nlp.remove_pipe('ner')
doc = nlp('those are random words')
for token in doc:
    print(token.lemma_)

Output with line commented out: those be random word

Output with line without comment: that be random word

Hope it is clarified now.

Python Playground

Spacy Nlp Pipeline Order Of Operations

Solution 1:

Post a Comment for "Spacy Nlp Pipeline Order Of Operations"