Disabling Gensim's Removal Of Punctuation Etc. When Parsing A Wiki Corpus
Solution 1:
I wouldn't be surprised if spacy was operating on the level of sentences. For that it is very likely using sentence boundaries (dot, question mark, etc.). That is why spacy NER (or maybe even a POS Tagger earlier in the pipeline) might be failing for you.
As for the way to represent named entities for gensim's LSI - I would recommend adding an artificial identifier (a non-existent word). From the perspective of a model it does not make any difference and it may save you the burden of reworking gensim's preprocessing.
You may want to refer to the model.wv.vocab
where model = gensim.models.Word2Vec(...)
For that you would have to train the model twice. Alternatively, try creating a vocabulary set from the raw text and pick a random set of letters that does not exist already in the vocabulary.
Solution 2:
You can use a gensim word2vec pretrained model in spaCy, but the problem here is your processing pipeline's order:
- You pass the texts to gensim
- Gensim parses and tokenizes the strings
- You normalize the tokens
- You pass the tokens back to spaCy
- You make a w2v corpus (with spaCy) (?)
That means the docs are already tokenized when spaCy gets them, and yes, its NER is... complex: https://www.youtube.com/watch?v=sqDHBH9IjRU
What you'd probably like to do is:
- You pass the texts to spaCy
- spaCy parses them with NER
- spaCy tokenizes them accordingly, keeping entities as one token
- you load the gensim w2v model with spacy.load()
- you use the loaded model to create the w2v corpus in spaCy
All you need to do is download the model from gensim and tell spaCy to look for it from the command line:
- wget [url to model]
- python -m init-model [options] [file you just downloaded]
Here is the command line documentation for init-model: https://spacy.io/api/cli#init-model
then load it just like en_core_web_md, e.g. You can use .txt, .zip or .tgz models.
Post a Comment for "Disabling Gensim's Removal Of Punctuation Etc. When Parsing A Wiki Corpus"