Disabling Gensim's Removal Of Punctuation Etc. When Parsing A Wiki Corpus

October 30, 2022 Post a Comment

I want to train a word2vec model on the english wikipedia using python with gensim. I closely followed https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw for that. It works

Solution 1:

I wouldn't be surprised if spacy was operating on the level of sentences. For that it is very likely using sentence boundaries (dot, question mark, etc.). That is why spacy NER (or maybe even a POS Tagger earlier in the pipeline) might be failing for you.

As for the way to represent named entities for gensim's LSI - I would recommend adding an artificial identifier (a non-existent word). From the perspective of a model it does not make any difference and it may save you the burden of reworking gensim's preprocessing.

You may want to refer to the model.wv.vocab where model = gensim.models.Word2Vec(...) For that you would have to train the model twice. Alternatively, try creating a vocabulary set from the raw text and pick a random set of letters that does not exist already in the vocabulary.

Solution 2:

You can use a gensim word2vec pretrained model in spaCy, but the problem here is your processing pipeline's order:

You pass the texts to gensim
Gensim parses and tokenizes the strings
You normalize the tokens
You pass the tokens back to spaCy
You make a w2v corpus (with spaCy) (?)

That means the docs are already tokenized when spaCy gets them, and yes, its NER is... complex: https://www.youtube.com/watch?v=sqDHBH9IjRU

What you'd probably like to do is:

You pass the texts to spaCy
spaCy parses them with NER
spaCy tokenizes them accordingly, keeping entities as one token
you load the gensim w2v model with spacy.load()
you use the loaded model to create the w2v corpus in spaCy

All you need to do is download the model from gensim and tell spaCy to look for it from the command line:

wget [url to model]
python -m init-model [options] [file you just downloaded]

Here is the command line documentation for init-model: https://spacy.io/api/cli#init-model

then load it just like en_core_web_md, e.g. You can use .txt, .zip or .tgz models.

Python Playground

Disabling Gensim's Removal Of Punctuation Etc. When Parsing A Wiki Corpus

Solution 1:

Solution 2:

Post a Comment for "Disabling Gensim's Removal Of Punctuation Etc. When Parsing A Wiki Corpus"