First of all, I am new to python/nltk so my apologies if the question is too basic. I have a large file that I am trying to tokenize; I get memory errors.
One solution I've read about is to read the file one line at a time, which makes sense, however, when doing that, I get the error cannot concatenate 'str' and 'list' objects
. I am not sure why that error is displayed since (after reading the file, I check its type and it is in fact a string.
I have tried to split the 7MB files into 4 smaller ones, and when running that, I get:
error: failed to write data to stream
.
Finally, when trying a very small sample of the file (100KB or less), and running the modified code, I am able to tokenize the file.
Any insights into what's happening? Thank you.
# tokenizing large file one line at a time
import nltk
filename=open("X:\MyFile.txt","r").read()
type(raw) #str
tokens = ''
for line in filename
tokens+=nltk.word_tokenize(filename)
#cannot concatenate 'str' and 'list' objects
The following works with small file:
import nltk
filename=open("X:\MyFile.txt","r").read()
type(raw)
tokens = nltk.word.tokenize(filename)