Given a variable number of strings, I'd like to one-hot encode them as in the following example:
s1 = 'awaken my love'
s2 = 'awaken the beast'
s3 = 'wake beast love'
# desired result - NumPy array
array([[ 1., 1., 1., 0., 0., 0.],
[ 1., 0., 0., 1., 1., 0.],
[ 0., 0., 1., 0., 1., 1.]])
Current code:
def uniquewords(*args):
"""Create order-preserved string with unique words between *args"""
allwords = ' '.join(args).split()
return ' '.join(sorted(set(allwords), key=allwords.index)).split()
def encode(*args):
"""One-hot encode the given input strings"""
unique = uniquewords(*args)
feature_vectors = np.zeros((len(args), len(unique)))
for vec, s in zip(feature_vectors, args):
for num, word in enumerate(unique):
vec[num] = word in s
return feature_vectors
The issue is in this line:
vec[num] = word in s
Which picks up, for instance, 'wake' in 'awaken my love'
as True
(rightly so, but not for my needs) and gives the following, slightly-off result:
print(encode(s1, s2, s3))
[[ 1. 1. 1. 0. 0. 1.]
[ 1. 0. 0. 1. 1. 1.]
[ 0. 0. 1. 0. 1. 1.]]
I've seen a solution using re
but am not sure how to apply here. How can I correct the one-liner above? (Getting rid of the nested loop would be nice too, but I'm not asking for general code editing unless it's kindly offered.)