One-hot encoding: list membership error

Question

Given a variable number of strings, I'd like to one-hot encode them as in the following example:

s1 = 'awaken my love'
s2 = 'awaken the beast'
s3 = 'wake beast love'

# desired result - NumPy array
array([[ 1.,  1.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  1.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  1.,  1.]])

Current code:

def uniquewords(*args):
    """Create order-preserved string with unique words between *args"""
    allwords = ' '.join(args).split()
    return ' '.join(sorted(set(allwords), key=allwords.index)).split()

def encode(*args):
    """One-hot encode the given input strings"""
    unique = uniquewords(*args)
    feature_vectors = np.zeros((len(args), len(unique)))
    for vec, s in zip(feature_vectors, args):
        for num, word in enumerate(unique):                
            vec[num] = word in s
    return feature_vectors

The issue is in this line:

vec[num] = word in s

Which picks up, for instance, 'wake' in 'awaken my love' as True (rightly so, but not for my needs) and gives the following, slightly-off result:

print(encode(s1, s2, s3))
[[ 1.  1.  1.  0.  0.  1.]
 [ 1.  0.  0.  1.  1.  1.]
 [ 0.  0.  1.  0.  1.  1.]]

I've seen a solution using re but am not sure how to apply here. How can I correct the one-liner above? (Getting rid of the nested loop would be nice too, but I'm not asking for general code editing unless it's kindly offered.)

`word in s` isn't testing `'wake' in ['awaken']`. It's testing `'wake' in 'awaken'`. — user2357112, Aug 21 '17 at 20:05
(well, really `'wake' in 'awaken my love'` or `'wake' in 'awaken the beast'`) — user2357112, Aug 21 '17 at 20:07
Perform the `in` test against a string's set of words, rather than against the string. — user2357112, Aug 21 '17 at 20:08
...what? No, you've already got logic to convert a string to a set of words. Just use it a bit differently. — user2357112, Aug 21 '17 at 20:12
I'm not talking about converting the `str` representation of a set back into a set. — user2357112, Aug 21 '17 at 20:12

Divakar · Answer 1 · 2017-08-21T21:03:37.100

Here's one approach -

def membership(list_strings):
    split_str = [i.split(" ") for i in list_strings]
    split_str_unq = np.unique(np.concatenate(split_str))
    out = np.array([np.in1d(split_str_unq, b_i) for b_i in split_str]).astype(int)
    df_out = pd.DataFrame(out, columns = split_str_unq)
    return df_out

Sample run -

In [189]: s1 = 'awaken my love'
     ...: s2 = 'awaken the beast'
     ...: s3 = 'wake beast love'
     ...: 

In [190]: membership([s1,s2,s3])
Out[190]: 
   awaken  beast  love  my  the  wake
0       1      0     1   1    0     0
1       1      1     0   0    1     0
2       0      1     1   0    0     1

Here's another making use of np.searchsorted to get the column indices per row for setting into the output array and hopefully faster -

def membership_v2(list_strings):
    split_str = [i.split(" ") for i in list_strings]
    all_strings = np.concatenate(split_str)
    split_str_unq = np.unique(all_strings)
    col = np.searchsorted(split_str_unq, all_strings)
    row = np.repeat(np.arange(len(split_str)) , [len(i) for i in split_str])
    out = np.zeros((len(split_str),col.max()+1),dtype=int)
    out[row, col] = 1
    df_out = pd.DataFrame(out, columns = split_str_unq)
    return df_out

Note that the output as a dataframe is meant mostly for a better/easier representation of the output.

Your `v2` version is very fast indeed. – Alexander Aug 22 '17 at 00:15 — Alexander, Aug 22 '17 at 00:15

Izaak van Dongen · Accepted Answer · 2017-08-21T20:33:38.233

If you do a slight refactoring so that you treat each sentence as a list of words thoughout, it removes a lot of the splitting and joining you're having to do, and naturalises the behaviour of word in s a bit. However, a set is preferred for membership testing, as it can do this in O(1), and you should only construct one per argument iterated over, so your code would result in this:

import numpy as np
import itertools

def uniquewords(*args):
    """Create order-preserved string with unique words between *args"""
    allwords = list(itertools.chain(*args))
    return sorted(set(allwords), key=allwords.index)

def encode(*args):
    """One-hot encode the given input strings"""
    args_with_words = [arg.split() for arg in args]
    unique = uniquewords(*args_with_words)
    feature_vectors = np.zeros((len(args), len(unique)))
    for vec, s in zip(feature_vectors, args_with_words):
        s_set = set(s)
        for num, word in enumerate(unique):                
            vec[num] = word in s_set
    return feature_vectors

print encode("awaken my love", "awaken the beast", "wake beast love")

with the correct output of

[[ 1.  1.  1.  0.  0.  0.]
 [ 1.  0.  0.  1.  1.  0.]
 [ 0.  0.  1.  0.  1.  1.]]

Once you've done this, you might realise you don't really need membership testing at all, and you can just iterate over s, only bothering with words that need to be set to 1. This approach may be significantly faster over larger datasets.

import numpy as np
import itertools

def uniquewords(*args):
    """Dictionary of words to their indices in the matrix"""
    words = {}
    n = 0
    for word in itertools.chain(*args):
        if word not in words:
            words[word] = n
            n += 1
    return words

def encode(*args):
    """One-hot encode the given input strings"""
    args_with_words = [arg.split() for arg in args]
    unique = uniquewords(*args_with_words)
    feature_vectors = np.zeros((len(args), len(unique)))
    for vec, s in zip(feature_vectors, args_with_words):
        for word in s:                
            vec[unique[word]] = 1
    return feature_vectors

print encode("awaken my love", "awaken the beast", "wake beast love")

All answers were elegant, but I'm accepting this one because it's 4x faster than the numpy approach and also well explained. — Brad Solomon, Aug 22 '17 at 12:38

Alexander · Answer 3 · 2017-08-21T23:31:12.863

You can use pandas to create a one-hot encoding transformation from a list of lists (e.g. a list of strings where each each string is subsequently split into a list of words).

import pandas as pd

s1 = 'awaken my love'
s2 = 'awaken the beast'
s3 = 'wake beast love'

words = pd.Series([s1, s2, s3])
df = pd.melt(words.str.split().apply(pd.Series).reset_index(), 
             value_name='word', id_vars='index')
result = (
    pd.concat([df['index'], pd.get_dummies(df['word'])], axis=1)
    .groupby('index')
    .any()
).astype(float)
>>> result
       awaken  beast  love  my  the  wake
index                                    
0           1      0     1   1    0     0
1           1      1     0   0    1     0
2           0      1     1   0    0     1

>>> result.values
array([[ 1.,  0.,  1.,  1.,  0.,  0.],
       [ 1.,  1.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  1.,  0.,  0.,  1.]])

Explanation

First, create a series from your list of words.

Then split the words into columns and reset the index:

>>> words.str.split().apply(pd.Series).reset_index()
# Output:
#    index       0      1      2
# 0      0  awaken     my   love
# 1      1  awaken    the  beast
# 2      2    wake  beast   love

One then melts this intermediate dataframe above which results in the following:

   index variable    word
0      0        0  awaken
1      1        0  awaken
2      2        0    wake
3      0        1      my
4      1        1     the
5      2        1   beast
6      0        2    love
7      1        2   beast
8      2        2    love

Apply get_dummies to the words and concatenate the results to their index locations. The resulting datatframe is then grouped on index and any is used on the aggregation (all values are zero or one, so any indicates if there are one or more instances of that word). This returns a boolean matrix, which is converted to floats. To return the numpy array, apply .values to the result.

Felipe Cruz · Answer 4 · 2017-08-21T20:24:00.423

0

A Set will make the in operator runs in O(1) on average.

Change:

vec[num] = word in s

to:

vec[num] = word in set(s.split())

Final version:

def encode(*args):
    """One-hot encode the given input strings"""
    unique = uniquewords(*args)
    feature_vectors = np.zeros((len(args), len(unique)))
    for vec, s in zip(feature_vectors, args):
        for num, word in enumerate(unique):
            vec[num] = word in set(s.split())
    return feature_vectors

Result:

[[ 1.  1.  1.  0.  0.  0.]
 [ 1.  0.  0.  1.  1.  0.]
 [ 0.  0.  1.  0.  1.  1.]]

edited Aug 21 '17 at 20:24

answered Aug 21 '17 at 20:15

Felipe Cruz

940
5
14

Strings are splitted on whitespace by default: `word in s.split()` – Moses Koledoye Aug 21 '17 at 20:19
changed. although explicit is better than implicit :) – Felipe Cruz Aug 21 '17 at 20:25
1

Not sure that applies here. Readability should *always* take precedence. The latter is more readable. – Moses Koledoye Aug 21 '17 at 20:26

One-hot encoding: list membership error

4 Answers4