tensorflow word2vec源码分析 - Welcome to AI World

简介

最近在编写文本匹配模型时输入需要传入词向量，并在训练的过程中更新词向量。由于之前都是采用的gensim来生成词向量，词典和嵌入矩阵都不太方便获取到，因此决定采用tensorflow来训练词向量，但是据我在网上的了解，tensorflow训练的词向量整体效果还是不如gensim，gensim的源码我没有看过，如果对此清楚的童鞋请留言，十分感谢。不过在模型的训练阶段，还是会对词向量进行更新，因此即使效果没有达到最佳影响应该也不是特别大，接下来就一起来看下tensorflow的word2vec。

源码

众所周知，word2vec包括两种形式，cbow与skip-gram，cbow是采用context来预测target，skip-gram是采用target来预测context，tensorflow的官方代码采用的是skip-gram，在下文中，我会讲解如何改写成cbow的形式。其次tensorflow的官方代码采用了的负采样的优化方法来加速训练，损失函数为噪声对比估算损失，后文会详细介绍，接下来就一起来看代码。

tensorflow提供了多个版本的word2vec源码，不同的版本核心代码差不多，更多的是速度上的提升，这里我们以basic版本来讲解word2vec_basic.py

word2vec不同版本代码地址

首先下载带训练的数据，数据是一个没有换行的整段文本，因为是英文，因此根据空格进行分词。

# Step 1: Download the data. 
url = 'http://mattmahoney.net/dc/'     

# pylint: disable=redefined-outer-name
def maybe_download(filename, expected_bytes):
  local_filename = os.path.join(gettempdir(), filename)
    if not os.path.exists(local_filename):
        local_filename, _ = urllib.request.urlretrieve(url + filename,
  local_filename)
    statinfo = os.stat(local_filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified', filename)
    else:
        print(statinfo.st_size)
        raise Exception('Failed to verify ' + local_filename +
                        '. Can you get to it with a browser?')
    return local_filename

filename = maybe_download('text8.zip', 31344016)

# Read the data into a list of strings. 
def read_data(filename):
  with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0])).split()
    return data

vocabulary = read_data(filename)
print('Data size', len(vocabulary))

接下来构建词典，并把一些较为稀少的词替换成UNK。

# Step 2: Build the dictionary and replace rare words with UNK token. 
vocabulary_size = 50000     
def build_dataset(words, n_words):
  count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(n_words - 1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
  for word in words:
        index = dictionary.get(word, 0)
        if index == 0:  # dictionary['UNK']
  unk_count += 1
  data.append(index)
    count[0][1] = unk_count
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reversed_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(vocabulary, vocabulary_size)
del vocabulary  
# Hint to reduce memory. 
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])

接下来是比较重要的一部分，我分成多个部分来讲解。因为是skip-gram模型，所以是用target来预测context。先看三个参数，batch_size表示训练的batch个数，解释num_skips之前我们先说skip_window，skip_window表示的是对于target来说，选择的上下文的单边长度，举个例子，“我的爱好是唱、跳、rap”分词之后的结果是[我，的，爱好，是，唱，跳，rap]，如果这里target选的是爱好，skip_window设置的是2，那么和target相关的词就是[我，的，是，唱]，如果skip_window设置的是1，那么和target相关的词是[的，是]，下文中定义的span参数，就是包括target后的待选集合的长度即skip_window*2+target 。再看下batch与labels参数，batch表示的是输入，labels表示的是输出，所以两个参数的第一个维度都是batch_size

data_index = 0
# Step 3: Function to generate a training batch for the skip-gram model. 
def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    assert batch_size % num_skips == 0
    assert num_skips &lt;= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1 # [ skip_window target skip_window ]

下面的代码就是构建训练集了，这里逻辑有点复杂，以注释来讲解。顺便举个例子，还是以[我，的，爱好，是，唱，跳，rap]为例子，假设这几个词在词典中的index为[1,2,3,4,5,6,7]，num_skips为3，skip_window为4，如果target是爱好，那么候选集就是[我，的，爱好，是，唱]对应的index为[1,2,3,4,5]，根据下面的代码，context_words为[1,2,4,5]，words_to_use是随机进行选择num_skips个数，这里假设选到了[1,2,5]三个数，接下来就是遍历words_to_use的数值，并和target组成训练集，最后针对爱好这个词的数据集为batch：[3，3，3]，label：[1，2，5]，然后继续循环for i in range(batch_size // num_skips)

    # 构建一个队列，长度是span
    buffer = collections.deque(maxlen=span)  # pylint: disable=redefined-builtin
    # data_index 表示的是对应的word的下标，从0开始，如果下标+span已经大于数据的长度了，就把index设置为0重新遍历
    if data_index + span > len(data):
        data_index = 0
    # 把数据添加到队列里
    buffer.extend(data[data_index:data_index + span])
    data_index += span
    # 要循环batch_size // num_skips次才能把batch填满，每次填num_skips个数
    for i in range(batch_size // num_skips):
        # 确认context的位置index
        context_words = [w for w in range(span) if w != skip_window]
        # 随机选择num_skips个位置，并返回index
        words_to_use = random.sample(context_words, num_skips)
        # 遍历context的index并与target组合在一起
        for j, context_word in enumerate(words_to_use):
            # batch保存的是输入数据，因为是skip-gram，所以把target放进去，
            batch[i * num_skips + j] = buffer[skip_window]
            # labels保存输出，根据words_to_use中的下标context_word去获取context word
            labels[i * num_skips + j, 0] = buffer[context_word]
        # 如果已经遍历到了数据的尾部，就重新把头的数据添加到队列里，否则buffer添加一个元素，队列中首的数据挤掉，尾部加入新数据，index+1，
        if data_index == len(data):
            buffer.extend(data[0:span])
            data_index = span
        else:
            buffer.append(data[data_index])
            data_index += 1
    # Backtrack a little bit to avoid skipping words in the end of a batch
    data_index = (data_index + len(data) - span) % len(data)
    return batch, labels

batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)
for i in range(8):
    print(batch[i], reverse_dictionary[batch[i]], '-', labels[i, 0],
reverse_dictionary[labels[i, 0]])

接下来的代码就是模型的构建了，代码很简单，就是一个embedding层，不过损失函数是nce_loss噪声对比估算损失。为什么不用交叉熵损失呢，主要原因是模型采用了负采样来加速模型训练，负采样其实就是正样本与负样本的极大似然估计，这里不做深入讲解，不清楚的小伙伴建议google一下把这里弄清楚，其中embeddings 是嵌入矩阵，即我们要求的词向量，nce_weights是计算逻辑回归的权重，nce_biases是偏移量

# Step 4: Build and train a skip-gram model.   

batch_size = 256 
embedding_size = 128 
skip_window = 1 
num_skips = 2 
num_sampled = 64 

valid_size = 16
valid_window = 100 
valid_examples = np.random.choice(valid_window, valid_size, replace=False)

graph = tf.Graph()

with graph.as_default():

    # Input data.
    with tf.name_scope('inputs'):
        train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
        train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
        valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

    # Ops and variables pinned to the CPU because of missing GPU implementation
    with tf.device('/cpu:0'):
        # Look up embeddings for inputs.
    with tf.name_scope('embeddings'):
            embeddings = tf.Variable(
                tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
            embed = tf.nn.embedding_lookup(embeddings, train_inputs)

        # Construct the variables for the NCE loss
        with tf.name_scope('weights'):
            nce_weights = tf.Variable(
                tf.truncated_normal([vocabulary_size, embedding_size],
        stddev=1.0 / math.sqrt(embedding_size)))
        with tf.name_scope('biases'):
            nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

tensorflow为服采用提供了封装好的损失函数即nce_loss，其api如下

def nce_loss(weights, biases, inputs, labels, num_sampled, num_classes,
             num_true=1,
             sampled_values=None,
             remove_accidental_hits=False,
             partition_strategy="mod",
             name="nce_loss")

对应的参数说明如下

weight.shape = (N, K)
bias.shape = (N)
inputs.shape = (batch_size, K)
labels.shape = (batch_size, num_true)
num_true : 实际的正样本个数
num_sampled: 采样出多少个负样本
num_classes = N
sampled_values: 采样出的负样本，如果是None，就会用不同的sampler去采样
remove_accidental_hits: 如果采样时不小心采样到的负样本刚好是正样本，要不要去掉
partition_strategy：对weights进行embedding_lookup时并行查表时的策略。TF的embeding_lookup是在CPU里实现的，这里需要考虑多线程查表时的锁的问题。

后续的代码就是loss的编写，随机梯度下降法进行训练，测试代码编写等，不再赘述。

      tf.name_scope('loss'):
            loss = tf.reduce_mean(
                tf.nn.nce_loss(
                    weights=nce_weights,
                    biases=nce_biases,
                    labels=train_labels,
                    inputs=embed,
                    num_sampled=num_sampled,
                    num_classes=vocabulary_size))
  
      # Add the loss value as a scalar to summary.
      tf.summary.scalar('loss', loss)
  
      # Construct the SGD optimizer using a learning rate of 1.0.
      with tf.name_scope('optimizer'):
          optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
  
      # Compute the cosine similarity between minibatch examples and all
      # embeddings.  
      norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
      normalized_embeddings = embeddings / norm
      valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings,
                                                valid_dataset)
      similarity = tf.matmul(
          valid_embeddings, normalized_embeddings, transpose_b=True)
  
      # Merge all summaries.
      merged = tf.summary.merge_all()
  
      # Add variable initializer.
      init = tf.global_variables_initializer()
  
      # Create a saver.
      saver = tf.train.Saver()

最后给出tensorflow官方对于字词的向量表示法的讲解

官方的代码输入数据是一个整段的文本，如果对于一行一行的文本数据，需要对数据处理部分进行代码的修改，如果您不知道该如何处理，不妨参考下我的代码