我已经在np.hstack
上为培训标签和验证(测试)标签应用了tensorflow.keras.preprocessing.text.Tokenizer.texts_to_sequences
。
令人惊讶和神秘的是,在我应用训练标签之后,输出的大小与我在应用np.hstack
之前的输出大小不同。但是,在使用tensorflow.keras.preprocessing.text.Tokenizer.texts_to_sequences
和np.hstack
之前和之后,验证标签的形状没有变化。
这是Google Colab的链接,可轻松重现错误。
下面给出了重现该错误的完整代码(以防万一该链接不起作用):
!pip install tensorflow==2.1
# For Preprocessing the Text => To Tokenize the Text
from tensorflow.keras.preprocessing.text import Tokenizer
# If the Two Articles are of different length, pad_sequences will make the length equal
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Package for performing Numerical Operations
import numpy as np
Unique_Labels_List = ['India', 'USA', 'Australia', 'Germany', 'Bhutan', 'Nepal', 'New Zealand', 'Israel', 'Canada', 'France', 'Ireland', 'Poland', 'Egypt', 'Greece', 'China', 'Spain', 'Mexico']
Train_Labels = Unique_Labels_List[0:14]
#print('Train Labels = {}'.format(Train_Labels))
Val_Labels = Unique_Labels_List[14:]
#print('Val_Labels = {}'.format(Val_Labels))
No_Of_Train_Items = [248, 200, 200, 218, 248, 248, 249, 247, 220, 200, 200, 211, 224, 209]
No_Val_Items = [212, 200, 219]
T_L = []
for Each_Label, Item in zip(Train_Labels, No_Of_Train_Items):
T_L.append([Each_Label] * Item)
T_L = [item for sublist in T_L for item in sublist]
V_L = []
for Each_Label, Item in zip(Val_Labels, No_Val_Items):
V_L.append([Each_Label] * Item)
V_L = [item for sublist in V_L for item in sublist]
len(T_L)
len(V_L)
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(Unique_Labels_List)
# Since it should be a Numpy Array, we should Convert the Sequences to Numpy Array, for both Training and
# Test Labels
training_label_list = label_tokenizer.texts_to_sequences(T_L)
validation_label_list = label_tokenizer.texts_to_sequences(V_L)
training_label_seq = np.hstack(training_label_list)
validation_label_seq = np.hstack(validation_label_list)
print('Actual Number of Train Labels before np.hstack are {}'.format(len(training_label_list)))
print('Change in the Number of Train Labels because of np.hstack are {}'.format(len(training_label_seq)))
print('-------------------------------------------------------------------------------------------------------')
print('Actual Number of Validation Labels before np.hstack are {}'.format(len(validation_label_list)))
print('However, there is no change in the Number of Validation Labels because of np.hstack {}'.format(len(validation_label_seq)))
先感谢您。
最佳答案
这是因为在training_label_list
中具有多个值的列表。您可以通过sorted(training_label_list, key=lambda x: len(x), reverse = True)
进行验证。
发生这种情况是因为label_tokenizer以以下方式考虑了New Zealand
。
>>>label_tokenizer.index_word
{1: 'india',
2: 'usa',
3: 'australia',
4: 'germany',
5: 'bhutan',
6: 'nepal',
7: 'new',
8: 'zealand',
9: 'israel',
10: 'canada',
11: 'france',
12: 'ireland',
13: 'poland',
14: 'egypt',
15: 'greece',
16: 'china',
17: 'spain',
18: 'mexico'}
检出索引7和8。