Deal with relational data using libFM with blocks

原文：https://thierrysilbermann.wordpress.com/2015/09/17/deal-with-relational-data-using-libfm-with-blocks/

September 17, 2015 ThierryS

An answer for this question: [Example] Files for Block Structure

There is a quick explanation in the README doc here: libFM1.42 Manual

Quick explanation is case you don’t want to read this whole blog post.

I’ll take back the toy dataset I used in this previous blog post. Look at it to get the features meaning.

train.libfm

and test.libfm

And I’ll merge them, so it will be easier for the whole process

dataset.libfm

So if we wanted to use block structure.

We will have those 5 files first:

rel_user.libfm (features 0,1 and 6-8 are users features)

but in fact you can avoid to have feature_id_number broken like that (0-1, 6-8), we can recompress it, so (0-1 -> 0-1 and 6-8 -> 2-4)

rel_product.libfm (features 2-5 and 9 are products features) Same thing we can compress from:

rel_user.train (which is now the mapping, the first 3 lines correspond to the first line of rel_user.libfm | /!\ we are using a 0 indexing)

rel_product.train (which is now the mapping)

file y.train which contains the ratings only

Almost done…

Now you need to create the .x and .xt files for the user block and the product block. For this you need the script available with libFM in /bin/ after you compile them.

you are forced to used the flag –ofiley even if rel_user.y will never be used. You can delete it every time.

and then

Now you can do the same thing for the test set, for test because we merge the train and test dataset at the beginning, we only need to generate rel_user.test, rel_product.test and y.test

At this point, you will have a lot of files: (rel_user.train, rel_user.test, rel_user.x, rel_user.xt, rel_product.train, rel_product.test, rel_product.x, rel_produt.xt, y.train, y.test)

And run the whole thing:

It’s a bit overkill for this problem but I hope you get the point.

Now a real example

For this example, I’ll use the ml-1m.zip MovieLens dataset that you can get from here (1 million ratings)

ratings.dat (sample) / Format: UserID::MovieID::Rating::Timestamp

movies.dat (sample) / Format: MovieID::Title::Genres

users.dat (sample) / Format: UserID::Gender::Age::Occupation::Zip-code

I’ll create 3 different models.

Easiest libFM files to train without block. I’ll use those features: UserID, MovieID
Regular libFM files to train without block. I’ll use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)
libFM files to train with block. I’ll also use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)

Model 1 and 2 can be created using the following code:

# -*- coding: utf-8 -*-

__author__ = 'Silbermann Thierry'

__license__ = 'WTFPL'

import pandas as pd

import numpy as np

def create_libfm(w_filename, model_lvl=1):

# Load the data

file_ratings = 'ratings.dat'

data_ratings = pd.read_csv(file_ratings, delimiter='::', engine='python',

names=['UserID', 'MovieID', 'Ratings', 'Timestamp'])

file_movies = 'movies.dat'

data_movies = pd.read_csv(file_movies, delimiter='::', engine='python',

names=['MovieID', 'Name', 'Genre_list'])

file_users = 'users.dat'

data_users = pd.read_csv(file_users, delimiter='::', engine='python',

names=['UserID', 'Genre', 'Age', 'Occupation', 'ZipCode'])

# Transform data

ratings = data_ratings['Ratings']

data_ratings = data_ratings.drop(['Ratings', 'Timestamp'], axis=1)

data_movies = data_movies.drop(['Name'], axis=1)

list_genres = [genres.split('|') for genres in data_movies['Genre_list']]

set_genre = [item for sublist in list_genres for item in sublist]

data_users = data_users.drop(['ZipCode'], axis=1)

print 'Data loaded'

# Map the data

offset_array = [0]

dict_array = []

feat = [('UserID', data_ratings), ('MovieID', data_ratings)]

if model_lvl > 1:

feat.extend[('Genre', data_users), ('Age', data_users),

('Occupation', data_users), ('Genre_list', data_movies)]

for (feature_name, dataset) in feat:

uniq = np.unique(dataset[feature_name])

offset_array.append(len(uniq) + offset_array[-1])

dict_array.append({key: value + offset_array[-2]

for value, key in enumerate(uniq)})

print 'Mapping done'

# Create libFM file

w = open(w_filename, 'w')

for i in range(data_ratings.shape[0]):

s = "{0}".format(ratings[i])

for index_feat, (feature_name, dataset) in enumerate(feat):

if dataset[feature_name][i] in dict_array[index_feat]:

s += " {0}:1".format(

dict_array[index_feat][dataset[feature_name][i]]

+ offset_array[index_feat]

)

s += '\n'

w.write(s)

w.close()

if __name__ == '__main__':

create_libfm('model1.libfm', 1)

create_libfm('model2.libfm', 2)

So you end up with a file model1.libfm and model2.libfm. Just need to split each of those files in two to create a training et test set file that I’ll call train_m1.libfm, test_m1.libfm (same thing for model2, train_m2.libfm, test_m2.libfm)

Then you just run libFM like this:

But I guess you already know how to do those.

Now the interesting part, using blocks.

[TODO]