原文:https://thierrysilbermann.wordpress.com/2015/09/17/deal-with-relational-data-using-libfm-with-blocks/
An answer for this question: [Example] Files for Block Structure
There is a quick explanation in the README doc here: libFM1.42 Manual
Quick explanation is case you don’t want to read this whole blog post.
I’ll take back the toy dataset I used in this previous blog post. Look at it to get the features meaning.
train.libfm
and test.libfm
And I’ll merge them, so it will be easier for the whole process
dataset.libfm
So if we wanted to use block structure.
We will have those 5 files first:
- rel_user.libfm (features 0,1 and 6-8 are users features)
but in fact you can avoid to have feature_id_number broken like that (0-1, 6-8), we can recompress it, so (0-1 -> 0-1 and 6-8 -> 2-4)
- rel_product.libfm (features 2-5 and 9 are products features) Same thing we can compress from:
to
- rel_user.train (which is now the mapping, the first 3 lines correspond to the first line of rel_user.libfm | /!\ we are using a 0 indexing)
- rel_product.train (which is now the mapping)
- file y.train which contains the ratings only
Almost done…
Now you need to create the .x and .xt files for the user block and the product block. For this you need the script available with libFM in /bin/ after you compile them.
you are forced to used the flag –ofiley even if rel_user.y will never be used. You can delete it every time.
and then
Now you can do the same thing for the test set, for test because we merge the train and test dataset at the beginning, we only need to generate rel_user.test, rel_product.test and y.test
At this point, you will have a lot of files: (rel_user.train, rel_user.test, rel_user.x, rel_user.xt, rel_product.train, rel_product.test, rel_product.x, rel_produt.xt, y.train, y.test)
And run the whole thing:
It’s a bit overkill for this problem but I hope you get the point.
Now a real example
For this example, I’ll use the ml-1m.zip MovieLens dataset that you can get from here (1 million ratings)
ratings.dat (sample) / Format: UserID::MovieID::Rating::Timestamp
movies.dat (sample) / Format: MovieID::Title::Genres
users.dat (sample) / Format: UserID::Gender::Age::Occupation::Zip-code
I’ll create 3 different models.
- Easiest libFM files to train without block. I’ll use those features: UserID, MovieID
- Regular libFM files to train without block. I’ll use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)
- libFM files to train with block. I’ll also use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)
Model 1 and 2 can be created using the following code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | # -*- coding: utf-8 -*- __author__ = 'Silbermann Thierry' __license__ = 'WTFPL' import pandas as pd import numpy as np def create_libfm(w_filename, model_lvl = 1 ): # Load the data file_ratings = 'ratings.dat' data_ratings = pd.read_csv(file_ratings, delimiter = '::' , engine = 'python' , names = [ 'UserID' , 'MovieID' , 'Ratings' , 'Timestamp' ]) file_movies = 'movies.dat' data_movies = pd.read_csv(file_movies, delimiter = '::' , engine = 'python' , names = [ 'MovieID' , 'Name' , 'Genre_list' ]) file_users = 'users.dat' data_users = pd.read_csv(file_users, delimiter = '::' , engine = 'python' , names = [ 'UserID' , 'Genre' , 'Age' , 'Occupation' , 'ZipCode' ]) # Transform data ratings = data_ratings[ 'Ratings' ] data_ratings = data_ratings.drop([ 'Ratings' , 'Timestamp' ], axis = 1 ) data_movies = data_movies.drop([ 'Name' ], axis = 1 ) list_genres = [genres.split( '|' ) for genres in data_movies[ 'Genre_list' ]] set_genre = [item for sublist in list_genres for item in sublist] data_users = data_users.drop([ 'ZipCode' ], axis = 1 ) print 'Data loaded' # Map the data offset_array = [ 0 ] dict_array = [] feat = [( 'UserID' , data_ratings), ( 'MovieID' , data_ratings)] if model_lvl > 1 : feat.extend[( 'Genre' , data_users), ( 'Age' , data_users), ( 'Occupation' , data_users), ( 'Genre_list' , data_movies)] for (feature_name, dataset) in feat: uniq = np.unique(dataset[feature_name]) offset_array.append( len (uniq) + offset_array[ - 1 ]) dict_array.append({key: value + offset_array[ - 2 ] for value, key in enumerate (uniq)}) print 'Mapping done' # Create libFM file w = open (w_filename, 'w' ) for i in range (data_ratings.shape[ 0 ]): s = "{0}" . format (ratings[i]) for index_feat, (feature_name, dataset) in enumerate (feat): if dataset[feature_name][i] in dict_array[index_feat]: s + = " {0}:1" . format ( dict_array[index_feat][dataset[feature_name][i]] + offset_array[index_feat] ) s + = '\n' w.write(s) w.close() if __name__ = = '__main__' : create_libfm( 'model1.libfm' , 1 ) create_libfm( 'model2.libfm' , 2 ) |
So you end up with a file model1.libfm and model2.libfm. Just need to split each of those files in two to create a training et test set file that I’ll call train_m1.libfm, test_m1.libfm (same thing for model2, train_m2.libfm, test_m2.libfm)
Then you just run libFM like this:
But I guess you already know how to do those.
Now the interesting part, using blocks.
[TODO]