Welcome to Pyflix
Pyflix is a small package written in Python that provides an easy entry point for getting up and running in the Netflix Prize competition. It combines an efficient storage scheme with an intuitive high-level API that allows contestants to focus on the real problem, the recommendation system algorithm. To get started with Pyflix, keep reading.
Prerequisites
- Python 2.4 or higher.
- Numpy 1.0. You might be able to use the older Numarray or Numeric packages instead of Numpy with few changes, but only Numpy is currently supported. If you plan to do heavy math computations (vector/matrix operations, linear equation systems, least squares, eigenvalues, etc.), Numpy can be built to use the highly optimized and robust BLAS, LAPACK and ATLAS libraries; for more details, consult the Numpy documentation.
Download
- Get the latest Pyflix release from the list below and uncompress the archive file.
- Alternatively, you may check out the latest project source code from the Subversion repository.
Once you're done, change to the pyflix-version directory.
Build the indexed binaries
The first step is to compress the original text data into a set of indexed binary files. Assuming you have uncompressed Netflix's archive and the included training_set.tar under path/to/download/, run in the command shell:
python pyflix/setup.py -i path/to/download/ -o database
This will create binaries for the training and probe dataset under a directory named database. By default, the datasets are indexed both by movie and user id; if you don't need both, you may specify --no-movie or --no-user. This will cut down the build time and (at some extent) the memory requirements, at the expense of some functionality.
By default, the probe set entries are excluded from the training set; if you want to include them, add the --no-scrub option. Finally, if you'd like to create a binary version of the qualifying set as well, add -q or --qualifying. You can list all options by adding -h or --help.
The procedure will take roughly between one and two hours on a modern system for the default options. Diagnostic messages and progress bars will show the progress of each stage. If everything goes well, the final output should look similar to this:
$ python pyflix/setup.py -i ../download/ -o ../bin -q * Reading ../download/probe.txt... [********************************************************************] 100% * Building rating files... [********************************************************************] 100% 4544779 training set entries rated 1 9995998 training set entries rated 2 28458811 training set entries rated 3 33288865 training set entries rated 4 22783659 training set entries rated 5 73211 probe set entries rated 1 136082 probe set entries rated 2 352436 probe set entries rated 3 462093 probe set entries rated 4 384573 probe set entries rated 5 Split files in 21 minutes and 57.0 seconds * Sorting ../bin/training_set/1.tmp, ../bin/training_set/2.tmp, ../bin/training_set/3.tmp, ../bin/training_set/4.tmp, ../bin/training_set/5.tmp... Sorted files in 4 minutes and 35.5 seconds * Merging sorted files from ../bin/training_set/sorted_by_movie... [********************************************************************] 100% Merged movies in 22 minutes and 18.1 seconds * Merging sorted files from ../bin/training_set/sorted_by_user ... [********************************************************************] 100% Merged users in 23 minutes and 51.0 seconds * Sorting ../bin/probe_set/1.tmp, ../bin/probe_set/2.tmp, ../bin/probe_set/3.tmp, ../bin/probe_set/4.tmp, ../bin/probe_set/5.tmp... Sorted files in 0 minutes and 2.0 seconds * Merging sorted files from ../bin/probe_set/sorted_by_movie ... [********************************************************************] 100% Merged movies in 0 minutes and 21.6 seconds * Merging sorted files from ../bin/probe_set/sorted_by_user ... [********************************************************************] 100% Merged users in 1 minutes and 45.8 seconds * Building ../bin/qualifying_set/all.tmp... [********************************************************************] 100% 2817131 qualifying set entries * Sorting ../bin/qualifying_set/all.tmp... Sorted files in 0 minutes and 4.2 seconds * Merging sorted files from ../bin/qualifying_set/sorted_by_movie ... [********************************************************************] 100% Merged movies in 0 minutes and 39.1 seconds * Merging sorted files from ../bin/qualifying_set/sorted_by_user ... [********************************************************************] 100% Merged users in 1 minutes and 39.3 seconds * Completed successfully in 78 minutes and 12.5 seconds
The output directory will have one subdirectory for each dataset (training_set, probe_set and optionally qualifying_set). The files movies.dat and movies.idx hold the data indexed by movie; similarly users.dat and users.idx hold the data indexed by user. The rest files (with extension .tmp) are temporary files that can be deleted after setup is over.
Basic usage
Pyflix comes with full documentation under docs/api/index.html, created with the excellent Epydoc package. Below is a sample interactive session to demonstrate the dataset API:
Python 2.4.2 (#1, Mar 8 2006, 13:24:00) [GCC 3.4.4 20050721 (Red Hat 3.4.4-2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> # the main module you'll need is pyflix.datasets ... from pyflix.datasets import RatedDataset >>> # open the training set ... tr = RatedDataset('../bin/training_set') >>> # print the number of movies and users, respectively ... len(tr.movieIDs()), len(tr.userIDs()) (17770, 480189) >>> >>> # looking into a single movie ... m = tr.movie(11148) # get a RatedRecord instance for the movie #11148 >>> m.values() # get the IDs of the users that rated it array([1470697, 2104443, 2574924, 934125, 1364974], dtype=uint32) >>> tr.userIDs(11148) # another way to get them from the dataset array([1470697, 2104443, 2574924, 934125, 1364974], dtype=uint32) >>> m.ratings() # get the ratings of the users that rated it array([3, 4, 4, 5, 5], dtype=uint8) >>> m.counts() # get the #ratings for each rating in (1,2,3,4,5) array([0, 0, 1, 2, 2]) >>> list(m.iterValueRatings()) # get an iterator over (user_id,rating) tuples [(1470697, 3), (2104443, 4), (2574924, 4), (934125, 5), (1364974, 5)] >>> >>> # looking into a single user ... u = tr.user(2446680) # get a RatedRecord instance for the user #2446680 >>> u.values() # get the IDs of the movies he/she rated array([2843, 6850, 9458, 3282], dtype=uint16) >>> tr.movieIDs(2446680) # another way to get them from the dataset array([2843, 6850, 9458, 3282], dtype=uint16) >>> u.ratings() # get his/her ratings on these movies array([1, 2, 2, 3], dtype=uint8) >>> u.counts() # get the #ratings for each rating in (1,2,3,4,5) array([1, 2, 1, 0, 0], dtype=uint16) >>> list(u.iterValueRatings()) # get an iterator over (movie_id,rating) tuples [(2843, 1), (6850, 2), (9458, 2), (3282, 3)] >>> >>> # looking into two or more movies ... m1,m2 = 5,11 >>> tr.userIDs(m1,m2) # get the IDs of the users that rated both movies array([ 305344, 387418, 727242, 1664010, 2439493], dtype=uint32) >>> tr.ratingsMatrixTo(m1,m2) # get their ratings for these movies as a 2D array array([[1, 1], [1, 1], [1, 1], [5, 3], [1, 1]], dtype=uint8) >>> # get an iterator over (user_id,ratings) tuples ... for u,r in tr.iterUserIDsRatings(m1,m2): print (u,r) ... (305344, array([1, 1], dtype=uint8)) (387418, array([1, 1], dtype=uint8)) (727242, array([1, 1], dtype=uint8)) (1664010, array([5, 3], dtype=uint8)) (2439493, array([1, 1], dtype=uint8)) >>> # looking into two or more users ... u1,u2 = 1048577,2097163 >>> tr.movieIDs(u1,u2) # get the IDs of the movies rated by both users array([ 5496, 12317, 15205, 16384], dtype=uint16) >>> tr.ratingsMatrixBy(u1,u2) # get their ratings for these movies as a 2D array array([[3, 4], [5, 2], [5, 3], [5, 2]], dtype=uint8) >>> # get an iterator over (movie_id,ratings) tuples ... for m,r in tr.iterMovieIDsRatings(u1,u2): print (m,r) ... (5496, array([3, 4], dtype=uint8)) (12317, array([5, 2], dtype=uint8)) (15205, array([5, 3], dtype=uint8)) (16384, array([5, 2], dtype=uint8))
Writing a recommendation algorithm
The typical way to write a recommendation algorithm is to extend the Algorithm class of the pyflix.algorithms module. You create an Algorithm instance by passing a RatedDataset argument as the training set. Typically you train your algorithm on the training set and store your trained model as instance attribute(s) of the Algorithm instance. You then have to override the __call__(movie_id, user_id) method; this takes a movie id and a user id and should return the predicted rating for the given movie/user pair. So the general template of an Algorithm is:
from pyflix.algorithms import Algorithm
class MyFancyAlgorithm(Algorithm):
def __init__(self, training_set):
super(MyFancyAlgorithm,self).__init__(training_set)
# train a model based on training_set here
def __call__(self, movie_id, user_id):
# use the trained model to predict the rating for the given movie/user pair here
If you put your algorithm in a module under the algorithms package (e.g. pyflix.algorithms.myalg.py), you can test it on the probe set and get its RMSE by running:
python pyflix/algorithms/run.py myalg.MyFancyAlgorithm path/to/database
When you are with happy with the result and want to run the algorithm on the qualifying set and create the predictions file, call the writePredictions method of Algorithm:
from pyflix.datasets import RatedDataset from pyflix.algorithms.myalg import MyFancyAlgorithm a = MyFancyAlgorithm(RatedDataset('path/to/training_set/')) a.writePredictions('path/to/download/qualifying.txt', 'myalg_predictions.txt')
You can find three simple averaging algorithms in the pyflix.algorithms.average module.
About Trac
- TracGuide -- Built-in Documentation
- The Trac project -- Trac Open Source Project
- Trac FAQ -- Frequently Asked Questions
- TracSupport -- Trac Support
For a complete list of local wiki pages, see TitleIndex.
Trac is brought to you by Edgewall Software, providing professional Linux and software development services to clients worldwide. Visit http://www.edgewall.com/ for more information.
