The CBRecSys 2014 workshop aims to address this by providing a dedicated venue for papers dedicated to all aspects of content-based recommendation. To facilitate exploration of such aspects, the workshop will feature an in-workshop challenge on book recommendation. For this challenge a large dataset containing user profiles with book ratings and tags and 2.8 million book descriptions with library metadata, user ratings, tags, and reviews from Amazon and LibraryThing will be made available. The rich textual nature of the task makes the challenge an excellent venue to revisit the questions of the benefits of content-based filtering vs. collaborative filtering and metadata vs. ratings information.

 

Task

The book recommendation challenge focuses on recommending new, interesting books to LibraryThing users based on usage data (which books have they added to their collection) and content-based information about the books available on LibraryThing.

 

Data set

The data set for the book recommendation challenge is comprised of two parts. To obtain the data set, please follow the instructions outlined here.

 

Usage data

The first part of our challenge data set for book recommendation is a log of usage data: who added which books to their collection at what point in time? This usage data serves as the main data source for evaluating our challenge. See the 'Evaluation' section on this page for more information.

The usage data contains the following five columns of information, separated by a tab character ('\t'):

  • User ID
    An anonymized LibraryThing user name. Example: "u0000991"
  • Book ID
    A so-called LibraryThing work ID, which maps a book to one or more ISBN numbers representing possible different editions. A list of work IDs mapped to different ISBNs is included in the data set. Example: "77850"
  • Timestamp
    Year and month. Example: "2008-10"
  • Rating
    Rating assigned to that book by the user, if any. LibraryThing users can assign star ratings in steps of half a star from 0.5 to 5.0 stars. The interface does not allow the user to specify a rating of exactly stars. In the data set, we have converted these ratings to whole numbers by multiplying by 2. That means that the possible ratings ranging from 1.0 to 10.0, with 1.0 being the lowest rating and 10.0 the highest. The rating 0.0 is used to specify that a user has not rated that book yet (even though they have added it to their collection).
  • Tag(s)
    Tag(s) assigned to that book by the user, if any. Multiple tags are separated by comma's. Example: "Location- Cedar Chest, Hardcover, Paranormal-Magic-Fantasy, Romance"

All LibraryThing users in the data set have at least 20 books in their profile. The data set has not been filtered in any other way.

The data set has been anonymized in the following ways:

  • User names have been anonymized.
  • Timestamps have been reduced in granularity to months of the year instead of more specific timestamps. Book consumption takes place over a longer period of time than consuming music or movies, so this level of granularity should still be precise enough to capture possible shifts in user interest, while making it harder to identify individual users.
  • A small amount of random noise was introduced into the data set.

Please consult the included README file for more information.

 

Book metadata

The second part of our challenge data set for book recommendation is a collection of metadata records for 2.8 million books cataloged on LibraryThing. This collection was crawled from Amazon and LibratyThing by the University of Duisburg-Essen in early 2009. From Amazon there is formal metadata like booktitle, author, publisher, publication year, library classification codes, Amazon categories, and similar product information, as well as user-generated content in the form of user ratings and reviews. From LibraryThing, there are user tags and user-provided metadata on awards, book characters and locations, and blurbs.

Please note that, while the metadata collection contains metadata for 2.8 million books, not all of these books also occur in the filtered usage data set. A subset of 1,830,958 of these 2.8 million books were been added to at least one user's collection in our filtered data set. We provide a separate file in the data collection called 'list.all-book-IDs.txt' that lists all used book IDs.

Please consult the included README file for more information.

 

Obtaining the data set

In order to have access to the data designated as the Amazon/LibraryThing Book Corpus, organizations (who have not already done so) must first fill in a data release Application Form. The signed form must be scanned and sent by email to data@list.uva.nl. On receipt of the form, you will be sent information on how to download the data.

Access to the data by an individual person is to be controlled by that person's organization. The organization may only grant access to people working under its control, i.e. its own members, consultants to the organization, or individuals providing service to the organization. All individual's application forms must be signed by a person authorized by your organization for such signatures. The individual's form must be kept by the organization for any persons being involved at its site.

 

Future use of data set

The only requirement for use of the challenge data is submitting a set of results to the book recommendation challenge at CBRecSys 2014. The data set may then be used in any future work, as long as its use is in accordance with the license agreement. We would appreciate a citation of the workshop overview paper if you use the data set in any other work.

 

Evaluation

The evaluation of the book recommendation challenge follows the familiar backtesting paradigm, where a small number of randomly selected books is withheld for each user with the remaining data to be used as training material. If a user’s withheld items are predicted at the top of the ranked result list, i.e., if the algorithm is able to correctly predict the user’s interest in those withheld items, then the algorithm is considered to perform well.

To make evaluation easier for challenge participants and to reduce the possibility of overfitting, we have divided the usage data into a training and a validation set. The training material was released on May 12, 2014 and is meant to be used to train your recommendation algorithms. To obtain this training material, please follow the instructions outlined here. To aid in the learning and parameter optimization phase on the training material, we have performed 10-fold cross-validation on the training set. This resulted in 10 training & test sets, one for each of the 10 folds. We encourage (but do not require) all participants to use these 10 folds to train their system, so the results of the training phase are directly comparable among participants. See the accompanying README file for more details.

The validation set was released on July 1, 2014 and its aim is to produce the final comparison of the different submitted approaches. This validation set contains a set of users that are not included in the original training material to avoid overfitting. To allow participants to train their systems on these new users as well, we have also released a small amount of extra training material corresponding to the users near the end of the challenge. To obtain this validation material, please follow the instructions outlined here. See the accompanying README file for more details. Participants are allowed to submit up to 6 runs. Please submit your runs via e-mail to toine@hum.aau.dk before 23:59 CET on July 21, 2014.

The main evaluation metric used in the challenge is the ranking-based metric NDCG@10, as ratings information is only sparsely available in the book recommendation data set. We will be using the tried-and-tested trec_eval program to calculate the NDCG@10 scores. trec_eval requires a results file in standard TREC format (= which books has an algorithm recommended for each user) and a so-called 'qrels' file based on the test set, containing the relevance judgments (= which books were actually read by each user).

Participants are asked to submit their runs in standard TREC format. To make evaluation using the training material easier, we provide a Python script here that converts the test files for each fold into the qrels format required by trec_eval.

Please e-mail toine@hum.aau.dk if you have any questions about the data or the evaluation!