[Qa1]: Should it be a summary for the expert panel, or should it be a brief description of implementation details, hyperparameter values etc. to enable others to implement the algorithms described in the paper?
[Aa1]: It should be a brief summary of the algorithm for the expert panel. The idea of to have some sort of personal subjective evaluation with respect to non-quantifiable criteria like novelty, originality, etc.
[Qd1]: Will the additional dataset used for deciding the winners of the challenge be a ‘normal’ week for task one, or a special week with certain characteristics?
[Ad1]: The additional dataset will be similar to the training datasets for eacht track, i.e. for Christmas, it is identical (in structure) to the evaluation set provide, with the only exception that there is a smaller set of users in it.
[Qd2]: When will the the additional dataset be published?
[Ad2]: It will not be published, the participants will be given a list of users to recommend a set of movies to for each track. They will be asked to return a list of recommended movies which we will evaluate. This will be around the time of the camera-ready deadline.
[Qd3]: Is it allowed to use the xmas ratings for the oscar predictions?
[Ad3]: Yes. But it is not allowed to use any “future” information i.e. when predicting movies for xmas, tune your algorithms to not use any ratings made after the first date in the xmas-eval sets. We are aware of the fact that the training sets include future data, at least when it comes to the xmas data, but we trust you will be honest.
[Qd4]: We also need to remove the training data after the Oscar week in task1-oscar?
[Qd5]: Is it allowed to use information, e.g. from Wikipedia, about the Oscar nominations in 2010?
[Ad5]: No, we know that this might sound harsh, but in order to know that we’re comparing apples to apples we are imposing this limit. The idea is to find the context, not to scrape extra meta-data off of IMDb/Wikipedia.
[Qd6]: There are some inconsistencies in the datasets, e.g. user-movie pairs that appear with multiple ratings. Is this deliberate?
[Ad6]: No, it is not deliberate. This is the kind of noise you find in real-life data. The inconsistencies are probably due to improper use of the the API’s of MP and FT.
[Qd7]: The password does not seem to work.
[Ad7]: It does, try again. If it fails, try to download the file again. The files have been tested on Win/Linux/OSX/Unix and do work.
[Qd8]: In the filmtipset dataset for task1 (oscar), the ratings in the test data were made between Feb 27 and March 7; Feb 27/28 do actually not belong to week 9. Shall we just ignore those?
[Ad8]: These ratings should be regarded as true positives, and thus the evaluation we will do will be made on the same dates found in the evaluation we distribute to the participants. Note that this is for Filmtipset week only.
[Qd9]: What day does the week start on?
[Ad9]: Monday, do note .
[Qd10]: The ratings in task 2 are from end of 2009 and beginning of 2010 Dates in task 2: Do we need to remove all training data from December 2009 on, or does the “no future data” rule not hold in task 2?
[Ad10]: Yes, for the Xmas week you should only train your recommender on ratings done prior to the break off date for xmas, i.e. prior to calendar week 52.
[Qd15]: The mood keywords seem to be additional keywords – so is the task here to come up with item recommendations that make use of this data?
[Ad15]: The idea is to recommend movies for users found in the testset, the evaluation should be done on movies labeled with the mood id 16.
[Qd16]: Are you going to release the datasets after the workshop?
[Ad16]: The datasets are released exclusively for the challenge.
[Qd17]: Is it compulsory to send a paper to the workshop in order to use the data?
[Ad17]: Yes, we expect the teams to contribute a paper to the workshop. If you do not contribute a paper to the workshop, you are give up the right to use the datasets in any future work. Please see the Dataset page. The paper can be from 4 to 8 pages.
[Qd18]: The Moviepilot Mood evaluation dataset contains movies tagged with ids other than 16?
[Ad18]: Yes, please use only the movies tagged with id 16 for the evaluation in you papers, you should however use the rest of the file (with the corresponding mood ids) in order to avoid overfitting.
[Qe1]: Will we send our predictions to the organizers, which will then perform the evaluation?
[Qe2]: Which measure of the given 4 will count for the decision?
[Ae2]: This will be published after submission. Since the objective is to create an algorithm that performs well, optimising it for one specific measure is not the way to go.
[Qe3]: What is the exact evaluation protocol? Do we have to rank the items in the test set, or do we have to rank all items and thus find out which ones are actually in the test set?
[Ae3]: The idea is to find the true positives that appear in the evaluation sets. How you do that is up to you. We left the ratings in the evaluation sets for you to be able to use that additional information while configuring your algorithms, something rated with a high rating, 5 (ft) or 10 (mp), should perhaps be recommended with a higher probability than something rated 3 (ft) or 7 (mp).
[Qe4]: Why not use XXX as the evaluation metric instead?
[Ae4]: There are as many evaluation metrics as there are ways to implement a recommender, we have settled for these metrics as they are commonly used and well known by the majority of RecSys people. You are welcome to use any other evaluation metric in you paper in addition to the ones specified.
[Qe5]: In a system rating movies from 1 to 5 like in the ft-social one, how do we compute P@N?
[Ae5]: Treat the ratings contained in the evaluation sets as true positives, i.e. discard the rating. The rating is provided to give you more details on how relevant items are, but will not be evaluated.
[Qe6]: What’s with the movies not labeled with id 16 in the evaluation set for the Moviepilot Mood track?
[Ae6]: We consider these as False Positives. The evaluation results in your papers you should only consider movies tagged with mood 16, all other movies in that set are False Positives.
[Qe7]: How is the evaluation supposed to be perfomed?
[Ae7]: Rank items for each individual user and compute a score for each user and then average the scores of different users.
[Qp]: Do you know of any relevant publications?
[Ap]: Yes, please have a look at these papers
- Herlocker2004, Evaluating collaborative filtering recommender systems
- Said2010, How Social Relationships Affect User Similarities