Built-in datasets

Amazon Clothing

This data is built based on the Amazon datasets provided by Julian McAuley @ http://jmcauley.ucsd.edu/data/amazon/. We make sure all items having three types of auxiliary data: text, image, and context (items appearing together).

cornac.datasets.amazon_clothing.load_context(reader: cornac.data.reader.Reader = None) → List[source]

Load the item-item interactions

Parameters:reader (obj:cornac.data.Reader, default: None) – Reader object used to read the data.
Returns:data – Data in the form of a list of tuples (item, item, 1).
Return type:array-like
cornac.datasets.amazon_clothing.load_image()[source]

Load the item image in the form of visual features (extracted from pre-trained CNN)

Returns:
  • features (numpy.ndarray) – Feature matrix with shape (n, 4096) with n is the number of items.
  • item_ids (List) – List of item ids aligned with indices in features.
cornac.datasets.amazon_clothing.load_rating(reader: cornac.data.reader.Reader = None) → List[source]

Load the user-item ratings

Parameters:reader (obj:cornac.data.Reader, default: None) – Reader object used to read the data.
Returns:data – Data in the form of a list of tuples (user, item, rating).
Return type:array-like
cornac.datasets.amazon_clothing.load_text()[source]

Load the item text descriptions

Returns:
  • texts (List) – List of text documents, one per item.
  • ids (List) – List of item ids aligned with indices in texts.

Amazon Office

This data is built based on the Amazon datasets provided by Julian McAuley at: http://jmcauley.ucsd.edu/data/amazon/

cornac.datasets.amazon_office.load_context(reader: cornac.data.reader.Reader = None) → List[source]

Load the item-item interactions

Parameters:reader (obj:cornac.data.Reader, default: None) – Reader object used to read the data.
Returns:data – Data in the form of a list of tuples (item, item, 1).
Return type:array-like
cornac.datasets.amazon_office.load_rating(reader: cornac.data.reader.Reader = None) → List[source]

Load the user-item ratings

Parameters:reader (obj:cornac.data.Reader, default: None) – Reader object used to read the data.
Returns:data – Data in the form of a list of tuples (user, item, rating).
Return type:array-like

CiteULike

This dataset is mostly from the paper ‘Collaborative topic modeling for recommending scientific articles’ [Wang and Blei - KDD 2011]. It was further collected, named citeulike-a, and used in the paper ‘Collaborative Topic Regression with Social Regularization’ [Wang, Chen and Li - IJCAI 2013].

Link to the data: http://www.wanghao.in/CDL.htm

cornac.datasets.citeulike.load_data(reader: cornac.data.reader.Reader = None) → List[source]

Load the implicit feedback between users and items

Parameters:reader (obj:cornac.data.Reader, default: None) – Reader object used to read the data.
Returns:data – Data in the form of a list of tuples (user, item, 1).
Return type:array-like
cornac.datasets.citeulike.load_text()[source]

Load item texts including tile and abstract joined together into one document per item.

Returns:
  • texts (List) – List of text documents, one per item.
  • ids (List) – List of item ids aligned with indices in texts.

Epinions

Link to the dataset: http://www.trustlet.org/downloaded_epinions.html

cornac.datasets.epinions.load_data(reader: cornac.data.reader.Reader = None) → List[source]

Load the rating feedback

Parameters:reader (obj:cornac.data.Reader, default: None) – Reader object used to read the data.
Returns:data – Data in the form of a list of tuples (user, item, rating).
Return type:array-like
cornac.datasets.epinions.load_trust(reader: cornac.data.reader.Reader = None) → List[source]

Load the trust data

Parameters:reader (obj:cornac.data.Reader, default: None) – Reader object used to read the data.
Returns:data – Data in the form of a list of tuples (user, item, rating).
Return type:array-like

MovieLens

Link to the data: https://grouplens.org/datasets/movielens/

cornac.datasets.movielens.load_100k(fmt='UIR', reader=None)[source]

Load the MovieLens 100K dataset

Parameters:fmt (str, default: 'UIR') – Data format to be returned.
Returns:data – Data in the form of a list of tuples depending on the given data format.
Return type:array-like
cornac.datasets.movielens.load_1m(fmt='UIR', reader: cornac.data.reader.Reader = None) → List[source]

Load the MovieLens 1M dataset

Parameters:
  • fmt (str, default: 'UIR') – Data format to be returned.
  • reader (obj:cornac.data.Reader, default: None) – Reader object used to read the data.
Returns:

data – Data in the form of a list of tuples depending on the given data format.

Return type:

array-like

cornac.datasets.movielens.load_plot()[source]

Load the plots of movies provided @ http://dm.postech.ac.kr/~cartopy/ConvMF/

Returns:
  • texts (List) – List of text documents, one per item.
  • ids (List) – List of item ids aligned with indices in texts.

Netflix

Link to the data: https://www.kaggle.com/netflix-inc/netflix-prize-data/

cornac.datasets.netflix.load_data(fmt='UIR', reader: cornac.data.reader.Reader = None) → List[source]

Load the Netflix entire dataset - Number of ratings: 100,480,507 - Number of users: 480,189 - Number of items: 17,770

Parameters:
  • fmt (str, default: 'UIR') – Data format to be returned.
  • reader (obj:cornac.data.Reader, default: None) – Reader object used to read the data.
Returns:

data – Data in the form of a list of tuples depending on the given data format.

Return type:

array-like

cornac.datasets.netflix.load_data_small(fmt='UIR', reader: cornac.data.reader.Reader = None) → List[source]

Load a small subset of the Netflix dataset. We draw this subsample such that every user has at least 10 items and each item has at least 10 users. - Number of ratings: 607,803 - Number of users: 10,000 - Number of items: 5,000

Parameters:
  • fmt (str, default: 'UIR') – Data format to be returned.
  • reader (obj:cornac.data.Reader, default: None) – Reader object used to read the data.
Returns:

data – Data in the form of a list of tuples depending on the given data format.

Return type:

array-like

Tradesy

Link to the data: http://jmcauley.ucsd.edu/data/tradesy/ This data is used in the VBPR paper. After cleaning the data, we have: - Number of feedback: 394,421 (410,186 is reported but there are duplicates) - Number of users: 19,243 (19,823 is reported due to duplicates) - Number of items: 165,906 (166,521 is reported due to duplicates)

cornac.datasets.tradesy.load_data(reader: cornac.data.reader.Reader = None) → List[source]

Load the feedback observations

Parameters:reader (obj:cornac.data.Reader, default: None) – Reader object used to read the data.
Returns:data – Data in the form of a list of tuples (user, item, 1).
Return type:array-like
cornac.datasets.tradesy.load_feature()[source]

Load the item visual feature

Returns:
  • features (numpy.ndarray) – Feature matrix with shape (n, 4096) with n is the number of items.
  • item_ids (List) – List of item ids aligned with indices in features.