Data

class cornac.data.FeatureModule(features=None, ids=None, copy=False, normalized=False, **kwargs)[source]
Parameters:
  • features (numpy.ndarray or scipy.sparse.csr_matrix, default = None) – Numpy 2d-array that the row indices are aligned with user/item in ids.
  • ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided features will be used as ids.
  • copy (bool, default = False) – Whether or not to make a copy of the input features array and leave it unchanged during manipulation. If False, rows of the input feature array will be swapped if needed when building the module.
batch_feature(batch_ids)[source]

Return a matrix (batch of feature vectors) corresponding to provided batch_ids

build(id_map=None)[source]

Build the feature matrix. Features will be swapped if the id_map is provided

feature_dim

Return the dimensionality of the feature vectors

features

Return the whole feature matrix

class cornac.data.TextModule(corpus: List[str] = None, ids: List = None, tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_vocab: int = None, max_doc_freq: Union[float, int] = 1.0, min_freq: int = 1, stop_words: Union[List, str] = None, **kwargs)[source]

Text module

Parameters:
  • corpus (List[str], default = None) – List of user/item texts that the indices are aligned with ids.
  • ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided corpus will be used as ids.
  • tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
  • vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
  • max_vocab (int, optional, default = None) – The maximum size of the vocabulary. If vocab is provided, this will be ignored.
  • max_doc_freq (Union[float, int] = 1.0) – The maximum frequency of tokens appearing in documents to be excluded from vocabulary. If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
  • min_freq (int, default = 1) – The minimum frequency of tokens to be included into vocabulary. If vocab is not None, this will be ignored.
  • stop_words (Collection, str, default: None) – Collection of stop words which will be ignored when building Vocabulary. If str, it indicates a built-in stop words list. Currently, only english is supported.
batch_seq(batch_ids, max_length=None)[source]

Return a numpy matrix of text sequences containing token ids with size=(len(batch_ids), max_length). If max_length=None, it will be inferred based on retrieved sequences.

batch_tfidf(batch_ids)[source]

Return matrix of TF-IDF features corresponding to provided batch_ids

build(id_map=None)[source]

Build the model based on provided list of ordered ids

class cornac.data.ImageModule(**kwargs)[source]

Image module

batch_image(batch_ids, target_size=(256, 256), color_mode='rgb', interpolation='nearest')[source]

Return batch of images corresponding to provided batch_ids

build(id_map=None)[source]

Build the model based on provided list of ordered ids

class cornac.data.GraphModule(**kwargs)[source]

Graph module

batch(batch_ids)[source]

Return batch of vectors from the sparse adjacency matrix corresponding to provided batch_ids.

Parameters:batch_ids (array, required) – An array contains the ids of rows to be returned from the sparse adjacency matrix.
build(id_map=None)[source]

Build the feature matrix. Features will be swapped if the id_map is provided

get_train_triplet(train_row_ids, train_col_ids)[source]

Get the training tuples

class cornac.data.TrainSet(uid_map, iid_map)[source]

Training Set

Parameters:
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
get_iid(raw_iid)[source]

Return the mapped id of an item given a raw id

get_uid(raw_uid)[source]

Return the mapped id of a user given a raw id

static idx_iter(idx_range, batch_size=1, shuffle=False)[source]

Create an iterator over batch of indices

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator

Return type:

batch of indices (array of np.int)

iid_list

Return the list of mapped item ids

is_unk_item(mapped_iid)[source]

Return whether or not an item is unknown given the mapped id

is_unk_user(mapped_uid)[source]

Return whether or not a user is unknown given the mapped id

num_items

Return the number of items

num_users

Return the number of users

raw_iid_list

Return the list of raw item ids

raw_uid_list

Return the list of raw user ids

uid_list

Return the list of mapped user ids

class cornac.data.MatrixTrainSet(uir_tuple, max_rating, min_rating, global_mean, uid_map, iid_map)[source]

Training set contains preference matrix

Parameters:
  • uir_tuple (tuple) – Tuple of 3 numpy arrays (users, items, ratings).
  • max_rating (float) – Maximum value of the preferences.
  • min_rating (float) – Minimum value of the preferences.
  • global_mean (float) – Average value of the preferences.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
classmethod from_uir(data, global_uid_map=None, global_iid_map=None, global_ui_set=None, verbose=False)[source]

Constructing TrainSet from triplet data.

Parameters:
  • data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
  • global_uid_map (defaultdict, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of users.
  • global_iid_map (defaultdict, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of items.
  • global_ui_set (set, optional, default: None) – The global set of tuples (user, item). This helps avoiding duplicate observations.
  • verbose (bool, default: False) – The verbosity flag.
Returns:

train_set – MatrixTrainSet object.

Return type:

<cornac.data.MatrixTrainSet>

item_iter(batch_size=1, shuffle=False)[source]

Create an iterator over item ids

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator

Return type:

batch of item ids (array of np.int)

item_ppl_rank()[source]

Rank items by their popularity.

Returns:item_rank, item_scores – Ranking and scores for all items
Return type:array, array
uij_iter(batch_size=1, shuffle=False)[source]

Create an iterator over data yielding batch of users, positive items, and negative items

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator – batch of negative items (array of np.int)

Return type:

batch of users (array of np.int), batch of positive items (array of np.int),

uir_iter(batch_size=1, shuffle=False)[source]

Create an iterator over data yielding batch of users, items, and rating values

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator – batch of ratings (array of np.float)

Return type:

batch of users (array of np.int), batch of items (array of np.int),

user_iter(batch_size=1, shuffle=False)[source]

Create an iterator over user ids

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator

Return type:

batch of user ids (array of np.int)

class cornac.data.MultimodalTrainSet(matrix, max_rating, min_rating, global_mean, uid_map, iid_map, **kwargs)[source]

Multimodal training set

Parameters:
  • matrix (scipy.sparse.csr_matrix) – Preferences in the form of scipy sparse matrix.
  • max_rating (float) – Maximum value of the preferences.
  • min_rating (float) – Minimum value of the preferences.
  • global_mean (float) – Average value of the preferences.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
class cornac.data.TestSet(user_ratings, uid_map, iid_map)[source]

Test Set

Parameters:
  • user_ratings (defaultdict of list) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
classmethod from_uir(data, global_uid_map, global_iid_map, global_ui_set, verbose=False)[source]

Constructing TestSet from triplet data.

Parameters:
  • data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
  • global_uid_map (defaultdict) – The dictionary containing global mapping from original ids to mapped ids of users.
  • global_iid_map (defaultdict) – The dictionary containing global mapping from original ids to mapped ids of items.
  • global_ui_set (set) – The global set of tuples (user, item). This helps avoiding duplicate observations.
  • verbose (bool, default: False) – The verbosity flag.
Returns:

test_set – TestSet object.

Return type:

<cornac.data.TestSet>

get_iid(raw_iid)[source]

Return the mapped id of an item given a raw id

get_ratings(mapped_uid)[source]

Return a list of tuples of (item, rating) of given mapped user id

get_uid(raw_uid)[source]

Return the mapped id of a user given a raw id

users

Return a list of users

class cornac.data.MultimodalTestSet(user_ratings, uid_map, iid_map, **kwargs)[source]

Test Set

Parameters:
  • user_ratings (defaultdict of list) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
class cornac.data.Reader(user_set=None, item_set=None, min_user_freq=1, min_item_freq=1, bin_threshold=None, encoding='utf-8', errors=None)[source]

Reader class for reading data with different types of format.

Parameters:
  • user_set (set, default = None) – Set of users to be selected when reading data. If None, all users that appear in the data will be included.
  • item_set (set, default = None) – Set of items to be selected when reading data. If None, all items that appear in the data will be included.
  • min_user_freq (int, default = 1) – The minimum frequency of a user to be selected. If min_user_freq=1, all users that appear in the data will be included.
  • min_item_freq (int, default = 1) – The minimum frequency of an item to be selected. If min_item_freq=1, all items that appear in the data will be included.
  • bin_threshold (float, default = None) – The rating threshold to binarize rating values (turn explicit feedback to implicit feedback). For example, if bin_threshold = 3.0, all rating values >= 3.0 will be set to 1.0, and the rest (< 3.0) will be discarded.
  • encoding (str, default = utf-8) – Encoding used to decode the file.
  • errors (int, default = None) – Optional string that specifies how encoding errors are to be handled. Pass ‘strict’ to raise a ValueError exception if there is an encoding error (None has the same effect), or pass ‘ignore’ to ignore errors.
read(fpath, fmt='UIR', sep='\t', skip_lines=0, id_inline=False, parser=None)[source]

Read data and parse line by line based on provided fmt or parser.

Parameters:
  • fpath (str) – Path to the data file
  • fmt (str, default: UIR) – Line format to be parsed
  • sep (str, default:) – The delimiter string.
  • skip_lines (int, default: 0) – Number of first lines to skip
  • id_inline (bool, default: False) – If True, user ids corresponding to the line numbers of the file, where all the ids in each line are item ids.
  • parser (function, default: None) – Function takes a list of str tokenized by sep and returns a list of tuples which will be joined to the final results. If None, parser will be determined based on fmt.
Returns:

tuples – Data in the form of list of tuples. What inside each tuple depends on parser or fmt.

Return type:

list

Train Set

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

class cornac.data.trainset.MatrixTrainSet(uir_tuple, max_rating, min_rating, global_mean, uid_map, iid_map)[source]

Training set contains preference matrix

Parameters:
  • uir_tuple (tuple) – Tuple of 3 numpy arrays (users, items, ratings).
  • max_rating (float) – Maximum value of the preferences.
  • min_rating (float) – Minimum value of the preferences.
  • global_mean (float) – Average value of the preferences.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
classmethod from_uir(data, global_uid_map=None, global_iid_map=None, global_ui_set=None, verbose=False)[source]

Constructing TrainSet from triplet data.

Parameters:
  • data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
  • global_uid_map (defaultdict, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of users.
  • global_iid_map (defaultdict, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of items.
  • global_ui_set (set, optional, default: None) – The global set of tuples (user, item). This helps avoiding duplicate observations.
  • verbose (bool, default: False) – The verbosity flag.
Returns:

train_set – MatrixTrainSet object.

Return type:

<cornac.data.MatrixTrainSet>

item_iter(batch_size=1, shuffle=False)[source]

Create an iterator over item ids

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator

Return type:

batch of item ids (array of np.int)

item_ppl_rank()[source]

Rank items by their popularity.

Returns:item_rank, item_scores – Ranking and scores for all items
Return type:array, array
uij_iter(batch_size=1, shuffle=False)[source]

Create an iterator over data yielding batch of users, positive items, and negative items

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator – batch of negative items (array of np.int)

Return type:

batch of users (array of np.int), batch of positive items (array of np.int),

uir_iter(batch_size=1, shuffle=False)[source]

Create an iterator over data yielding batch of users, items, and rating values

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator – batch of ratings (array of np.float)

Return type:

batch of users (array of np.int), batch of items (array of np.int),

user_iter(batch_size=1, shuffle=False)[source]

Create an iterator over user ids

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator

Return type:

batch of user ids (array of np.int)

class cornac.data.trainset.MultimodalTrainSet(matrix, max_rating, min_rating, global_mean, uid_map, iid_map, **kwargs)[source]

Multimodal training set

Parameters:
  • matrix (scipy.sparse.csr_matrix) – Preferences in the form of scipy sparse matrix.
  • max_rating (float) – Maximum value of the preferences.
  • min_rating (float) – Minimum value of the preferences.
  • global_mean (float) – Average value of the preferences.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
class cornac.data.trainset.TrainSet(uid_map, iid_map)[source]

Training Set

Parameters:
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
get_iid(raw_iid)[source]

Return the mapped id of an item given a raw id

get_uid(raw_uid)[source]

Return the mapped id of a user given a raw id

static idx_iter(idx_range, batch_size=1, shuffle=False)[source]

Create an iterator over batch of indices

Parameters:
  • batch_size (int, optional, default = 1) –
  • shuffle (bool, optional) – If True, orders of triplets will be randomized. If False, default orders kept
Returns:

iterator

Return type:

batch of indices (array of np.int)

iid_list

Return the list of mapped item ids

is_unk_item(mapped_iid)[source]

Return whether or not an item is unknown given the mapped id

is_unk_user(mapped_uid)[source]

Return whether or not a user is unknown given the mapped id

num_items

Return the number of items

num_users

Return the number of users

raw_iid_list

Return the list of raw item ids

raw_uid_list

Return the list of raw user ids

uid_list

Return the list of mapped user ids

Test Set

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

class cornac.data.testset.MultimodalTestSet(user_ratings, uid_map, iid_map, **kwargs)[source]

Test Set

Parameters:
  • user_ratings (defaultdict of list) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
class cornac.data.testset.TestSet(user_ratings, uid_map, iid_map)[source]

Test Set

Parameters:
  • user_ratings (defaultdict of list) – The dictionary containing lists of tuples of the form (item, rating). The keys are user ids.
  • uid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of users.
  • iid_map (defaultdict) – The dictionary containing mapping from original ids to mapped ids of items.
classmethod from_uir(data, global_uid_map, global_iid_map, global_ui_set, verbose=False)[source]

Constructing TestSet from triplet data.

Parameters:
  • data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
  • global_uid_map (defaultdict) – The dictionary containing global mapping from original ids to mapped ids of users.
  • global_iid_map (defaultdict) – The dictionary containing global mapping from original ids to mapped ids of items.
  • global_ui_set (set) – The global set of tuples (user, item). This helps avoiding duplicate observations.
  • verbose (bool, default: False) – The verbosity flag.
Returns:

test_set – TestSet object.

Return type:

<cornac.data.TestSet>

get_iid(raw_iid)[source]

Return the mapped id of an item given a raw id

get_ratings(mapped_uid)[source]

Return a list of tuples of (item, rating) of given mapped user id

get_uid(raw_uid)[source]

Return the mapped id of a user given a raw id

users

Return a list of users

Graph Module

@author: Aghiles Salah <asalah@smu.edu.sg>

class cornac.data.graph.GraphModule(**kwargs)[source]

Graph module

batch(batch_ids)[source]

Return batch of vectors from the sparse adjacency matrix corresponding to provided batch_ids.

Parameters:batch_ids (array, required) – An array contains the ids of rows to be returned from the sparse adjacency matrix.
build(id_map=None)[source]

Build the feature matrix. Features will be swapped if the id_map is provided

get_train_triplet(train_row_ids, train_col_ids)[source]

Get the training tuples

Text Module

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

class cornac.data.text.Tokenizer[source]

Generic class for other subclasses to extend from. This typically either splits text into word tokens or character tokens.

batch_tokenize(texts: List[str]) → List[List[str]][source]

Splitting a corpus with multiple text documents.

Returns:tokens
Return type:List[List[str]]
tokenize(t: str) → List[str][source]

Splitting text into tokens.

Returns:tokens
Return type:List[str]
class cornac.data.text.BaseTokenizer(sep: str = ' ', pre_rules: List[Callable[str, str]] = None, stop_words: Union[List, str] = None)[source]

A base tokenizer use a provided delimiter sep to split text.

batch_tokenize(texts: List[str]) → List[List[str]][source]

Splitting a corpus with multiple text documents.

Returns:tokens
Return type:List[List[str]]
tokenize(t: str) → List[str][source]

Splitting text into tokens.

Returns:tokens
Return type:List[str]
class cornac.data.text.Vocabulary(idx2tok: List[str], use_special_tokens: bool = False)[source]

Vocabulary basically contains mapping between numbers and tokens and vice versa.

classmethod from_sequences(sequences: List[List[str]], max_vocab: int = None, min_freq: int = 1, use_special_tokens: bool = False) → cornac.data.text.Vocabulary[source]

Build a vocabulary from sequences (list of list of tokens).

classmethod from_tokens(tokens: List[str], max_vocab: int = None, min_freq: int = 1, use_special_tokens: bool = False) → cornac.data.text.Vocabulary[source]

Build a vocabulary from list of tokens.

classmethod load(path)[source]

Load a vocabulary from path to a pickle file.

save(path)[source]

Save idx2tok into a pickle file.

to_idx(tokens: List[str]) → List[int][source]

Convert a list of tokens to their integer indices.

to_text(indices: List[int], sep=' ') → List[str][source]

Convert a list of integer indices to their tokens.

class cornac.data.text.CountVectorizer(tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_doc_freq: Union[float, int] = 1.0, min_freq: int = 1, max_features: int = None, stop_words: Union[List, str] = None, binary: bool = False)[source]

Convert a collection of text documents to a matrix of token counts This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

Parameters:
  • tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
  • vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
  • max_doc_freq (Union[float, int] = 1.0) – The maximum frequency of tokens appearing in documents to be excluded from vocabulary. If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
  • min_freq (int, default = 1) – The minimum frequency of tokens to be included into vocabulary. If vocab is not None, this will be ignored.
  • max_features (int, default=None) – If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. If vocab is not None, this will be ignored.
  • stop_words (Collection, str, default: None) – Collection of stop words which will be ignored when building Vocabulary. If str, it indicates a built-in stop words list. Currently, only english is supported.
  • binary (boolean, default=False) – If True, all non zero counts are set to 1.
fit(raw_documents: List[str]) → cornac.data.text.CountVectorizer[source]

Build a vocabulary of all tokens in the raw documents.

Parameters:raw_documents (iterable) – An iterable which yields either str, unicode or file objects.
Returns:
Return type:self
fit_transform(raw_documents: List[str]) -> (typing.List[typing.List[str]], <class 'scipy.sparse.csr.csr_matrix'>)[source]

Build the vocabulary and return term-document matrix.

Parameters:raw_documents (List[str]) –
Returns:
sequences: List[List[str]
Tokenized sequences of raw_documents
X: array, [n_samples, n_features]
Document-term matrix.
Return type:(sequences, X)
transform(raw_documents: List[str]) -> (typing.List[typing.List[str]], <class 'scipy.sparse.csr.csr_matrix'>)[source]

Transform documents to document-term matrix.

Parameters:raw_documents (List[str]) –
Returns:
sequences: List[List[str]
Tokenized sequences of raw_documents.
X: array, [n_samples, n_features]
Document-term matrix.
Return type:(sequences, X)
class cornac.data.text.TextModule(corpus: List[str] = None, ids: List = None, tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_vocab: int = None, max_doc_freq: Union[float, int] = 1.0, min_freq: int = 1, stop_words: Union[List, str] = None, **kwargs)[source]

Text module

Parameters:
  • corpus (List[str], default = None) – List of user/item texts that the indices are aligned with ids.
  • ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided corpus will be used as ids.
  • tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
  • vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
  • max_vocab (int, optional, default = None) – The maximum size of the vocabulary. If vocab is provided, this will be ignored.
  • max_doc_freq (Union[float, int] = 1.0) – The maximum frequency of tokens appearing in documents to be excluded from vocabulary. If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
  • min_freq (int, default = 1) – The minimum frequency of tokens to be included into vocabulary. If vocab is not None, this will be ignored.
  • stop_words (Collection, str, default: None) – Collection of stop words which will be ignored when building Vocabulary. If str, it indicates a built-in stop words list. Currently, only english is supported.
batch_seq(batch_ids, max_length=None)[source]

Return a numpy matrix of text sequences containing token ids with size=(len(batch_ids), max_length). If max_length=None, it will be inferred based on retrieved sequences.

batch_tfidf(batch_ids)[source]

Return matrix of TF-IDF features corresponding to provided batch_ids

build(id_map=None)[source]

Build the model based on provided list of ordered ids

Image Module

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

class cornac.data.image.ImageModule(**kwargs)[source]

Image module

batch_image(batch_ids, target_size=(256, 256), color_mode='rgb', interpolation='nearest')[source]

Return batch of images corresponding to provided batch_ids

build(id_map=None)[source]

Build the model based on provided list of ordered ids

Reader

@author: Quoc-Tuan Truong <tuantq.vnu@gmail.com>

class cornac.data.reader.Reader(user_set=None, item_set=None, min_user_freq=1, min_item_freq=1, bin_threshold=None, encoding='utf-8', errors=None)[source]

Reader class for reading data with different types of format.

Parameters:
  • user_set (set, default = None) – Set of users to be selected when reading data. If None, all users that appear in the data will be included.
  • item_set (set, default = None) – Set of items to be selected when reading data. If None, all items that appear in the data will be included.
  • min_user_freq (int, default = 1) – The minimum frequency of a user to be selected. If min_user_freq=1, all users that appear in the data will be included.
  • min_item_freq (int, default = 1) – The minimum frequency of an item to be selected. If min_item_freq=1, all items that appear in the data will be included.
  • bin_threshold (float, default = None) – The rating threshold to binarize rating values (turn explicit feedback to implicit feedback). For example, if bin_threshold = 3.0, all rating values >= 3.0 will be set to 1.0, and the rest (< 3.0) will be discarded.
  • encoding (str, default = utf-8) – Encoding used to decode the file.
  • errors (int, default = None) – Optional string that specifies how encoding errors are to be handled. Pass ‘strict’ to raise a ValueError exception if there is an encoding error (None has the same effect), or pass ‘ignore’ to ignore errors.
read(fpath, fmt='UIR', sep='\t', skip_lines=0, id_inline=False, parser=None)[source]

Read data and parse line by line based on provided fmt or parser.

Parameters:
  • fpath (str) – Path to the data file
  • fmt (str, default: UIR) – Line format to be parsed
  • sep (str, default:) – The delimiter string.
  • skip_lines (int, default: 0) – Number of first lines to skip
  • id_inline (bool, default: False) – If True, user ids corresponding to the line numbers of the file, where all the ids in each line are item ids.
  • parser (function, default: None) – Function takes a list of str tokenized by sep and returns a list of tuples which will be joined to the final results. If None, parser will be determined based on fmt.
Returns:

tuples – Data in the form of list of tuples. What inside each tuple depends on parser or fmt.

Return type:

list

cornac.data.reader.read_text(fpath, sep=None, encoding='utf-8', errors=None)[source]

Read text file and return two lists of text documents and corresponding ids. If sep is None, only return one list containing elements are lines of text in the original file.

Parameters:
  • fpath (str) – Path to the data file
  • sep (str, default = None) – The delimiter string used to split id and text. Each line is assumed containing an id followed by corresponding text document. If None, each line will be a str in returned list.
  • encoding (str, default = utf-8) – Encoding used to decode the file.
  • errors (int, default = None) – Optional string that specifies how encoding errors are to be handled. Pass ‘strict’ to raise a ValueError exception if there is an encoding error (None has the same effect), or pass ‘ignore’ to ignore errors.