Data¶
-
class
cornac.data.
FeatureModality
(features=None, ids=None, normalized=False, **kwargs)[source]¶ Modality that contains features in general
Parameters: - features (numpy.ndarray or scipy.sparse.csr_matrix, default = None) – Numpy 2d-array that the row indices are aligned with user/item in ids.
- ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided features will be used as ids.
-
batch_feature
(batch_ids)[source]¶ Return a matrix (batch of feature vectors) corresponding to provided batch_ids
-
build
(id_map=None, **kwargs)[source]¶ Build the feature matrix. Features will be swapped if the id_map is provided
-
feature_dim
¶ Return the dimensionality of the feature vectors
-
features
¶ Return the whole feature matrix
-
class
cornac.data.
TextModality
(corpus: List[str] = None, ids: List = None, tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_vocab: int = None, max_doc_freq: Union[float, int] = 1.0, min_doc_freq: int = 1, tfidf_params: Dict = None, **kwargs)[source]¶ Text modality
Parameters: - corpus (List[str], default = None) – List of user/item texts that the indices are aligned with ids.
- ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided corpus will be used as ids.
- tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
- vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
- max_vocab (int, optional, default = None) – The maximum size of the vocabulary. If vocab is provided, this will be ignored.
- max_doc_freq (float in range [0.0, 1.0] or int, default=1.0) – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
- min_doc_freq (float in range [0.0, 1.0] or int, default=1) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the value represents a proportion of documents, int absolute counts. If vocab is not None, this will be ignored.
- tfidf_params (dict or None, optional, default=None) –
If None, a default arguments of
<cornac.data.text.IfidfVectorizer>
will be used. List of parameters:- ’binary’ : boolean, default=False
- If True, all non zero counts are set to 1.
- ’norm’ : ’l1’, ‘l2’ or None, optional, default=’l2’
- Each output row will have unit norm, either:
* ‘l2’: Sum of squares of vector elements is 1. The cosine
similarity between two vectors is their dot product when l2 norm has
been applied.
* ‘l1’: Sum of absolute values of vector elements is 1.
See
utils.common.normalize()
- ’use_idf’ : boolean, default=True
- Enable inverse-document-frequency reweighting.
- ’smooth_idf’ : boolean, default=True
- Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
- ’sublinear_tf’ : boolean (default=False)
- Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
-
batch_seq
(batch_ids, max_length=None)[source]¶ Return a numpy matrix of text sequences containing token ids with size=(len(batch_ids), max_length).
Parameters: - batch_ids (Union[List, numpy.array], required) – An array containing the ids of rows of text sequences to be returned.
- max_length (int, optional) – Cut-off length of returned sequences. If None, it will be inferred based on retrieved sequences.
Returns: batch_sequences – Batch of sequences with zero-padding at the end.
Return type: numpy.ndarray
-
batch_tfidf
(batch_ids, keep_sparse=False)[source]¶ Return matrix of TF-IDF features corresponding to provided batch_ids
Parameters: - batch_ids (array) – An array of ids to retrieve the corresponding features.
- keep_sparse (bool, default = False) – If True, the return feature matrix will be a scipy.sparse.csr_matrix. Otherwise, it will be a dense matrix.
Returns: batch_tfidf – Batch of TF-IDF representations corresponding to input batch_ids.
Return type: numpy.ndarray
-
build
(id_map=None, **kwargs)[source]¶ Build the model based on provided list of ordered ids
Parameters: id_map (dict, optional) – A dictionary holds mapping from original ids to mapped integer indices of users/items. Returns: text_modality – An object of type TextModality. Return type: <cornac.data.TextModality>
-
tfidf_matrix
¶ Return tf-idf matrix.
-
class
cornac.data.
ReviewModality
(data: List[tuple] = None, group_by: str = None, tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_vocab: int = None, max_doc_freq: Union[float, int] = 1.0, min_doc_freq: int = 1, tfidf_params: Dict = None, **kwargs)[source]¶ Review modality
Parameters: - data (List[tuple], required) – A triplet list of user, item, and review e.g., data=[(‘user1’, ‘item1’, ‘review1’), (‘user2’, ‘item2’, ‘review2)].
- group_by ('user', 'item', or None, required, default = None) – Group mode. Whether reviews are grouped based on users, items, or not.
- tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
- vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
- max_vocab (int, optional, default = None) – The maximum size of the vocabulary. If vocab is provided, this will be ignored.
- max_doc_freq (float in range [0.0, 1.0] or int, default=1.0) – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
- min_doc_freq (float in range [0.0, 1.0] or int, default=1) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the value represents a proportion of documents, int absolute counts. If vocab is not None, this will be ignored.
- tfidf_params (dict or None, optional, default=None) –
If None, a default arguments of
<cornac.data.text.IfidfVectorizer>
will be used. List of parameters:- ’binary’ : boolean, default=False
- If True, all non zero counts are set to 1.
- ’norm’ : ’l1’, ‘l2’ or None, optional, default=’l2’
- Each output row will have unit norm, either:
* ‘l2’: Sum of squares of vector elements is 1. The cosine
similarity between two vectors is their dot product when l2 norm has
been applied.
* ‘l1’: Sum of absolute values of vector elements is 1.
See
utils.common.normalize()
- ’use_idf’ : boolean, default=True
- Enable inverse-document-frequency reweighting.
- ’smooth_idf’ : boolean, default=True
- Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
- ’sublinear_tf’ : boolean (default=False)
- Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
-
class
cornac.data.
ImageModality
(**kwargs)[source]¶ Image modality
Parameters: - images (Union[List, numpy.ndarray], optional) – A list or tensor of images that the row indices are aligned with user/item in ids.
- paths (List[str], optional) – A list of paths, to images stored on disk, which the row indices are aligned with user/item in ids..
-
batch_image
(batch_ids, target_size=(256, 256), color_mode='rgb', interpolation='nearest')[source]¶ Return batch of images corresponding to provided batch_ids
Parameters: - batch_ids (Union[List, numpy.array], required) – An array containing the ids of rows of images to be returned.
- target_size (tuple, optional, default: (256, 256)) – Size (width, height) of returned images to be resized.
- color_mode (str, optional, default: 'rgb') – Color mode of returned images.
- interpolation (str, optional, default: 'nearest') – Method used for interpolation when resize images. Options are OpenCV supported methods.
Returns: res – Batch of images corresponding to input batch_ids.
Return type: numpy.ndarray
-
build
(id_map=None, **kwargs)[source]¶ Build the model based on provided list of ordered ids
Parameters: id_map (dict, optional) – A dictionary holds mapping from original ids to mapped integer indices of users/items. Returns: image_modality – An object of type ImageModality. Return type: <cornac.data.ImageModality>
-
class
cornac.data.
GraphModality
(**kwargs)[source]¶ Graph modality
Parameters: data (List[str], required) – A list encoding an adjacency matrix, of a user or an item graph, in the sparse triplet format, e.g., data=[(‘user1’, ‘user4’, 1.0)]. -
batch
(batch_ids)[source]¶ Return batch of vectors from the sparse adjacency matrix corresponding to provided batch_ids.
Parameters: batch_ids (array, required) – An array containing the ids of rows to be returned from the sparse adjacency matrix.
-
build
(id_map=None, **kwargs)[source]¶ Build the feature matrix. Features will be swapped if the id_map is provided
-
classmethod
from_feature
(features, k=5, ids=None, similarity='cosine', symmetric=False, verbose=True)[source]¶ Instantiate a GraphModality with a KNN graph build using input features.
Parameters: - features (2d Numpy array, shape: [n_objects, n_features], required) – A 2d Numpy array of features, e.g., visual, textual, etc.
- k (int, optional, default: 5) – The number of nearest neighbors
- ids (array, optional, default: None) – The list of object ids or labels, which align with the rows of features. For instance if you use textual (bag-of-word) features, then “ids” should be the same as the input to cornac.data.TextModality.
- similarity (string, optional, default: "cosine") – The similarity measure. At this time only the cosine is supported
- symmetric (bool, optional, default: False) – When True the resulting KNN-Graph is made symmetric
- verbose (bool, default: False) – The verbosity flag.
Returns: graph_modality – GraphModality object.
Return type: <cornac.data.GraphModality>
-
get_node_degree
(in_ids=None, out_ids=None)[source]¶ Get the “in” and “out” degree for the desired set of nodes
Parameters: - in_ids (array, required) – An array containing the ids for which to get the “in” degree.
- out_ids (array, required) – An array containing the ids for which to get the “out” degree.
Returns: Dictionary of the from {node_id
Return type: [in_degree,out_degree]}
-
get_train_triplet
(train_row_ids, train_col_ids)[source]¶ Get the subset of relations which align with the training data
Parameters: - train_row_ids (array, required) – An array containing the ids of training objects (users or items) for which to get the “out” relations.
- train_col_ids (array, required) – An array containing the ids of training objects (users or items) for whom to get the “in” relations. Please refer to cornac/models/c2pf/recom_c2pf.py for a concrete usage example of this function.
Returns: Return type: A subset of the adjacency matrix, in the sparse triplet format, whose elements align with the training set as specified by “train_row_ids” and “train_col_ids”.
-
matrix
¶ Return the adjacency matrix in scipy csr sparse format
-
-
class
cornac.data.
SentimentModality
(**kwargs)[source]¶ Aspect module :param data: A triplet list of user, item, and sentiment information which also a triplet list of aspect, opinion, and sentiment, e.g., data=[(‘user1’, ‘item1’, [(‘aspect1’, ‘opinion1’, ‘sentiment1’)])]. :type data: List[tuple], required
-
build
(uid_map=None, iid_map=None, dok_matrix=None, **kwargs)[source]¶ Build the model based on provided list of ordered ids
-
num_aspects
¶ Return the number of aspects
-
num_opinions
¶ Return the number of aspects
-
-
class
cornac.data.
Dataset
(num_users, num_items, uid_map, iid_map, uir_tuple, timestamps=None, seed=None)[source]¶ Training set contains preference matrix
Parameters: - num_users (int, required) – Number of users.
- num_items (int, required) – Number of items.
- uid_map (
OrderDict
, required) – The dictionary containing mapping from user original ids to mapped integer indices. - iid_map (
OrderDict
, required) – The dictionary containing mapping from item original ids to mapped integer indices. - uir_tuple (tuple, required) – Tuple of 3 numpy arrays (user_indices, item_indices, rating_values).
- timestamps (numpy.array, optional, default: None) – Array of timestamps corresponding to observations in uir_tuple.
- seed (int, optional, default: None) – Random seed for reproducing data sampling.
-
timestamps
¶ Numpy array of timestamps corresponding to feedback in uir_tuple. This is only available when input data is in UIRT format.
Type: numpy.array
-
classmethod
build
(data, fmt='UIR', global_uid_map=None, global_iid_map=None, seed=None, exclude_unknowns=False)[source]¶ Constructing Dataset from given data of specific format.
Parameters: - data (array-like, required) – Data in the form of triplets (user, item, rating) for UIR format, or quadruplets (user, item, rating, timestamps) for UIRT format.
- fmt (str, default: 'UIR') –
Format of the input data. Currently, we are supporting:
’UIR’: User, Item, Rating ‘UIRT’: User, Item, Rating, Timestamp
- global_uid_map (
defaultdict
, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of users. - global_iid_map (
defaultdict
, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of items. - seed (int, optional, default: None) – Random seed for reproducing data sampling.
- exclude_unknowns (bool, default: False) – Ignore unknown users and items.
Returns: res – Dataset object.
Return type: <cornac.data.Dataset>
-
chrono_item_data
¶ Data organized by item sorted chronologically (timestamps required). A dictionary where keys are items, values are tuples of three chronologically sorted lists (users, ratings, timestamps) interacted with the corresponding items.
-
chrono_user_data
¶ Data organized by user sorted chronologically (timestamps required). A dictionary where keys are users, values are tuples of three chronologically sorted lists (items, ratings, timestamps) interacted by the corresponding users.
-
csc_matrix
¶ The user-item interaction matrix in CSC sparse format
-
csr_matrix
¶ The user-item interaction matrix in CSR sparse format
-
dok_matrix
¶ The user-item interaction matrix in DOK sparse format
-
classmethod
from_uir
(data, seed=None)[source]¶ Constructing Dataset from UIR (User, Item, Rating) triplet data.
Parameters: - data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
- seed (int, optional, default: None) – Random seed for reproducing data sampling.
Returns: res – Dataset object.
Return type: <cornac.data.Dataset>
-
classmethod
from_uirt
(data, seed=None)[source]¶ Constructing Dataset from UIRT (User, Item, Rating, Timestamp) quadruplet data.
Parameters: - data (array-like, shape: [n_examples, 4]) – Data in the form of triplets (user, item, rating, timestamp)
- seed (int, optional, default: None) – Random seed for reproducing data sampling.
Returns: res – Dataset object.
Return type: <cornac.data.Dataset>
-
idx_iter
(idx_range, batch_size=1, shuffle=False)[source]¶ Create an iterator over batch of indices
Parameters: Returns: iterator
Return type: batch of indices (array of np.int)
-
item_data
¶ Data organized by item. A dictionary where keys are items, values are tuples of two lists (users, ratings) interacted with the corresponding items.
-
item_ids
¶ An iterator over the raw item ids
-
item_indices
¶ An iterator over the item indices
-
item_iter
(batch_size=1, shuffle=False)[source]¶ Create an iterator over item indices
Parameters: Returns: iterator
Return type: batch of item indices (array of np.int)
-
matrix
¶ The user-item interaction matrix in CSR sparse format
-
total_items
¶ Total number of items including test and validation items if exists
-
total_users
¶ Total number of users including test and validation users if exists
-
uij_iter
(batch_size=1, shuffle=False, neg_sampling='uniform')[source]¶ Create an iterator over data yielding batch of users, positive items, and negative items
Parameters: Returns: iterator – batch of negative items (array of np.int)
Return type: batch of users (array of np.int), batch of positive items (array of np.int),
-
uir_iter
(batch_size=1, shuffle=False, binary=False, num_zeros=0)[source]¶ Create an iterator over data yielding batch of users, items, and rating values
Parameters: - batch_size (int, optional, default = 1) –
- shuffle (bool, optional, default: False) – If True, orders of triplets will be randomized. If False, default orders kept.
- binary (bool, optional, default: False) – If True, non-zero ratings will be turned into 1, otherwise, values remain unchanged.
- num_zeros (int, optional, default = 0) – Number of unobserved ratings (zeros) to be added per user. This could be used for negative sampling. By default, no values are added.
Returns: iterator – batch of ratings (array of np.float)
Return type: batch of users (array of np.int), batch of items (array of np.int),
-
user_data
¶ Data organized by user. A dictionary where keys are users, values are tuples of two lists (items, ratings) interacted by the corresponding users.
-
user_ids
¶ An iterator over the raw user ids
-
user_indices
¶ An iterator over the user indices
-
class
cornac.data.
Reader
(user_set=None, item_set=None, min_user_freq=1, min_item_freq=1, bin_threshold=None, encoding='utf-8', errors=None)[source]¶ Reader class for reading data with different types of format.
Parameters: - user_set (set, default = None) – Set of users to be retained when reading data. If None, all users will be included.
- item_set (set, default = None) – Set of items to be retained when reading data. If None, all items will be included.
- min_user_freq (int, default = 1) – The minimum frequency of a user to be retained. If min_user_freq = 1, all users will be included.
- min_item_freq (int, default = 1) – The minimum frequency of an item to be retained. If min_item_freq = 1, all items will be included.
- bin_threshold (float, default = None) – The rating threshold to binarize rating values (turn explicit feedback to implicit feedback). For example, if bin_threshold = 3.0, all rating values >= 3.0 will be set to 1.0, and the rest (< 3.0) will be discarded.
- encoding (str, default = utf-8) – Encoding used to decode the file.
- errors (int, default = None) – Optional string that specifies how encoding errors are to be handled. Pass ‘strict’ to raise a ValueError exception if there is an encoding error (None has the same effect), or pass ‘ignore’ to ignore errors.
-
read
(fpath, fmt='UIR', sep='\t', skip_lines=0, id_inline=False, parser=None, **kwargs)[source]¶ Read data and parse line by line based on provided fmt or parser.
Parameters: - fpath (str) – Path to the data file.
- fmt (str, default: 'UIR') – Line format to be parsed (‘UIR’ or ‘UIRT’).
- sep (str, default: ' ') – The delimiter string.
- skip_lines (int, default: 0) – Number of first lines to skip
- id_inline (bool, default: False) – If True, user ids corresponding to the line numbers of the file, where all the ids in each line are item ids.
- parser (function, default: None) – Function takes a list of str tokenized by sep and returns a list of tuples which will be joined to the final results. If None, parser will be determined based on fmt.
Returns: tuples – Data in the form of list of tuples. What inside each tuple depends on parser or fmt.
Return type:
Dataset¶
-
class
cornac.data.dataset.
Dataset
(num_users, num_items, uid_map, iid_map, uir_tuple, timestamps=None, seed=None)[source]¶ Training set contains preference matrix
Parameters: - num_users (int, required) – Number of users.
- num_items (int, required) – Number of items.
- uid_map (
OrderDict
, required) – The dictionary containing mapping from user original ids to mapped integer indices. - iid_map (
OrderDict
, required) – The dictionary containing mapping from item original ids to mapped integer indices. - uir_tuple (tuple, required) – Tuple of 3 numpy arrays (user_indices, item_indices, rating_values).
- timestamps (numpy.array, optional, default: None) – Array of timestamps corresponding to observations in uir_tuple.
- seed (int, optional, default: None) – Random seed for reproducing data sampling.
-
timestamps
¶ Numpy array of timestamps corresponding to feedback in uir_tuple. This is only available when input data is in UIRT format.
Type: numpy.array
-
classmethod
build
(data, fmt='UIR', global_uid_map=None, global_iid_map=None, seed=None, exclude_unknowns=False)[source]¶ Constructing Dataset from given data of specific format.
Parameters: - data (array-like, required) – Data in the form of triplets (user, item, rating) for UIR format, or quadruplets (user, item, rating, timestamps) for UIRT format.
- fmt (str, default: 'UIR') –
Format of the input data. Currently, we are supporting:
’UIR’: User, Item, Rating ‘UIRT’: User, Item, Rating, Timestamp
- global_uid_map (
defaultdict
, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of users. - global_iid_map (
defaultdict
, optional, default: None) – The dictionary containing global mapping from original ids to mapped ids of items. - seed (int, optional, default: None) – Random seed for reproducing data sampling.
- exclude_unknowns (bool, default: False) – Ignore unknown users and items.
Returns: res – Dataset object.
Return type: <cornac.data.Dataset>
-
chrono_item_data
¶ Data organized by item sorted chronologically (timestamps required). A dictionary where keys are items, values are tuples of three chronologically sorted lists (users, ratings, timestamps) interacted with the corresponding items.
-
chrono_user_data
¶ Data organized by user sorted chronologically (timestamps required). A dictionary where keys are users, values are tuples of three chronologically sorted lists (items, ratings, timestamps) interacted by the corresponding users.
-
csc_matrix
¶ The user-item interaction matrix in CSC sparse format
-
csr_matrix
¶ The user-item interaction matrix in CSR sparse format
-
dok_matrix
¶ The user-item interaction matrix in DOK sparse format
-
classmethod
from_uir
(data, seed=None)[source]¶ Constructing Dataset from UIR (User, Item, Rating) triplet data.
Parameters: - data (array-like, shape: [n_examples, 3]) – Data in the form of triplets (user, item, rating)
- seed (int, optional, default: None) – Random seed for reproducing data sampling.
Returns: res – Dataset object.
Return type: <cornac.data.Dataset>
-
classmethod
from_uirt
(data, seed=None)[source]¶ Constructing Dataset from UIRT (User, Item, Rating, Timestamp) quadruplet data.
Parameters: - data (array-like, shape: [n_examples, 4]) – Data in the form of triplets (user, item, rating, timestamp)
- seed (int, optional, default: None) – Random seed for reproducing data sampling.
Returns: res – Dataset object.
Return type: <cornac.data.Dataset>
-
idx_iter
(idx_range, batch_size=1, shuffle=False)[source]¶ Create an iterator over batch of indices
Parameters: Returns: iterator
Return type: batch of indices (array of np.int)
-
item_data
¶ Data organized by item. A dictionary where keys are items, values are tuples of two lists (users, ratings) interacted with the corresponding items.
-
item_ids
¶ An iterator over the raw item ids
-
item_indices
¶ An iterator over the item indices
-
item_iter
(batch_size=1, shuffle=False)[source]¶ Create an iterator over item indices
Parameters: Returns: iterator
Return type: batch of item indices (array of np.int)
-
matrix
¶ The user-item interaction matrix in CSR sparse format
-
total_items
¶ Total number of items including test and validation items if exists
-
total_users
¶ Total number of users including test and validation users if exists
-
uij_iter
(batch_size=1, shuffle=False, neg_sampling='uniform')[source]¶ Create an iterator over data yielding batch of users, positive items, and negative items
Parameters: Returns: iterator – batch of negative items (array of np.int)
Return type: batch of users (array of np.int), batch of positive items (array of np.int),
-
uir_iter
(batch_size=1, shuffle=False, binary=False, num_zeros=0)[source]¶ Create an iterator over data yielding batch of users, items, and rating values
Parameters: - batch_size (int, optional, default = 1) –
- shuffle (bool, optional, default: False) – If True, orders of triplets will be randomized. If False, default orders kept.
- binary (bool, optional, default: False) – If True, non-zero ratings will be turned into 1, otherwise, values remain unchanged.
- num_zeros (int, optional, default = 0) – Number of unobserved ratings (zeros) to be added per user. This could be used for negative sampling. By default, no values are added.
Returns: iterator – batch of ratings (array of np.float)
Return type: batch of users (array of np.int), batch of items (array of np.int),
-
user_data
¶ Data organized by user. A dictionary where keys are users, values are tuples of two lists (items, ratings) interacted by the corresponding users.
-
user_ids
¶ An iterator over the raw user ids
-
user_indices
¶ An iterator over the user indices
Modality¶
-
class
cornac.data.modality.
FeatureModality
(features=None, ids=None, normalized=False, **kwargs)[source]¶ Modality that contains features in general
Parameters: - features (numpy.ndarray or scipy.sparse.csr_matrix, default = None) – Numpy 2d-array that the row indices are aligned with user/item in ids.
- ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided features will be used as ids.
-
batch_feature
(batch_ids)[source]¶ Return a matrix (batch of feature vectors) corresponding to provided batch_ids
-
build
(id_map=None, **kwargs)[source]¶ Build the feature matrix. Features will be swapped if the id_map is provided
-
feature_dim
¶ Return the dimensionality of the feature vectors
-
features
¶ Return the whole feature matrix
Graph Modality¶
-
class
cornac.data.graph.
GraphModality
(**kwargs)[source]¶ Graph modality
Parameters: data (List[str], required) – A list encoding an adjacency matrix, of a user or an item graph, in the sparse triplet format, e.g., data=[(‘user1’, ‘user4’, 1.0)]. -
batch
(batch_ids)[source]¶ Return batch of vectors from the sparse adjacency matrix corresponding to provided batch_ids.
Parameters: batch_ids (array, required) – An array containing the ids of rows to be returned from the sparse adjacency matrix.
-
build
(id_map=None, **kwargs)[source]¶ Build the feature matrix. Features will be swapped if the id_map is provided
-
classmethod
from_feature
(features, k=5, ids=None, similarity='cosine', symmetric=False, verbose=True)[source]¶ Instantiate a GraphModality with a KNN graph build using input features.
Parameters: - features (2d Numpy array, shape: [n_objects, n_features], required) – A 2d Numpy array of features, e.g., visual, textual, etc.
- k (int, optional, default: 5) – The number of nearest neighbors
- ids (array, optional, default: None) – The list of object ids or labels, which align with the rows of features. For instance if you use textual (bag-of-word) features, then “ids” should be the same as the input to cornac.data.TextModality.
- similarity (string, optional, default: "cosine") – The similarity measure. At this time only the cosine is supported
- symmetric (bool, optional, default: False) – When True the resulting KNN-Graph is made symmetric
- verbose (bool, default: False) – The verbosity flag.
Returns: graph_modality – GraphModality object.
Return type: <cornac.data.GraphModality>
-
get_node_degree
(in_ids=None, out_ids=None)[source]¶ Get the “in” and “out” degree for the desired set of nodes
Parameters: - in_ids (array, required) – An array containing the ids for which to get the “in” degree.
- out_ids (array, required) – An array containing the ids for which to get the “out” degree.
Returns: Dictionary of the from {node_id
Return type: [in_degree,out_degree]}
-
get_train_triplet
(train_row_ids, train_col_ids)[source]¶ Get the subset of relations which align with the training data
Parameters: - train_row_ids (array, required) – An array containing the ids of training objects (users or items) for which to get the “out” relations.
- train_col_ids (array, required) – An array containing the ids of training objects (users or items) for whom to get the “in” relations. Please refer to cornac/models/c2pf/recom_c2pf.py for a concrete usage example of this function.
Returns: Return type: A subset of the adjacency matrix, in the sparse triplet format, whose elements align with the training set as specified by “train_row_ids” and “train_col_ids”.
-
matrix
¶ Return the adjacency matrix in scipy csr sparse format
-
Text Modality¶
-
class
cornac.data.text.
Tokenizer
[source]¶ Generic class for other subclasses to extend from. This typically either splits text into word tokens or character tokens.
-
class
cornac.data.text.
BaseTokenizer
(sep: str = ' ', pre_rules: List[Callable[str, str]] = None, stop_words: Union[List, str] = None)[source]¶ A base tokenizer use a provided delimiter sep to split text.
Parameters: - sep (str, optional, default: ' ') – Separator string used to split text into tokens.
- pre_rules (List[Callable[[str], str]], optional) – List of callable lambda functions to apply on text before tokenization.
- stop_words (Union[List, str], optional) – List of stop-words to be ignored during tokenization, or key of built-in stop-word lists (e.g., english).
-
class
cornac.data.text.
Vocabulary
(idx2tok: List[str], use_special_tokens: bool = False)[source]¶ Vocabulary basically contains mapping between numbers and tokens and vice versa.
Parameters: -
classmethod
from_sequences
(sequences: List[List[str]], max_vocab: int = None, min_freq: int = 1, use_special_tokens: bool = False) → cornac.data.text.Vocabulary[source]¶ Build a vocabulary from sequences (list of list of tokens).
Parameters: - sequences (List[List[str]], required) – Corpus of multiple lists of string tokens.
- max_vocab (int, optional) – Limit for size of the vocabulary. If specified, tokens will be ranked based on counts and gathered top-down until reach max_vocab.
- min_freq (int, optional, default: 1) – Cut-off threshold for tokens based on their counts.
- use_special_tokens (bool, optional, default: False) – If True, vocabulary will include SPECIAL_TOKENS.
-
classmethod
from_tokens
(tokens: List[str], max_vocab: int = None, min_freq: int = 1, use_special_tokens: bool = False) → cornac.data.text.Vocabulary[source]¶ Build a vocabulary from list of tokens.
Parameters: - tokens (List[str], required) – List of string tokens.
- max_vocab (int, optional) – Limit for size of the vocabulary. If specified, tokens will be ranked based on counts and gathered top-down until reach max_vocab.
- min_freq (int, optional, default: 1) – Cut-off threshold for tokens based on their counts.
- use_special_tokens (bool, optional, default: False) – If True, vocabulary will include SPECIAL_TOKENS.
-
save
(path)[source]¶ Save idx2tok into a pickle file.
Parameters: path (str, required) – Path to store the dictionary on disk.
-
to_idx
(tokens: List[str]) → List[int][source]¶ Convert a list of tokens to their integer indices.
Parameters: tokens (List[str], required) – List of string tokens. Returns: indices – List of integer indices corresponding to input tokens. Return type: List[int]
-
classmethod
-
class
cornac.data.text.
CountVectorizer
(tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_doc_freq: Union[float, int] = 1.0, min_doc_freq: int = 1, max_features: int = None, binary: bool = False)[source]¶ Convert a collection of text documents to a matrix of token counts This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
Parameters: - tokenizer (Tokenizer, optional, default=None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
- vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
- max_doc_freq (float in range [0.0, 1.0] or int, default=1.0) – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
- min_doc_freq (float in range [0.0, 1.0] or int, default=1) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the value represents a proportion of documents, int absolute counts. If vocab is not None, this will be ignored.
- max_features (int or None, optional, default=None) – If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. If vocab is not None, this will be ignored.
- binary (boolean, default=False) – If True, all non zero counts are set to 1.
- Reference –
- --------- –
- https (//github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L790) –
-
fit
(raw_documents: List[str]) → cornac.data.text.CountVectorizer[source]¶ Build a vocabulary of all tokens in the raw documents.
Parameters: raw_documents (iterable) – An iterable which yields either str, unicode or file objects. Returns: count_vectorizer – An object of type CountVectorizer. Return type: <cornac.data.text.CountVectorizer>
-
fit_transform
(raw_documents: List[str]) -> (typing.List[typing.List[str]], <class 'scipy.sparse.csr.csr_matrix'>)[source]¶ Build the vocabulary and return term-document matrix.
Parameters: raw_documents (List[str]) – Returns: - sequences: List[List[str]
- Tokenized sequences of raw_documents
- X: array, [n_samples, n_features]
- Document-term matrix.
Return type: (sequences, X)
-
transform
(raw_documents: List[str]) -> (typing.List[typing.List[str]], <class 'scipy.sparse.csr.csr_matrix'>)[source]¶ Transform documents to document-term matrix.
Parameters: raw_documents (List[str]) – Returns: - sequences: List[List[str]
- Tokenized sequences of raw_documents.
- X: array, [n_samples, n_features]
- Document-term matrix.
Return type: (sequences, X)
-
class
cornac.data.text.
TextModality
(corpus: List[str] = None, ids: List = None, tokenizer: cornac.data.text.Tokenizer = None, vocab: cornac.data.text.Vocabulary = None, max_vocab: int = None, max_doc_freq: Union[float, int] = 1.0, min_doc_freq: int = 1, tfidf_params: Dict = None, **kwargs)[source]¶ Text modality
Parameters: - corpus (List[str], default = None) – List of user/item texts that the indices are aligned with ids.
- ids (List, default = None) – List of user/item ids that the indices are aligned with corpus. If None, the indices of provided corpus will be used as ids.
- tokenizer (Tokenizer, optional, default = None) – Tokenizer for text splitting. If None, the BaseTokenizer will be used.
- vocab (Vocabulary, optional, default = None) – Vocabulary of tokens. It contains mapping between tokens to their integer ids and vice versa.
- max_vocab (int, optional, default = None) – The maximum size of the vocabulary. If vocab is provided, this will be ignored.
- max_doc_freq (float in range [0.0, 1.0] or int, default=1.0) – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the value represents a proportion of documents, int for absolute counts. If vocab is not None, this will be ignored.
- min_doc_freq (float in range [0.0, 1.0] or int, default=1) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the value represents a proportion of documents, int absolute counts. If vocab is not None, this will be ignored.
- tfidf_params (dict or None, optional, default=None) –
If None, a default arguments of
<cornac.data.text.IfidfVectorizer>
will be used. List of parameters:- ’binary’ : boolean, default=False
- If True, all non zero counts are set to 1.
- ’norm’ : ’l1’, ‘l2’ or None, optional, default=’l2’
- Each output row will have unit norm, either:
* ‘l2’: Sum of squares of vector elements is 1. The cosine
similarity between two vectors is their dot product when l2 norm has
been applied.
* ‘l1’: Sum of absolute values of vector elements is 1.
See
utils.common.normalize()
- ’use_idf’ : boolean, default=True
- Enable inverse-document-frequency reweighting.
- ’smooth_idf’ : boolean, default=True
- Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
- ’sublinear_tf’ : boolean (default=False)
- Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
-
batch_seq
(batch_ids, max_length=None)[source]¶ Return a numpy matrix of text sequences containing token ids with size=(len(batch_ids), max_length).
Parameters: - batch_ids (Union[List, numpy.array], required) – An array containing the ids of rows of text sequences to be returned.
- max_length (int, optional) – Cut-off length of returned sequences. If None, it will be inferred based on retrieved sequences.
Returns: batch_sequences – Batch of sequences with zero-padding at the end.
Return type: numpy.ndarray
-
batch_tfidf
(batch_ids, keep_sparse=False)[source]¶ Return matrix of TF-IDF features corresponding to provided batch_ids
Parameters: - batch_ids (array) – An array of ids to retrieve the corresponding features.
- keep_sparse (bool, default = False) – If True, the return feature matrix will be a scipy.sparse.csr_matrix. Otherwise, it will be a dense matrix.
Returns: batch_tfidf – Batch of TF-IDF representations corresponding to input batch_ids.
Return type: numpy.ndarray
-
build
(id_map=None, **kwargs)[source]¶ Build the model based on provided list of ordered ids
Parameters: id_map (dict, optional) – A dictionary holds mapping from original ids to mapped integer indices of users/items. Returns: text_modality – An object of type TextModality. Return type: <cornac.data.TextModality>
-
tfidf_matrix
¶ Return tf-idf matrix.
Image Modality¶
-
class
cornac.data.image.
ImageModality
(**kwargs)[source]¶ Image modality
Parameters: - images (Union[List, numpy.ndarray], optional) – A list or tensor of images that the row indices are aligned with user/item in ids.
- paths (List[str], optional) – A list of paths, to images stored on disk, which the row indices are aligned with user/item in ids..
-
batch_image
(batch_ids, target_size=(256, 256), color_mode='rgb', interpolation='nearest')[source]¶ Return batch of images corresponding to provided batch_ids
Parameters: - batch_ids (Union[List, numpy.array], required) – An array containing the ids of rows of images to be returned.
- target_size (tuple, optional, default: (256, 256)) – Size (width, height) of returned images to be resized.
- color_mode (str, optional, default: 'rgb') – Color mode of returned images.
- interpolation (str, optional, default: 'nearest') – Method used for interpolation when resize images. Options are OpenCV supported methods.
Returns: res – Batch of images corresponding to input batch_ids.
Return type: numpy.ndarray
-
build
(id_map=None, **kwargs)[source]¶ Build the model based on provided list of ordered ids
Parameters: id_map (dict, optional) – A dictionary holds mapping from original ids to mapped integer indices of users/items. Returns: image_modality – An object of type ImageModality. Return type: <cornac.data.ImageModality>
Sentiment Modality¶
-
class
cornac.data.sentiment.
SentimentModality
(**kwargs)[source]¶ Aspect module :param data: A triplet list of user, item, and sentiment information which also a triplet list of aspect, opinion, and sentiment, e.g., data=[(‘user1’, ‘item1’, [(‘aspect1’, ‘opinion1’, ‘sentiment1’)])]. :type data: List[tuple], required
-
build
(uid_map=None, iid_map=None, dok_matrix=None, **kwargs)[source]¶ Build the model based on provided list of ordered ids
-
num_aspects
¶ Return the number of aspects
-
num_opinions
¶ Return the number of aspects
-
Reader¶
-
class
cornac.data.reader.
Reader
(user_set=None, item_set=None, min_user_freq=1, min_item_freq=1, bin_threshold=None, encoding='utf-8', errors=None)[source]¶ Reader class for reading data with different types of format.
Parameters: - user_set (set, default = None) – Set of users to be retained when reading data. If None, all users will be included.
- item_set (set, default = None) – Set of items to be retained when reading data. If None, all items will be included.
- min_user_freq (int, default = 1) – The minimum frequency of a user to be retained. If min_user_freq = 1, all users will be included.
- min_item_freq (int, default = 1) – The minimum frequency of an item to be retained. If min_item_freq = 1, all items will be included.
- bin_threshold (float, default = None) – The rating threshold to binarize rating values (turn explicit feedback to implicit feedback). For example, if bin_threshold = 3.0, all rating values >= 3.0 will be set to 1.0, and the rest (< 3.0) will be discarded.
- encoding (str, default = utf-8) – Encoding used to decode the file.
- errors (int, default = None) – Optional string that specifies how encoding errors are to be handled. Pass ‘strict’ to raise a ValueError exception if there is an encoding error (None has the same effect), or pass ‘ignore’ to ignore errors.
-
read
(fpath, fmt='UIR', sep='\t', skip_lines=0, id_inline=False, parser=None, **kwargs)[source]¶ Read data and parse line by line based on provided fmt or parser.
Parameters: - fpath (str) – Path to the data file.
- fmt (str, default: 'UIR') – Line format to be parsed (‘UIR’ or ‘UIRT’).
- sep (str, default: ' ') – The delimiter string.
- skip_lines (int, default: 0) – Number of first lines to skip
- id_inline (bool, default: False) – If True, user ids corresponding to the line numbers of the file, where all the ids in each line are item ids.
- parser (function, default: None) – Function takes a list of str tokenized by sep and returns a list of tuples which will be joined to the final results. If None, parser will be determined based on fmt.
Returns: tuples – Data in the form of list of tuples. What inside each tuple depends on parser or fmt.
Return type:
-
cornac.data.reader.
read_text
(fpath, sep=None, encoding='utf-8', errors=None)[source]¶ Read text file and return two lists of text documents and corresponding ids. If sep is None, only return one list containing elements are lines of text in the original file.
Parameters: - fpath (str) – Path to the data file
- sep (str, default = None) – The delimiter string used to split id and text. Each line is assumed containing an id followed by corresponding text document. If None, each line will be a str in returned list.
- encoding (str, default = utf-8) – Encoding used to decode the file.
- errors (int, default = None) – Optional string that specifies how encoding errors are to be handled. Pass ‘strict’ to raise a ValueError exception if there is an encoding error (None has the same effect), or pass ‘ignore’ to ignore errors.
Returns: texts, ids (optional) – Return list of text strings with corresponding indices (if sep is not None).
Return type: