RFoT Code Base
Contents
RFoT Code Base#
Subpackages#
Submodules#
RFoT.RFoT module#
Tensor decomposition is a powerful unsupervised Machine Learning method that enables the modeling of multi-dimensional data, including malware data. We introduce a novel ensemble semi-supervised classification algorithm, named Random Forest of Tensors (RFoT), that utilizes tensor decomposition to extract the complex and multi-faceted latent patterns from data. Our hybrid model leverages the strength of multi-dimensional analysis combined with clustering to capture the sample groupings in the latent components, whose combinations distinguish malware and benign-ware. The patterns extracted from a given data with tensor decomposition depend upon the configuration of the tensor such as dimension, entry, and rank selection. To capture the unique perspectives of different tensor configurations, we employ the “wisdom of crowds” philosophy and make use of decisions made by the majority of a randomly generated ensemble of tensors with varying dimensions, entries, and ranks.
As the tensor decomposition backend, RFoT offers two CPD algorithms. First, RFoT package includes the Python implementation of CP-ALS algorithm that was originally introduced in the MATLAB Tensor Toolbox [BK06, BK08, BK+15]. CP-ALS backend can also be used to decompose each random tensor in a parallel manner. RFoT can also be used with the Python implentation of the CP-APR algorithm with the GPU capability [ErenMooreSkau+22]. Use of CP-APR backend allows decomposing each random tensor configuration both in an embarrassingly parallel fashion in a single GPU, and in a multi-GPU parallel execution.
- class RFoT.RFoT.RFoT(max_depth=1, min_rank=2, max_rank=20, min_dimensions=3, max_dimensions=3, min_cluster_search=2, max_cluster_search=12, component_purity_tol=- 1, cluster_purity_tol=0.9, n_estimators=80, rank='random', clustering='ms', decomp='cp_als', zero_tol=1e-08, dont_bin=[], bin_scale=1.0, bin_entry=False, bin_max_map={'bin': 1000, 'max': 1000000}, tol=0.0001, n_iters=50, verbose=True, decomp_verbose=False, fixsigns=True, random_state=42, n_jobs=1, n_gpus=1, gpu_id=0)[source]
Bases:
object
Initilize the RFoT.RFoT class.
- Parameters
max_depth (int, optional) --
Maximum number of times to run RFoT. The default is 1.
Note
If
max_depth=1
, data is fit with RFoT once.Otherwise, when
max_depth
is more than 1, each corresponding fit of the data with RFoT will work on the abstaining predictions from the prior fit.
min_rank (int, optional) --
Minimum tensor rank R to be randomly sampled. The default is 2.
Note
Should be more than 1.
min_rank
should be less thanmax_rank
.Only used when
rank="random"
.
- max_rankint, optional
Maximum tensor rank R to be randomly sampled. The default is 20.
Note
max_rank
should be more thanmin_rank
.Only used when
rank="random"
.
- min_dimensionsint, optional
When randomly sampling tensor configurations, minimum number of dimensions a tensor should have within the ensemble of random tensor configurations. The default is 3.
- max_dimensionsint, optional
When randomly sampling tensor configurations, maximum number of dimensions a tensor should have within the ensemble of random tensor configurations. The default is 3.
- min_cluster_searchint, optional
When searching for the number of clusters via likelihood in GMM, minimum number of clusters to try. The default is 2.
- max_cluster_searchint, optional
When searching for the number of clusters via likelihood in GMM, maximum number of clusters to try. The default is 12.
- component_purity_tolfloat or int, optional
The purity score threshold for the latent factors. The default is -1.
This threshold is calculated based on the known instances in the component.
If the purity score of the latent factor is lower then the threshold
component_purity_tol
, component is discarded and would not be used in obtaining clusters.Note
By default
component_purity_tol=-1
.When
component_purity_tol=-1
, component uniformity is not used in deciding whether to discard the components, and onlycluster_purity_tol
is used.Either
component_purity_tol
orcluster_purity_tol
must be more than 0.
- cluster_purity_tolfloat, optional
The purity score threshold for the clusters. The default is 0.9. This threshold is calculated based on the known instances in the cluster.
If the purity score of the cluster is lower then the threshold
cluster_purity_tol
, cluster is discarded and would not be used in the semi-supervised class voting of the unknown samples in the same cluster.Note
When
cluster_purity_tol=-1
, cluster uniformity is not used in deciding whether to discard the clusters, and onlycomponent_purity_tol
is used.Either
component_purity_tol
orcluster_purity_tol
must be more than 0.
- n_estimatorsint, optional
Number of random tensor configurations in the ensemble. The default is 80.
Caution
Based on the hyper-parameter configurations, and the number of features in the dataset, it is possible to have less number of random tensor configurations than the one specified in
n_estimators
.
- rankint or string, optional
Method for assigning rank for each random tensor to be decomposed. The default is "random".
When
rank="random"
, the rank for decomposition is sampled randomly from the range (min_rank
,max_rank
).All the tensors in the ensemble can also be decomposed with same rank (example:
rank=2
).- clusteringstring, optional
Clustering method to be used for capturing the patterns from the latent factors. The default is "ms".
- decompstring, optional
Tensor decomposition backend/algorithm to be used. The default is "cp_als".
Note
GPU is used when
decomp="cp_apr_gpu"
.decomp="debug"
allows serial computation where any error or warning would be raised to the user level.
- zero_tolfloat, optional
Samples who are close to the zero, where closeness defined by
zero_tol
, are removed from the latent factor. The default is 1e-08.- dont_binlist, optional
List of column (feature) indices whose values should not be binned. The default is list().
- bin_scalefloat, optional
When using a given column (feature) as a tensor dimension, the feature values are binned to create feature value to tensor dimension mapping. This allows a feature value to be represented by an index in the tensor dimension for that feature. The default is 1.0.
When
bin_scale=1.0
, the size of the dimension that represents the given feature will be equal to the number of unique values in that column (feature).See also
See Pandas Cut for value binning.
- bin_entrybool, optional
If
bin_entry=True
, the features that are used as tensor entry are also binned. The default is False.- bin_max_mapdict, optional
bin_max_map
prevents any dimension of any of the tensors in the ensemble to be too large. The default isbin_max_map={"max": 10 ** 6, "bin": 10 ** 3}
.Specifically,
bin_max_map["bin"]
is used to determine the size of the dimension when:\(bin\_scale \cdot |f_i| > bin\_max\_map["max"]\)
- tolfloat, optional
CP-ALS hyper-parameter. The default is 1e-4.
- n_itersint, optional
Maximum number of iterations (epoch) to run the tensor decomposition algorithm. The default is 50.
- verbosebool, optional
If
verbose=True
, progress of the method is displayed. The default is True.- decomp_verbosebool, optional
If
decomp_verbose=True
, progress of the tensor decomposition backend is displayed for each random tensor. The default is False.- fixsignsbool, optional
CP-ALS hyper-parameter. The default is True.
- random_stateint, optional
Random seed. The default is 42.
- n_jobsint, optional
Number of prallel tensor decompositions to perform when decomposing the random tensors from the ensemble. The default is 1.
- n_gpusint, optional
Number of GPUs. The default is 1.
Note
Only used when
decomp="cp_apr_gpu"
.When
n_gpus
is more than 1, and whenn_jobs
is more than one, multi-GPU parallel execution is performed. For example,n_gpus=2
andn_jobs=2
will use 2 GPUs, and 1 job will be run on each GPU in parallel.
- gpu_idint, optional
GPU device ID when using GPU. The default is 0.
Note
Only used when
decomp="cp_apr_gpu"
.Not considered when
n_gpus
is more than 1.
- Raises
Exception -- Invalid parameter selection.
- Return type
None.
- get_params()[source]
Returns the parameters of the RFoT object.
- Returns
Parameters and data stored in the RFoT object.
- Return type
dict
- predict(X: numpy.array, y: numpy.ndarray)[source]
Semi-supervised prediction of the unknown samples (with labels -1) based on the known samples.
Important
Use -1 for the unknown samples.
In returned
y_pred
, samples with -1 predictions are said to be abstaining predictions (i.e. model says "we do not know that the label for that sample is").Returned
y_pred
includes both known and unknown samples, where the labels of unknown samples may have changed from the originaly
.
Example Usage
from RFoT import RFoT from sklearn import datasets from sklearn.metrics import f1_score import numpy as np # load the dataset iris = datasets.load_iris() X = iris["data"] y = (iris["target"] == 2).astype(np.int) y_true = y.copy() y_experiment = y_true.copy() # label 30% some as unknown rng = np.random.RandomState(42) random_unlabeled_points = rng.rand(y_experiment.shape[0]) < 0.3 y_experiment[random_unlabeled_points] = -1 # predict with RFoT model = RFoT( bin_scale=1, max_dimensions=3, component_purity_tol=1.0, min_rank=2, max_rank=3, n_estimators=50, bin_entry=True, clustering="ms", max_depth=2, n_jobs=50, ) y_pred = model.predict(X, y_experiment) # look at results unknown_indices = np.argwhere(y_experiment == -1).flatten() did_predict_indices = np.argwhere(y_pred[unknown_indices] != -1).flatten() abstaining_count = len(np.argwhere(y_pred == -1)) f1 = f1_score( y_true[unknown_indices][did_predict_indices], y_pred[unknown_indices][did_predict_indices], average="weighted", ) print("------------------------") print("Num. of Abstaining", abstaining_count) print("Percent Abstaining", (abstaining_count / len(unknown_indices)) * 100, "%") print("F1=", f1)
Example Usage
# y is the vector of known and unknown labels passed to RFoT # y_pred is the prediction returned by RFoT # y_true is the ground truth import numpy as np from sklearn.metrics import f1_score unknown_indices = np.argwhere(y == -1).flatten() did_predict_indices = np.argwhere(y_pred[unknown_indices] != -1).flatten() abstaining_count = len(np.argwhere(y_pred == -1)) f1 = f1_score( y_true[unknown_indices][did_predict_indices], y_pred[unknown_indices][did_predict_indices], average="weighted", ) print("Num. of Abstaining", abstaining_count) print("Percent Abstaining", (abstaining_count / len(unknown_indices)) * 100, "%") print("F1=", f1)
- Parameters
X (np.array) -- Features matrix X where columns are the m features and rows are the n samples.
y (np.ndarray) -- Vector of size n with the label for each sample. Unknown samples have the labels -1.
- Returns
y_pred -- Predictions made over the original y. Known samples are kept as is. Unknown samples that are no longer labeled as -1 did have prediction. Samples that are still -1 are the abstaining predictions.
- Return type
np.ndarray
- set_params(**parameters)[source]
Used to set the parameters of RFoT object.
- Parameters
**parameters (dict) -- Dictionary of parameters where keys are the variable names.
- Returns
RFoT object.
- Return type
object
RFoT.version module#
RFoT version.