RFoT Code Base#

Subpackages#

Submodules#

RFoT.RFoT module#

Tensor decomposition is a powerful unsupervised Machine Learning method that enables the modeling of multi-dimensional data, including malware data. We introduce a novel ensemble semi-supervised classification algorithm, named Random Forest of Tensors (RFoT), that utilizes tensor decomposition to extract the complex and multi-faceted latent patterns from data. Our hybrid model leverages the strength of multi-dimensional analysis combined with clustering to capture the sample groupings in the latent components, whose combinations distinguish malware and benign-ware. The patterns extracted from a given data with tensor decomposition depend upon the configuration of the tensor such as dimension, entry, and rank selection. To capture the unique perspectives of different tensor configurations, we employ the “wisdom of crowds” philosophy and make use of decisions made by the majority of a randomly generated ensemble of tensors with varying dimensions, entries, and ranks.

As the tensor decomposition backend, RFoT offers two CPD algorithms. First, RFoT package includes the Python implementation of CP-ALS algorithm that was originally introduced in the MATLAB Tensor Toolbox [BK06, BK08, BK+15]. CP-ALS backend can also be used to decompose each random tensor in a parallel manner. RFoT can also be used with the Python implentation of the CP-APR algorithm with the GPU capability [ErenMooreSkau+22]. Use of CP-APR backend allows decomposing each random tensor configuration both in an embarrassingly parallel fashion in a single GPU, and in a multi-GPU parallel execution.

class RFoT.RFoT.RFoT(max_depth=1, min_rank=2, max_rank=20, min_dimensions=3, max_dimensions=3, min_cluster_search=2, max_cluster_search=12, component_purity_tol=- 1, cluster_purity_tol=0.9, n_estimators=80, rank='random', clustering='ms', decomp='cp_als', zero_tol=1e-08, dont_bin=[], bin_scale=1.0, bin_entry=False, bin_max_map={'bin': 1000, 'max': 1000000}, tol=0.0001, n_iters=50, verbose=True, decomp_verbose=False, fixsigns=True, random_state=42, n_jobs=1, n_gpus=1, gpu_id=0)[source]

Bases: object

Initilize the RFoT.RFoT class.

Parameters

max_depth (int, optional) --
Maximum number of times to run RFoT. The default is 1.
Note
- If max_depth=1, data is fit with RFoT once.
- Otherwise, when max_depth is more than 1, each corresponding fit of the data with RFoT will work on the abstaining predictions from the prior fit.
min_rank (int, optional) --
Minimum tensor rank R to be randomly sampled. The default is 2.
Note
- Should be more than 1. min_rank should be less than max_rank.
- Only used when rank="random".

max_rankint, optional

Maximum tensor rank R to be randomly sampled. The default is 20.

Note

max_rank should be more than min_rank.
Only used when rank="random".

min_dimensionsint, optional

When randomly sampling tensor configurations, minimum number of dimensions a tensor should have within the ensemble of random tensor configurations. The default is 3.

max_dimensionsint, optional

When randomly sampling tensor configurations, maximum number of dimensions a tensor should have within the ensemble of random tensor configurations. The default is 3.

min_cluster_searchint, optional

When searching for the number of clusters via likelihood in GMM, minimum number of clusters to try. The default is 2.

max_cluster_searchint, optional

When searching for the number of clusters via likelihood in GMM, maximum number of clusters to try. The default is 12.

component_purity_tolfloat or int, optional

The purity score threshold for the latent factors. The default is -1.

This threshold is calculated based on the known instances in the component.

If the purity score of the latent factor is lower then the threshold component_purity_tol, component is discarded and would not be used in obtaining clusters.

Note

By default component_purity_tol=-1.
When component_purity_tol=-1, component uniformity is not used in deciding whether to discard the components, and only cluster_purity_tol is used.
Either component_purity_tol or cluster_purity_tol must be more than 0.

cluster_purity_tolfloat, optional

The purity score threshold for the clusters. The default is 0.9. This threshold is calculated based on the known instances in the cluster.

If the purity score of the cluster is lower then the threshold cluster_purity_tol, cluster is discarded and would not be used in the semi-supervised class voting of the unknown samples in the same cluster.

Note

When cluster_purity_tol=-1, cluster uniformity is not used in deciding whether to discard the clusters, and only component_purity_tol is used.
Either component_purity_tol or cluster_purity_tol must be more than 0.

n_estimatorsint, optional

Number of random tensor configurations in the ensemble. The default is 80.

Caution

Based on the hyper-parameter configurations, and the number of features in the dataset, it is possible to have less number of random tensor configurations than the one specified in n_estimators.

rankint or string, optional

Method for assigning rank for each random tensor to be decomposed. The default is "random".

When rank="random", the rank for decomposition is sampled randomly from the range (min_rank, max_rank).

All the tensors in the ensemble can also be decomposed with same rank (example: rank=2).

clusteringstring, optional

Clustering method to be used for capturing the patterns from the latent factors. The default is "ms".

Options

clustering="ms" (Mean Shift)
clustering="component"
clustering="gmm" (Gaussian Mixture Model)

decompstring, optional

Tensor decomposition backend/algorithm to be used. The default is "cp_als".

Options

decomp="cp_als" (Alternating least squares for CANDECOMP/PARAFAC Decomposition)
decomp="cp_apr" (CANDECOMP/PARAFAC Alternating Poisson Regression)
decomp="cp_apr_gpu" (CP-APR with GPU)
decomp="debug"

Note

GPU is used when decomp="cp_apr_gpu".
decomp="debug" allows serial computation where any error or warning would be raised to the user level.

zero_tolfloat, optional

Samples who are close to the zero, where closeness defined by zero_tol, are removed from the latent factor. The default is 1e-08.

dont_binlist, optional

List of column (feature) indices whose values should not be binned. The default is list().

bin_scalefloat, optional

When using a given column (feature) as a tensor dimension, the feature values are binned to create feature value to tensor dimension mapping. This allows a feature value to be represented by an index in the tensor dimension for that feature. The default is 1.0.

When bin_scale=1.0, the size of the dimension that represents the given feature will be equal to the number of unique values in that column (feature).

See also

See Pandas Cut for value binning.

bin_entrybool, optional

If bin_entry=True, the features that are used as tensor entry are also binned. The default is False.

bin_max_mapdict, optional

bin_max_map prevents any dimension of any of the tensors in the ensemble to be too large. The default is bin_max_map={"max": 10 ** 6, "bin": 10 ** 3}.

Specifically, bin_max_map["bin"] is used to determine the size of the dimension when:

\(bin\_scale \cdot |f_i| > bin\_max\_map["max"]\)

tolfloat, optional

CP-ALS hyper-parameter. The default is 1e-4.

n_itersint, optional

Maximum number of iterations (epoch) to run the tensor decomposition algorithm. The default is 50.

verbosebool, optional

If verbose=True, progress of the method is displayed. The default is True.

decomp_verbosebool, optional

If decomp_verbose=True, progress of the tensor decomposition backend is displayed for each random tensor. The default is False.

fixsignsbool, optional

CP-ALS hyper-parameter. The default is True.

random_stateint, optional

Random seed. The default is 42.

n_jobsint, optional

Number of prallel tensor decompositions to perform when decomposing the random tensors from the ensemble. The default is 1.

n_gpusint, optional

Number of GPUs. The default is 1.

Note

Only used when decomp="cp_apr_gpu".
When n_gpus is more than 1, and when n_jobs is more than one, multi-GPU parallel execution is performed. For example, n_gpus=2 and n_jobs=2 will use 2 GPUs, and 1 job will be run on each GPU in parallel.

gpu_idint, optional

GPU device ID when using GPU. The default is 0.

Note

Only used when decomp="cp_apr_gpu".
Not considered when n_gpus is more than 1.

Raises: Exception -- Invalid parameter selection.
Return type: None.

get_params()[source]

Returns the parameters of the RFoT object.

Returns: Parameters and data stored in the RFoT object.
Return type: dict

predict(X: numpy.array, y: numpy.ndarray)[source]

Semi-supervised prediction of the unknown samples (with labels -1) based on the known samples.

Important

Use -1 for the unknown samples.
In returned y_pred, samples with -1 predictions are said to be abstaining predictions (i.e. model says "we do not know that the label for that sample is").
Returned y_pred includes both known and unknown samples, where the labels of unknown samples may have changed from the original y.

Example Usage

from RFoT import RFoT
from sklearn import datasets
from sklearn.metrics import f1_score
import numpy as np

# load the dataset
iris = datasets.load_iris()
X = iris["data"]
y = (iris["target"] == 2).astype(np.int)

y_true = y.copy()
y_experiment = y_true.copy()

# label 30% some as unknown
rng = np.random.RandomState(42)
random_unlabeled_points = rng.rand(y_experiment.shape[0]) < 0.3
y_experiment[random_unlabeled_points] = -1

# predict with RFoT
model = RFoT(
        bin_scale=1,
        max_dimensions=3,
        component_purity_tol=1.0,
        min_rank=2,
        max_rank=3,
        n_estimators=50,
        bin_entry=True,
        clustering="ms",
        max_depth=2,
        n_jobs=50,
)
y_pred = model.predict(X, y_experiment)

# look at results
unknown_indices = np.argwhere(y_experiment == -1).flatten()
did_predict_indices = np.argwhere(y_pred[unknown_indices] != -1).flatten()
abstaining_count = len(np.argwhere(y_pred == -1))
f1 = f1_score(
    y_true[unknown_indices][did_predict_indices],
    y_pred[unknown_indices][did_predict_indices],
    average="weighted",
)

print("------------------------")
print("Num. of Abstaining", abstaining_count)
print("Percent Abstaining", (abstaining_count / len(unknown_indices)) * 100, "%")
print("F1=", f1)

Example Usage

# y is the vector of known and unknown labels passed to RFoT
# y_pred is the prediction returned by RFoT
# y_true is the ground truth

import numpy as np
from sklearn.metrics import f1_score

unknown_indices = np.argwhere(y == -1).flatten()
did_predict_indices = np.argwhere(y_pred[unknown_indices] != -1).flatten()
abstaining_count = len(np.argwhere(y_pred == -1))

f1 = f1_score(
    y_true[unknown_indices][did_predict_indices],
    y_pred[unknown_indices][did_predict_indices],
    average="weighted",
)

print("Num. of Abstaining", abstaining_count)
print("Percent Abstaining", (abstaining_count / len(unknown_indices)) * 100, "%")
print("F1=", f1)

Parameters

X (np.array) -- Features matrix X where columns are the m features and rows are the n samples.
y (np.ndarray) -- Vector of size n with the label for each sample. Unknown samples have the labels -1.

Returns

y_pred -- Predictions made over the original y. Known samples are kept as is. Unknown samples that are no longer labeled as -1 did have prediction. Samples that are still -1 are the abstaining predictions.

Return type

np.ndarray

set_params(**parameters)[source]

Used to set the parameters of RFoT object.

Parameters: **parameters (dict) -- Dictionary of parameters where keys are the variable names.
Returns: RFoT object.
Return type: object

RFoT.version module#

RFoT version.

RFoT v0.0.1 documentation

RFoT Code Base

Contents

RFoT Code Base#

Subpackages#

Submodules#

RFoT.RFoT module#

RFoT.version module#

Module contents#