Binary Classification with Splines

 

Background

Problem 1. Spline approximation for one factor (one independent variable)

Simplified Problem Statement

Mathematical Problem Statement

Problem dimension and solving time

Solution in Run-File Environment

Solution in MATLAB Environment

Problem 2. Sum of splines approximation for set of factors. Set of splines are built for factors. Their sum best fit dependent variable

Simplified Problem Statement

Mathematical Problem Statement

Problem dimension and solving time

Solution in Run-File Environment

Solution in MATLAB Environment

Problem 3. Sum of splines approximation with 4-fold Cross Validation

Simplified Problem Statement

Mathematical Problem Statement

Problem dimension and solving time

Solution in Run-File Environment

Solution in MATLAB Environment

 

Background

This case study demonstrate binary classifier on the base of approximation multidimensional data (with several independent variables) by a sum of splines using PSG function spline_sum.

PSG function Maximum Likelihood for Logistic Regression, logexp_sum, is minimized to find variables of splines providing the best approximation of data in the case of  one factor (see Problem 1) and set of factors (Problem 2). Estimated spline may "overfit" the in-sample data and this may result in poor out-of-sample performance. Сross-validation technique is used to check overfitting (see Problem 3). To prepare data for cross-validation we use PSG Crossvalidation(K,Matrix) matrix operation which splits input Matrix of Scenarios in N pairs of complementary sub-matrices.

 

Problem 1

Spline approximation for one factor (one independent variable).

 

Simplified Problem Statement

 

Maximize logexp_sum(spline_sum) (maximize Logarithms Exponents Sum applied to Spline Sum)

 Calculate:

logexp_sum(spline_sum) (function Logarithms Exponents Sum applied to Spline Sum)

L(spline_sum) (function L applied to Spline Sum)

 

where

logexp_sum = Logarithms Exponents Sum

spline_sum = Spline Sum calculates spline values depending upon regression variables for every scenario

l = Linear Loss for Spline Sum

 

Mathematical Problem Statement

 

Formal Problem Statement

 

Problem dimension and solving time

 

Number of Variables

20

Number of Scenarios

4000

Objective Value

-0.6890

Solving Time (sec)

0.11

 

Solution in Run-File Environment

 

Description (Run-File)

 

Input Files to run CS:

Problem Statement (.txt file)
DATA (.zip file)

 

Output Files:

Output DATA (.zip file)

 

Solution in MATLAB Environment

 

Solved with PSG MATLAB function tbpsg_run (General (Text) Format of PSG in MATLAB):

 

Description (tbpsg_run)

 

Input Files to run CS:

MATLAB code (.txt file)
Data (.zip file with .m and .mat files)

 

Problem 2

Sum of splines approximation for set of factors. Set of splines are built for factors. Their sum best fit dependent variable.

 

Simplified Problem Statement

 

Maximize logexp_sum(spline_sum) (maximize Logarithms Exponents Sum applied to Spline Sum)

 Calculate:

logexp_sum(spline_sum) (function Logarithms Exponents Sum applied to Spline Sum)

logistic(spline_sum) (function Logistic applied to Spline Sum)

L(spline_sum) (function L applied to Spline Sum)

 

where

 

logexp_sum = Logarithms Exponents Sum

spline_sum = Spline Sum calculates spline values depending upon regression variables for every scenario

logistic = calculate values of logistic function of spline regression for every scenario

l = Linear Loss for Spline Sum

 

 

Mathematical Problem Statement

 

Formal Problem Statement

 

Problem dimension and solving time

 

Number of Variables

286

Number of Scenarios

4000

Objective Value

-0.6781

Solving Time (sec)

19.77

 

Solution in Run-File Environment

 

Description (Run-File)

 

Input Files to run CS:

Problem Statement (.txt file)
DATA (.zip file)

 

Output Files:

Output DATA (.zip file)

 

Solution in MATLAB Environment

 

Solved with PSG MATLAB function tbpsg_run (General (Text) Format of PSG in MATLAB):

 

Description (tbpsg_run)

 

Input Files to run CS:

MATLAB code (.txt file)
Data (.zip file with .m and .mat files)

 

Problem 3

Sum of splines approximation with 4-fold Cross Validation (4 in-sample data and 4 out-of-sample data).

 

Simplified Problem Statement

 

4-fold crossvalidation

Maximize logexp_sum(spline_sum) (maximize Logarithms Exponents Sum applied to Spline Sum)

Calculate:

logexp_sum(spline_sum) (function Logarithms Exponents Sum applied to Spline Sum on the out-of-sample data)

logistic(spline_sum) (function Logistic applied to Spline Sum on the in-sample data)

logistic(spline_sum) (function Logistic applied to Spline Sum on the out-of-sample data)

logexp_sum(spline_sum) (function Logarithms Exponents Sum applied to Spline Sum on the in-sample data)

logexp_sum(spline_sum) (function Logarithms Exponents Sum applied to Spline Sum on the out-of-sample data)

 

where

 

crossvalidation(N,Matrix) = matrix operation splits input Matrix into N pairs of complementary sub-matrices

logexp_sum = Logarithms Exponents Sum

spline_sum = Spline Sum calculates spline values depending upon regression variables for every scenario

logistic = calculate values of logistic function of spline regression for every scenario

 

 

Mathematical Problem Statement

 

Formal Problem Statement

 

Problem dimension and solving time

 

For one problem in Cross-validation:

 

Number of Variables

286

Number of Scenarios

4000

Objective Value

-0.6769

Solving Time (sec)

7.48

 

Solution in Run-File Environment

 

Description (Run-File)

 

Input Files to run CS:

Problem Statement (.txt file)
DATA (.zip file)

 

Output Files:

Output DATA (.zip file)

 

Solution in MATLAB Environment

 

Solved with PSG MATLAB function tbpsg_run (General (Text) Format of PSG in MATLAB):

 

Description (tbpsg_run)

 

Input Files to run CS:

MATLAB code (.txt file)
Data (.zip file with .m and .mat files)