Project 2

Regression and Classification with the Ames Housing Data

This project uses the Ames housing data recently made available on kaggle.

Data Science often involves modelling and prediction based on a dataset. In this project, techniques such as regression and classification are explored. Python packages used for this dataset include:

numpy
pandas
matplotlib
seaborn
scikit-learn
imb-learn

1. Estimating the value of homes from fixed characteristics

1.1 Overview of dataset using pandas .describe()

house_res.describe(include='all')

1.2 Fixed characteristics as observed from the data description file:

1.3 Feature engineering

A good practice before performing any modelling or classification is to examine your dataset and look for salient features that may be aggregated or perform any factorisation or binarisation (one-hot encoding) for qualtitative data. A brief feature engineering workflow may be as follows:

Sum columns that can be aggregated
Binarise columns
Drop columns with low variance
Get dummies for categorical columns

A good feature engineering step that you may consider is to remove quantitative data columns with near 0 variance. This illustrates that that particular column has minimal impact to your prediction or classification. Here is a sample python code used to show columns with near 0 variance.

# near zero variance

def nearZeroVariance(X, freqCut = 95 / 5, uniqueCut = 10):
    '''
    Determine predictors with near zero or zero variance.
    Inputs:
    X: pandas data frame
    freqCut: the cutoff for the ratio of the most common value to the second most common value
    uniqueCut: the cutoff for the percentage of distinct values out of the number of total samples
    Returns a tuple containing a list of column names: (zeroVar, nzVar)
    '''

    colNames = X.columns.values.tolist()
    freqRatio = dict()
    uniquePct = dict()

    for names in colNames:
        counts = (
            (X[names])
            .value_counts()
            .sort_values(ascending = False)
            .values
            )

        if len(counts) == 1:
            freqRatio[names] = -1
            uniquePct[names] = (len(counts) / len(X[names])) * 100
            continue

        freqRatio[names] = counts[0] / counts[1]
        uniquePct[names] = (len(counts) / len(X[names])) * 100

    zeroVar = list()
    nzVar = list()
    for k in uniquePct.keys():
        if freqRatio[k] == -1:
            zeroVar.append(k)

        if uniquePct[k] < uniqueCut and freqRatio[k] > freqCut:
            nzVar.append(k)

    return(zeroVar, nzVar)

zerovartest = house_res1.loc[:,['LotFrontage','LotArea','MasVnrArea', 'TotalBsmtSF','GrLivArea',\
                                'GarageArea','WoodDeckSF', 'PoolArea', 'MiscVal']]
nearZeroVariance(zerovartest)

([], ['MasVnrArea', 'PoolArea', 'MiscVal'])

2. Preparing your dataset for prediction/classification

2.1 Train-test split and normalisation

Before running your model, it is important to split your data into train and test datasets. It is also important to normalise any data columns whenever necessary. In our case, we may use scikit-learn’s train-test split module to split our dataset; and its StandardScaler module to normalise our data.

# Lasso regression  
lasso = Lasso(alpha=optimal_lasso.alpha_)

lasso_scores = cross_val_score(lasso, Xs, y, cv=10)

print lasso_scores
print np.mean(lasso_scores)

[ 0.88388639  0.8378792   0.83228373  0.73855977  0.7982018   0.7768981
  0.82231355  0.77187934  0.51414779  0.83227006]
0.780831973138

3. Model for regression

3.1 Lasso regression to predict house prices

The lasso regression applies regularisation to data columns such that certain columns may not be as information to predict your variable of interest. This is particularly useful for datasets with large amounts of qualitative features.

lasso.fit(Xs, y)

Lasso(alpha=1098.3164314643716, copy_X=True, fit_intercept=True,
   max_iter=1000, normalize=False, positive=False, precompute=False,
   random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

# top 20 features after lasso
lasso_coefs = pd.DataFrame({'variable':X.columns,
                            'coef':lasso.coef_,
                            'abs_coef':np.abs(lasso.coef_)})

lasso_coefs.sort_values('abs_coef', inplace=True, ascending=False)

lasso_coefs.head(20)

	abs_coef	coef	variable
3	27801.501124	27801.501124	GrLivArea
52	13227.923817	13227.923817	Neighborhood_NridgHt
93	13146.851630	13146.851630	GarageCars_3
2	9522.574031	9522.574031	TotalBsmtSF
58	7605.301512	7605.301512	Neighborhood_StoneBr
51	7503.234427	7503.234427	Neighborhood_NoRidge
57	6852.984978	6852.984978	Neighborhood_Somerst
7	5878.558372	-5878.558372	BsmtFullBath_0
19	5098.838853	-5098.838853	MSSubClass_90
82	4719.729393	4719.729393	TotRmsAbvGrd_10
53	4426.366451	-4426.366451	Neighborhood_OldTown
4	4359.232517	4359.232517	GarageArea
72	4212.180965	-4212.180965	BedroomAbvGr_5
43	4058.672481	-4058.672481	Neighborhood_Edwards
8	3902.417073	-3902.417073	fullbath2<
21	3650.970712	-3650.970712	MSSubClass_160
5	3590.483741	3590.483741	WoodDeckSF
1	3211.026312	3211.026312	LotArea
41	2830.346250	2830.346250	Neighborhood_CollgCr
9	2691.396233	-2691.396233	halfbath_0

4. Classify records into abnormal or normal sale

4.1 Caveat: Imbalanced dataset

In some cases, your dataset might present itself to be imbalanced. This has large implications towards the building of our models. In the case of classification, an over-representation of a particular class may skew the classification towards the majority class. To mitigate this problem, it is advisable to perform certain sampling techniques in order to balance out the classes.

For this dataset, we will be exploring 2 imbalanced dataset sampling techniques:

SMOTE - Synthetic Minority Oversampling TEchnique
Combination of SMOTE and TOMEK - tomek link undersampling

4.1.1 SMOTE

Synthetic minority oversampling technique

For each minority point, compute nearest neighbours
draw line to nearest neighbours
synthetically add a new point as minority

4.1.2 Combination, SMOTE and tomek

Tomek link (undersampling)

a pair of samples is considered tomek link if they are nearest neighbour and of differing class
the majority class of the tomek link is then removed (under sample)

4.2 Logistic Regression Classification

We will next use logistic regression to classify our dataset into abnormal or normal housing sale.

param = {'penalty':['l1','l2'] ,\
         'C': list(np.linspace(0.1,1,num=10))}

clf = GridSearchCV(LogisticRegression(),param, cv=5)
clf.fit(X_res,y_res)

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.10000000000000001, 0.20000000000000001, 0.30000000000000004, 0.40000000000000002, 0.5, 0.59999999999999998, 0.70000000000000007, 0.80000000000000004, 0.90000000000000002, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

4.3 Logistic Regression after SMOTE

Classification report

from imblearn.metrics import classification_report_imbalanced
print classification_report_imbalanced(y_test, y_pred,target_names=['normal','abnormal'])

                   pre       rec       spe        f1       geo       iba       sup

     normal       0.94      0.79      0.40      0.86      0.35      0.13       444
   abnormal       0.13      0.40      0.79      0.20      0.35      0.11        35

avg / total       0.88      0.76      0.43      0.81      0.35      0.13       479

4.4 Logistic Regression after SMOTE + TOMEK

Classification report

print classification_report_imbalanced(y_test, y_pred,target_names=['normal','abnormal'])

                   pre       rec       spe        f1       geo       iba       sup

     normal       0.95      0.77      0.49      0.85      0.37      0.15       444
   abnormal       0.14      0.49      0.77      0.22      0.37      0.12        35

avg / total       0.89      0.75      0.51      0.80      0.37      0.14       479

Jeryl Ong

Regression and Classification with the Ames Housing Data

1. Estimating the value of homes from fixed characteristics

1.1 Overview of dataset using pandas .describe()

1.2 Fixed characteristics as observed from the data description file:

1.3 Feature engineering

2. Preparing your dataset for prediction/classification

2.1 Train-test split and normalisation

3. Model for regression

3.1 Lasso regression to predict house prices

4. Classify records into abnormal or normal sale

4.1 Caveat: Imbalanced dataset

4.1.1 SMOTE

4.1.2 Combination, SMOTE and tomek

4.2 Logistic Regression Classification

4.3 Logistic Regression after SMOTE

Classification report

4.4 Logistic Regression after SMOTE + TOMEK

Classification report