Project 2

Regression and Classification with the Ames Housing Data


This project uses the Ames housing data recently made available on kaggle.

Data Science often involves modelling and prediction based on a dataset. In this project, techniques such as regression and classification are explored. Python packages used for this dataset include:

  1. numpy
  2. pandas
  3. matplotlib
  4. seaborn
  5. scikit-learn
  6. imb-learn

1. Estimating the value of homes from fixed characteristics

1.1 Overview of dataset using pandas .describe()

house_res.describe(include='all')

1.2 Fixed characteristics as observed from the data description file:

1.3 Feature engineering

A good practice before performing any modelling or classification is to examine your dataset and look for salient features that may be aggregated or perform any factorisation or binarisation (one-hot encoding) for qualtitative data. A brief feature engineering workflow may be as follows:

  1. Sum columns that can be aggregated
  2. Binarise columns
  3. Drop columns with low variance
  4. Get dummies for categorical columns

A good feature engineering step that you may consider is to remove quantitative data columns with near 0 variance. This illustrates that that particular column has minimal impact to your prediction or classification. Here is a sample python code used to show columns with near 0 variance.

# near zero variance

def nearZeroVariance(X, freqCut = 95 / 5, uniqueCut = 10):
    '''
    Determine predictors with near zero or zero variance.
    Inputs:
    X: pandas data frame
    freqCut: the cutoff for the ratio of the most common value to the second most common value
    uniqueCut: the cutoff for the percentage of distinct values out of the number of total samples
    Returns a tuple containing a list of column names: (zeroVar, nzVar)
    '''

    colNames = X.columns.values.tolist()
    freqRatio = dict()
    uniquePct = dict()

    for names in colNames:
        counts = (
            (X[names])
            .value_counts()
            .sort_values(ascending = False)
            .values
            )

        if len(counts) == 1:
            freqRatio[names] = -1
            uniquePct[names] = (len(counts) / len(X[names])) * 100
            continue

        freqRatio[names] = counts[0] / counts[1]
        uniquePct[names] = (len(counts) / len(X[names])) * 100

    zeroVar = list()
    nzVar = list()
    for k in uniquePct.keys():
        if freqRatio[k] == -1:
            zeroVar.append(k)

        if uniquePct[k] < uniqueCut and freqRatio[k] > freqCut:
            nzVar.append(k)

    return(zeroVar, nzVar)
zerovartest = house_res1.loc[:,['LotFrontage','LotArea','MasVnrArea', 'TotalBsmtSF','GrLivArea',\
                                'GarageArea','WoodDeckSF', 'PoolArea', 'MiscVal']]
nearZeroVariance(zerovartest)
([], ['MasVnrArea', 'PoolArea', 'MiscVal'])

2. Preparing your dataset for prediction/classification

2.1 Train-test split and normalisation

Before running your model, it is important to split your data into train and test datasets. It is also important to normalise any data columns whenever necessary. In our case, we may use scikit-learn’s train-test split module to split our dataset; and its StandardScaler module to normalise our data.

# Lasso regression  
lasso = Lasso(alpha=optimal_lasso.alpha_)

lasso_scores = cross_val_score(lasso, Xs, y, cv=10)

print lasso_scores
print np.mean(lasso_scores)
[ 0.88388639  0.8378792   0.83228373  0.73855977  0.7982018   0.7768981
  0.82231355  0.77187934  0.51414779  0.83227006]
0.780831973138

3. Model for regression

3.1 Lasso regression to predict house prices

The lasso regression applies regularisation to data columns such that certain columns may not be as information to predict your variable of interest. This is particularly useful for datasets with large amounts of qualitative features.

lasso.fit(Xs, y)
Lasso(alpha=1098.3164314643716, copy_X=True, fit_intercept=True,
   max_iter=1000, normalize=False, positive=False, precompute=False,
   random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
# top 20 features after lasso
lasso_coefs = pd.DataFrame({'variable':X.columns,
                            'coef':lasso.coef_,
                            'abs_coef':np.abs(lasso.coef_)})

lasso_coefs.sort_values('abs_coef', inplace=True, ascending=False)

lasso_coefs.head(20)
abs_coef coef variable
3 27801.501124 27801.501124 GrLivArea
52 13227.923817 13227.923817 Neighborhood_NridgHt
93 13146.851630 13146.851630 GarageCars_3
2 9522.574031 9522.574031 TotalBsmtSF
58 7605.301512 7605.301512 Neighborhood_StoneBr
51 7503.234427 7503.234427 Neighborhood_NoRidge
57 6852.984978 6852.984978 Neighborhood_Somerst
7 5878.558372 -5878.558372 BsmtFullBath_0
19 5098.838853 -5098.838853 MSSubClass_90
82 4719.729393 4719.729393 TotRmsAbvGrd_10
53 4426.366451 -4426.366451 Neighborhood_OldTown
4 4359.232517 4359.232517 GarageArea
72 4212.180965 -4212.180965 BedroomAbvGr_5
43 4058.672481 -4058.672481 Neighborhood_Edwards
8 3902.417073 -3902.417073 fullbath2<
21 3650.970712 -3650.970712 MSSubClass_160
5 3590.483741 3590.483741 WoodDeckSF
1 3211.026312 3211.026312 LotArea
41 2830.346250 2830.346250 Neighborhood_CollgCr
9 2691.396233 -2691.396233 halfbath_0

4. Classify records into abnormal or normal sale

4.1 Caveat: Imbalanced dataset

In some cases, your dataset might present itself to be imbalanced. This has large implications towards the building of our models. In the case of classification, an over-representation of a particular class may skew the classification towards the majority class. To mitigate this problem, it is advisable to perform certain sampling techniques in order to balance out the classes.

For this dataset, we will be exploring 2 imbalanced dataset sampling techniques:

  1. SMOTE - Synthetic Minority Oversampling TEchnique
  2. Combination of SMOTE and TOMEK - tomek link undersampling

4.1.1 SMOTE

Synthetic minority oversampling technique

  • For each minority point, compute nearest neighbours
  • draw line to nearest neighbours
  • synthetically add a new point as minority

4.1.2 Combination, SMOTE and tomek

Tomek link (undersampling)

  • a pair of samples is considered tomek link if they are nearest neighbour and of differing class
  • the majority class of the tomek link is then removed (under sample)

4.2 Logistic Regression Classification

We will next use logistic regression to classify our dataset into abnormal or normal housing sale.

param = {'penalty':['l1','l2'] ,\
         'C': list(np.linspace(0.1,1,num=10))}
clf = GridSearchCV(LogisticRegression(),param, cv=5)
clf.fit(X_res,y_res)
GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.10000000000000001, 0.20000000000000001, 0.30000000000000004, 0.40000000000000002, 0.5, 0.59999999999999998, 0.70000000000000007, 0.80000000000000004, 0.90000000000000002, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

4.3 Logistic Regression after SMOTE

Classification report

from imblearn.metrics import classification_report_imbalanced
print classification_report_imbalanced(y_test, y_pred,target_names=['normal','abnormal'])
                   pre       rec       spe        f1       geo       iba       sup

     normal       0.94      0.79      0.40      0.86      0.35      0.13       444
   abnormal       0.13      0.40      0.79      0.20      0.35      0.11        35

avg / total       0.88      0.76      0.43      0.81      0.35      0.13       479

4.4 Logistic Regression after SMOTE + TOMEK

Classification report

print classification_report_imbalanced(y_test, y_pred,target_names=['normal','abnormal'])
                   pre       rec       spe        f1       geo       iba       sup

     normal       0.95      0.77      0.49      0.85      0.37      0.15       444
   abnormal       0.14      0.49      0.77      0.22      0.37      0.12        35

avg / total       0.89      0.75      0.51      0.80      0.37      0.14       479