Getting Started with TensorFlow the Easy Way (Part 4)

Shahebaz Mohammad
Analytics Vidhya
Published in
5 min readDec 12, 2018

--

This is part 4— “ Implementing a Classification Example in Tensorflow” in a series of articles on how to get started with TensorFlow.

Source: https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623

Classification challenges are quite exciting to solve with Machine learning and Deep learning techniques. There are various business use cases of classification algorithm — from churn prediction to moderating text reviews for spam/ham.

In this part, we will code out a Tensorflow Linear Classification example, analyze our predictions and validate how fair our model performs on the unseen data. All the regular tensorflow functions will be skipped in this part since we have already covered them in the previous posts. If you feel missed with concepts and variables in tensorflow feel free to look at the previous parts of this series.

Part 1: Tensorflow Installation and Setup, Syntax, and Graphs

Part 2: Variables and Placeholders in Tensorflow

Part 3: Implementing a Regression Example in Tensorflow

The Problem Statement

Given various features of an individual to predict what class of income they belong in (>50k or <=50k). The dataset can be downloaded from here

Below is the overview of the columns of the data

Let us start loading our dataset using pandas

import pandas as pd
import tensorflow as tf

Dataset from UCI usually has no standard csv format. So, let us rename the columns from the information provided at the data page.

census = pd.read_csv("adult.csv", header=None)
census.columns = ['age', 'workclass','fnlwgt', 'education', 'education_num', 'marital_status',
'occupation', 'relationship', 'race', 'gender', 'capital_gain',
'capital_loss', 'hours_per_week', 'native_country', 'income_bracket']

Check out the header to see our tabular dataframe.

Our target variable in this case is a string i.e >50K, <=50K. Unfortunately, TensorFlow won’t be able to understand strings as labels, we need to use pandas .apply() method to apply a custom function that converts them to 0s and 1s.

def label_fix(label):
if label==' <=50K':
return 0
else:
return 1
census['income_bracket'] = census['income_bracket'].apply(label_fix)
census['income_bracket'].unique()

The output should be either 0 or 1. Now, we are set to code out the tensorflow pipeline followed by splitting our data into train and test sets.

from sklearn.model_selection import train_test_splitx_data = census.drop('income_bracket',axis=1)
y_labels = census['income_bracket']
X_train, X_test, y_train, y_test = train_test_split(x_data,y_labels,test_size=0.3,random_state=101)

If you have followed through the previous parts you must have noticed that we didn’t use any sort of strings as features before. How do you think we can pass categorical features into a tensorflow model? There are actually two methods

  1. Vocabulary List
  2. Hash Bucket

Vocabulary List

In the vocabulary list method, we do have to pass the column name and all the unique labels that exist for that column. This is feasible for encoding columns with 2–4 unique values.

Hash Bucket

In this method hash of all the unique values is calculated and replaced with labels. This is convenient when you have high cardinality columns where you cannot pass a list of all unique values.

For Gender let us, use vocabulary list method

gender = tf.feature_column.categorical_column_with_vocabulary_list("gender", ["Female", "Male"])

And, for high cardinal columns use hash buckets;

occupation = tf.feature_column.categorical_column_with_hash_bucket("occupation", hash_bucket_size=1000)
marital_status = tf.feature_column.categorical_column_with_hash_bucket("marital_status", hash_bucket_size=1000)
relationship = tf.feature_column.categorical_column_with_hash_bucket("relationship", hash_bucket_size=1000)
education = tf.feature_column.categorical_column_with_hash_bucket("education", hash_bucket_size=1000)
workclass = tf.feature_column.categorical_column_with_hash_bucket("workclass", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket("native_country", hash_bucket_size=1000)

The continous numeric columns are easy and similar to the above tf.feature_columns.numeric_column

age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")

We are all set. Now, just wrap all these features in a list object using the following line of code

feat_cols = [gender,occupation,marital_status,relationship,education,workclass,native_country, age,education_num,capital_gain,capital_loss,hours_per_week]input_func=tf.estimator.inputs.pandas_input_fn(x=X_train,y=y_train,batch_size=100,num_epochs=None,shuffle=True)

Now, that we have ready with data splits and also the features to use, let us train the model.

input_func=tf.estimator.inputs.pandas_input_fn(x=X_train,y=y_train,batch_size=100,num_epochs=None,shuffle=True)model = tf.estimator.LinearClassifier(feature_columns=feat_cols)
model.train(input_fn=input_func,steps=5000)

On successful training you should get loss and training scores for each step/epoch

Let us make pred_fn that holds our test dataset for making predictions with shuffle=False

Important Note: While training the model it is upto you to use shuffle=True. Usually its good to train the models using the shuffle. But, while making predictions make sure you have shuffle=False as with random prediction order you can never validate or measure your results.

pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_test,batch_size=len(X_test),shuffle=False)predictions = list(model.predict(input_fn=pred_fn))
final_preds = []
for pred in predictions:
final_preds.append(pred['class_ids'][0])

The predictions are ready. The next step is to find out how our tensorflow model has performed.

from sklearn.metrics import classification_report
print(classification_report(y_test,final_preds))

Woah! A great F1-score for a vanilla baseline model. Now, let us have a look at the AUC metric for our model.

from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test,final_preds)
roc_auc = auc(fpr, tpr)
print("ROC AUC Score: {}".format(roc_auc))
plt.figure()
plt.plot(fpr, tpr, color='green', lw=1, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

We score acceptable AUC score of 0.738 too. To tweak the score we can either increase the epoch or do feature engineering like;

  1. Create Categories using Age column
  2. Calculate Average hours for occupation
  3. Average Capital gain for education.. etc

I leave the rest of the tweaking to you. That’s all for this series. Congratulations on building your first classification model with vanilla TensorFlow. Great Job!

Do clap, comment and share your thoughts in the comments below. Follow Analytics Vidhya and stay tuned for upcoming exciting series of posts

--

--