Project 3
Web Scraping and Analysis
For this project, we will explore web scraping to obtain data from websites. More specifically, we will be looking at job search websites to obtain job listings together with job descriptions and salary information. We will next use the information scraped to predict salary or classify jobs according to its description. Classifying jobs according to job description involves analysing textual information. We can make use of scikit learn’s natural language processing packages to help us in our analysis.
Packages used:
- BeautifulSoup
- pandas
- urllib3
- re
- pickle
- matplotlib
- seaborn
- scikit-learn
- imblearn
1. Getting the links from the job site for scraping using BeautifulSoup and urllib3
url = 'https://jobscentral.com.sg/jobsearch?q=data%20science&pg='
urllist = []
for i in range(1,15):
a = url + str(i)
urllist.append(a)
links = []
for link in urllist:
http = urllib3.PoolManager()
response = http.request('GET', link)
soup = BeautifulSoup(response.data, "html.parser")
a = soup.find_all('a',attrs={"class":"job-title"}, href=True)
for val in a:
linkstr = 'https://jobscentral.com.sg' + val['href']
if linkstr not in links:
links.append(linkstr)
### 1.2 Append results into pandas dataframe
2. Analysing factors that affect salary
2.1 Generate meaningful features from text using term frequency–inverse document frequency (TFIDF)
Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query.
Source: http://www.tfidf.com/
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(stop_words='english', ngram_range=(2,3),max_features=2000)
textvec = vect.fit_transform(X['full_description'])
Example of features after tf-idf
2.2 Principal Component Analysis (PCA)
tf-idf often produces large amounts of features. We may make use of dimensionality reduction techniques like Principal Component Analysis (PCA) to reduce the number of features needed for prediction or classification.
from sklearn.decomposition import PCA
pca = PCA(n_components=15)
pca.fit(X1)
Explained variance shows how much variance in the dataset is accounted for in your principal components.
```python
print(pca.explained_variance_ratio_)
print(sum(pca.explained_variance_ratio_))
[0.17547903 0.1269256 0.08849329 0.05611374 0.03669752 0.02350112
0.01906956 0.01444719 0.01286519 0.01149481 0.01019614 0.0093229
0.00823705 0.00807786 0.00669269]
0.607613709219762
2.3 Linear regression to predict salary (with lasso regularisation)
lasso = linear_model.Lasso()
lasso.fit(X_train, y_train)
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)
cross_val_score(lasso, X_train, y_train)
array([0.33293736, 0.35060433, 0.27800117])
y_pred = lasso.predict(X_test)
# The coefficients
print('Coefficients: \n', lasso.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(y_test, y_pred))
Coefficients:
[ 1240.71748822 -462.59215024 2215.14691749 -225.41913544
-2109.89030805 -327.22110887 -976.31850488 608.75956347
-173.8406491 -525.20852812 625.66194878 -2623.4846474
-3067.27596946 -1788.03523016 3521.19561575]
Mean squared error: 2702897.93
3. Classifying data science and non-data science jobs
3.1 Generate value counts of job categories
all_jobsraw.is_category.value_counts()
engineer 818
analyst 688
leadership 624
dont_care 375
scientist 361
intern 54
database 1
Name: is_category, dtype: int64
Binarise job categories into data science or non-data science.
all_jobsraw['is_datasci'] = all_jobsraw['is_category'].map(lambda x: 1 if x == 'scientist' else 0)
3.2 Generate tf-idf matrix from job description
df_text1 = pd.DataFrame(data=df_text.todense(), columns=vect.get_feature_names())
df_text1.head(2)
3.3 Build a logistic regression and random forest classifier to classify the dataset
Logistic regression
from sklearn.metrics import classification_report
target_names = ['non-science', 'science']
print(classification_report(y_test, y_pred, target_names=target_names))
precision recall f1-score support
non-science 0.96 0.94 0.95 770
science 0.60 0.71 0.65 107
avg / total 0.92 0.91 0.91 877
Random forest classifier
target_names = ['non-science', 'science']
print(classification_report(y_test, y_pred, target_names=target_names))
precision recall f1-score support
non-science 0.95 0.94 0.95 770
science 0.60 0.65 0.62 107
avg / total 0.91 0.90 0.91 877