Project Proposal

Introduction/Background

In this era of the internets and technology we have seen a wide increase in fraudulent and phishy websites where the site owner/attacker attempts to gain sensitive information from the user for his/ her own commercial benefit. This has attracted the interest of many to apply machine learning modeling to automatically identify which websites are scams or not based on a particular website’s characteristics and features. In our project, we will take features such as a website’s url length, the types of symbols the url contains, and number of MX servers and apply machine earning techniques to produce whether a website is fraudulent or not.

Problem Definition

Oftentimes, we may receive an email or see a website that grabs our attention. However, how do we know whether we can trust its information? Because of the rising use of the internet, there has been an unfortunate increase in fraudulent sites that compel users to disclose the wrong information as well as have their money stolen. Even though an average young adult can tell something may or may not be a scam because some things are “too good to be true”, many fraudulent websites are starting to look like legitimate sites. This can be especially problematic for the elderly and younger kids who may not understand what the normal of the internet is. Consequently, in our project, we will be using machine learning techniques to identify suspicious websites to protect ourselves from fraud and its negative consequences.

Methods

The accurate detection of scam websites will involve several features such as types of characters found in the url and the number of MX servers. However, many of these features will not be significant when detecting scam websites; the additional classification from some features will be marginal. Principal component analysis (PCA) will likely be used to remove some features and reduce the dimensionality of the data. This will simplify visualization and analysis including clustering.

There are several forms of clustering which can be utilized to describe the reduced data. Since each type has its own strengths, it will be hard to determine which to use without the data. If the clusters can easily be distinguished by distance, anything other than k-means may be excessive. Because the dataset we are using contains labels, both external and internal cluster evaluations can provide insight for the accuracy of the clusters.

Potential Results and Discussion

Analyzing the data will likely require clustering. Therefore, clustering evaluation methods such as Beta-CV, Silhouette Coefficient, and Davies-Bouldin index may indicate the effectiveness of the clustering. Since our model will undergo supervised learning, we will also measure external evaluation methods like precision, accuracy, and recall.

References

Abbasi, A., Chen, H. A comparison of fraud cues and classification methods for fake escrow website detection. Inf Technol Manag 10, 83–101 (2009). https://doi.org/10.1007/s10799-009-0059-0

Abbasi, A., Zhang, Z., Zimbra, D., Chen, H., & Nunamaker, J. F. (2010). Detecting Fake Websites: The Contribution of Statistical Learning Theory. MIS Quarterly, 34(3), 435–461. https://doi.org/10.2307/25750686

Afanasyeva, O., Shiyan, V., & Goncharova, M. (2021). Cyber Fraud as a Relevant Internet of Things Security Threat. Proceedings of the International Scientific and Practical Conference on Computer and Information Security - Volume 1: INFSEC, 122–126. doi:10.5220/0010619600003170

Vrbančič, G., Fister, I., & Podgorelec, V. (2020). Datasets for phishing websites detection . Data in Brief, 33, 106438. doi:10.1016/j.dib.2020.106438

Proposed Timeline

Tasks	Contact	Due Date
Project Proposal
Writing up requirements	Abdulrahman, Hassan	6/16
Project timeline and contribution table	Seo Hyun	6/15
Setup Github pages	Seo Hyun	6/16
Presentation and Video Recording	Abdulrahman	6/16
Midterm Report
Dataset final search + Overview	Hassan	6/23
Transfer Content to Github pages	Seo Hyun	7/7
▼ Data Preprocessing
Select features	Abdulrahman	6/30
Format editing, data collecting, etc.	Hassan	6/23
▼ Implement ML Model
Clustering	Abdulrahman, Hassan	7/4
Clustering Evaluation	Seo Hyun	7/4
▼ Write Up Report
Introduction/Background, Problem definition, Data Collection	Hassan	7/6
Methods, Results and Discussion	Abdulrahman	7/6
Create Visualizations	Seo Hyun	7/7
Final Report
Transfer Content to Github Pages	Seo Hyun, Hassan	7/25
Final 7 min Video	Abdulrahman	7/25
▼ Implement ML Model
Clustering	Abdulrahman, Hassan	7/14
Clustering Evaluation	Seo Hyun	7/14
Second Model	Abdulrahman, Hassan	7/21
Second Model Evaluation	Seo Hyun	7/21
▼ Write Up Report
Introduction/Background, Problem definition, Data Collection	Abdulrahman	7/25
Methods, Results and Discussion	Hassan	7/25
Create Visualizations	Seo Hyun	7/23

Contribution Table

Task	Owner	Completed?
Project Proposal
Writing up requirements	Abdulrahman, Hassan
Project timeline and contribution table	Seo Hyun
Set Up Github Pages	Seo Hyun
Google Slides	Abdulrahman
Final Video	Abdulrahman

*To be updated with detailed tasks from timeline in the future

Presentation

Video

Link to Video