Project Proposal
Introduction/Background
In this era of the internets and technology we have seen a wide increase in fraudulent and phishy websites where the site owner/attacker attempts to gain sensitive information from the user for his/ her own commercial benefit. This has attracted the interest of many to apply machine learning modeling to automatically identify which websites are scams or not based on a particular website’s characteristics and features. In our project, we will take features such as a website’s url length, the types of symbols the url contains, and number of MX servers and apply machine earning techniques to produce whether a website is fraudulent or not.
Problem Definition
Oftentimes, we may receive an email or see a website that grabs our attention. However, how do we know whether we can trust its information? Because of the rising use of the internet, there has been an unfortunate increase in fraudulent sites that compel users to disclose the wrong information as well as have their money stolen. Even though an average young adult can tell something may or may not be a scam because some things are “too good to be true”, many fraudulent websites are starting to look like legitimate sites. This can be especially problematic for the elderly and younger kids who may not understand what the normal of the internet is. Consequently, in our project, we will be using machine learning techniques to identify suspicious websites to protect ourselves from fraud and its negative consequences.
Methods
The accurate detection of scam websites will involve several features such as types of characters found in the url and the number of MX servers. However, many of these features will not be significant when detecting scam websites; the additional classification from some features will be marginal. Principal component analysis (PCA) will likely be used to remove some features and reduce the dimensionality of the data. This will simplify visualization and analysis including clustering.
There are several forms of clustering which can be utilized to describe the reduced data. Since each type has its own strengths, it will be hard to determine which to use without the data. If the clusters can easily be distinguished by distance, anything other than k-means may be excessive. Because the dataset we are using contains labels, both external and internal cluster evaluations can provide insight for the accuracy of the clusters.
Potential Results and Discussion
Analyzing the data will likely require clustering. Therefore, clustering evaluation methods such as Beta-CV, Silhouette Coefficient, and Davies-Bouldin index may indicate the effectiveness of the clustering. Since our model will undergo supervised learning, we will also measure external evaluation methods like precision, accuracy, and recall.
References
Abbasi, A., Chen, H. A comparison of fraud cues and classification methods for fake escrow website detection. Inf Technol Manag 10, 83–101 (2009). https://doi.org/10.1007/s10799-009-0059-0
Abbasi, A., Zhang, Z., Zimbra, D., Chen, H., & Nunamaker, J. F. (2010). Detecting Fake Websites: The Contribution of Statistical Learning Theory. MIS Quarterly, 34(3), 435–461. https://doi.org/10.2307/25750686
Afanasyeva, O., Shiyan, V., & Goncharova, M. (2021). Cyber Fraud as a Relevant Internet of Things Security Threat. Proceedings of the International Scientific and Practical Conference on Computer and Information Security - Volume 1: INFSEC, 122–126. doi:10.5220/0010619600003170
Vrbančič, G., Fister, I., & Podgorelec, V. (2020). Datasets for phishing websites detection . Data in Brief, 33, 106438. doi:10.1016/j.dib.2020.106438
Proposed Timeline
Tasks | Contact | Due Date |
---|---|---|
Project Proposal | ||
Writing up requirements | Abdulrahman, Hassan | 6/16 |
Project timeline and contribution table | Seo Hyun | 6/15 |
Setup Github pages | Seo Hyun | 6/16 |
Presentation and Video Recording | Abdulrahman | 6/16 |
Midterm Report | ||
Dataset final search + Overview | Hassan | 6/23 |
Transfer Content to Github pages | Seo Hyun | 7/7 |
▼ Data Preprocessing | ||
Select features | Abdulrahman | 6/30 |
Format editing, data collecting, etc. | Hassan | 6/23 |
▼ Implement ML Model | ||
Clustering | Abdulrahman, Hassan | 7/4 |
Clustering Evaluation | Seo Hyun | 7/4 |
▼ Write Up Report | ||
Introduction/Background, Problem definition, Data Collection | Hassan | 7/6 |
Methods, Results and Discussion | Abdulrahman | 7/6 |
Create Visualizations | Seo Hyun | 7/7 |
Final Report | ||
Transfer Content to Github Pages | Seo Hyun, Hassan | 7/25 |
Final 7 min Video | Abdulrahman | 7/25 |
▼ Implement ML Model | ||
Clustering | Abdulrahman, Hassan | 7/14 |
Clustering Evaluation | Seo Hyun | 7/14 |
Second Model | Abdulrahman, Hassan | 7/21 |
Second Model Evaluation | Seo Hyun | 7/21 |
▼ Write Up Report | ||
Introduction/Background, Problem definition, Data Collection | Abdulrahman | 7/25 |
Methods, Results and Discussion | Hassan | 7/25 |
Create Visualizations | Seo Hyun | 7/23 |
Contribution Table
Task | Owner | Completed? |
---|---|---|
Project Proposal | ||
Writing up requirements | Abdulrahman, Hassan | |
Project timeline and contribution table | Seo Hyun | |
Set Up Github Pages | Seo Hyun | |
Google Slides | Abdulrahman | |
Final Video | Abdulrahman |
*To be updated with detailed tasks from timeline in the future
Presentation
Video