Natural language processing and Machine learning based phishing website detection system

Yazhmozhi V. M. (Lead / Corresponding author), B. Janet

Research output: Contribution to conferencePaperpeer-review

4 Citations (Scopus)


Because of the colossal growth of internet, most of the users have changed their preference from traditional shopping, banking etc. to online mode. This paved the way for a lot of cybercrimes including phishing into existence. The attackers try to extract sensitive/personal details such as user ID, passwords and debit card/credit card information by disguising themselves as reliable websites. Identifying whether the Uniform Resource Locator (URL) of a website is legitimate or phishing is a difficult task because it exploits the user's vulnerabilities. Although many products are available for detecting phishing websites, they are just making use of heuristic approach and black lists and hence they can't prevent phishing in a more effective way. A system that detects phishing websites in real time has been proposed in this paper. It uses five different classification algorithms with two different feature sets using natural language processing and word vectors to identify which performs better. After analyzing the accuracy of different machine learning classification algorithms like naive bayes, logistic regression, support vector machine, decision tree and random forest using different features, it has been found that random Forest algorithm with features based on natural language processing has performed better with an accuracy of 97.99.
Original languageEnglish
Number of pages5
Publication statusPublished - 12 Mar 2020
Event2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC) - India, Palladam, India
Duration: 12 Dec 201914 Dec 2019


Conference2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)


  • Phishing
  • Natural language processing
  • Machine learning
  • Word vectors
  • Classification
  • Cyber security

Cite this