Please note a website outage is scheduled for Thursday 11 July from 6-8am. We apologise for any inconvenience.

A Comparative Study of Web Pages Classification Methods Applied to Health Consumer Web Pages

Abstract

These days, the Internet is developing at an exponential rate and can cover just about any data required. Nonetheless, the immense measure of web pages makes it more difficult to effectively discover the target data by a user. Therefore, an efficient method, for classifying this huge amount of data is essential if the web pages are to be exploited to its full potential. In the domain of automatic web page classifier many approaches have been tried to solve this problem using different Machine learning-based algorithms including Support Vector Machine (SVM), Naïve Bayes, Decision Tree, K-Nearest Neighbor (K-NN) and Neural Networks. However, there is a lack of comparison between these algorithms to find a better framework for the classification and analysis of health related web pages. In this research study, we compare two commonly used supervised Machine Learning algorithms; Support Vector Machines (SVM) and Naïve Bayes to classify web pages which provide drugs related information of patients for example side effects, patient action and follow-up information for patients. We use Unified Medical Language System (UMLS) to annotate the health related concepts in Web pages and train SVM and Naïve Bayes classifiers in General Architecture for Text Engineering to classify health related and non-health related Web pages. The evaluation was performed using K-fold cross validation using four runs on a data set of fifty Web pages. Results found that SVM performed better to classify health and non-health related pages in terms of precision, recall and F-measure.

view journal