ANALISIS KINERJA ALGORITMA SUPPORT VECTOR MACHINE DALAM MENDETEKSI DAN MENGKLASIFIKASI EMAIL SPAM BERBASIS TEKS OTOMATIS
Kata Kunci:
SVM, TF-IDF, Email Spam, Klasifikasi Teks, Machine LearningAbstrak
Penelitian ini bertujuan menganalisis kinerja algoritma Support Vector Machine (SVM) dalam mendeteksi dan mengklasifikasi email spam berbasis teks otomatis menggunakan pendekatan TF-IDF. Dataset terdiri dari 5.572 email yang telah melalui tahap pembersihan dan normalisasi teks. Proses pelatihan model dilakukan dengan pembagian data latih dan uji menggunakan metode train–test split, kemudian dilanjutkan optimasi hyperparameter melalui Grid Search Cross-Validation. Model SVM awal memperoleh akurasi 98,30%, dan setelah proses optimasi diperoleh hyperparameter terbaik dengan nilai C=1 dan kernel linear. Evaluasi akhir menunjukkan akurasi 98,30%, precision kelas ham 0,98 dan spam 0,99, recall ham 1,00 dan spam 0,88, serta F1-Score masing-masing 0,99 dan 0,93. Matriks konfusi menunjukkan 965 email ham terklasifikasi benar, 1 salah klasifikasi, serta 131 email spam terdeteksi benar dan 18 salah prediksi. Hasil ini membuktikan bahwa SVM berbasis TF-IDF mampu memberikan performa yang sangat baik dalam mengidentifikasi email spam. Penelitian ini merekomendasikan integrasi word embedding dan model ensembel untuk peningkatan performa pada studi lanjutan.
This study aims to analyze the performance of the Support Vector Machine (SVM) algorithm in detecting and classifying spam emails using an automated text-based approach with TF-IDF. The dataset consists of 5,572 emails that underwent preprocessing, including cleaning and text normalization. The model was trained using a train–test split and optimized through Grid Search Cross-Validation. The initial SVM model achieved an accuracy of 98.30%, and hyperparameter tuning produced the optimal configuration with C=1 and a linear kernel. Final evaluation results show an accuracy of 98.30%, with precision scores of 0.98 for ham and 0.99 for spam, recall of 1.00 for ham and 0.88 for spam, and F1-Scores of 0.99 and 0.93, respectively. The confusion matrix indicates 965 correctly classified ham emails, 1 misclassified ham, 131 correctly predicted spam, and 18 misclassified spam emails. These findings demonstrate that TF-IDF–based SVM provides excellent performance for spam email detection. Future work is recommended to explore word-embedding-based features and ensemble models for further performance enhancement.




