Abstract:
Researchers have collected Twitter data to study a wide range of topics, one of which is a natural
disaster. A social network sensor was developed in existing research to filter natural disaster
information from direct eyewitnesses, none eyewitnesses, and non-natural disaster information. It
can be used as a tool for early warning or monitoring when natural disasters occur. The main
component of the social network sensor is the text tweet classification. Similar to text classification
research in general, the challenge is the feature extraction method to convert Twitter text into
structured data. The strategy commonly used is vector space representation. However, it has the
potential to produce high dimension data. This research focuses on the feature extraction method
to resolve high dimension data issues. We propose a hybrid approach of word2vec-based and
lexicon-based feature extraction to produce new features. The Experiment result shows that the
proposed method has fewer features and improves classification performance with an average
AUC value of 0.84, and the number of features is 150. The value is obtained by using only the
word2vec-based method. In the end, this research shows that lexicon-based did not influence the
improvement in the performance of social network sensor predictions in natural disasters.
Keywords: feature extraction, natural disaster, text classification, word2vec, lexicon