Papers by Ayu Purwarianti
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
A. Background Which part of the signal is a speech artifact?-Unknown characteristics-Difficult to... more A. Background Which part of the signal is a speech artifact?-Unknown characteristics-Difficult to remove Speech artifact contamination Unable to process the EEG data B. Overview Lip EMG to monitor speech artifacts Vertical EOG to monitor eye artifacts Electrode placement Research pipeline Data collection (picture-naming experiment) Preprocessing Speech artifact removal Previous work methods SAR-ICA BSS-CCA Proposed method (tensor-based) Evaluation (grand-average correlation with lip EMG during 0-1350 ms) Eye artifact removal Related works-SAR-ICA by Porcaro et al, 2015-BSS-CCA by Vos et al, 2010 Both proposed a speech artifact removal method with matrix decomposition. SAR-ICA > BSS-CCA, according to Porcaro et al, 2015

Vulnerability Detection in PHP Web Application Using Lexical Analysis Approach with Machine Learning
2018 5th International Conference on Data and Software Engineering (ICoDSE), 2018
Security is an important aspect and continues becoming a challenging topic especially in a web ap... more Security is an important aspect and continues becoming a challenging topic especially in a web application. Today, 78,9% of websites uses PHP as programming languages. As a popular language, WebApps written in PHP tend to have many vulnerabilities and they are reflected from their source codes. Static analysis is a method that can be used to perform vulnerability detection in source codes. However, it usually requires an additional method that involves an expert knowledge. In this paper, we propose a vulnerability detection technique using lexical analysis with machine learning as a classification method. In this work, we focused on using PHP native token and Abstract Syntax Tree (AST) as features then manipulate them to get the best feature. We pruned the AST to dump some unusable nodes or subtrees and then extracted the node type token with Breadth First Search (BFS) algorithm. Moreover, unusable PHP token are filtered and also combined each other token to enrich the features extracted using TF-IDF. These features are used for classification in machine learning to find the best features between AST token and PHP token. The classification methods that we used were Gaussian Naïve Bayes (GNB), Support Vector Machine (SVM) and Decision Tree. As the result, we were able to get highest recall score at 92% with PHP token as features and Gaussian Naïve Bayes as machine learning classification method.

We propose an automated estimation scheme to analyze question classification in Indonesian multi ... more We propose an automated estimation scheme to analyze question classification in Indonesian multi closed domain question answering systems. The goal is to provide a good questioning classification system even if using only available language sources. Our strategy here is to build a pattern and rule to extract some important words and utilize the results as a feature for classification estimation of automated learning-based questions. Scenarios designed in automated learning estimates: (i) Analyzing questions, to represent the key information needed to answer user questions using target focus and target identification; (ii) Classify the type of question, construct a taxonomy of questions that have been coded into the system to determine the expected answer type, through some question processing patterns and rules. The proposed method is evaluated using datasets collected from various Indonesian websites. Test results show that the classification process using the proposed method is ve...

Deep Learning for Dengue Fever Event Detection Using Online News
2020 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET), 2020
Dengue fever currently has been a hyperendemic infectious disease in Indonesia. Early detection a... more Dengue fever currently has been a hyperendemic infectious disease in Indonesia. Early detection ability of the dengue fever events are essential for a timely and effective response to prevent outbreaks. This paper presents dengue fever event detection using online news. A previous study conducted an event detection task from sentences using word frequency burst to detect an ongoing event. However, news do not only report about the event (i.e., the event of dengue fever case) but also information regarding the disease. This paper focuses on detecting an event of dengue fever from online news. An assessment of different deep learning models is reported in this paper. Using k-fold cross validation, convolutional neural network (CNN) achieved the best performance (in average, test accuracy: 80.019%, precision: 78.561%, recall: 77.747%, and f1-score: 77.234%).

2020 7th International Conference on Advance Informatics: Concepts, Theory and Applications (ICAICTA), 2020
Compared to English, the amount of labeled data for Indonesian text classification tasks is very ... more Compared to English, the amount of labeled data for Indonesian text classification tasks is very small. Recently developed multilingual language models have shown its ability to create multilingual representations effectively. This paper investigates the effect of combining English and Indonesian data on building Indonesian text classification (e.g., sentiment analysis and hate speech) using multilingual language models. Using the feature-based approach, we observe its performance on various data sizes and total added English data. The experiment showed that the addition of English data, especially if the amount of Indonesian data is small, improves performance. Using the fine-tuning approach, we further showed its effectiveness in utilizing the English language to build Indonesian text classification models.
Paper ini menjelaskan tentang implementasi pengenalan OOV (Out of Vocabulary) words pada Aplikasi... more Paper ini menjelaskan tentang implementasi pengenalan OOV (Out of Vocabulary) words pada Aplikasi Pengenal Suara Berbahasa Indonesia. Pengenalan OOV words penting karena masalah ini tidak dapat diselesaikan dengan menambah ukuran kamus. Untuk mengimplementasi pengenalan OOV words, dilakukan transduksi fonem ke kata. Klasifikasi kata-kata diberikan dengan melihat model bahasa dan probabilitas perubahan fonem untuk menentukan bagian yang termasuk OOV words. Pada paper ini juga dilakukan evaluasi terhadap beberapa jenis kamus yang digunakan pada sistem pengenal suara. Modifikasi pada kamus sistem pengenal bahasa Indonesia menghasilkan peningkatan sekitar 4% sedangkan hasil deteksi akurasi OOV sebesar sekitar 77%.

Inveo, a management information system for emissions inventory e-administration in Indonesia
2017 2nd International conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE), 2017
Emissions inventory is a data collection of accounted amount of pollutants released into the atmo... more Emissions inventory is a data collection of accounted amount of pollutants released into the atmosphere. Emissions inventory can be useful to monitor air pollution in an area. Many countries in the world have already developed a system to manage emission inventory, excluding Indonesia. We have developed Inveo, a management information system for emissions inventory in Indonesia. As an e-administration system, Inveo aims to facilitate the process of collecting and managing regional emissions inventory data which also providing data and information to the public. Requirement analysis in this paper is defined by doing surveys and analyzing existing works and researches. Point, area, and road emissions inventory sources are modeled as a point. Inveo can manage several emissions inventory data such as emissions, locations, roads, surveys, facilities, fuels, map legend bounds, and user accounts. This system provides data and information to the public and is visualized by emission map, gra...

Our Indonesian-English Cross Language Question Answering (CLQA) is divided into 4 components: que... more Our Indonesian-English Cross Language Question Answering (CLQA) is divided into 4 components: question analyzer, keyword translator, passage retriever and answer finder component. The Indonesian question is inputted into a question analyzer which yields Indonesian keyword list, Indonesian question focus and question class. We defined the question class by using an SVM machine implemented in Weka[15]. Because Indonesian is a poor data resource language, we use a bigram frequency feature as an addition feature for the question classification. The Indonesian keywords are translated into English using an Indonesian-English bilingual dictionary. The English translations are composed into a boolean query to retrieve relevant passages. We select the passages within 3 highest IDF scores. In the answer finder, the answer is located by using an SVM method for text chunking implemented in Yamcha[4]. Different with other Indonesian-English CLQA[1,14], we do not tag the name entities in the targ...

Experiments on coreference resolution for Indonesian language with lexical and shallow syntactic features
2017 5th International Conference on Information and Communication Technology (ICoIC7), 2017
We built Indonesian coreference resolution that solves not only pronoun referenced to proper noun... more We built Indonesian coreference resolution that solves not only pronoun referenced to proper noun, but also proper noun to proper noun and pronoun to pronoun. The differences with the available Indonesian coreference resolution lay on the problem scope and features. We conducted experiments using various features (lexical and shallow syntactic features) such as appositive feature, nearest candidate feature, direct sentence feature, previous and next word feature, and a lexical feature of first person. We also modified the method to build the training set by selecting the negative examples by cross pairing every single markable that appear between antecedent and anaphor. Compared with two available methods to build the training set, we conducted experiments using C45 algorithm. Using 200 news sentences, the best experiment achieved 71.6% F-Measure score.

Monitoring Gadget Usage Behavior Among Adolescents Using Machine Learning
The aim study has a long-term goal, namely to reduce negative impact of gadget use among adolesce... more The aim study has a long-term goal, namely to reduce negative impact of gadget use among adolescents. By giving awareness and the ability for teens to control the use of gadgets, adolescents are expected will be more productive and act as users of information technology intelligent. From economic products, the software developed can be marketed to various educational institutions such as junior high school or university, or where parents or schools will get a monitoring report on the use of gadgets from adolescents users. The method used in this study includes artificial intelligence techniques (machine learning) for various development models of text/speech / video/type classification user; User Centered Design techniques for application development; and multiple techniques social humanities such as desk study activities, focus group discussions, survey/questionnaire/interview. The results of the first year research to date are software development to monitor user behavior on the g...

2018 International Conference on Asian Language Processing (IALP), 2018
Researches on Indonesian named entity (NE) tagger have been conducted since years ago. However, m... more Researches on Indonesian named entity (NE) tagger have been conducted since years ago. However, most did not use deep learning and instead employed traditional machine learning algorithms such as association rule, support vector machine, random forest, naïve bayes, etc. In those researches, word lists as gazetteers or clue words were provided to enhance the accuracy. Here, we attempt to employ deep learning in our Indonesian NE tagger. We use long short-term memory (LSTM) as the topology since it is the state-of-the-art of NE tagger. By using LSTM, we do not need a word list in order to enhance the accuracy. Basically, there are two main things that we investigate. The first is the output layer of the network: Softmax vs conditional random field (CRF). The second is the usage of part of speech (POS) tag embedding input layer. Using 8400 sentences as the training data and 97 sentences as the evaluation data, we find that using POS tag embedding as additional input improves the performance of our Indonesian NE tagger. As for the comparison between Softmax and CRF, we find that both architectures have a weakness in classifying an NE tag.

Comparative Study of Covid-19 Tweets Sentiment Classification Methods
2021 9th International Conference on Information and Communication Technology (ICoICT), 2021
Covid-19 is a disease caused by a virus and has become a pandemic in many countries around the wo... more Covid-19 is a disease caused by a virus and has become a pandemic in many countries around the world. The disease not only affects public health, but also affects other aspects of life. People tend to write comments about things happening during the pandemic on social media, one of which is Twitter. Sentiment analysis on Twitter data is not an easy task due to the characteristics of the tweeter text which is user generated content. Therefore, in this paper, a sentiment analysis study is carried out on Twitter data using three schemes, namely the vector space model (Bag of Words and TF-IDF) with Support Vector Machine, word embedding (word2vec and Glove) with Long Short-Term Memory, and BERT (Bidirectional Encoder Representations from Transformers). Based on the conducted experiments, BERT achieved the best performance compared to the other two schemes, reaching 0.85 (weighted F1-score) and 0.83 (macro F1-score) for the classification of three sentiment classes on Kaggle competition data (Coronavirus tweets NLP – Text Classification).

International Journal on Electrical Engineering and Informatics, 2020
Studies on human-machine interaction system show positive results on system development accuracy.... more Studies on human-machine interaction system show positive results on system development accuracy. However, there are problems, especially using certain input modalities such as speech, gesture, face detection, and skeleton tracking. These problems include how to design an interface system for a machine to contextualize the existing conversations. Other problems include activating the system using various modalities, right multimodal fusion methods, machine understanding of human intentions, and methods for developing knowledge. This study developed a method of human-machine interaction system. It involved several stages, including a multimodal activation system, methods for recognizing speech modalities, gestures, face detection and skeleton tracking, multimodal fusion strategies, understanding human intent and Indonesian dialogue systems, as well as machine knowledge development methods and the right response. The research contributes to an easier and more natural humanmachine inte...

Development of sentiment classification system for Indonesian public policy tweet
2015 International Conference on Computer, Control, Informatics and its Applications (IC3INA), 2015
We propose a sentiment classification system for Indonesian public policy tweet. The system consi... more We propose a sentiment classification system for Indonesian public policy tweet. The system consists of two subsystems: relevant tweet classification and tweet sentiment classification. Using Indonesian public policy tweet, we conduct the experiment to measure the performance for each subsystem and their combination. The purposes of the experiments are to find the best feature and algorithm for each subsystem. We emphasize to employ clustering technique for relevant tweet classification and supervised learning algorithm for sentiment classification. The best setting for clustering technique is using K-means algorithm and 2-gram feature. The best setting for tweet sentiment classification is using maximum entropy algorithm and 1-gram feature with accuracy 71.62%.

Utterance disfluency handling in Indonesian-English machine translation
2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA), 2016
We propose a hybrid technique on handling the utterance disfluency for Indonesian-English machine... more We propose a hybrid technique on handling the utterance disfluency for Indonesian-English machine translation. The handling is done as the preprocessing in the machine translation system. In the preprocessing, we classify utterance disfluency using a combination of statistical based technique and rule based technique. We used 4 types of disfluency such as filler (for filled pause and discourse marker), rough copy, noncopy, and stutter. For the statistical based, we employ lexical features and CRF algorithm. In the experiment on the disfluency classification, we compared the statistical based only method and the hybrid method. The experimental result has shown that using a hybrid method achieved higher performance than only the statistical based method. The disfluency classification result is employed in the machine translation and enhance the BLEU score of Indonesian-English machine translation.

Indonesian-Japanese term extraction from bilingual corpora using machine learning
2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2015
As bilateral relation between Indonesia and Japan strengthens, the need of consistent term usage ... more As bilateral relation between Indonesia and Japan strengthens, the need of consistent term usage for both languages becomes important. In this paper, a new method for Indonesian-Japanese term extraction is presented. In general, this is done in 3 steps: (1) n-gram extraction for each language, (2) n-gram cross-pairing between both languages, and (3) classification. This method is aimed to be able to handle term extraction from both parallel corpora and comparable corpora. In order to use this method, we have to build a classification model first using machine learning. There are 4 types of feature we take into consideration. They are dictionary based features, cognate based features, combined features, and statistic features. The first three features are linguistic features. Dictionary based features consider word-pair existence in a predefined dictionary, cognate based features consider morpheme level similarity, combined features consider both dictionary and cognate based features altogether, and statistic features is used in case the first 3 features fail. The only statistic feature we use is context heterogeneity similarity, which consider the variety of words that can precede or follow a term. For learning algorithm, we use SVM (Support Vector Machine). In the experiment, we compared several scenarios: only linguistic features, only statistic features, or both features combined. The classification model was built from parallel corpora since plenty of term pairs can be extracted from parallel corpora. The size of training data was 5,000 term pairs. The best result was achieved by using only linguistic features and without the preprocessing step. The accuracy was up to 90.98% and recall 92.14%. A testing from comparable corpora was also done with size of 37,392 term pairs where 94 were equivalent translation and 37,298 were not. Evaluation using test set gave accuracy of 98.63% precision, but with low recall score of 24.47%.

Comparison on the rule based method and statistical based method on emotion classification for Indonesian Twitter text
2015 International Conference on Information Technology Systems and Innovation (ICITSI), 2015
In this study, we conducted experiments on emotion classification of Indonesian Twitter text. To ... more In this study, we conducted experiments on emotion classification of Indonesian Twitter text. To conduct such experiments, we built a corpus of labeled Twitter data with size of 7622 Twitter text taken from 69 Twitter accounts, manually labeled by 5 native speakers. We used 6 basic emotion labels (angry, disgust, fear, joy, sad, surprise) and add one label of neutral emotion class. Here, we compared a rule based method with a statistical based method. In the rule based method, we employed the existing Synesketch algorithm with two types of emotion word list: a manually written and a translated WordNet-Affect list. In the statistical based method, we employed SVM (Support Vector Machine) algorithm with unigram feature and feature selection algorithms of Information Gain and Minimum Frequency. Other than a pure statistical based method, we also employed the manually built emotion word list in the SVM based classification. In the text pre-processing, we compared several methods such as the normalization, emotion conversion, stop words removal, number removal, and a one-character token removal. The experimental results showed that the statistical based method result of 71.740% accuracy score is higher than the rule based method of 63.172% accuracy score. To enhance the accuracy, we employed SMOTE in order to handle the imbalanced data and achieved best result with the f-measure of 83.203%. In another experiment, we combined the pure statistical method with the rule based method by employing the manually word list into the classification features. The f-measure for this experiment has only reached 81.592%.

Filled Pause Detection in Indonesian Spontaneous Speech
Communications in Computer and Information Science, 2016
Detecting filled pause in spontaneous speech recognition is very important since most of the spee... more Detecting filled pause in spontaneous speech recognition is very important since most of the speech is spontaneous and the most frequent phenomenon in Indonesian spontaneous speech is filled pause. This paper discusses the detection of filled pauses in spontaneous speech of Indonesian by utilizing acoustic features of the speech signal. The detection was conducted by employing statistical method using Naive Bayes, Classification Tree, and Multilayer Perceptron algorithm. To build the model, speech data were collected from an entertainment program. Word parts in the data were labeled and its features were extracted. These include the formant and pitch stability, energy-drop, and duration. Half an hour of sentences contains 295 filled pause and 2082 non-filled pause words were employed as training data. Using 25 sentences as testing data, Naive Bayes gave best detection correctness, 74.35 % on a closed data set and 71.43 % on an open data set.
Indonesian medical sentence transformation for question generation
2015 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob), 2015
We present a novel scheme of sentence transformation for Indonesian medical question generation (... more We present a novel scheme of sentence transformation for Indonesian medical question generation (ImeQG) system by utilizing effectively documents for information navigation. Through the ImeQG proposed method, we conducted a general procedure of dependency analysis for extract verbs and relevant phrases to generate natural sentences by applying transformation rules. For this purpose, we defined some P-A templates based on a statistical measure. An experimental evaluation in this proposed method showed 79.00% for precision, 87.80% for recall and 81.50% for F1.
Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions - ACL '07, 2007
We propose a novel method to expand a small existing translation dictionary to a large translatio... more We propose a novel method to expand a small existing translation dictionary to a large translation dictionary using a pivot language. Our method depends on the assumption that it is possible to find a pivot language for a given language pair on condition that there are both a large translation dictionary from the source language to the pivot language, and a large translation dictionary from the pivot language to the destination language. Experiments that expands the Indonesian-Japanese dictionary using the English language as a pivot language shows that the proposed method can improve performance of a real CLIR system.
Uploads
Papers by Ayu Purwarianti