Dimitar Kazakov - University of York, UK

Papers by Dimitar Kazakov

NLP Text Classification for COVID-19 Automatic Detection from Radiology Report in Indonesian Language

2022 5th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI)

NLP Analysis of COVID-19 Radiology Reports in Indonesian using IndoBERT

2022 4th International Conference on Biomedical Engineering (IBIOMED)

Predicting User Preferences with XGBoost Learning to Rank Method

2020 3rd International Seminar on Research of Information Technology and Intelligent Systems (ISRITI)

Learning user preferences become very important as the personalization systems grow rapidly in th... more Learning user preferences become very important as the personalization systems grow rapidly in this current era. Offering special and personal services can be an added value for the companies to maintain their customer loyalty. Building a personalized recommendation requires a good machine learning model to understand the individual preferences. Every user can be presented with a list of items sorted by its score learned from the individual preferences. So the first couple items shown will be the most liked items by the user. We can borrow the Learning to Rank algorithm from Information Retrieval to solve this problem. In this paper, we present the implementation of user preferences learning by using XGBoost Learning to Rank method in movie domain. We show the evaluation of three different approaches in Learning to Rank according to their Normalized Discounted Cumulative Gain (NDCG) score. We can conclude that in our case study, the pairwise approach appears to be the best solution to produce a personalized list of recommendation.

Research paper thumbnail of Perspectives Generation via Multi-Head Attention Mechanism and Common-Sense Knowledge

International Journal on Cybernetics & Informatics

Consideration of multiple viewpoints on a contentious issue is critical for avoiding bias and ass... more Consideration of multiple viewpoints on a contentious issue is critical for avoiding bias and assisting in the formulation of rational decisions. We observe that the current model imposes a constraint on diversity. This is because the conventional attention mechanism is biased toward a single semantic aspect of the claim, whereas the claim may contain multiple semantic aspects. Additionally, disregarding common-sense knowledge may result in generating perspectives that violate known facts about the world. The proposed approach is divided into two stages: the first stage considers multiple semantic aspects, which results in more diverse generated perspectives; the second stage improves the quality of generated perspectives by incorporating common-sense knowledge. We train the model on each stage using reinforcement learning and automated metric scores. The experimental results demonstrate the effectiveness of our proposed model in generating a broader range of perspectives on a conte...

Research paper thumbnail of An XGBoost Model for Age Prediction from COVID-19 Blood Test

2021 4th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), 2021

COVID-19 was declared a pandemic by the World Health Organization (WHO) in January 2020. Many stu... more COVID-19 was declared a pandemic by the World Health Organization (WHO) in January 2020. Many studies found that some specific age groups of people have a higher risk of contracting the disease. The gold standard test for the disease is a condition-specific test based on Reverse-Transcriptase Polymerase Chain Reaction (RT-PCR). We have previously shown that the results of a standard suite of non-specific blood tests can be used to indicate the presence of a COVID-19 infection with a high likelihood. We continue our research in this area with a study of the connection between the patients' routine blood test results and their age. Predicting a person's age from blood chemistry is not new in health science. Most often, such results are used to detect the signs of diseases associated with aging and develop new medications. The experiment described here shows that the XGBoost algorithm can be used to predict the patients' age from their routine blood tests. The performance evaluation is very satisfactory, with R 2 > 0.80 and a normalized RMSE below 0.1.

User Preference Modelling from Movie Database

2020 International Conference on ICT for Smart Society (ICISS), 2020

Film industry becomes one of the largest amongst the other economic sectors worldwide nowadays. T... more Film industry becomes one of the largest amongst the other economic sectors worldwide nowadays. This industry also has a significant impact in the global economic market. Many movies have been produced each year. However, some movies still fail to reach break-even points every year. Players in this industry need to think carefully when spending most of their money to make a great and successful movie in the market. Several attempts also need to be made to estimate the success performance before a movie can be released. On the audience side, there is an indicator that we can use to decide that a movie is good by looking to its ratings. The industry can use this rating systems as a factor to be considered when evaluating the past movie performances to make better movie for the next production. This study aims to predict the movie ratings and find the most important features which correlate to the ratings. We performed an experiment on IMDb dataset which contains 5,043 movie metadata on four models of machine learning, Linear Regression (LR), Decision Tree (DT), Random Forest (RF) and K-Nearest Neighbor (KNN). From the experiment, Random Forest achieves the best performance amongst the four models.

A Novel Model for Enhancing Fact-Checking

Lecture Notes in Networks and Systems, 2021

Fact-checking is a task to capture the relation between a claim and evidence (premise) to decide ... more Fact-checking is a task to capture the relation between a claim and evidence (premise) to decide this claim’s truth. Detecting the factuality of claim, as in fake news, depending only on news knowledge, e.g., evidence text, is generally inadequate since fake news is intentionally written to mislead readers. Most of the previous models on this task rely on claim and evidence argument as input for their model, where sometimes the systems fail to detect the relation, particularly for ambiguate information. This study aims to improve fact-checking task by incorporating warrant as a bridge between the claim and the evidence, illustrating why this evidence supports this claim, i.e., If the warrant links between the claim and the evidence then the relation is supporting, if not it is either irrelevant or attacking, so warrants are applicable only for supporting the claim. To solve the problem of gap semantic between claim evidence pair, A model that can detect the relation based on existing extracted warrants from structured data is developed. For warrant selection, knowledge-based prediction and style-based prediction models are merged to capture more helpful information to infer which warrant represents the best bridges between claim and evidence. Picking a reasonable warrant can help alleviate the evidence ambiguity problem if the proper relation cannot be detected. Experimental results show that incorporating the best warrant to fact-checking model improves the performance of fact-checking.

Research paper thumbnail of Warrant Generation through Deep Learning

Natural Language Computing, 2021

The warrant element of the Toulmin model is critical for fact-checking and assessing the strength... more The warrant element of the Toulmin model is critical for fact-checking and assessing the strength of an argument. As implicit information, warrants justify the arguments and explain why the evidence supports the claim. Despite the critical role warrants play in facilitating argument comprehension, the fact that most works aim to select the best warrant from existing structured data and labelled data is scarce presents a fact-checking challenge, particularly when the evidence is insufficient, or the conclusion is not inferred or generated well based on the evidence. Additionally, deep learning methods for false information detection face a significant bottleneck due to their training requirement of a large amount of labelled data. Manually annotating data, on the other hand, is a time-consuming and laborious process. Thus, we examine the extent to which warrants can be retrieved or reconfigured using unstructured data obtained from their premises.

Research paper thumbnail of Probabilistic Instruction Cache Analysis Using Bayesian Networks

Current approaches to instruction cache analysis for determining worst-case execution time rely o... more Current approaches to instruction cache analysis for determining worst-case execution time rely on building a mathematical model of the cache that tracks its contents at all points in the program. This requires perfect knowledge of the functional behaviour of the cache and may result in extreme complexity and pessimism if many alternative paths through code sections are possible. To overcome these issues, this paper proposes a new hybrid approach in which information obtained from program traces is used to automate the construction of a model of how the cache is used. The resulting model involves the learning of a Bayesian network that predicts which instructions result in cache misses as a function of previously taken paths. The model can then be utilised to predict cache misses for previously unseen inputs and paths. The accuracy of this learned model is assessed against real benchmarks and an established statistical approach to illustrate its benefits.

Research paper thumbnail of Equation Discovery for Macroeconomic Modelling

Proceedings of the International Conference on Agents and Artificial Intelligence, 2009

This article describes a machine learning based approach applied to acquiring empirical forecasti... more This article describes a machine learning based approach applied to acquiring empirical forecasting models. The approach makes use of the LAGRAMGE equation discovery tool to define a potentially very wide range of equations to be considered for the model. Importantly, the equations can vary in the number of terms and types of functors linking the variables. The parameters of each competing equation are automatically fitted to allow the tool to compare the models. The analysts using the tool can exercise their judgement twice, once when defining the equation syntax, restricting in such a way the search to a space known to contain several types of models that are based on theoretical arguments. In addition, one can use the same theoretical arguments to choose among the list of best fitting models, as these can be structurally very different while providing similar fits on the data. Here we describe experiments with macroeconomic data from the Euro area for the period 1971-2007 in which the parameters of hundreds of thousands of structurally different equations are fitted and the equations compared to produce the best models for the individual cases considered. The results show the approach is able to produce complex non-linear models with several equations showing high fidelity.

Research paper thumbnail of Integrating Time Series with Social Media Data in an Ontology for the Modelling of Extreme Financial Events

This article describes a novel dataset aiming to provide insight on the relationship between stoc... more This article describes a novel dataset aiming to provide insight on the relationship between stock market prices and news on social media, such as Twitter. While several financial companies advertise that they use Twitter data in their decision process, it has been hard to demonstrate whether online postings can genuinely affect market prices. By focussing on an extreme financial event that unfolded over several days and had dramatic and lasting consequences we have aimed to provide data for a case study that could address this question. The dataset contains the stock market price of Volkswagen, Ford and the S&P500 index for the period immediately preceding and following the discovery that Volkswagen had found a way to manipulate in its favour the results of pollution tests for their diesel engines. We also include a large number of relevant tweets from this period alongside key phrases extracted from each message with the intention of providing material for subsequent sentiment analysis. All data is represented as a ontology in order to facilitate its handling, and to allow the integration of other relevant information, such as the link between a subsidiary company and its holding or the names of senior management and their links to other companies.

Lecture Notes in Computer Science, 1999

There is a history of research focussed on learning of shiftreduce parsers from syntactically ann... more There is a history of research focussed on learning of shiftreduce parsers from syntactically annotated corpora by the means of machine learning techniques based on logic. The presence of lexical semantic tags in the treebank has proved useful for the learning of semantic constraints limiting the amount of nondeterminism in the parsers. The grain of the semantic tags used is of direct importance to that task. The combination of the system Lapis with the lexical resource WordNet described here allows to learn parsers while shifting the responsiblility for the choice of semantic tags from the corpus annotator to the learning system. The method is tested on an original corpus also described herein.

Lecture Notes in Computer Science, 2003

The paper surveys some of the mechanisms that have been demonstrated to be relevant for evolving ... more The paper surveys some of the mechanisms that have been demonstrated to be relevant for evolving communication systems in software simulations or robotic experiments. In each case, precursors or parallels with work in the study of artificial life and adaptive behaviour are discussed.

Lecture Notes in Computer Science, 2001

2006 IEEE International Conference on Evolutionary Computation

This paper reports the results of a study of a specific type of concurrency in the Ant Colony Sys... more This paper reports the results of a study of a specific type of concurrency in the Ant Colony System (ACS) algorithm. Studies of Cellular Automata (CA) have shown that the update mechanism used can have a dramatic influence on the dynamics of the CA. ACS is usually implemented with a sequential update mechanism. A new method for controlling the concurrency in a nature-inspired algorithm is introduced. Comprehensive tests on a wide range of problem instances are reported. The study found that concurrency levels had no statistically significant effect on ACS performance. This result is interesting because it contradicts what has been observed in another form of nature-inspired algorithm, namely CAs.

Research paper thumbnail of The effects of variable stationarity in a financial time-series on Artificial Neural Networks

2011 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics (CIFEr), 2011

This study investigates the characteristic of nonstationarity in a financial time-series and its ... more This study investigates the characteristic of nonstationarity in a financial time-series and its effect on the learning process for Artificial Neural Networks (ANN). It is motivated by previous work where it was shown that nonstationarity is not static within a financial time series but quite variable in nature. Initially unit-root tests were performed to isolate segments that were stationary or non-stationary at a predetermined significance level and then various tests were conducted based on forecasting accuracy. The hypothesis of this research is that when using the de-trended/original observations from the time series the trend/level stationary segments should produce lower error measures and when the series are differenced the difference stationary (non-stationary) segments should have lower error. The results to date reveal that the effects of variable stationarity on learning with ANNs are a function of forecasting time-horizon, strength of the linear-time trend, sample size and persistence of the stationary process.

Research paper thumbnail of Stochastic Simulation of Inherited Kinship-Driven Altruism

Lecture Notes in Computer Science, 2003

We present an implementation of a method for finding counterexamples to universally quantified in... more We present an implementation of a method for finding counterexamples to universally quantified inductive conjectures in first-order logic. Our method uses the proof by consistency strategy to guide a search for a counterexample and a standard first-order theorem prover to perform a concurrent check for inconsistency. We explain briefly the theory behind the method, describe our implementation, and evaluate results achieved on a variety of incorrect conjectures from various sources. Some work in progress is also presented: we are applying the method to the verification of cryptographic security protocols. In this context, a counterexample to a security property can indicate an attack on the protocol, and our method extracts the trace of messages exchanged in order to effect the attack. This application demonstrates the advantages of the method, in that quite complex side conditions decide whether a particular sequence of messages is possible. Using a theorem prover provides a natural way of dealing with this. Some early results are presented and we discuss future work.

Research paper thumbnail of Modeling the behavior of the stock market with an Artificial Immune System

IEEE Congress on Evolutionary Computation, 2010

This study analyzes the effectiveness of an Artificial Immune System (AIS) to model and predict t... more This study analyzes the effectiveness of an Artificial Immune System (AIS) to model and predict the movements of the stock market. To aid in this research the AIS models are compared with a k-Nearest Neighbors (kNN) algorithm, an artificial neural network (ANN) and a benchmark market portfolio to compare simulated trading results. The analysis shows that the AIS produced overall accuracy results of 67% over a 20 year test period and that the increased complexity of the model was warranted by the statistically significant superior results when compared to the simpler instancebased approach of kNN. The accuracy results were comparable to those obtained from training the ANN and the trading results outperformed the market benchmark, providing evidence that the stock market had a degree of predictability during the time period of 1989-2008. In general the practice of using the natural immune system to inspire a learning algorithm has been established as a viable alternative to modeling the stock market when implementing a supervised learning approach.

2011 IEEE 17th International Conference on Embedded and Real-Time Computing Systems and Applications, 2011

Lecture Notes in Computer Science, 2010

The use of technical indicators to derive stock trading signals is a foundation of financial tech... more The use of technical indicators to derive stock trading signals is a foundation of financial technical analysis. Many of these indicators have several parameters which creates a difficult optimization problem given the highly non-linear and non-stationary nature of a financial timeseries. This study investigates a popular financial indicator, Bollinger Bands, and the fine tuning of its parameters via particle swarm optimization under 4 different fitness functions: profitability, Sharpe ratio, Sortino ratio and accuracy. The experiment results show that the parameters optimized through PSO using the profitability fitness function produced superior out-of-sample trading results which includes transaction costs when compared to the default parameters.