We decided on using embeddings to represent the carciogenesis dataset in a efficient form. This was done using the PyKeen library, which offers a myriad of different embedding models. Further it can be configured with different parameters like the number of epochs or the dimension of the generated embedding. From this we tried out different approaches for classification ranging from clustering algorithms like KNN to machine learning approaches using the sklearn library. We settled on using random forrests in conjunction with embeddings generated using the \<model_name> model. Using this we could achieve F1-Scores ranging from \<lower_bound> up to \<higher_bound> for the given test lps. We split the data in to learning and test in a ratio

of \<ratio>.

We decided on using embeddings to represent the carciogenesis dataset in an efficient form.

This was done using the PyKeen library, which offers a myriad of different embedding models.

Further it can be configured with different parameters like the number of epochs, or the dimension

of the generated embedding.

To make predictions using these embeddings, we first used typical machine learning algorithms such as

random forests, logistic regression, or clustering algorithms such as kNN. In doing so, we encountered

the problem that very many of the learning problems have a very unbalanced ratio of positive and negative

(included and excluded) instances.

For learning problems that had an extremely high proportion of negative (excluded) instances,

the classification algorithms classified all instances as negative, since these mostly optimize the accuracy

instead of the F1 score.

To overcome this problem, we tried to balance the training data before the training. Since undersampling,

with a very small amount of positive instances leads to a very small training data set,

we therefore decided to oversample. The oversampling algorithm we used is the SMOTE implementation of the sklearn extension

imbalenced-learn (https://github.com/scikit-learn-contrib/imbalanced-learn). In simple terms, SMOTE calculates

new synthetic data points for the smaller class, each of which lies on the line between two data points of this class.

Using this technique and a Linear SVM, we were able to at least slightly improve the problem of overweighting the negative class.

Using this we could achieve F1-Scores ranging from \<lower_bound> up to \<higher_bound> for the given test lps.

We split the data in to learning and test in a ratio of \<ratio>.