Deep Learning and Integration of Semantic Knowledge

Researchers:  Sören Auer , Daniel Kudenko , Mark Musen , Maria-Esther Vidal

Existing Projects:   ScienceGraph , KnowGraphs

Knowledge Graphs have gained increasing popularity in the last decade in science and technology. They enable a versatile and evolving semantic representation of knowledge at the crossroads of various a) levels of information structuring – unstructured, semi-structured, structured, b) levels of abstraction – conceptual vs. operational, c) knowledge representation formalisms – graphs, facts, entity-relationship, logic, and d) technology ecosystems. However, knowledge graphs are currently relatively simple semantic structures mainly representing an assembly of factual statements arranged in entity descriptions, possibly enriched by class hierarchies and corresponding property definitions.

We aim to advance the concept of knowledge graphs towards cognitive knowledge graphs, where the constituents are more complex elements, such as ideas, theories, approaches, claims as they are conveyed, for example, in scholarly contributions or OMICS data structures for personalized medicine. We aim to tightly interleave three aspects for cognitive knowledge graph management: semantic representations (semantic intelligence), machine learning (machine intelligence) as well as crowd and expert sourcing (human intelligence).

PhD student(s): Salomon Kabongo Kabenamualu , Can Aykul

Unsupervised Representation Learning

Researchers:  Bodo Rosenhahn , David Suter

Existing Projects:

Our research focusses on representation learning which subsumes a set of techniques that allow to represent and structure data in an unsupervised manner. Its mainly used to automatically generate features for further processing and is an implicitly learned alternative to manual feature engineering.

A learned representation can be used to compress input data on a low-dimensional manifold for data visualization. Another use case is to use a low-dimensional representation as input for a subsequent machine learning method such as a support vector machine or a random forest.

Several methods for subspace projection and sparse data representation exist, ranging from principle component analysis (PCA), independent component analysis (ICA), local linear embedding (LLE), vector quantization (VQ) to dictionary learning. More recent approaches are based on autoencoders, variational autoencoders or invertible neural networks which are the basis of our research:

In the future lab we will focus on representation learning and the integration of special conditions, such as specific priors. We investigate structuring autoencoders which allow to structure the latent space during representation learning [1], normalizing flows for anomaly detection [2] and sparse feature selection based on disentangled representations [3]. Based on our past experience using frameworks of mixed integer linear programming (MILP) [4], we will also formulate variants for sparse svms as MILP, similar to [5]. Here we are mainly interested to express specific priors using additional constraints.

PhD student(s): Mariia Khan

[1] Marco Rudolph, Bastian Wandt, Bodo Rosenhahn Structuring Autoencoders Third International Workshop on “Robust Subspace Learning and Applications in Computer Vision” (ICCV), August 2019

[2] Marco Rudolph, Tom Wehrbein, Bodo Rosenhahn, Bastian Wandt Fully Convolutional Cross-Scale-Flows for Image-based Defect Detection Winter Conference on Applications of Computer Vision (WACV), IEEE, Hawaii, USA, January 2022

[3] Maren Awiszus, Hanno Ackermann, Bodo Rosenhahn Learning Disentangled Representations via Independent Subspaces Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), October 2019

[4] Andrea Hornakova*, Roberto Henschel*, Bodo Rosenhahn, Paul Swoboda, (* equal contribution) Lifted Disjoint Paths with Application in Multiple Object Tracking Proceedings of the 37th International Conference on Machine Learning (ICML), July 2020

[5] Tanveer, M. Robust and Sparse Linear Programming Twin Support Vector Machines. Cogn Comput 7, 137–149 (2015). https://doi.org/10.1007/s12559-014-9278-8

Probabilistic Methods, Spatial Data

Researchers:  Wei Wu

Existing Projects:   CampaNeo , smashHit

Nowadays, data are growing explosively in the information society. Big data have been driving data mining research in both academia and industry. Data similarity (or distance) computation is a fundamental research topic which underpins many high-level applications based on similarity measures in machine learning and data mining, e.g., classification, clustering, regression, retrieval and visualization. However, it has been daunting for large-scale data analytics to exactly compute similarity because of  “3V” characteristics (volume, velocity and variety). Given the ever-growing availability and awareness of large-scale data sets in many important application domains, e.g., bioinformatics, transportation, epidemiology and public safety, it is imperative to develop efficient yet accurate similarity estimation algorithms in large-scale data analytics.

One powerful solution is data hashing technique, which applies a family of hash functions to transform data objects to a sequence of hash codes so that similar objects are mapped to the same hash code with higher probability than dissimilar ones. Consequently, hashing techniques, as a vital building block for large-scale data analytics, can be used to efficiently and in many cases, unbiasedly approximate the similarity between data objects, thus benefiting many important data mining and machine learning tasks that rely on similarity measure such as information retrieval, classification, clustering and visualization. So far Dr. Wu has published the following top-tier papers,

[1] Wu W, Li B, Chen L, et al. Canonical consistent weighted sampling for real-value weighted min-hash//Proceedings of the 16th IEEE International Conference on Data Mining. Barcelona, Spain, 2016: 1287-1292

[2] Wu W, Li B, Chen L, et al. K-ary tree hashing for fast graph classification. IEEE Transactions on Knowledge and Data Engineering, 2017, 30(5): 936-949

[3] Wu W, Li B, Chen L, et al. Consistent weighted sampling made more practical//Proceedings of the 26th International Conference on World Wide Web. Perth, Australia, 2017: 1035-1043

[4] Wu W, Li B, Chen L, et al. Efficient attributed network embedding via recursive randomized hashing//Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm, Sweden, 2018: 2861-2867

[5] Wu W, Li B, Chen L, et al. Improved consistent weighted sampling revisited. IEEE Transactions on Knowledge and Data Engineering, 2018, 31(12): 2332-2345

[6] Wu W, Li B, Chen L, et al. A Review for Weighted MinHash Algorithms. IEEE Transactions on Knowledge and Data Engineering, 2020

[7] Wu W, Li B, Luo C, et al. Hashing-accelerated graph neural networks for link prediction//Proceedings of the 30th Web Conference. Ljubljana, Slovenia, 2021: 2910-2920

Information Extraction and Web Mining

Researchers:  Niloy Ganguly , Wolfgang Nejdl

Existing Projects:   SoBigData , Cleopatra , Oscar

In the area of Information Extraction in the medical domain, we focus mainly on evaluation on summarization of biomedical literature, where we develop Medical summarization models , which help to mitigate the information overload issue associated with biomedical literature search.

Such models provide concise descriptions of relevant or top-ranked articles. We also look into techniques that evaluate and improve the factual consistency of summaries w.r.t source documents. Secondly, we focus on improving the lookup of clinical trials. Clinical trials are crucial for the practice of evidence-based medicine. Different stakeholders, such as trial volunteers, investigators, and meta-analyses researchers often need to search for trials. We propose an automated method to retrieve relevant trials based on the overlap of UMLS concepts between user query and clinical trials. In another work, we work with Deep generative models. These models face several difficulties due to the unique characteristics of molecular graphs. Here, we propose a novel autoencoder for molecular graphs, whose encoder and decoder are specially designed to account for the above properties.

In the area of Web Mining in the medical domain, we are looking into COVID-19 which has fueled a possibility of Pro- and Anti-Vaxxers expressing their support and concerns regarding the vaccines on social media platforms. Understanding this online discourse is crucial for policy makers. The goal of this work is to improve this understanding using the lens of Twitter-discourse data. Using this method we detect and investigate specific user-groups who posted about vaccines in pre-COVID and COVID times. Secondly, we have also looked into Online medical forums that have become a predominant platform for answering health-related information needs of consumers. It is necessary to automatically classify medical queries based on a consumer’s intention, so that these questions are directed to the right set of medical experts. Here, we develop a novel medical knowledge-aware BERT-based model (MedBERT) that utilizes domain-specific side information obtained from popular medical knowledge bases. Thirdly, we have also explored the privacy practices that are followed by healthcare institutes across the globe where we wish to evaluate the appropriateness and legal compliance of these data practices with laws of the land.

PhD student(s): Soumyadeep Roy, Gunjan Balde and Abhilash Nandi

Robust and Reliable Machine Learning

Researchers:  Niloy Ganguly , Marius Lindauer , Wolfgang Nejdl

Existing Projects:

Privacy Preserving Data Mining and Data Protection

Researchers:  Megha Khosla , Wolfgang Nejdl

Existing Projects:   ZL-Gesundheit

The success of deep learning (DL) has led to its adoption in several fields including vision, recommendation systems, natural language processing, medicine etc. While DL has led to the state of the art improvements in various tasks these systems are usually data hungry and require large amounts of data during training. This poses serious privacy concerns as the data used could contain sensitive personal information and can be misused or leaked through various vulnerabilities. Our focus is twofold: we work towards exposing vulnerabilities of deep learning systems on graph-structured data towards privacy leaks as well as develop mitigating techniques to ensure privacy-preserving learning under differential privacy guarantees.

[1] Olatunji, I. E., Nejdl, W., and Khosla, M. (2021), Membership inference attack on graph neural networks. In IEEE International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications 2021.

[2] Olatunji, I. E., Funke, T., and Khosla, M. (2021). Releasing Graph Neural Networks with Differential Privacy Guarantees. arXiv preprint arXiv:2109.08907, 2021.

[3] Olatunji, I. E., Rauch, J., Katzensteiner, M., & Khosla, M. (2021). A Review of Anonymization for Healthcare Data. arXiv preprint arXiv:2104.06523, 2021.

Interpretability of Artificial Intelligence Algorithms

Researchers:  Avishek Anand , Wolfgang Nejdl

Existing Projects:   Interpreting Neural Rankers , Simple-ML

Predictive models are all pervasive with usage in search engines, recommender systems, health, legal and financial domains. But for the most part they are used as black boxes which output a prediction, score or rankings without understanding partially or even completely how different features influence the model prediction. In such cases when an algorithm prioritizes information to predict, classify or rank; algorithmic transparency becomes an important feature to keep tabs on restricting discrimination and enhancing explainability-based trust in the system.

Consequently, we end up with accurate yet non-interpretable models. We have been working on building models that are either interpretable by design or approaches that explain in a post-hoc manner the rationale behind a prediction by an already trained model. Specifically, we have proposed different interpretability approaches to audit ranking models in the context of Web search.

We have been studying the problem of interpretability for text based ranking models by trying to unearth the query intent as understood by complex retrieval models. In [1], we proposed a model-agnostic approach that attempts to locally approximate a complex ranker by using a simple ranking model in the term space. In [3], we ponder on the question, what makes a good reference input distribution for neural rankers? We also have simple research prototype for explaining neural rankers called EXS [4].

Recently, we investigated the difference between human understanding and machine understanding of images in using post-hoc interpretability approaches [2]. In particular, we seek to answer the following questions: Which (well performing) complex ML models are closer to humans in their use of features to make accurate predictions? How does task difficulty affect the feature selection capability of machines in comparison to humans? Are humans consistently better at selecting features that make image recognition more accurate?

Publications

[1] Model agnostic interpretability of rankers via intent modelling. Jaspreet Singh and Avishek Anand. In Conference on Fairness, Accountability, and Transparency (FAT), 2020.

[2] Dissonance Between Human and Machine Understanding. Zijian Zhang, Jaspreet Singh, Ujwal Gadiraju, Avishek Anand. In CSCW 2019.

[3] A study on the Interpretability of Neural Retrieval Models using DeepSHAP. Zeon Trevor Fernando, Jaspreet Singh, Avishek Anand. In SIGIR 2019.

[4] EXS: Explainable Search Using Local Model Agnostic Interpretability. Jaspreet Singh and Avishek Anand. In WSDM 2019.

[5] Posthoc Interpretability of Learning to Rank Models using Secondary Training Data. Jaspreet Singh and Avishek Anand. In Workshop on ExplainAble Recommendation and Search (Co-located with SIGIR’ 18).

[6] Finding Interpretable Concept Spaces in Node Embeddings using Knowledge Bases. Maximilian Idahl, Megha Khosla and Avishek Anand. In in workshop on Advances in Interpretable Machine Learning and Artificial Intelligence & eXplainable Knowledge Discovery in Data (co-located with ECML-PKDD 2019)

Fairness and Responsibility in Artificial Intelligence

Researchers:  Markus Luczak-Roesch , Bodo Rosenhahn , David Suter , Maria-Esther Vidal , Cameron Pierson

Existing Projects:   BIAS , NoBIAS

The rapid and increasing development of machine learning in healthcare applications (ML-HCAs) requires ethical examination to assess the impact of novel medical devices and methods on patient and society. It is imperative that such ethical examinations are made to elucidate the associated ethical considerations, whether known or new. As medical technology advances so must the concurrent ethical examination of use and scope, such as the nature of system application, the data underwriting said system, and impacts to patient, society, and healthcare. Such ethical examination is imperative to avoid embedding or amplifying biases into machine learning tools used in healthcare.

The development of AI in medicine ought to be interdisciplinary and/or by co-design. Therefore, implementation of ethicacl framework evaluation with a research team provides the benefit of auditing (i.e., van Wynsberghe & Robbins, 2014) from the investigators of this study, while also promoting ethical consideration identification and management in situ of the research group. Such implementation would promote the ethical development of ML-HCAs. The proposed framework, however, has yet to be independently evaluated. Thus, this study aims to evaluate Char and colleagues’ (2020) pipeline framework within the context of a research group seeking to develop machine learning techniques to identify biomarkers of breast cancer patients to predict patient success to chemotherapy treatment.

Char, D. S., Abràmoff, M. D., & Feudtner, C. (2020). Identifying ethical considerations for machine learning healthcare applications. The American Journal of Bioethics, 20(11), 7-17. https://doi.org/10.1080/15265161.2020.1819469

van Wynsberghe, A., & Robbins, S. (2014). Ethicist as designer: A pragmatic approach to ethics in the lab. Science and Engineering Ethics, 20, 947-961. https://doi.org/10.1007/s11948-013-9498-4

Machine Learning/AI for Precision Medicine and Health Care

Researchers:  Sören Auer , Niloy Ganguly , Thomas Illig , Megha Khosla , Michael Marschollek, Wolfgang Nejdl , David Suter , Maria-Esther Vidal , Cameron Pierson

Existing Projects:   BacData , Big Data for Cochlea implants , BigMedilytics , PRESENt

Knowledge graphs (KGs) have gained momentum as expressive data structures to represent the convergence of data and knowledge spread across heterogeneous data sources. In particular, the rich amount of biomedical data existing in KGs demonstrates the feasibility of integrating and representing a vast amount of biomedical data and knowledge as symbolic and subsymbolic statements.

We investigate hybrid approaches that combine machine learning methods with KGs to enhance predictive models and interpretability. As a result, we expect a paradigm shift in knowledge management towards explainable AI.

We apply our techniques in the context of breast cancer. Personalized therapies will be entailed from the evolution of a patient disease predicted based on the patient profile. The description of patient profiles integrated with available knowledge about treatments will support a better understanding and explanation of the disease evolution and therapy effectiveness.

PhD student(s): Can Aykul , Jonas Wallat