Safe AI – Can Artificial Intelligence Be Safe and Trustworthy?
Researchers from the Department of AI Safety and Transparency have launched a new project led by the Head of the Department, Dr. Sebastian Cygert. The goal of the project is to develop new methods for ensuring the safety, transparency, and reliable evaluation of large models, which are increasingly used in the public sector, industry, and services.

Artificial intelligence is becoming one of the key drivers of digital transformation, entering more and more areas of economic, social, and public life. Systems based on large language models (LLMs) are being deployed in industry, public administration, professional services, education, and healthcare. The rapid development of these technologies opens new possibilities for automation, data analysis, and service personalization, but it also creates significant challenges related to the safety, transparency, and reliability of AI models. These issues are at the core of the SAFEAI project carried out at NASK.
Transparency of Training Data
One of the major challenges is the lack of transparency regarding training data. Many models are developed in processes where the origin, nature, and legal compliance of the data remain undisclosed. This makes it difficult to assess risks related to privacy, copyright, and the potential use of information in training that should not be included — such as licensed, test, or legally ambiguous data. Within the SAFEAI project, we are developing methods for detecting data leaks that make it possible to determine whether specific content was used in model training in violation of established rules.
Safety of Generated Content
Another area of risk concerns the safety of generated outputs. Generative models can produce harmful, dangerous, or illegal content. In the project, we are developing methods for model steering as well as so‑called guard models — specialized systems that monitor and control the output of generative models to prevent the emission of undesirable content. The goal is to increase the models’ resilience to attempts at bypassing safeguards and to improve the safety of their use in real‑world applications.
Reliable Evaluation of AI Models
The third pillar of the project is reliable and contamination‑resistant evaluation of AI models. We are developing systematic audit methods that help detect hidden manipulations and identify areas where a model behaves unpredictably or contrary to expectations. In parallel, we are creating testing tools that reflect realistic usage conditions, as many existing benchmarks have lost credibility due to prior exposure of models to test data, resulting in artificially inflated scores.
Project Results
The SAFEAI project is carried out in active collaboration with international experts and research institutions. Members of the team participate in initiatives such as the Astra Fellowship and the MARS Programme (Mentorship for Alignment Researchers), enabling the exchange of knowledge with researchers from leading institutions working on AI model safety.
At the same time, the project aims to develop new methods and tools that will be made available to the scientific community in an open‑source model, supporting the development of safe and transparent artificial intelligence. The research results will be submitted for publication at top‑tier CORE A/A* conferences, such as NeurIPS, ICLR, ICML, and ACL