Polish Large Language Model (PLLuM)

Innovative Polish large language model for the public and the private sector

Page image

Challenge

The PLLuM (Polish Large Language Model) project is an initiative to create an open and accountable Polish language model. We aim to provide support for innovation in the public and business sectors through the development of tools such as an intelligent assistant for administration. We are working to collect and compile a comprehensive data set in Polish. This process is taking place in accordance with the guidelines of the National Data Center of Excellence. Our project will enable access to the model via an open-source license and programming (API) and graphical user interfaces (GUI), which will allow practical use in public administration, such as in the form of a prototype intelligent assistant. We are making sure that our model is secure and free of harmful or inaccurate content, which is crucial for its use in the public sector.

There are many ways to support our initiative. At this time, we especially encourage you to contact us about donating text data to train the model. Please email zilliat@nask.pl with specific information about the type and size of the data set

What we did

The PLLuM project is a unique combination of the strengths of leading Polish scientific institutions, bringing together experts from various fields to create a groundbreaking language model. The consortium includes Wrocław University of Technology (project leader), NASK – National Research Institute, the Information Processing Center – National Research Institute (OPI PIB), the Institute of Computer Science Foundations of the Polish Academy of Sciences, the University of Lodz and the Institute of Slavic Studies of the Polish Academy of Sciences. This scientific cooperation brings together diverse competencies and passions, creating a solid foundation for the development of AI in Poland.

The project is being implemented in 2024. During these 12 months:

  1. we will develop an implementation plan for building a large open base language model for the Polish language. A collection of language resources will be created, including data from government public information.
  2. We will build a corpus of language data necessary for base training and tuning of our model. The work will include the creation of a data management system, including a corpus of language data that meets the parameters required for effective training.

3 We will conduct a baseline training of a large language model for Polish, tuning it for a series of tasks. The result will be a trained neural network.

  1. We will develop the dialog model by learning with reinforcement, using user responses. The result will be an improved model and information system.

5. We will build an output correction module for a large dialogue language model to improve the quality of its responses. The work will conclude with the creation of an information system.