KATI – One lid for many data pots

Just in time for the Hannover Messe 2021, Digital Science and Fraunhofer INT announced a collaborative project to make the KATI-system available to interested parties outside of the Fraunhofer-Gesellschaft. This provided the basis for marketing KATI – a milestone in the development of KATI and the history of the institute.

What are the advantages of expanding KATI to include the Dimension data provided by Digital Science? What technical challenges had to be overcome in order to utilize these advantages? It all began a year before, in March 2020, with an initial conversation between Christian Herzog and Mario Diwersy from Digital Science and the responsibles at the Fraunhofer INT, sounding out the possibility of a cooperation. It quickly became clear that both parties saw a great opportunity and so, while discussions were taking place about the formalities, technical work began on collecting the comprehensive dataset.

The structure of the data

The Dimensions database contains bibliographic data of more than 120 million publications, including details on the title, authors, journal and year of publication of a paper. In this respect, relatively few adjustments were needed to the data model developed at Fraunhofer INT, which is based on the use of a so-called graph database or RDF store. The latter stands for Resource Description Framework and enables modeling of the relationships between the ‘resources’, which are linked to one another by relations. In KATI, for example, a ‘resource’ could be a paper or an author. These can be linked together by the relationship ‘written by’. This creates something called ‘triples’, comprising two resources and a relationship:

        Paper A – written by – Author X

This type of data preparation has proven to be very powerful and forms the basis for many options and features that the KATI-system offers its users.

Despite these principle similarities, there are some updates and expansions in the Dimensions data, which made it necessary to carry out a few adjustments in the KATI engine room.

One is the fact that Dimensions describes institutions using a unique identifier. This facilitates a whole range of analyses that can be carried out with KATI.

Another important difference is that Dimensions doesn’t use just one classification system, but multiple. These include:

  • Fields of Research, based on an Australian–New Zealand classification system and comprising two hierarchical levels. They correspond to a classical division into scientific disciplines and subdisciplines.
  • The Sustainable Development Goals, which were put into effect on January 1, 2016, by the UN.
  • Various other systems that primarily come from the field of medicine.

Digital Science automatically allocates each publication to the various classes and uses classification algorithms specifically trained for this. As such, this is an article-based classification that is not based on the assignment of a particular journal to one or more classes, allowing a more detailed and precise allocation of publications to the various classes.

The structure of the system

To be able to take this slightly different structure of the data into account, the KATI-system itself also had to be adjusted. This initially affected the data model on which the system is based. Subsequently, the so-called transformer was modified. This program is responsible for processing the raw data into the triples discussed above, which are then loaded into the corresponding databases in the next step. Then, these triples are made searchable using an appropriate search engine. The KATI Lab has programmed a ‘pump’ for this purpose, which ensures that the searchable data is copied from the actual graph database into the search index.

All of these adjustments initially took place in the backend, the engine room, so to speak. But adjustments had to be made to the front end, to the actual user interface, too. This affected all components of the KATI system, both the design of the search interface and the resource-side, and the analysis-side with the various dashboards.

For example, the design of the filters on the search page had to be adapted to account for the fact that you can now filter the results according to several category systems and that these are also partly arranged hierarchically.

Naturally, the biggest changes were in the analysis part of the KATI-system, as all visualizations used on the dashboards required adjustments. Behind each of these visualizations there are one or more database queries which are responsible for providing the necessary data. All of these queries had to be modified to correspond to the changed data structure. This comprised more than 30 visualizations for which more than 30 database queries had to be created or adjusted.

The KATI Lab team used this opportunity to completely redesign the entire user interface. This meant adjusting both the structure of the code, to make it easier to maintain and expand in the future, as well as the look of the interface. Important elements such as the filters for the search results or the workspace were redesigned and given more functionality. Other important improvements included the dashboard design, which now, for example, offers the possibility to influence the appearance of the visualizations.

By including the Dimensions data, the KATI team has now created a new version of the system which can be made available to interested parties from outside the Fraunhofer-Gesellschaft, paving the way for commercial marketing. In Digital Science, we have a strong, expert partner at our side for this. As part of the process, the group was able to demonstrate that both the data model and the overall concept of the KATI-system are designed so flexibly that they can be expanded to include further data pools. We are currently working on including patent data to open up an important information source for technology foresight at Fraunhofer INT. The institute is thus very well positioned for further developments in the area of data-driven foresight.