Exploiting multimedia content often relies on the correct identification of entities in text and images. A major difficulty for understanding a multimedia content lies in its ambiguity with regard to the actual user needs, for instance when identifying an entity from a given textual mention or matching a visual perception to a need expressed through language.

The project proposes to tackle the problem of analyzing ambiguous visual and textual content by learning and combining their representations and by taking into account the existing knowledge about entities. We aim at not only disambiguating one modality by using the other appropriately but also to jointly disambiguate both by representing them in a common space. The full set of contributions proposed in the project will then be used to solve a new task that we define, namely Multimedia Question Answering (MQA)1. This task requires to rely on three different sources of information to answer a textual question with regard to visual data as well as a knowledge base (KB) containing millions of unique entities and associated texts.

Who is Paris looking at ?

Considering the image above, to answer the question « Who is Paris looking at ? », one needs to understand the question and disambiguate the term “Paris”, as relating to Greek mythology. For this, the painting of Rubens helps but does not give the answer directly because even a fine visual analysis does not provide a direct answer. However, using a KB
and accessing the entity related to the Greek hero or the painting can help to answer Aphrodite.

In a simpler form, the MQA task is actually commonly practiced in the everyday life, through a decomposed process.

  • while watching a film or a TV-series, one can wonder “In which movie did I already see this actress?”. The answer usually requires to first determine the actress name from the credits of the film, then to access to a knowledge base such as IMDB or Wikipedia to obtain the list of the previous films the actress played. In an even simpler form, such a scenario matches to industrial needs.
  • in the context of maintenance or technical support, one may have to determine the reference of a particular product to access the available information required to ensure a technical operation. It is then easy to get the reference from a simple picture (other means are not always available); then the access to the required need can be posed as a QA problem.

While usually feasible, the weakness of the current approaches to these use cases is that they often rely on “disconnected” technologies. Beyond the theoretical motivation to provide a more unified approach, the project is fundamentally interested into better understanding the links that exist between language, vision and knowledge.

1At the beginning of the project, a quite similar task has been proposed by (Shah et al., 2019) and named Knowledge-Aware Visual Question Answering (KVQA). It nevertheless focused on the knowledge concerning the persons only. During the first year of the project, we thus renamed the task Knowledge-based Visual Question Answering about named Entities (KVQAE) and proposed the ViQuAE benchmark and baseline model to address it, with more than 900 different types of entities (code).