Chatbot Ruth von Fischer

Project details:

1. Subject of the Project

This chatbot focuses on the Swiss artist Ruth von Fischer (1911–2009), providing biographical and artistic insights in a dialog format. Further sources include:

2. Technical Foundation

The chatbot is built on a Retrieval-Augmented Generation (RAG) framework using:

  • LangChain and OpenAI GPT-3.5-Turbo
  • Flask for the backend, deployed on DigitalOcean
  • Chroma vector database for semantic retrieval (chunk size: 505, overlap: 20)
  • Embeddings: text-embedding-ada-002
  • Gunicorn WSGI server (production-grade interface between Flask and the web)

3. Multilingual Dialog Capabilities

  • The chatbot supports multilingual dialog in German, French, and English.
  • The language is detected from the user's initial question using detect_langs and constrained to these three options.
  • Subsequent interactions continue in the same language.

4. Natural Language Processing Pipeline

Named Entity Recognition (NER) and preprocessing are handled with SpaCy models:

  • German: de_core_news_md
  • French: fr_core_news_md
  • Multilingual: xx_ent_wiki_sm

5. Hybrid Information Retrieval

Each question triggers a retrieval step combining:

  • Temporal markers (years): 40% weight
  • Named Entities (NER): 40% weight
  • Cosine similarity between question and sources: 20% weight

While a dynamic chunk count based on question complexity was evaluated (e.g., 3 for factual, up to 15 for analytical), empirical testing supported a fixed number of 4 top relevant chunks.

Text chunks with a cosine similarity below 78.0% are discarded.

6. Three-step Pipeline

Each question is processed in a structured three-step pipeline:

  1. Input reformulation: The user’s question is rewritten into a neutral, German formulation (e.g., "Ruth von Fischer..." instead of "you..."). This transformation enables consistent semantic retrieval, as all source texts are in German. This step ends in a dynamically constructed prompt passed to the language model.
  2. Information retrieval: The neutralized question is used to query the Chroma vector database. Relevance scoring is based on a weighted hybrid model: temporal indicators, named entities, and cosine similarity.
  3. Response generation: The retrieved context and the original user question are used to generate a first-person, dialog-style answer (e.g., "I was born..." instead of "Ruth von Fischer was born..."). This step also culminates in a prompt, constructed on the fly and enriched with example-based guidance tailored to the detected language.

Both prompts (steps 1 and 3) end with explicit JSON formatting instructions to ensure structured model output.

7. Dialog Logic and Personalization

  • Each new question is interpreted based on the previous answer and the current question.
  • Internally, all operations are performed in German to align with source texts.
  • The chatbot is not anonymous; Ruth von Fischer is addressed in the second person, and answers are phrased in first person, simulating direct communication with Ruth von Fischer.

8. Rule-based Enhancements

The chatbot also includes deterministic rule-based systems for:

  • Expanding sole first names and nicknames to full names based on a curated person database
  • Extracting Swiss cantons from location mentions
  • Translating cantons into English and French

9. Guard Rails

To ensure robustness, safety, and factual integrity, the following guard rails have been implemented:

  • Problematic or harmful language is detected and mitigated in accordance with Responsible AI principles
  • Off-topic or irrelevant questions are initially reinterpreted to relate to Ruth von Fischer whenever possible; if no meaningful interpretation can be found, the system gently prompts the user to ask a new question about Ruth and her life
  • A post-verification step ensures that each generated answer is grounded in the underlying knowledge base (factual consistency check)

10. Developer Utilities

An Easter egg mode allows developers to experiment with different retrieval chunk counts and weighting parameters (years/NER/similarity), supporting evaluation and fine-tuning.

11. Future Directions

Planned improvements include:

  • Adding support for more languages
  • Increasing the knowledge base by integrating additional textual sources on Ruth von Fischer
  • Rendering an image if the answer refers to one
  • Resolving pronouns in the source data

12. Imprint

Author: Dominik Heeb

Email: dominik.heeb@trigonella.ch

Phone: +41 76 562 73 54