[WIP] The RAG Triad - Metrics to evaluate Advanced RAG
https://docs.google.com/presentation/d/14zE7RMad5OAp3IknnjqGdfdmKwiJObo_Wp6ikSwPxCo/edit?usp=sharing Recording, Slides https://github.com/OrionStar25/Build-and-Evaluate-RAGs
https://huggingface.co/learn/cookbook/en/rag_evaluation https://python.langchain.com/v0.1/docs/integrations/chat/google_vertex_ai_palm/ https://cloud.google.com/vertex-ai/docs/tutorials/jupyter-notebooks#vertex-ai-workbench
start with:
- Steps in evaluating RAG
- Build RAG
- Build a evaluation dataset
- manually create dataset
- use an llm to create dataset
- use another llm to filter out relevant questions
- Use another LLM as a judge to critic on evaluation set
-
lots of things to tweak in the RAG pipeline - but there should be a proper way to evaluate its impact
- dataset used is huggingface documentation
- text + source
- synthetic evaluation dataset creation
- input: randomised context from knowledge base
- text+metadata
- chunk documents into equal-sized chunks with overlap for continuity
- output:
- possible question from the context asked by a user
- force llm to not mention anything about using a context to generate this question (user never had a context)
- answer to the given question supported by the context.
- possible question from the context asked by a user
- use LLM A (Mixtral 7B-instruct) to create this set.
- The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts (8 experts).
- The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks.
-
generate at least > 200 QA pairs since half of these will get filtered out + evaluation set should be at least ~100 questions.
- critique agent:
- rate each question on several criteria
- evaluation criteria:
- groundedness of question within context
- relevance of questions wrt to expected task
- stand-alone of questions without any context
- give score 1 to 5 and remove all questions that have low score for any criteria
- use same LLM A to critique
- input: randomised context from knowledge base
- Build RAG
- split documents using recursive split
- embed documents
- model used: thenlper/gte-small
- retrieve relevant chunks - like a search engine
- FAISS index stores embeddings for quick retrievals of similar chunks
- use retrieved contexts + query to formulate answer
- rerank retrieved context
- retrieve 30 docs
- rerank and output best 7
- model used: colbert-ir/colbertv2.0
- reevaluates and reorders the documents retrieved by the initial search based on their relevance to the query
- answer using llm
- LLM B (zephyr-7b-beta)
- a fine-tuned version of (mistralai/Mistral-7B-v0.1) trained to act as helpful assistants.
- rerank retrieved context
- Evaluate RAG
- use answer correctness as a metric
- use a scoring rubric to ground the evaluation
- use LLM C (chatvertexai/palm2)
- should use other metrics such as context relevance, faithfulness (groundedness) https://docs.ragas.io/en/latest/concepts/metrics/index.html
- use answer correctness as a metric
RAG Triads:
- answer relevance
- is the answer relevant to the question
- context relevance
- is the retrieved context relevant to the question asked
- groundedness
- is the answer grounded in the context retrieved
Enjoy Reading This Article?
Here are some more articles you might like to read next: