Author: Mushfiqur Rahman Talha
Organisation: Greentech Apps Foundation (GTAF)
Elasticsearch Search Optimisation for Islamic Digital Libraries
Elasticsearch search optimisation is a critical challenge for Islamic digital platforms, where accuracy, trust, and scholarly integrity depend heavily on search quality. Search is not just a technical feature for Islamic digital platforms, it is a matter of trust, learning accuracy, and scholarly integrity. When users search Hadith or other religious texts, even small ranking errors can significantly affect understanding.
In this article, we share how the R&D team at Greentech Apps Foundation systematically evaluated and optimised Elasticsearch search for Hadith retrieval, comparing lexical, semantic (vector), and hybrid approaches. This work is based on our recent empirical research and experimentation.
Why Islamic Text Search Is Different
Islamic texts—particularly Hadith—pose unique challenges for search systems:
- Short documents (often 50–80 words)
- Formulaic narration chains
- Highly specialised vocabulary
- Strong expectation of precision
Most search engines rely on default BM25 configurations, which are tuned for web or news content—not religious corpora. Our goal was to determine whether systematic optimisation could significantly improve retrieval quality.
Experiment Overview
We evaluated three major search paradigms:
1. Lexical Search (Term-Based)
- BM25
- Divergence from Randomness (DFR)
- Divergence from Independence (DFI)
- Information-Based (IB)
- Language Models (Dirichlet & Jelinek–Mercer)
2. Vector Search (Semantic Retrieval)
- Multiple pre-trained embedding models
- Elasticsearch HNSW (Approximate Nearest Neighbour) indexing
- Extensive parameter tuning
3. Hybrid Search
- Lexical + vector fusion
- Linear score combination
- Reciprocal Rank Fusion (RRF)
Evaluation used 28 real user queries, with relevance judgements cross-checked against Sunnah.com, and measured using nDCG@10 and MAP@10.
Key Findings (TL;DR)
✅ Lexical Search Winner: LM-Dirichlet
Language Modelling with Dirichlet smoothing outperformed BM25:
- nDCG@10 = 0.701
- Optimal smoothing parameter: µ ≈ 24–28
- +5.6% improvement over BM25
This is a surprisingly low µ value, showing that Hadith documents benefit from trusting document-level statistics more than collection-level smoothing.
✅ Vector Search Winner: embeddinggemma-300m
Among all embedding models tested, Google’s embeddinggemma-300m performed best:
- nDCG@10 = 0.7225
- Outperformed all lexical methods
- Even default HNSW settings beat BM25
⚠️ Counter-Intuitive HNSW Result
Best HNSW configuration:
- m = 5
- ef_construction = 100
Lower connectivity produced better semantic precision, likely because Hadith texts are short and semantically dense.
🚀 Best Overall: Hybrid Search
Combining lexical and vector signals delivered near-production performance:
| Method | nDCG@10 |
|---|---|
| BM25 (baseline) | 0.664 |
| Vector (optimised) | 0.7225 |
| Hybrid (linear, α = 0.5) | 0.7407 |
This was only 0.04% below Sunnah.com’s production system, achieved entirely with open tooling and transparent ranking logic.
Why Hybrid Search Works Best
Lexical and semantic retrieval capture different notions of relevance:
- Lexical search excels at exact phrasing and terminology
- Vector search captures paraphrases and conceptual similarity
Balanced fusion (50/50 weighting) proved optimal for our corpus, indicating that both signals are equally important for Islamic knowledge retrieval.
Why Elasticsearch Search Optimisation Matters for Islamic Content
Elasticsearch search optimisation is especially important for Islamic knowledge platforms because users expect precision, contextual relevance, and doctrinal accuracy. Unlike general web search, even small ranking errors in Hadith or Qur’anic content can lead to misunderstanding or misuse of religious information.
Optimised search systems help ensure that authoritative sources appear first, related narrations are discoverable, and semantically relevant content is surfaced even when users phrase queries differently. By combining lexical accuracy with semantic understanding, search platforms can better support scholars, students, and everyday learners alike. This is why Greentech Apps Foundation prioritises rigorous evaluation and optimisation of its search infrastructure as part of its broader mission to make authentic Islamic knowledge accessible through technology.
Practical Takeaways for Developers
If you’re building search for Islamic or other specialised text corpora:
- Do not rely on BM25 defaults
- Try LM-Dirichlet with low µ values
- Evaluate embedding models empirically—benchmarks don’t always transfer
- Tune HNSW parameters (lower m can be better)
- Use hybrid fusion wherever possible
These improvements require minimal architectural changes but yield double-digit relevance gains.
What’s Next for GTAF Search R&D
Future work includes:
- Query-dependent fusion weights
- Learning-to-Rank integration
- Multilingual (Arabic ↔ English) retrieval
- Domain-specific embedding fine-tuning
- User-centric evaluation with scholars and students
Final Thoughts
This work demonstrates that religious text search deserves the same level of rigour as academic or enterprise search systems. With systematic evaluation and domain-aware tuning, open technologies like Elasticsearch can deliver production-grade retrieval quality for Islamic knowledge platforms.
📄 You can read the full white paper here:
👉 https://www.researchgate.net/publication/400058904_Elasticsearch_Search_Experimentation_and_Implementation

Leave a Reply