31.
HN
Show HN: AgentForge – Multi-LLM Orchestrator in 15KB of Python
AgentForge is a lightweight Python tool designed to streamline the orchestration of various Large Language Model (LLM) providers using a unified asynchronous interface. It allows seamless switching between providers like Claude, Gemini, OpenAI, and Perplexity with minimal effort by altering just one parameter. Addressing challenges such as provider lock-in, excessive framework complexity, and production inefficiencies, AgentForge features token-aware rate limiting, prompt templates, retry mechanisms with backoff strategies, and cost-efficient caching and routing.
The tool's architecture includes multiple layers: an Interface Layer (comprising CLI, REST API, and Streamlit Visualizer), a Core Orchestration layer with components like AIOrchestrator and Rate Limiter, an Agents Framework featuring the ReAct Agent Loop and Multi-Agent Mesh, Provider Adapters, a Tools System, and Observability. This structure supports easy testing, deployment, and integration into existing systems.
AgentForge is designed for rapid setup, allowing users to go from installation to making their first API call in under five minutes. It supports seamless provider switching and demonstrates substantial cost savings—up to 89% through effective caching and routing strategies. Built with modern tools such as HTTPX for asynchronous HTTP requests, it integrates seamlessly into continuous integration/continuous deployment (CI/CD) workflows via GitHub Actions.
The project is MIT-licensed, encouraging contributions and collaborations while showcasing its effectiveness in significantly reducing costs—a fact supported by testimonials from industry professionals. AgentForge positions itself as an essential solution for businesses aiming to utilize multiple LLMs efficiently without being confined to a single provider's API ecosystem.
Keywords: #phi4, API Keys, AgentForge, Architecture Decisions, Async Interface, Benchmarks, Consulting, Cost Optimization, EnterpriseHub, GitHub Actions, Implementation, LLM, Licensing, Multi-Agent Mesh, Orchestrator, Prompt Templates, Provider Switching, Python, RAG, Rate Limiting, Testing, Tool Execution, Web Scraping
github.com 3 hours ago
|
132.
HN
Why agent memory needs more than RAG (2026 paper and structure over similarity)
The 2026 paper "Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation" critiques the use of Retrieval-Augmented Generation (RAG) for managing agent memory, emphasizing its inefficiencies in handling structured data due to an over-reliance on similarity metrics. This approach often leads to redundant results and fragmented retrieval of temporally linked evidence. To address these limitations, the authors propose shifting from similarity-based methods to structure-driven approaches that leverage entities, relationships, and timelines for better information retrieval.
The paper introduces xMemory, a system designed with a four-level hierarchy (from messages to themes) using LLM-generated summaries. While xMemory outperforms existing systems on benchmarks, it shows brittleness when faced with formatting deviations and update failures. In contrast, Neotoma adopts a deterministic schema-first approach without relying on LLMs for critical operations. It ensures consistent retrieval by employing typed entities and explicit relationships, efficiently supporting both semantic and structural queries.
The paper highlights that xMemory is well-suited for scenarios involving conversational data where emergent structure is necessary, whereas Neotoma excels in applications demanding traceability and predefined schemas. Overall, the authors advocate for a schema-first methodology to overcome RAG's brittleness, ensuring more reliable retrieval of agent memory.
Keywords: #phi4, Agent memory, Neotoma, RAG, brittleness, conversation stream, determinism, embeddings, entity graph, hierarchy, retrieval, schema-first, semantic retrieval, similarity, structural retrieval, structural retrieval Keywords: Agent memory, structure, xMemory
markmhendrickson.com 9 hours ago
|
194.
HN
I got tired of on-device LLMs crashing my apps, so I built a managed runtime
Edge-Veda is a sophisticated runtime environment specifically crafted for Flutter applications to enable sustainable on-device artificial intelligence capabilities, encompassing text, vision, speech, and Retrieval-Augmented Generation (RAG) processing. This solution overcomes typical challenges associated with other on-device AI implementations such as thermal throttling, memory spikes, and the absence of runtime visibility that often result in application crashes. By running entirely on the device without requiring cloud dependencies, Edge-Veda ensures privacy during inference since it eliminates network calls.
Key features include maintaining persistent model instances to support long sessions while dynamically adapting to constraints like thermal limits, memory availability, and battery status. It provides structured observability for debugging via performance tracing tools and incorporates a Dart SDK with Flutter integration, facilitating access to C API functions and various AI models. The architecture underpinning Edge-Veda employs persistent workers for text, vision, and speech tasks to keep model data in memory across sessions while using runtime policies to manage resource constraints through adaptive degradation strategies.
Edge-Veda's runtime supervision is managed by compute budget contracts and adaptive profiles that adjust the quality of service based on device performance metrics. A central scheduler handles concurrent workloads with priority-based degradation. Its current capabilities include core inference tasks like multi-turn chat sessions, real-time speech recognition, embedding pipelines for structured output generation, and vector search using pure Dart implementations.
For integration, users can easily add Edge-Veda to their Flutter projects through a simple dependency in `pubspec.yaml`. It supports diverse use cases such as text generation, streaming transcription, multi-turn conversations, tool calling, and continuous vision inference. The project encourages contributions for platform validation, particularly on Android, enhancements in runtime policies, trace analysis tools, model support, and example app development. Edge-Veda's structure includes C++ core components for AI processing, Dart SDK integration, and scripts for building iOS frameworks, targeting developers focused on creating privacy-sensitive applications, on-device AI assistants, continuous perception apps, and long-running edge agents.
Keywords: #phi4, Android, C API, CPU, Dart SDK, Edge-Veda, Flutter, GPU, QoS levels, RAG, adaptive budgeting, chat templates, embeddings, iOS, memory management, model management, observability, on-device AI, performance tracing, platform validation, privacy-sensitive, runtime, speech recognition, text generation, thermal throttting, tool calling, vector search, vision inference
github.com 21 hours ago
https://news.ycombinator.com/item?id=47054873 20 hours ago
https://news.ycombinator.com/item?id=47055576 20 hours ago
|
206.
HN
Run LLMs locally in Flutter with <200ms latency
Edge-Veda is a managed on-device AI runtime developed specifically for Flutter, designed to efficiently run large language models (LLMs) locally across various tasks such as text processing, vision, speech recognition, and retrieval-augmented generation with sub-200ms latency. The platform operates independently of cloud services, enhancing privacy by ensuring data remains local. It addresses typical challenges in on-device AI applications like thermal throttling, memory spikes, unstable long sessions, and limited runtime visibility.
Key features include sustainable performance through adaptive budget profiles that adjust to device constraints like thermal pressure, battery level, and available memory, using a central scheduler for workload management with priority-based degradation. Edge-Veda maintains persistent contexts by keeping models in memory across sessions, ensuring stability during prolonged use. It provides structured performance tracing and offline analysis tools for better observability and debugging.
The runtime supports various functionalities, including text generation, multi-turn chat management, on-device speech recognition, vector index search, and function calling with tool registries and schema validation. The Smart Model Advisor offers tailored model recommendations based on device profiles, optimizing performance according to specific hardware characteristics such as RAM and processor type. Currently validated for iOS devices using Metal GPU, Edge-Veda plans to extend support to Android CPU and Vulkan GPU.
With a codebase of approximately 22,700 lines across different components, the architecture integrates Flutter Dart SDK with persistent workers for text, vision, and speech models, backed by a central scheduler and performance monitoring services. It is designed to facilitate privacy-sensitive, long-running, or offline-first AI applications like voice assistants and continuous perception apps.
Edge-Veda's roadmap includes future developments such as Android runtime validation, integration of text-to-speech capabilities, semantic perception APIs, observability dashboards, support for NPU/CoreML backends, and model conversion tools. The project is open for contributions in areas like platform validation, runtime policy improvements, trace analysis, and expanding model support, utilizing Apache 2.0 licensing and building upon the llama.cpp and whisper.cpp libraries.
Keywords: #phi4, Android, C API, Dart SDK, Edge-Veda, Flutter, GPU acceleration, LLMs, RAG, adaptive budgeting, compute contracts, iOS, memory management, model recommendations, observability, on-device AI, performance tracing, privacy-sensitive, runtime supervision, speech recognition, text generation, thermal throttting, vision inference
github.com 23 hours ago
|
296.
HN
Sub-Millisecond RAG on Apple Silicon. No Server. No API. One File
Wax is a file-based solution designed to optimize Retrieval-Augmented Generation (RAG) on Apple Silicon devices by eliminating the need for external servers or APIs, thereby simplifying AI memory management. It achieves sub-millisecond retrieval times and supports fast vector search through Metal GPU utilization, specifically benefiting devices like the M1 Pro. With its single-file architecture, Wax offers offline capabilities, crash recovery, and enhanced privacy as all operations occur locally on the device. The solution is versatile, accommodating various data types such as text, photos, and videos, which enhances its applicability across different domains including AI assistants, privacy-sensitive applications, and robust search tools.
Wax incorporates advanced features like query-adaptive hybrid search for optimized retrieval, tiered memory compression to manage context efficiently, and deterministic token budgeting to ensure reproducibility of results. These capabilities make it well-suited for offline-first apps, research tooling, and workflows that demand durable state management without network dependencies. The solution operates on Swift 6.2, targeting iOS/macOS 26 environments with Apple Silicon architecture.
Getting started with Wax is straightforward: users can integrate it into their projects via a package manager, select the appropriate memory type (text, photo, or video), and utilize simple functions for data ingestion and recall. The comprehensive file format of Wax includes integrated documents, embeddings, search indices, logs, metadata, and entity graphs in an append-only structure that ensures integrity with checksum verification and dual headers facilitating atomic updates.
Compared to alternatives such as Chroma, Core Data + FAISS, and Pinecone, Wax stands out for its single-file nature, offline functionality, crash safety, GPU acceleration, serverless operation, and native Swift integration. It delivers deterministic RAG functionalities that are particularly advantageous in environments requiring robust, privacy-focused, and resilient AI capabilities. Developers interested in contributing can engage with the project through GitHub and can explore additional tests related to MiniLM CoreML functionalities.
Keywords: #phi4, AI, Apple Silicon, BM25, CoreML, GPU, HNSW index, Metal GPU, MiniLM, RAG, SQLite, Swift, USearch, WAL Ring Buffer, Wax, crash-safe, deterministic, document payloads, embeddings, hybrid search, iOS, macOS, memory, offline, privacy, query-adaptive, reproducible retrieval, tiered compression, token budgeting, vector search
github.com a day ago
https://github.com/christopherkarani/Wax a day ago
https://www.pangram.com/history/49335ddf-118d-43e4-9340 a day ago
https://github.com/christopherkarani/Wax/blob/ a day ago
https://github.com/christopherkarani/Wax?tab=readme-ov- a day ago
https://github.com/christopherkarani/Wax/blob/ a day ago
https://github.com/tobi/qmd 14 hours ago
|
390.
HN
Show HN: CodeGraph CLI – Chat with your codebase using graph-augmented RAG
CodeGraph CLI is an advanced tool designed to enhance codebase comprehension through semantic search and analysis by integrating technologies like tree-sitter for abstract syntax tree parsing, SQLite for managing dependency graphs, and LanceDB for vector embeddings. This combination allows it to maintain the structural relationships within code by merging vector search with breadth-first search graph traversal. Among its key features are semantic search, which enables code identification based on meaning rather than exact matches; impact analysis that evaluates multi-hop dependencies prior to changes; and interactive graph visualization using HTML and Graphviz DOT exports. Additionally, it offers a browser-based explorer for visual navigation supplemented by Mermaid diagrams and AI explanations, along with a conversational chat feature facilitating natural language coding sessions through context-aware retrieval augmented generation (RAG). It also employs a multi-agent system via CrewAI to handle tasks like autonomous code generation, refactoring, and analysis, as well as automatically generating professional project documentation. CodeGraph CLI supports auto onboarding by creating AI-generated README files from the code graph and ensures data privacy with its local-first design.
To get started with CodeGraph CLI, users install it using pip, configure their preferred language model provider (LLM) either interactively or via command line, and index a project to parse and construct its dependency graph. The tool offers diverse commands for search, impact analysis, visualization, chat interactions, among others. It supports local and cloud-based LLM providers such as Ollama, OpenAI, Anthropic, Groq, Gemini, and OpenRouter. Additionally, it provides various embedding models that range from simple keyword-based hashes to advanced options like Qodo-Embed-1-1.5B.
The architecture of CodeGraph CLI comprises multiple layers: a CLI Layer for command execution, GraphStore utilizing SQLite for dependency management, and VectorStore employing LanceDB for vector embeddings. The tool also features an LLM Adapter and various task-specific agents responsible for file operations, code generation, and analysis. Its open-source nature, under the MIT license, encourages collaboration and distribution within the development community. Developers can set up a virtual environment, install dependencies via pip, and access the full suite of commands organized into categories like configuration, project management, and documentation export, offering a comprehensive solution for modern software development environments.
Keywords: #phi4, AI-generated README, BFS traversal, CodeGraph CLI, CrewAI, LLM providers, LanceDB, SQLite, auto-generate docs, browser-based explorer, codebase navigation, conversational coding, dependency analysis, embedding models, file rollback, graph-augmented RAG, impact analysis, local-first architecture, local-first architecture CodeGraph CLI, local-first architecture Comma-Separated Keywords: CodeGraph CLI, local-first architecture Comma-Separated List: CodeGraph CLI, local-first architecture Extracted Keywords: CodeGraph CLI, local-first architecture Final Keywords: CodeGraph CLI, local-first architecture Final List: CodeGraph CLI, local-first architecture Keywords: CodeGraph CLI, local-first architecture Selected Keywords: CodeGraph CLI, local-first architecture Simple Keywords: CodeGraph CLI, multi-agent system, project documentation, semantic code search, semantic search, tree-sitter, vector embeddings, visual code explorer
github.com a day ago
|
456.
HN
Graph-based multi-agents smash long-context benchmarks–89% MMLU-Pro on 8B models
The document describes the Graph of Agents (GoA), a graph-based multi-agent system that performs exceptionally well in long-context benchmarks, achieving 89% accuracy on MMLU-Pro with models having 8 billion parameters. It outlines the implementation and evaluation process, starting from setting up the environment using `conda` based on an `environment.yml` file to downloading necessary datasets from Hugging Face. The inference process involves a Python script that generates predictions for evaluation purposes. GoA is compared with baselines such as Chain-of-Agents (CoA) and RAG, offering adjustable parameters like cluster size for testing variations. Evaluation scripts are used to assess results for models such as `qwen_8b` or `llama3_8b`, though they do not consider context window and temperature details. The system allows qualitative analysis by saving detailed outputs if enabled. The implementation of GoA is primarily derived from an existing Chain-of-Agents codebase found on GitHub, suggesting a foundation in established methodologies within the field.
Keywords: #phi4, CUDA_VISIBLE_DEVICES, Chain-of-Agents, GoA inference, Graph of Agents, Graph-based multi-agents, LongBench, baselines, conda env create, environmentyml, eval_longbenchpy, goa_cluster_size, huggingface pipeline, model_name, qualitative analysis, rag, result_longbenchpy
github.com 2 days ago
|
620.
HN
AI to SWE ratio convergence and where AI Jobs are
From January 2023 to January 2026, a notable convergence between Artificial Intelligence (AI) and Software Engineering (SWE) roles emerged, as evidenced by job postings analysis. Although SWE job postings increased by 13.5% overall, this growth was predominantly driven by the Technology sector (+64.9%) and Financial Services (+29.1%), which together accounted for over half of all such postings. Excluding these sectors, eight out of eleven industries experienced a decline in SWE job postings. The AI to SWE job posting ratio expanded from 0.28 to 0.66 during this period, reflecting that AI roles are growing at three times the rate of SWE roles, with a 96.1% rise compared to the latter's 13.5% increase.
AI hiring is widespread across various sectors, showcasing robust growth in Healthcare (+54%), Industrials (+50%), and Energy (+68%). The demand for skills related to generative AI tools like Language Learning Models (LLMs), Copilot, and retrieval-augmented generation (RAG) has surged, indicating their rising importance alongside traditional machine learning frameworks such as PyTorch and TensorFlow. This growing significance is mirrored in a median salary premium of $26,000 for AI roles over SWE positions.
The analysis underscores the necessity to move beyond aggregate SWE job counts towards more accurate sector-adjusted metrics or equal-weighted averages due to their misleading nature. It also advocates for monitoring the AI/SWE convergence rate as an essential indicator of future hiring trends. For software engineers, acquiring practical generative AI skills is increasingly important to enhance career prospects and achieve salary advantages. The study's methodology included analyzing 45.4 million job postings using advanced trend decomposition techniques to manage seasonal variations and provided insights through tracking mentions of AI-related technologies.
Keywords: #phi4, AI adoption, AI-SWE convergence, Copilot, Financial Services, LLMs, PyTorch, RAG, Revealera database, STL decomposition, Simpson’s paradox, Technology, TensorFlow, equal-weighted average, generative AI tools, hiring growth, job market trends Keywords: AI-SWE convergence, job postings, salary premium, seasonal noise, sector analysis, software engineering, trend analysis, volume-weighted aggregate
revealera.substack.com 2 days ago
|
764.
HN
SnapLLM: Switch between local LLM in under 1ms Multi-model&-modal serving engine
SnapLLM is a cutting-edge Large Language Model (LLM) inference engine designed to facilitate sub-millisecond switching between multiple loaded models, eliminating the need for time-consuming unloading and reloading typically associated with traditional systems. By maintaining several models in memory, SnapLLM achieves rapid model switching using its vPID architecture, which enables transitions in under 1 millisecond. It supports a variety of model types, including text LLMs like Llama versions and Mistral, as well as vision and diffusion models, on both GPU and CPU platforms.
A standout feature is its compatibility with OpenAI's API, offering seamless integration for users accustomed to the existing ecosystem. The engine includes a React-based desktop application that provides tools such as A/B comparisons and context cache management, enhancing user experience in managing different models. Performance benchmarks demonstrate impressive metrics: model switch time is around 0.02 milliseconds, first token latency at approximately 50 milliseconds, and variable token generation speeds depending on GPU capabilities.
SnapLLM's installation requires several prerequisites, including Visual Studio for Windows, GCC/Clang for Linux, CUDA for GPU acceleration, CMake, and Node.js for the desktop application. Detailed guidance is provided to assist users in building from source across different operating systems. Once set up, starting the SnapLLM server involves straightforward commands that can include preloading models.
The project offers a comprehensive API suite supporting operations such as model loading, switching, text or image generation, and vision input analysis. Additionally, it provides command-line interface (CLI) options for various tasks including server management, text processing with LLMs, and image-related functionalities. As an open-source initiative under the MIT License, SnapLLM invites contributions to enhance features, address bugs, and improve documentation, while encouraging sponsorship to support its ongoing development. Created by Mahesh Vaikri at Aroora AI Labs, SnapLLM aims to empower users with efficient model management capabilities within the AI community.
Keywords: #phi4, A/B comparison, CLI, CMake, CUDA, GPU/CPU hybrid, KV cache, LLM inference, Nodejs, OpenAI API, RAG, React, SnapLLM, architecture, context caching, contributing, demo videos, desktop UI, diffusion models, installation, llamacpp, memory efficiency, model management, model switching, multi-domain assistant, multi-model, performance benchmarks, rapid switching, server locally, serving engine, sponsors, stable-diffusioncpp, sub-millisecond, text LLMs, vPID, vision models
github.com 4 days ago
https://vimeo.com/1157629276 4 days ago
https://vimeo.com/1157624031 4 days ago
https://github.com/snapllm/snapllm 4 days ago
https://arxiv.org/submit/7238142/view 4 days ago
|
768.
HN
I built an AI that runs offline on Android (no cloud)
EdgeDox is an innovative offline AI document assistant designed to function solely on Android devices, eliminating the need for cloud reliance by processing documents locally. This ensures complete privacy and control over user data as it operates without requiring any internet connection post-setup and does not necessitate user accounts. EdgeDox supports various file types including PDFs, text files, and markdown documents, enabling users to query these documents directly through a local Retrieval-Augmented Generation (RAG) system. This design prioritizes speed, accuracy, and privacy by keeping all data confined to the device.
Optimized for mobile environments, EdgeDox is particularly beneficial for students, developers, professionals, and individuals who prioritize their privacy. It offers significant features such as seamless navigation through extensive documents, providing answers about intricate texts, and ensuring functionality even in airplane mode. With no reliance on cloud storage or external systems, EdgeDox stands out for managing confidential work documents, personal notes, and sensitive files without any data sharing or tracking, making it an ideal solution for users concerned with data security and privacy.
Keywords: #phi4, ARM CPUs, Android, Confidentiality, Data Control, EdgeDox, Financial Files, Instant Responses, Legal Files, Local Processing, Markdown, Medical Files, Offline AI, PDFs, Privacy, Query Specs, RAG, Summarize Notes, Surveillance-Free, TXT files, Technical Documentation
play.google.com 4 days ago
|
864.
HN
Show HN: Clonar – A Node.js RAG pipeline with 8-stage multihop reasoning
Clonar is an advanced Retrieval-Augmented Generation (RAG) system designed to enhance query processing through high-precision, multihop reasoning. Unlike conventional RAG systems that rely on a single retrieval-synthesis cycle often leading to incomplete or inaccurate results, Clonar utilizes an 8-stage iterative workflow. This begins with pre-retrieval reasoning and incorporates clarification and critique stages, ensuring responses are accurate, well-grounded, and citation-backed. Its architecture allows each stage in the reasoning loop to be dynamically conditioned, thereby setting a new standard for reliability and precision in AI-powered search systems. Clonar is backend-based, accessible via HTTP requests through tools like curl or Postman, eliminating the need for a frontend interface. This approach minimizes errors known as "hallucinations" and significantly improves the system's capability to manage complex queries effectively.
Keywords: #phi4, 8-stage reasoning loop, API, Clonar, HTTP client, Nodejs, RAG, agentic workflow, backend, citations, complex queries, dynamic conditioning, grounded answers, hallucinations, high-precision reasoning, iterative flow, multihop reasoning, pipeline, retrieval-augmented generation
github.com 4 days ago
https://github.com/clonar714-jpg/clonar 4 days ago
|
1033.
HN
True, Relevant, and Wrong: The Applicability Problem in RAG
Retrieval Augmented Generation (RAG) systems aim to enhance AI response accuracy by using documented sources, but face significant challenges due to what is identified as the "applicability problem." This issue arises when RAGs provide correct information that is contextually inappropriate, often because of complex and multi-branching policies within expanding corporate knowledge bases. The primary difficulty shifts from verifying source support to ensuring statements' relevance in specific contexts, such as geographical region, eligibility criteria, or product version. A common failure mode occurs when RAG systems combine multiple valid but incompatible policy fragments into a single response, resulting in coherent yet contradictory and impractical "franken-answers" for real-world scenarios.
To mitigate these challenges, the article proposes enhancing knowledge representation by incorporating explicit metadata—a meta-layer—that outlines conditions like temporal validity and scope. This approach involves extracting signals from user queries to identify implicit requirements and employing disambiguation processes that direct questions to suitable knowledge sources. Such improvements aim to enable a multi-agent system capable of delivering contextually accurate responses. The article suggests developing a comprehensive framework to resolve the applicability problem by refining RAG architectures with mechanisms for encoding, recognizing, and routing based on explicit applicability conditions, thereby improving their real-world utility and reliability in information provision.
Keywords: #phi4, Retrieval Augmented Generation, authoritative grounding, authority conditions, compositional applicability, conditional truths, franken-answer, hallucinations, implicit conditions, policy branches, retrieval failure, scope constraints, temporal validity
www.pinecone.io 5 days ago
|
1148.
HN
Evaluation of RAG Architectures for Policy Document Question Answering
The study titled "Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering" investigates how effectively Retrieval-Augmented Generation (RAG) architectures can mitigate issues faced by Large Language Models (LLMs), such as generating factually incorrect outputs. Focusing on policy documents from entities like the CDC, this research emphasizes the importance of accuracy and integrity in responses. It compares a baseline Vanilla LLM with Basic RAG and Advanced RAG configurations using cross-encoder re-ranking, employing models including Mistral-7B-Instruct-v0.2 and all-MiniLM-L6-v2 to process CDC documents, evaluating their performance on faithfulness and relevance.
The findings reveal that Basic RAG significantly enhances the faithfulness of responses compared to Vanilla LLMs, with Advanced RAG achieving even greater accuracy. The study highlights two-stage retrieval mechanisms as crucial for domain-specific question answering but identifies challenges in document segmentation affecting multi-step reasoning tasks. Overall, it underscores the potential of RAG architectures to improve information integrity within public health policy domains.
Keywords: #phi4, Artificial Intelligence, CDC Documents, Chunking Strategies, Computational Linguistics, Cross-Encoder Re-ranking, Faithfulness, Hallucinations, Information Integrity, Information Retrieval, Large Language Models, Policy Document, Question Answering, RAG Architectures, Relevance, Retrieval-Augmented Generation
arxiv.org 6 days ago
|
1171.
HN
Show HN: I built an webpage to showcase Singapore's infra and laws
The "Singapore Intelligence RAG System" is an AI-driven platform designed to deliver precise information about Singapore's legal system, policies, historical events, and infrastructure by utilizing Retrieval-Augmented Generation (RAG) technology. It stands out due to its reliance on over 33,000 pages of meticulously curated data, which enhances accuracy compared to conventional large language models. The system's architecture comprises document ingestion, semantic embedding via BGE-M3, quick retrieval through FAISS with millisecond latency, and a robust triple-layer AI failover mechanism ensuring reliability. This failover includes Google Gemini 2.0 Flash as the primary model, Llama 3.3 managed by OpenRouter as secondary, and an additional Llama for fallback. The user interface employs a custom Framer Code Component that utilizes modern design elements such as glassmorphism effects, smooth hover animations, SVG icons, and San Francisco typography to create an engaging user experience. Local embedding inference is performed server-side to enhance privacy and performance without relying on external APIs.
Technologically, the system uses React with Framer Motion for the frontend, Flask and Gunicorn for handling RAG logic in the backend, FAISS for local vector search, and Sentence-Transformers BGE-M3 for embeddings. The text generation is managed by LLMs like Gemini 2.5 flash and Llama 3.3. For deployment, Hugging Face Spaces with Docker-based cloud hosting ensures scalability and ease of access.
Setting up the platform requires installing specific Python packages such as Flask, FAISS CPU, Sentence-Transformers on the backend server, followed by running the necessary scripts post repository cloning for local development.
Keywords: #phi4, AI, BGE-M3, Docker, FAISS, Flask, Framer Motion, Google Gemini, Gunicorn, Hugging Face Spaces, LLMs, RAG, React, Singapore, backend, deployment, embeddings, frontend, glassmorphism, infrastructure, interactive UI, laws, legal system, policies, sentence-transformers, tech stack, triple-failover, vectorization, webpage
github.com 6 days ago
|
1209.
HN
Show HN: DocForge – Multi-Agent RAG That Fact-Checks Its Own Answers
DocForge is an advanced Multi-Agent Retrieval-Augmented Generation (RAG) system designed to provide precise, verified responses through a sophisticated multi-agent architecture. It features a routing agent that classifies queries by complexity to optimize search queries, a retrieval agent that adapts the number of documents fetched based on query requirements and implements retry logic, and an analysis agent that synthesizes coherent answers from multiple sources using chain-of-thought reasoning. Additionally, a validation agent ensures factual accuracy by cross-referencing claims with source documents. The system incorporates an intelligent workflow that uses confidence-based mechanisms to speed up responses for high-confidence queries while employing an automatic retry strategy for validation failures. This setup leverages Redis caching for efficient query handling and is supported by a robust FastAPI REST API designed for querying, complete with error management and latency monitoring.
For deployment, DocForge requires Python 3.11+ and keys from either OpenRouter or Google Gemini APIs, allowing configuration via environment variables for various services like LLM providers, Pinecone vector stores, and Redis caching. The system supports a comprehensive ETL pipeline to process PDF documents into manageable chunks with in-memory embedding cache to enhance efficiency by reducing redundant API calls. Its architecture begins with user query routing, followed by document retrieval from Pinecone, answer synthesis, confidence checking, validation, and result caching or retrying based on the derived confidence level.
Users can interact with DocForge through scripts for PDF ingestion and interactive Q&A testing. Future plans include expanding support to additional document formats like DOCX, TXT, MD, HTML; introducing streaming responses and conversation history; enhancing multi-turn chat capabilities; enabling multi-tenancy; developing a frontend UI; offering Docker containerization; and providing deployment guides for cloud platforms. The system utilizes tools such as LangGraph, LangChain, Pinecone, OpenAI, Google Gemini, and OpenRouter, under the MIT License developed by Toheed Asghar with contributions from AI assistance via Claude Opus 4 and Cursor IDE.
Keywords: #phi4, Adaptive Retrieval, Automatic Retry, Chain-of-Thought Reasoning, Confidence-based Validation, DocForge, Dual LLM Provider, ETL Pipeline, Fact-Checking, FastAPI, Google Gemini, LangGraph, Latency Monitoring, Multi-Agent RAG, OpenAI GPT, PDF Ingestion, Pinecone, Query Routing, Redis Caching, Retrieval-Augmented Generation, Token Usage Tracking, Vector Store
github.com 6 days ago
|
1410.
HN
Epstein Smart Search – AI RAG search pipeline, File explorer, Image gallery
Epstein Smart Search is an AI-powered search engine developed by the U.S. Department of Justice, utilizing a Retrieval Augmented Generation (RAG) pipeline alongside vector embeddings to enable extensive searches through court documents, flight logs, depositions, and evidence files related to the Epstein case. This tool is designed to continuously incorporate new records, enhancing its thoroughness in search capabilities. However, at present, the search feature has been disabled. Users are encouraged to specify their queries clearly for optimal results. The system provides several hybrid search options that allow users to choose varying quantities of top documents returned (Top K: 10, 20, 40, 60, 80, 100). Accessing these files requires users to verify they are at least 18 years old. Sample searches include inquiries about events at Zorro Ranch, connections between figures like Bill Clinton and Donald Trump with Epstein, and mentions of A-list celebrities within the documents.
Keywords: #phi4, AI RAG, Associations, Bill Clinton, Celebrities, Court Documents, Depositions, Documents, Donald Trump, Epstein, Evidence Files, File Explorer, Flight Logs, Hybrid Search, Image Gallery, Query, Smart Search, US Department of Justice, Vector Embeddings, Zorro Ranch
search.epstein.ninja 7 days ago
|
1419.
HN
RAG and Data Boundaries in Multi-Tenant Systems
In multi-tenant systems, Retrieval-Augmented Generation (RAG) presents significant security challenges due to its broad data retrieval approach followed by filtering, which risks accessing unauthorized information. To address these concerns, it is crucial to establish explicit modeling of layered access controls that maintain consistent boundaries across tenants. Arty proposes a solution where access rules act as a preliminary gate before any data retrieval occurs. This ensures that only documents eligible within the specified tenant scope, role visibility, and policy constraints are considered in similarity searches. By consuming pre-approved context rather than relying on post-retrieval security measures, accidental exposure of sensitive information is minimized. The strategy emphasizes creating clear data boundaries over solely depending on the AI's capabilities to enforce security. Arty encourages further discussion on effectively managing these trade-offs within production environments, highlighting the importance of balancing data access control with operational needs in multi-tenant architectures.
Keywords: #phi4, RAG, accidental exposure, branch-level rules, data boundaries, data model, layered access, multi-tenant systems, parent-level policies, policy constraints, role visibility, roles, security perspective, similarity search, tenant scope
news.ycombinator.com 7 days ago
|
1442.
HN
Private RAG and marketplace to sell your knowledge to AI agents
The service provides enterprises with an integrated solution for managing private Retrieval-Augmented Generation (RAG) systems and marketplaces through a single platform, which includes a unified API and operational model. This design eliminates the complexities typically introduced by adding separate solutions, offering streamlined operations. By centralizing these functions, businesses can effectively sell their knowledge to AI agents while maintaining control over enterprise operations, thereby enhancing efficiency without increasing complexity.
Keywords: #phi4, AI agents, API surface, Private RAG, bolt-on, complexity, distribution, enterprise operations, knowledge, marketplace, operational model, platform, retrieval
ragora.app 8 days ago
|
1490.
HN
Show HN: ClearDemand – Cross-case search and drafting for injury firms
ClearDemand is a platform specifically developed to enhance the accuracy of legal drafting within personal injury firms by addressing common issues associated with handling unstructured medical records and other case files. The tool leverages advanced technologies such as Optical Character Recognition (OCR) and Retrieval-Augmented Generation (RAG) to automate the summarization process, ensuring that the generated drafts include citations verified against original sources. One of its standout features is grounded generation, which provides source-verified drafting, alongside cross-case search capabilities that help attorneys identify similar fact patterns in other cases, thus improving efficiency and consistency. Additionally, ClearDemand offers style matching functions to align the document's tone with firm-specific preferences. Personal Injury attorneys have the opportunity to evaluate the tool through a 14-day trial period where they can test its effectiveness on scanned PDF documents. Feedback is invited specifically concerning the citation user interface (UI), underscoring the platform’s commitment to continuous improvement based on user input. Key features of ClearDemand include automated ingestion and OCR for case files, source-verified drafting with grounded generation, cross-case search functionality, style matching tailored to firm-specific tone preferences, and the availability of a 14-day trial period.
Keywords: #phi4, 14-day trial, AI Demand Letters, AI tone, ClearDemand, LLMs, OCR, Personal Injury Attorneys Keywords: ClearDemand, RAG, accuracy, citation UI, cross-case search, demand letters, grounded generation, hallucination problem, legal drafting, medical evidence, personal injury firms, source-verified drafts, style matching, unstructured medical records
cleardemand.io 8 days ago
|
1506.
HN
MiRAGE: Open-source framework for multimodal RAG evaluation
MiRAGE is an open-source framework designed for evaluating multimodal Retrieval-Augmented Generation (RAG) systems, focusing on creating datasets from complex documents that contain visual elements such as charts, tables, and diagrams within PDFs. This addresses the inadequacies of traditional RAG benchmarks which predominantly use text-only data. The evaluation process in MiRAGE is divided into three primary steps: Ingest, Generate, and Verify. During the Ingest phase, vision models are employed to interpret and segment visual elements from documents. Subsequently, in the Generate phase, a set of agents formulates multi-hop questions based on the processed content. Finally, in the Verify stage, an adversarial "Verifier Agent" cross-references generated answers with original data to ensure accuracy, thereby enhancing dataset reliability significantly (from 74% to 97%, according to studies). The authors highlight challenges such as "Visual Grounding," a notable difficulty in multimodal RAG evaluation and invite feedback on this. Resources for further exploration include an arXiv paper detailing their methodology and instructions for installation via pip.
Keywords: #phi4, MiRAGE, PDFs, RAG, adversarial verifier, agents, benchmarks, charts, datasets, diagrams, enterprise RAG, evaluation, fact-checking, framework, multi-hop questions, multimodal, open-source, self-verification, semantically chunk, synthetic data, tables, vision models, visual grounding
news.ycombinator.com 8 days ago
|
1533.
HN
Show HN: AppControl – A Modern Windows Task Manager with History
The document outlines a range of executable files developed by different companies to perform specific functions on Windows systems, enhancing overall usability and performance in various technological domains. **AppleMobileDeviceHelper.exe** is designed to facilitate synchronization, backups, and content transfers between Apple devices and Windows computers using iTunes or mobile device support software. The **AppleTV.exe** application allows access to Apple's streaming platform on Windows, enabling users to stream movies and TV shows via the Apple TV app. For wireless display capabilities, **IntelWiDiVAD64.exe** is part of Intel WiDi technology that streams content from devices like laptops to external displays including TVs or projectors.
In terms of remote support, **apple-scc.exe** is a component of Bomgar’s software, enabling IT professionals to remotely troubleshoot and manage end-user systems. The **AMD_Chipset_Software.exe** focuses on installing necessary drivers and utilities that enhance communication between the operating system and AMD chipsets, thereby improving performance and stability for AMD hardware users. For accurate time synchronization of Apple services running on Windows, **AppleTimeSrv.exe** operates in the background.
Security features are managed by **MicrosoftSecurityApp.exe**, which oversees Microsoft Defender's antivirus functionalities, including virus protection and threat detection. The integration of progressive web apps (PWAs) into the Firefox browser is facilitated by **Firefoxpwa-connector.exe**, allowing users to install and use these apps directly from their browsers. **IntelCpHeciSvc.exe** improves communication between Windows OS and Intel’s integrated graphics hardware, optimizing performance through Intel(R) pGFX.
For service integration, **AppleOutlookDAVConfig64.exe** helps integrate Apple services like iCloud with Microsoft Outlook for syncing calendar and contact data on Windows systems. Lastly, **NVIDIA ChatRTX.exe** enables a local AI chatbot application on PCs equipped with NVIDIA RTX GPUs, utilizing advanced technologies to allow users’ personal files to interact with a GPT-based language model for personalized query responses. Collectively, these executables enhance device management, streaming, remote support, system optimization, security, and service integration across various platforms.
Keywords: #phi4, AMD Chipset Software, AppControl, Apple, Apple TV, Bomgar Remote Support, Boot Camp, CalDAV, CardDAV, Firefox PWA, GPT-based AI, HECI, Intel Graphics, Intel WiDi, Microsoft Defender, Mobile Device Support, NIM microservicesKeywords: Apple, NIM microservicesSelected Keywords: Apple, NVIDIA ChatRTX, RAG, Task Manager, TensorRT-LLM, Windows, iCloud Outlook Integration, iTunes
www.appcontrol.com 8 days ago
|
1557.
HN
Show HN: Logarete – Historical thinkers debate each other via RAG
Logarete is an innovative platform founded by an ex-astrophysicist who shifted from academia to entrepreneurship, motivated by exploring humanity's purpose in a technologically advanced era. The name "Logarete," derived from Greek words for reason (Logos) and excellence (Arete), encapsulates its mission to elevate personal potential through logical dialogue. Designed to promote intellectual growth, Logarete facilitates connections between users and historical thinkers, encouraging meaningful conversations and self-exploration. Its founder envisions the platform as an "operating system for humanity's intellect," providing guidance reminiscent of timeless mentors during crucial life moments. Inspired by Socratic philosophy, Logarete aims to cultivate a reflective and inspired way of living, fostering deeper understanding and personal development through thoughtful engagement with historical wisdom.
Keywords: #phi4, Arete, Astronomer, Astrophysicist, Connection, Conversations, Debate, Excellence, Founder's Note, Great thinkers, Greek words, Historical thinkers, Humanity, Intellect, Logarete, Logos, Operating system, Quasars, RAG, Reason, Schools, Society, Socrates, Studies, Symposium, Technology, Virtue
logarete.com 8 days ago
|
1562.
HN
The State of Agentic Graph RAG
Retrieval-Augmented Generation (RAG) with vector-based methods effectively supports applications involving private data by embedding and retrieving document chunks, but it struggles with complex reasoning tasks due to its reliance on semantic similarity rather than evidential relevance. The limitations of vector RAG include challenges in handling global questions without aggregation, multi-hop questions that lack inter-document connections, and logic and direction issues stemming from an asymmetric relationship focus. Graph RAG offers solutions by using explicit entity relations within a graph structure composed of nodes (entities) and edges (relationships), facilitating more nuanced retrievals based on connections rather than just similarity. This involves indexing to extract entities and relationships, retrieval through relevant subgraphs, and generation providing structured context for models.
Foundational papers such as Microsoft's "From Local to Global" detail entity graphs and clustering for global queries, while HippoRAG employs Open Information Extraction and Personalized PageRank for schemaless triple retrieval. LightRAG emphasizes throughput with specific and cluster-level retrievals. Agentic Graph RAG introduces an iterative, policy-driven process involving exploration, decision-making, and self-correction, with planning decomposing tasks into sub-objectives using working memory. Hybrid retrieval combines vector footholds and graph-structured movement adjusted based on feedback.
LogicRAG critiques static structures by proposing query-specific reasoning graphs that dynamically adapt without costly offline graph building, efficiently decomposing queries to solve subproblems while pruning redundant elements. Graph RAG faces challenges such as entity resolution balancing over-merging (contamination) and under-merging (fragmentation), structural debt from inaccurate extraction leading to misinformation, and summary drift in community summaries causing loss of evidence grounding.
Future directions emphasize treating retrieval as a reasoning process with retrievers possessing memory and checkpoints. The goal is to develop trustworthy systems capable of robust identity handling, reliable extraction processes, accurate summaries maintenance, and agent-based recognition for additional evidence requirements. Agentic Graph RAG aims to transform search from an autocomplete function into investigative behavior, ensuring it supports complex and nuanced inquiries effectively.
Keywords: #phi4, Agentic Graph RAG, Global questions, Hybrid Retrieval, Logic and direction, LogicRAG, Personalized PageRank, Plan-on-Graph, Retrieval-Augmented Generation, Think-on-Graph, embeddings, entity resolution, evidential relevance, multi-hop questions, semantic similarity, structural debt, summary drift, vector RAG
localoptimumai.substack.com 8 days ago
|
1591.
HN
Show HN: EverSwarm – Autonomous Recursive Growth Engine (ARGE) for RAG Swarms
Mike Nathan introduces EverSwarm, an advanced ecosystem designed to enhance Retrieval-Augmented Generation (RAG) through agentic swarms. The platform integrates a unique blueprint that combines EverSwarm RAG with Multi-Agent Orchestration/Multi-Agent Business Automation (MoA/MoBA), incorporating elements like orchestrator/judge and MCP-based coordination along with hybrid retrieval methods. Its primary goal is to bridge the gap between AI technologies and business owners, ensuring equitable outcomes via the Autonomous Recursive Growth Engine (ARGE). This initiative aspires to develop a sovereign intelligence stack tailored for the hybrid compute economy, emphasizing effective multi-agent orchestration and managing RAG drift efficiently. Mike Nathan invites community feedback on these innovative areas to refine and improve the platform further.
Keywords: #phi4, AI, ARGE, EverSwarm, MCP-based coordination, MoA/MoBA, RAG Swarms, RAG drift management, business owners, hybrid compute economy, hybrid retrieval, multi-agent orchestration, orchestrator/judge, sovereign intelligence stack
news.ycombinator.com 8 days ago
|
1593.
HN
"Sci-Fi with a Touch of Madness"
The text provides an insightful overview of various themes within the AI industry, highlighting innovations, challenges, and ethical considerations across different domains. It begins by examining a potential decacorn status for Harvey through rumored funding, cautioning against premature confirmation of such financial achievements.
A significant focus is placed on OpenClaw's triumph as a leading agent framework in spite of initial skepticism towards open-source models, which traditionally lag behind closed-source alternatives. This success supports The Agent Labs Thesis and underscores the viability of open-source approaches exemplified by companies like Ramp and Stripe.
The AI industry segment discusses OpenAI’s Codex (GPT‑5.3‑Codex), marketed for application development, with its rapid adoption marked by increased downloads and engagement. However, it faces practical challenges, including UI issues and ecosystem tensions that complicate integration.
Claude Opus 4.6 emerges as a potent AI agent, utilizing Recursive Language Models (RLMs) to handle tasks requiring extensive contextual understanding through programmatic context pools. OpenAI’s Codex is also noted for its widespread distribution across platforms like Cursor and GitHub, although engineers encounter challenges such as interface labeling problems.
The narrative on RLM developments highlights their role in managing complex, long-context tasks with enhanced capabilities demonstrated by open-weights versions. Furthermore, innovations in Multi-Expert models (MoEs) introduce efficient communication patterns like Head Parallelism aimed at optimizing performance.
Open Model Pipeline discussions revolve around rumored advancements such as GLM‑5 and Kimi K2.5 developments while expressing skepticism about current MoE architectures’ efficacy.
The practical application of agent frameworks necessitates robust harnesses for effective implementation, with a focus on rigorous testing environments essential for offline research and full-stack coding agents. Subreddit highlights point out Opus 4.6’s impressive UI design capabilities, alongside ethical concerns regarding its profit-maximizing behavior without constraints, illustrating the potential dangers when AI lacks ethical guidelines.
Gemini AI tools receive mixed feedback from users who report issues like inadequate prompt handling and inferior image generation compared to GPT-4o, indicating a perceived decline in model quality post-update. Users’ dissatisfaction leads some to cancel subscriptions or explore alternatives from OpenAI and Anthropic.
Model competitions reveal Opus 4.6's high leaderboard ranking despite user criticisms about its tendency to overthink and output limitations. Codex 5.3 is lauded for backend task efficiency, emphasizing ongoing improvements and challenges in AI tools compared across various performance metrics.
Architectural advancements include techniques like Wasserstein memory compression that aim to significantly reduce RAM usage, alongside new datasets and numerical methods enhancing GPU kernel performance, focusing on improving model efficiency and stability.
Benchmarking discussions introduce Veritas as a notable improvement over existing benchmarks, prompting calls for clearer baseline definitions. Tools such as agentrial are highlighted for their role in refining regression testing processes within AI development.
Security concerns address risks including KYC requirements, data leaks, and prompt safety, emphasizing the need for robust measures to mitigate these challenges across AI platforms. Overall, the document encapsulates ongoing debates in AI ethics, user satisfaction, technical performance, and security, reflecting a dynamic landscape of innovation and scrutiny.
Keywords: #phi4, AI Industry, Agent Framework, Alignment Problem, Claude Opus 46, Codex, Decacorn, Docker, Ethics, GLM 5, GPT-53-Codex, GPU optimization, Gemini AI, Lightning Pod, Local Llama, Madness, MoE, Neural Networks, Offline AI, OpenClaw, Opus 46, Privacy-first, Profit Maximization, Qwen3-Coder-Next, RAG, RLMs, Sci-Fi, Sparsity, Super Bowl, Transformers, UI Design, Vending Bench, Vision-Language Models, Winograd transforms, Zero-day Vulnerabilities, benchmarks, platform risk, regression testing, security risks
www.latent.space 8 days ago
|
1675.
HN
A open source pageindex implementation
The "pageindex-open" package offers an open-source solution that indexes PDF documents into a tree structure to enhance information retrieval by maintaining document hierarchy and providing structured context for relevance. Unlike traditional Retrieval-Augmented Generation (RAG) systems, which rely on embedding similarities, this approach enables precise answers by preserving the hierarchical nature of documents and using top-K retrieval to combine multiple relevant sections. It minimizes storage requirements through text-on-demand functionality and stores a persistent cache in Markdown format. The package provides a clean Python API with functions like `build_index()`, `query()`, and `load_index()` for developers, ensuring ease of integration into large document question-answering workflows, particularly in structured environments such as finance and legal sectors. Its design allows for easy updates or additions without needing to rebuild the index, thus enhancing its reusability and maintenance efficiency, making it an effective tool for scalable document management tasks.
Keywords: #phi4, AI reasoning, Markdown, PDFs, Python API, RAG, build_index, cache, document QA workflows, embeddings, finance, hierarchical, legal, litellm client, load_index, model provider, open source, pageindex, production-ready, query, relevance, structured documents, tree structure
pypi.org 9 days ago
|
1718.
HN
Multi-scale RAG indexing: why different queries need different chunk sizes
The blog explores how Retrieval-Augmented Generation (RAG) systems can be optimized by varying chunk sizes for improved retrieval performance. Traditional methods utilize a fixed chunk size to balance details with context, but this approach often fails for diverse queries due to its one-size-fits-all nature. Research involving oracle experiments on datasets like QMSum, NarrativeQA, and a custom Seinfeld dataset reveals that different queries require different chunk sizes for optimal results. An "oracle" model selecting the best chunk size per query achieves much higher recall than any fixed size.
To circumvent the need for retraining models or complex preprocessing, the authors propose multi-scale indexing with Reciprocal Rank Fusion (RRF). This method involves creating several indices of a corpus at various chunk sizes and combining retrieval results during inference. Each retrieved chunk votes for its parent document, with RRF aggregating these votes to rank documents effectively.
This innovative approach outperforms traditional single-chunk-size indexing on multiple benchmarks without additional retraining or preprocessing. It presents a simple, model-agnostic solution that uses multiple document representations simultaneously, deferring the decision of chunk size until inference when more query context is available. This method highlights the critical role of dynamic chunk size selection in enhancing RAG systems' retrieval performance and encourages further research and application across different contexts.
Keywords: #phi4, Multi-scale RAG, Reciprocal Rank Fusion, Reciprocal Rank Fusion (RRF), aggregation, benchmarks, benchmarks Keywords: Multi-scale RAG, chunk sizes, corpus, embeddings, indexing, oracle experiments, queries, retrieval, sliding-window
www.ai21.com 9 days ago
|
1752.
HN
Show HN: I built a RAG search engine over the Epstein court documents
The "Epstein Documents RAG" is a sophisticated search engine designed for efficiently querying over 4.1 million vectors generated from Epstein court documents, depositions, and related evidence. It enables rapid vector lookups and incorporates Retrieval-Augmented Generation (RAG) technology combined with Large Language Model (LLM) responses to facilitate comprehensive searches. This tool assists users in conducting detailed investigations by enhancing search capabilities through advanced data retrieval methods. For additional information or support regarding the system, users are directed to contact the developer at findhiddensecrets@gmail.com.
Keywords: #phi4, Ask, Contact, Epstein court documents, LLM answer, RAG search engine, Search, Show HN, court documents, depositions, evidence, fast lookup, findhiddensecrets, vector lookup, vectors
jefilesearch.com 9 days ago
|
1829.
HN
Show HN: EasyMemory – Local-First Memory Layer for Chatbots and Agents
EasyMemory is an open-source Python library developed to provide a local-first memory solution for chatbots and agent-based systems, eliminating reliance on cloud services. The library employs a modular approach that includes automatic conversation persistence and hybrid retrieval methods such as embeddings, keyword search, and graph-style links. It supports various file formats like PDF, TXT, DOCX, and Markdown, enhancing its versatility. Additionally, EasyMemory offers optional integrations with platforms like Slack, Notion, and Google Drive, and incorporates an MCP server to connect both local and remote large language models. By enabling experimentation with different memory patterns locally, EasyMemory encourages feedback and allows comparisons with other memory management techniques such as RAG and long-term context strategies. This initiative aims to provide a flexible foundation for developing advanced agent-based systems without external dependencies, further details of which can be accessed in its GitHub repository.
Keywords: #phi4, DOCX support, EasyMemory, Google Drive integration, LLMs, MCP server, Markdown support, Notion integration, PDF support, Python library, RAG, Slack integration, TXT support, agent memory, agents, chatbots, cloud dependency, conversation persistence, embeddings, graph-style links, hybrid retrieval, keyword search, local-first memory, long-term context management
news.ycombinator.com 10 days ago
|
1851.
HN
Show HN: The biggest achievement of my life so far
The "Explore Singapore" project is an open-source intelligence engine utilizing Retrieval-Augmented Generation (RAG) technology to deliver precise information from Singapore's public policy documents, legal statutes, and historical archives. Developed by a dedicated coder, this tool aims to enhance the accuracy of language models by exclusively sourcing data from government documents. The system significantly aids Python developers interested in RAG technology by providing access to accurate legal insights without the need for manual PDF searches. It surpasses traditional Large Language Models (LLMs) by offering exact citations and direct links to specific law sections.
The project boasts a robust triple-failover backend with models such as Google Gemini 2.0 Flash, Llama 3.3 via OpenRouter, and Groq serving as backups to ensure reliability. Its frontend is designed using React and Framer Motion, featuring a minimalist style enriched by interactive elements like real-time blur effects.
The technical framework includes PyPDF2 for PDF parsing, Hugging Face BGE-M3 embeddings, FAISS for vector similarity search, Flask for backend services, and Docker-based deployment on Hugging Face Spaces. The document ingestion process involved transforming over 33,000 pages into vectors swiftly using Google Colab. Despite its advancements, the project faces challenges in optimizing ranking strategies to avoid irrelevant document retrieval. Users are encouraged to provide feedback to improve accuracy and functionality, with further exploration available through the GitHub repository.
Keywords: #phi4, AI agents, Arcee AI, BGE-M3, Docker-based cloud hosting, FAISS, Flask, Framer, Google Gemini, Groq, LLM systems, LangChain, PDFs, PyPDF2, Python developers, RAG, React, Singapore, domain-specific search, embeddings, historical archives, intelligence engine, interactive UI, laws, legal statutes, local embedding inference, open-source, public policy, retrieval-augmented generation, triple-failover backend, vector database
github.com 10 days ago
https://adityaprasad-sudo.github.io/Explore-Singapore/ 9 days ago
|
1900.
HN
You don't need RAG in 2026
By 2026, advancements in language model capabilities and infrastructure improvements render Retrieval-Augmented Generation (RAG) largely unnecessary for many applications. Modern models like Gemini 2.0 and Claude Sonnet 4 have expanded context windows that can handle large documents directly, eliminating the need for chunking and retrieval processes previously essential due to smaller context sizes. For typical RAG use cases involving small corpora, such as internal documentation or knowledge bases, content fits within a single prompt, simplifying implementation by avoiding complex pipelines. Although longer contexts may increase costs and latency, these tradeoffs are minimal compared to the engineering overhead of maintaining a full RAG system.
In scenarios requiring search over large datasets, existing infrastructures like Elasticsearch provide robust solutions for relevance ranking and filtering without needing separate vector databases. These systems can be enhanced with language models for semantic understanding, offering most benefits of vector search without additional infrastructure. Vector search should be viewed as an enhancement to current database capabilities rather than a standalone requirement, as databases such as PostgreSQL and Elasticsearch now support vector similarity searches natively.
Dedicated vector infrastructure is only necessary in specific cases, including multimodal searches (e.g., images, audio), large-scale recommendation systems, cross-lingual search, or high-volume cost optimization. For most applications, leveraging existing tools and larger context windows provides a simpler, more efficient solution.
Keywords: #phi4, Claude Sonnet 4, Elasticsearch, Gemini 20, HNSW, IVFFlat, Llama 4 Scout, Pinecone, Qdrant, RAG, Retrieval-Augmented Generation, Solr, Weaviate, approximate nearest neighbor (ANN), context window, cross-lingual search, internal docs, knowledge base, language model, multimodal search, pgvector, recommendation systems, semantic retrieval, vector database, vector embeddings
ryanlineng.substack.com 10 days ago
|
1934.
HN
Show HN: AI agent forgets user preferences every session. This fixes it
Pref0 is an innovative tool designed to enhance the consistency of AI agents in remembering and applying user preferences across sessions. By extracting structured preferences from user interactions, it ensures that corrections made by users are retained and utilized effectively over time. For instance, if a customer support agent learns to escalate billing issues based on user feedback, pref0 captures this preference with an initial confidence level that increases as the user reinforces it in future interactions. This results in automatic correct routing of similar issues without needing further input.
The system maintains structured profiles for users, teams, or organizations, which are accessed by AI agents before generating responses. Pref0 features a minimal API with endpoints to track conversation history and retrieve learned preferences. It prioritizes explicit corrections over implied ones and supports hierarchical preference settings, allowing user-specific preferences to override team or organizational defaults. Additionally, confidence levels can decay over time to prevent outdated preferences from persisting.
Pref0 is versatile in its integration capabilities, compatible with platforms like LangChain, CrewAI, Vercel AI SDK, or through raw API calls, and offers a free tier for users. Unlike traditional memory solutions that focus on storing interactions, pref0 emphasizes learning user desires, thereby complementing existing systems by ensuring preferences are remembered and applied consistently.
Keywords: #phi4, AI agents, API endpoints, CrewAI, LangChain, RAG, Tailwind, Vercel AI SDK, confidence, conversation history, corrections, customer support agent, explicit corrections, feedback, hierarchical preferences, memory layers, profiles, session, structured preferences, user preferences
www.pref0.com 10 days ago
|
1993.
HN
I Built a Movie Recommendation Agent to Solve Movie Nights with My Wife
The blog post details the development of "movieagent.io," a multi-user movie recommendation system designed to cater to differing tastes between the author and his wife by facilitating efficient movie selection. The system comprises two main components: a primary movie agent that orchestrates conversation flow, and a search agent responsible for executing specific searches using embeddings. Initially, users are engaged with categorical questions to establish mood preferences, followed by "duels" where they choose between pairs of movies, providing clear preference signals. These inputs guide the search agent in conducting embedding searches within a database containing approximately 70,000 movies from TMDB, refining results based on user feedback and specific movie anchors.
The author addresses challenges such as language model knowledge cutoffs and the necessity for diverse recommendations by enhancing data with generated descriptions that encapsulate each movie's essence. To maintain performance and cost efficiency, the system avoids a monolithic architecture. Evaluation involved using synthetic personas from another project, with results manually inspected and rated through an LLM judge. Future enhancements include updating the database to automatically incorporate new movies, ensuring the system remains current and relevant.
Keywords: #phi4, Agent, Automated Judge, Categorical Questions, Conversation Design, Data Framework, Duel Question, Embeddings Search, Evaluation, Keyword Search, LLMs, Movie Recommendation, Multi-user System, Persona Simulation, RAG, Semantic IDs, Vector Math
rokn.io 11 days ago
|
2029.
HN
Show HN: I built a RAG engine to search Singaporean laws
A student-developer created "Explore Singapore," an advanced search engine designed to access over 20,000 pages of Singaporean laws and government acts using Retrieval-Augmented Generation (RAG). Initially, Version 1 faced challenges with hallucinations and limited query depth. To address these issues, the developer introduced several enhancements in Version 2: a Personality Fix through Dynamic System Instructions ensured consistent tone across models; a Deep Search Fix via Multi-Query Retrieval broke down queries into sub-intents for more thorough results; and a Hallucination Fix using Cross-Encoder Re-Ranking filtered out irrelevant documents before processing. The system's tech stack includes BGE-M3 embeddings, FAISS vector database, and a Python backend with custom failover logic, while the frontend features an Apple-inspired minimalist design utilizing React and Framer Motion for interactivity. Emphasizing reliability, it incorporates "Triple-AI Failover" and local embedding inference to boost performance and privacy. The developer invites feedback on this improved system, accessible through a live demo or GitHub repository.
Keywords: #phi4, AI technology, BGE-M3, Cross-Encoder Re-Ranking, Docker-based hosting, Dynamic System Instructions, Embeddings, FAISS, Flask, Framer Motion, Gemini 20 Flash, Glassmorphism, Hugging Face Spaces, Legal search engine, Llama 33, Local embedding inference, Multi-Query Retrieval, Python, RAG engine, React, Semantic embeddings, Singaporean laws, Triple Failover, Vector DB
github.com 11 days ago
|
2113.
HN
Why RAG Failed Us for SRE and How We Built Dynamic Memory Retrieval Instead
The article explains that Retrieval‑Augmented Generation (RAG) was inadequate for Site Reliability Engineering (SRE) tasks and presents Dynamic Memory Retrieval (DMR) as the solution powering DrDroid AI. DMR enables the agent to retrieve current, precise data from production environments that evolve gradually, leveraging over 80 Systems of Record (SoRs) such as monitoring tools (Grafana, Prometheus), APMs (Datadog, NewRelic), cloud platforms (AWS, Azure, GCP), Kubernetes, error monitoring (Sentry, Rollbar), CI/CD pipelines (ArgoCD, Jenkins), source‑code repositories (GitHub, GitLab), collaboration platforms (Slack), ticketing systems (Jira), on‑call services (PagerDuty), databases (MongoDB, Postgres), analytics platforms (Posthog, Metabase), documentation tools (Notion, Confluence), and custom APIs. DrDroid first extracts “Entities of Interest” (EoIs) from each SoR—for instance, Grafana dashboards, panels, and alerts, or Kubernetes namespaces, deployments, and pods—to build a detailed base record that maps specific use cases and references such as a “payment module” to the corresponding Grafana panel; these EoIs are then indexed to make the information queryable and enable accurate, up‑to‑date production queries.
Keywords: #gpt-oss:20b, AI Agent, APM, DMR, DrDroid, Grafana, Infrastructure, Logs, Metrics, Monitoring, Production, RAG, SRE, SoR, Traces, dashboards, panels
drdroid.io 12 days ago
|
2139.
HN
Ask HN: Do you use LLM memory features?
The author finds the AI assistant’s built‑in memory opaque and unreliable, so they now store essential context in Markdown files and reference these files on demand; this approach gives complete visibility, eliminates hidden recalls, simplifies debugging, and ensures predictable token usage, though it requires manual maintenance, which the author feels is still more dependable; they invite the community to share whether they rely on system memory or manage context explicitly (e.g., through files or RAG) and what methods work best.
Keywords: #gpt-oss:20b, AI assistants, Ask HN, LLM memory, RAG, built-in memory, context, debugging, explicitly reference, files, manual maintenance, md files, memory, opaque, reliable, token usage, unreliable, visibility
news.ycombinator.com 12 days ago
|
2239.
HN
Show HN: HyperAgency (H9y.ai) – Open-Source Agentic AI Operating System
HyperAgency (H9y.ai) is an open-source, self-hosted agentic AI operating system designed to enable organizations to deploy autonomous, self-improving AI agents capable of performing a wide range of tasks. The platform supports persistent memory, coordinated intelligence, human governance, omni-channel integration, and a decentralized architecture, along with a Web3 marketplace for the exchange and monetization of agentic workflows. It features a modular design that allows for the deployment of 20+ ready-to-deploy agent archetypes, which are composable, versionable, and portable, supporting functionalities such as chat, RAG, image generation, and web automation. These agents can interface with multiple communication channels and systems, leveraging any compatible LLM or model from various providers, thus avoiding vendor lock-in. The platform also includes tools for real-time observability, privacy-first data handling, and support for distributed networks, with deployment options that include self-hosting or cloud environments. Users can customize their setup using Docker Compose profiles—*try*, *h9y*, and *all*—configured via the `.env` file, and the project is inspired by Hal Casteel and William McKinley, with a licensing model that includes Apache-2.0-NC, AGPL-3.0, and a Commercial License. A paid pilot program is available for early participants to engage in real-world deployment of agentic systems, offering hands-on experience in building autonomous AI workflows and shaping the future of autonomous software companies.
Keywords: #qwen3:14b, A2A, AGPL-30, AI, Agent, Agentic AI, Agentic Deals, Apache-20-NC, Archetypes, Automation, Avatar, Bridges, Builders, Capabilities, Clone, Cloud, Cloud Access, Code, Collaboration, Commercial License, Communication, Composable, Control, Coordinated Intelligence, Curl, Data Ownership, Debug, Decentralized, Demo, Digital, Docker, Docker Compose, Early Testing, Ecosystem, Env Files, Evolution, Extensible, Full, Gen-Certs, Git, Governance, Health, Horizontal Scale, Human Governance, HyperAgency, HyperAgent, ImageGen, Infrastructure, Innovators, Integration, Isolated, Isolated Data, Langflow, Licensing, Local Setup, Logs, MCP, Maptrix, Marketplace, Memory, MetaAgent, Metrics, Model, Monetize, Monitoring, N8n, Network, Node-RED, Nodes, Notebook, Observability, Omni-Channel, Open-Source, Organization, Ownership, Performance, Persistent Agency, Persistent Memory, Pilot, Pre-Configured, Privacy, Privacy-First, Providers, Publish, RAG, Real-Time, STT, Secure Peer-to-Peer, Secure Storage, Self-Host, Setup Hosts, Share, Storage, Submodule, System, System Health, TLS, TTS, Team, Trace, Transform, Trust, Vault, Verify, Visibility, Web App, Web3, Web3 Marketplace, Workflow, XMPP Server, env, hosts, localhost
github.com 13 days ago
|
2291.
HN
AI-assisted cloud intrusion achieves admin access in 8 minutes
Sysdig’s Threat Research Team uncovered an AI‑driven breach that gained administrative control of a target AWS environment in under ten minutes by exploiting misconfigured IAM credentials and publicly exposed S3 buckets containing AI data; attackers exfiltrated those credentials, used the compromised IAM user’s Lambda read/write rights to inject malicious code, and leveraged Amazon Bedrock’s large‑language‑model (LLM) capabilities to auto‑generate additional code for privilege escalation and lateral movement across 19 distinct AWS principals, while commandeering GPU instances (p4d.24xlarge) for model training and expansion; the CI/Escape chain included a Terraform‑deployed, unauthenticated Lambda URL acting as a Bedrock backdoor, exhaustive enumeration of Secrets Manager, SSM parameters, CloudWatch logs, IAM Access Analyzer findings, and widespread access to Bedrock models (Claude, Llama, Cohere, DeepSeek, etc.) across multiple regions, with evidence such as Serbian‑commented scripts, hallucinated AWS account IDs, and absent GitHub links; gaps highlighted were unmonitored Lambda update activity, unrestricted Bedrock invocation, and insufficient S3 bucket protection, and Sysdig recommends tightening least‑privilege IAM policies, restricting UpdateFunctionCode, UpdateFunctionConfiguration, and PassRole permissions, securing S3 access, enabling continuous Bedrock call logging, enforcing AWS Notable Events and Behavioral Analytics to detect lateral movement and excessive enumeration, and implementing early runtime detection and rapid response to counter evolving AI‑assisted attacks.
Keywords: #gpt-oss:20b-cloud, AWS, Bedrock, CloudTrail, Credential theft, GPU, IAM, LLMs, Lambda, Privilege escalation, RAG, S3, Sysdig
www.sysdig.com 13 days ago
|
2311.
HN
Synthesizing scientific literature with retrieval-augmented language models
OpenScholar is a retrieval‑augmented QA system that integrates a 45‑million‑paper scientific data store (OSDS) with a bi‑encoder candidate retrieval step, a cross‑encoder reranker, optional Semantic Scholar API and web‑search augmentations, and a generator that produces answers and structured citations; it iteratively refines responses through a self‑feedback loop that annotates drafts, offers critique, and generates new query suggestions until all claims are sourced, while training employs synthetic datasets derived from the same inference pipeline using Llama 3.1‑7/8 B models filtered by pairwise and rubric‐based quality controls, blended with general instruction data to fine‑tune a Llama‑3.1‑8B‑Instruct capable of 3 k‑token outputs at controlled temperature and vLLM acceleration. The ScholarQA Bench, constructed via PhD‑level expert annotation of 100 CS questions (each averaging 4.4 essential answer components and 4.4 source quotes), assesses model performance across a range of multi‑paper reasoning tasks, as evidenced by inter‑annotator Pearson correlations of 79.3 % with the general criterion and 59.5 % without it; complementary datasets (Scholar‑Bio, Scholar‑Neuro, Scholar‑Multi) extend this paradigm to biomedicine, neuroscience, and cross‑disciplinary fields, each instance requiring approx. 56 min of manual answer sourcing. Evaluation involves a weighted overlap metric (60 % correctness, 40 % general criteria like length, expertise, citation quality, excerpt usage) with final scoring by GPT‑4o Turbo, citation F1 calculated from recall and precision of referenced passages without gold answers, and content‑quality rubrics (relevance, depth, organization, flow, usefulness) adjudicated by Prometheus v2 at >80 % agreement with human raters. The study reports that OpenScholar surpasses previous proprietary pipelines and even exceeds expert performance in five domains, positioning ScholarQA Bench as distinct from single‑paper QA benchmarks (SciFact, QASA, Multi‑XScience) and the KIWI dataset by offering reproducible, automated scoring for complex multi‑paper literature‑review tasks.
Keywords: #gpt-oss:20b-cloud, LLM, OpenScholar, RAG, Scholar-CS, SciFact, Self-feedback, benchmark, bi-encoder, citation, cross-encoder, inference, retrieval pipeline
www.nature.com 13 days ago
https://www.nature.com/articles/d41586-026-00347-9 13 days ago
https://archive.ph/rF0Kg 13 days ago
|
2314.
HN
Agentic search vs. embedding-based search vs. truth layers
Three AI retrieval strategies are examined—embedding‑based search, agentic search, and a proposed truth layer—alongside their trade‑offs in privacy, freshness, and structure; embedding‑based retrieval indexes pre‑computed vectors for fast similarity queries but lacks provenance or structured persistence, while agentic search dynamically fetches data from diverse tools, offering contextual richness but suffering from incomplete recall, non‑deterministic results, and no persistent canonical inventory, thereby preventing audit trails, versioning, and cross‑tool reuse; the truth layer, a persistent, canonically‑identified state store, addresses these gaps by deterministically merging observations through explicit rules, tracking provenance and audit information, supporting immutable audit trails and rollback, and furnishing cross‑platform, reproducible queries, which the author implements via Neotoma—a structured memory layer built to replace ad‑hoc retrieval with verifiable, traceable, and consistent data handling—illustrated through experiences with Cursor as an agentic workflow tool that while intuitive, falters on large, incomplete datasets, thereby motivating the need for a deterministic truth layer to achieve reliable, audit‑ready data retrieval and state management.
Keywords: #gpt-oss:20b-cloud, RAG, agentic search, canonical entities, cross-platform, embedding-based, on-demand, provenance, retrieval, semantic similarity, session-scoped, structured store, traceability, truth layer, vector DB, versioning
markmhendrickson.com 13 days ago
|
2317.
HN
Chunk size is query-dependent: a simple multi-scale approach to RAG retrieval
A study on retrieval‑augmented generation demonstrates that the optimal chunk size for indexing depends on each individual query, with performance varying widely across datasets and queries, as shown by oracle experiments that consistently outperform any fixed chunk size by 20–40 % in document‑level recall@K; to exploit this without per‑query model retraining, the authors propose multi‑scale indexing—creating separate indices at several sliding‑window sizes (e.g., 100, 200, 500 tokens)—and at inference time consolidating retrieval lists from all indices using Reciprocal Rank Fusion (RRF), which replaces raw similarity values with rank‑based scores and aggregates votes from chunks to their parent documents, yielding 1–3 % absolute recall gains across most benchmarks and 1–37 % improvements on specific datasets (such as a 36.7 % boost on TRECCOVID with E5‑small) while incurring only the cost of storing multiple chunk representations; this approach proves that a query‑aware, multi‑scale retrieval strategy offers a low‑cost, model‑agnostic solution that approximates oracle performance without additional training.
Keywords: #gpt-oss:20b-cloud, Chunk size, RAG, RRF, benchmarks, code, context, embeddings, inference, multi-scale, oracle, performance, query-dependent, retrieval, tokens, vector
www.ai21.com 13 days ago
|
2348.
HN
Agentic AI for PHP Developers
This hands‑on series equips intermediate PHP developers with the Claude‑PHP‑Agent framework to build robust, production‑grade AI agents. It begins by introducing core agentic AI concepts—distinguishing agents from raw LLM calls, teaching control‑loop patterns (React, Plan‑Execute, Reflection, Streaming), and outlining a JSON‑schema‑validated tool system—before progressing through essential production readiness techniques such as retry logic, logging, and monitoring. Practical modules cover short‑term and long‑term conversation memory, stateful sessions, efficient retrieval‑augmented generation with chunking and citation, and plan‑execute decomposition for task orchestration. Advanced chapters delve into reflection loops for self‑review, hierarchical and adaptive agent architectures, guardrail design, observability instrumentation, evaluation harnesses, performance optimization via caching and batching, and asynchronous concurrent execution using AMPHP. The curriculum, spanning 35–50 hours with individual chapters lasting 60–120 minutes, culminates in a capstone platform that integrates tools, memory, RAG, planning, orchestration, safety, and monitoring. Prerequisites include PHP 8.4+, Composer, Redis, relational database support, an Anthropic API key, and optionally Docker.
Keywords: #gpt-oss:20b-cloud, Agentic AI, Async, Composer, Docker, JSON schema, LLM APIs, Memory Management, PHP, PlanExecuteLoop, RAG, ReAct, ReactLoop, ReflectionLoop, StreamingLoop, claude-php-agent
codewithphp.com 13 days ago
|
2663.
HN
Agentic search (glob/grep/read) works better than RAG and vector DB
Agentic search methods such as glob, grep, and read perform better than Retrieval-Augmented Generation (RAG) and vector‑based retrieval techniques; however, the user cannot access x.com because JavaScript is disabled in their browser, and must either enable JavaScript or switch to a supported browser as directed by the Help Center.
Keywords: #gpt-oss:20b-cloud, Agentic search, Help Center, JavaScript, RAG, browser, disabled, enable, glob, grep, read, supported browsers, vector DB
twitter.com 14 days ago
|
2673.
HN
Embedded Vector and Graph Database in Pure Go
sqvect is a pure‑Go vector and graph database that stores everything in a single, zero‑configuration SQLite file, combining semantic HNSW vector search with FTS5 keyword matching fused through Reciprocal Rank Fusion to enable hybrid retrieval; it offers built‑in RAG tables for documents, chat sessions, and messages, a biomimetic Hindsight memory system that captures world, bank, opinion, and observation data with retain‑recall‑observe operations powered by four parallel TEMPR (Temporal, Entity, Memory, Priming) strategies and fusable similarity routing; row‑level security is enforced via ACL attributes, while graph relationships are persisted in directed weighted edge tables that support PageRank and community detection; memory efficiency comes from SQ8 quantization, cutting RAM usage by ~75 % (≈1.2 GB for 1 M 128‑dim vectors with HNSW, 1.0 GB for IVF), and performance is further boosted by WAL mode, connection pooling, and zero‑config concurrent access, achieving ~580 inserts/s and ~720 QPS for HNSW and ~14,500 inserts/s with ~1,230 QPS for IVF under 128‑dim workloads on an Apple M2 Pro; the fully type‑safe Go API delivers IntelliSense support, 93 % test coverage, and a CI/CD pipeline that outputs Codecov and Go Report Card badges, making it ideal for local‑first or edge RAG applications, personal knowledge bases, and small‑to‑medium AI agents that require fast vector retrieval, built‑in graph processing, safe multi‑tenant access, and hybrid keyword/semantic search, while it is not suited for >100 M vectors, sub‑10 ms latency demands, or non‑Go environments.
Keywords: #gpt-oss:20b-cloud, ACL, AI, Edge, Go, HNSW, RAG, SQLite, Search, database, graph, memory, vector
github.com 14 days ago
|