474.
HN
Show HN: An beautiful webpage I made
The "Singapore Intelligence RAG System" is a sophisticated AI-driven platform designed to deliver reliable information regarding Singapore’s legal framework, policies, historical occurrences, and infrastructure developments. It employs Retrieval-Augmented Generation (RAG) technology, leveraging over 33,000 pages of meticulously curated data specific to Singapore. This approach mitigates the generation of inaccurate facts, distinguishing it from other language models.
The system's architecture features a high-performance RAG pipeline that utilizes BGE-M3 for vectorization and FAISS for expedited retrieval operations. It incorporates a "Triple-Failover" logic to ensure 99.9% uptime reliability by utilizing Google Gemini 2.0 Flash, Llama 3.3 70B via OpenRouter, and another instance of Llama 3.3 70B via Groq.
An interactive user interface developed with React and Framer Motion enhances the user experience through a "Liquid-Glass" design that includes real-time blur effects, spring physics, minimalist design elements, and smooth animations on hover. The embedding model operates locally within the application to boost privacy and performance efficiency.
The technology stack encompasses Flask and Gunicorn for backend operations, FAISS (CPU) as a vector database, Sentence-Transformers BGE-M3 for embeddings, and LLMs including Gemini 2.5 Flash and Llama 3.3. Deployment is achieved through Hugging Face Spaces with Docker-based hosting.
Installation requires setting up Python packages such as Flask, Flask-CORS, and FAISS. Users must configure the backend server before executing any server-side files and can clone the repository to begin setup. The project aims to provide an interactive and precise resource for exploring Singapore's legal and historical context while ensuring system reliability and user engagement through its advanced architectural and design features.
Keywords: #phi4, AI, BGE-M3, Backend, Deployment, Docker, Embeddings, FAISS, Flask, Framer Motion, Frontend, Glassmorphism, Google Gemini, Gunicorn, Historical, Hugging Face Spaces, Infrastructure, Installation, Intelligence, Legal, Llama, Local Inference, RAG System, React, Retrieval-Augmented Generation, Singapore, Tech Stack, Vector DB
github.com 2 days ago
|
475.
HN
Generating vector embeddings for semantic search locally
The article explores the creation of vector embeddings for local semantic search by converting text into numeric vectors that encapsulate meaning, enabling efficient similarity searches in databases. It outlines how items like books or products can be represented as rows with a vector column derived from their attributes using a function \( F \). When users perform queries, these are also processed to generate comparison vectors via the same function, facilitating effective search results based on similarity.
Key components of the function \( F \) include a machine learning model (e.g., nomic-embed-text-v2-moe), an inference engine like llama.cpp, and hardware considerations. The article details setting up a local environment for these tasks using Python dependency management tools such as uv and llama.cpp as an inference wrapper.
A practical example provided involves installing necessary dependencies on Ubuntu, downloading models in GGUF format, and managing network access during testing to generate embeddings locally with the nomic-embed-text-v2-moe model. This process uses cosine similarity for comparing vectors to retrieve similar items based on user queries stored in environment variables.
The article acknowledges limitations, such as potential mismatches between models, inference engines, or hardware compatibility issues. While it demonstrates a brute-force method using full-table scans for nearest neighbor searches, the text notes that more efficient probabilistic indexing methods like IVF and HNSW are available for real-world applications. It also highlights vector databases and libraries as tools for efficiently storing and searching embeddings without generating them directly.
Keywords: #phi4, ANN indexing, GGUF format, Llama, cosine similarity, dataset, embeddings creation, hardware, inference engine, machine learning, model, semantic search, vector databases, vector embeddings
theconsensus.dev 2 days ago
|
647.
HN
Show HN: Built an webpage to show Singaporean infra and laws
The "Explore Singapore" project is a webpage developed using an AI-driven platform known as the Singapore Intelligence RAG System, designed to provide comprehensive information about Singapore’s infrastructure and legal framework. The system utilizes Retrieval-Augmented Generation (RAG) technology to deliver accurate insights into the country's laws, policies, historical events, and critical infrastructure. A notable feature of this project is its "Triple-AI Failover Backend," which ensures reliability by employing a three-tiered AI inference setup: Google Gemini 2.0 Flash as primary, Llama 3.3 via OpenRouter as secondary, and Groq as tertiary.
The user interface employs the Liquid-Glass interactive design, leveraging React and Framer Motion to create engaging frontend experiences characterized by real-time backdrop blurs and smooth expansion animations. Additionally, the system enhances privacy and performance through local embedding inference, processing over 33,000 document pages into semantic embeddings using BGE-M3 models. These vectors are efficiently retrieved via FAISS for quick lookups, supported by a "Triple-Failover" logic to maintain high uptime.
Technologically, the project uses React and Framer Motion on the frontend, with Flask and Gunicorn powering the backend. It relies on FAISS as its vector database (CPU version) and utilizes Sentence-Transformers BGE-M3 for embeddings. Large language models such as Gemini 2.5 Flash and Llama 3.3 are integrated into the system, which is deployed using Hugging Face Spaces with Docker.
For local installation, prerequisites like Flask, flask-cors, google-generativeai, among others, need to be set up on the backend server prior to running Python scripts. The project repository can be cloned for this purpose. As its first open-source venture, "Explore Singapore" aims to gather user feedback to drive future improvements.
Keywords: #phi4, AI, Docker, FAISS, Flask, Framer Motion, Google Gemini, Gunicorn, Hugging Face Spaces, Llama, RAG System, React, Retrieval-Augmented Generation, Singapore, backend, deployment, embeddings, frontend, historical events, infrastructure, laws, legal system, local setup, local setup Keywords: Singapore, policies, vectorization, webpage
github.com 3 days ago
|
1158.
HN
MetalChat – Llama Inference for Apple Silicone
MetalChat is a C++ framework and command-line tool developed for accelerating inference of Meta Llama models on Apple Silicon via Metal. Currently in active development, it warns users that its API and CLI could change unexpectedly. Installation options include using Homebrew or building locally with Conan to incorporate into projects, specifically those utilizing CMake through an automatically exported target. The framework is open-source under the GPLv3 license. Users seeking installation guidance and usage instructions are directed to a getting started guide and issues tab on GitHub for further assistance.
Keywords: #phi4, Apple Silicon, C++ framework, CMake build system, Conan package, GPLv3 license, Homebrew package manager, Llama inference, Meta Llama models, Metal-accelerated, MetalChat, active development, command line interpreter, known issues
github.com 6 days ago
|
1207.
HN
Distributed Llama
Distributed Llama facilitates the connection of multiple home devices into a powerful cluster using distributed computing to enhance language model inference via tensor parallelism and high-speed Ethernet synchronization. Compatible with Linux, macOS, and Windows, it optimizes performance for ARM and x86_64 AVX2 CPUs and supports models like Qwen 3 MoE on Vulkan (as of September 2025) and various Llama models. The setup requires a root node using Python 3 and a C++ compiler to load and distribute models across worker nodes, which independently handle portions of the neural network without further configuration. Supporting up to \(2^n\) nodes, RAM usage is distributed among devices with slightly more required by the root node due to its additional responsibilities. Key commands for operations include `dllama inference`, `dllama chat`, `dllama worker`, and `dllama-api`, offering customization options such as model path, tokenizer configuration, precision settings, sequence length, threading, host binding address, and port. The project encourages community contributions with guidelines focusing on minimal changes, cross-platform compatibility, and English documentation adherence, available via merge requests or issues for broader discussions, all distributed under the MIT license.
Keywords: #phi4, API server, ARM, CLI chat, CPU, Distributed Llama, Ethernet, Linux, MIT license, MIT license Keywords: Distributed Llama, Qwen 3 MoE models, RAM usage, Vulkan, Windows, architecture, benchmark, cluster, devices, f32 buffer-float-type, inference, macOS, merge request, q40, quantizations, root node, synchronization, tensor parallelism, worker nodes, x86_64 AVX2 CPUs
github.com 6 days ago
|
1687.
HN
Why the hell is this showing up
The "Singapore Intelligence RAG System" is an advanced AI platform engineered to deliver accurate information regarding Singapore's legal framework, policies, historical incidents, and infrastructure. It leverages Retrieval-Augmented Generation (RAG) technology and a meticulously curated database of over 33,000 pages to minimize errors often found in other large language models. The system's architecture includes several critical components: data ingestion, vectorization using BGE-M3 for semantic embeddings, retrieval through FAISS for efficient lookups, and generation with a triple-failover mechanism ensuring high availability. Notable features of the platform are its Triple-AI Failover Backend, which enhances reliability, an interactive "Liquid-Glass" UI crafted with Framer Code Component, and local embedding inference to boost privacy and performance. On the technical side, the frontend is built using React and Framer Motion, while the backend integrates Flask, Gunicorn, FAISS (CPU), Sentence-Transformers BGE-M3, and various LLMs such as Gemini 2.5 Flash and Llama 3.3. The system is deployed via Hugging Face Spaces with Docker-based cloud hosting. Installation requires specific Python packages for backend setup, emphasizing local processing of embedding models to maintain performance and privacy standards.
Keywords: #phi4, AI, API, BGE-M3, Backend, Deployment, Docker, Embeddings, FAISS, Flask, Framer Motion, Frontend, Glassmorphism, Google Gemini, Gunicorn, Hugging Face Spaces, Infrastructure, Legal System, Llama, Local Setup, Prerequisites, RAG, React, Singapore, Vectorization
github.com 9 days ago
|
2165.
HN
10 months since the Llama-4 release: what happened to Meta AI?
Meta AI’s apparent stagnation after the Llama‑4 launch is underscored by the fact that, ten months on, the only publicly available API remains on a waitlist, reflecting a dearth of subsequent product releases or substantive development.
Keywords: #gpt-oss:20b, 10 months, API, Llama, Llama-4, Meta, Meta AI, disappointment, release, since, still, waitlist-only, what happened
news.ycombinator.com 12 days ago
https://github.com/facebookresearch/sam-3d-objects 12 days ago
https://github.com/facebookresearch/sam3 12 days ago
|
2171.
HN
Study: Meta AI model can reproduce almost half of Harry Potter book
A study by Stanford, Cornell, and West Virginia University evaluated five open‑weight language models—three Meta Llamas, one Microsoft, and one EleutherAI—on the Books3 corpus, which contains many still‑copyrighted works. The researchers found that all models can readily generate 50‑token excerpts from *Harry Potter and the Sorcerer’s Stone*, with Meta’s Llama 3.1 70B reproducing the text most easily, underscoring that verbatim copying is a widespread issue that could strengthen plaintiffs’ claims in AI‑copyright litigation while offering data useful to defendants.
Keywords: #gpt-oss:20b, AI, Book, Books3, Copyright, EleutherAI, GPT-4, Harry Potter, LLMs, Llama, Meta, Microsoft, Model, Open-weight, OpenAI, Plaintiffs
arstechnica.com 12 days ago
|
2265.
HN
Training language models on TPUs shouldn't be scary
The author has built an open‑source training pipeline for the speculative‑decoding language model EAGLE that uses hidden states from a verifier LLM (Llama 3.1 8B) to predict several tokens at once, a 450 M‑parameter drafter that still demands heavy compute—three‑epoch training on a single H100 TPU lasts roughly four days—prompting a move to Google Cloud TPU‑v6e chips via the TRC program. This switch required stripping all `.cuda()` calls, adopting `torch_xla[tpu]`, enabling bfloat16 precision, and transitioning from GPU‑centric Fully‑Sharded Data Parallelism to PyTorch XLA’s SPMD system on a 4‑ or 64‑chip grid; the 32 GB HBM versus 80 GB VRAM forced more aggressive manual sharding instead of the XLA FSDP wrapper, with SPMD initialization needed to avoid race conditions. Repeated recompilations during token generation, caused by dynamic‑size input tensors as revealed by XLA IR debugging, were eliminated by padding sequences to consistent 128‑ or 2048‑token multiples and batch‑wise batching, slashing iteration times from minutes to seconds. Inter‑chip communication bottlenecks were mitigated by replacing `dist.all_reduce` on the Gloo backend with `xm.all_reduce`, keeping reductions on the TPU interconnect and boosting core duty cycles to about 77 % and tensor‑core utilisation to ~24 %. Transformer optimisations—disabling unnecessary mask recomputation, pre‑computing linear‑index tensors, removing costly `aten::nonzero` and `aten::_local_scalar_dense` calls—sharpened throughput from 2.4 to 5.2 iterations per second. Roof‑line profiling showed the work is memory‑bandwidth bounded (~50 % HBM utilisation, ~22 % peak FLOPs); further mitigations such as duplicating large vocabulary matrices to avoid all‑to‑all traffic and refining sharding reduce communication overhead, positioning the system for near‑optimal TPU utilisation in future large‑scale runs. GPU‑inference experiments on an H100 using Tensor Cores deliver 67 TFLOPs for FP32, while an XLA‑optimised `torch.autocast` bfloat16 yields 2.17 it/s and adding Torch DYNAMO with an `openxla` backend lifts it to 2.38 it/s, though indiscriminate compilation of all functions can drop speed to 2.15 it/s; the XLA fusion capability makes `@torch.compile` unevenly beneficial. Benchmarks indicate a TPU outperforms a 4‑GPU H100 node on next‑token prediction at TTT = 1, and a “Training‑Time Test” shows flex‑attention outperforms SDPA on larger batches and longer sequences, using less memory and running faster, especially with dynamic sequence length—though scaling to 4‑GPU nodes suffers poor parallel efficiency at small loads; TPUs (v6e‑4) match 4‑GPU DDP performance but lack customised attention kernels, limiting peak efficiency and increasing memory bandwidth demands for smaller models. Further roof‑line analysis pinpoints memory‑bound loop‑fused attention kernels and compute‑bound convolution‑fused MLP layers, suggesting that replacing the current attention routine with an XLA‑optimised flash‑attention kernel and removing `u32[]/s32[]` dependencies could close the compute gap. Training on TPUs has now reached parity with multi‑GPU setups, and future work aims to train larger “drafters” to exploit more parallelism, improve weight shuffling, and further reduce bottlenecks.
Keywords: #gpt-oss:20b-cloud, AMP, BF16, CUDA, EAGLE, FSDP, GPU, HBM, Llama, PyTorch, SPMD, TPU, TensorCore, compute, dataset, epochs, training
dogac.dev 13 days ago
|
2437.
HN
Mappa – Fine-tune ANY multi-agent LLM systems end-to-end with AI coaches
Mappa is a framework that fine‑tunes multi‑agent large‑language‑model systems by attaching an external “coach” LLM (such as Gemini) that monitors every agent’s actions and tool outputs during training and assigns dense, per‑step scores, thereby resolving the credit‑assignment problem inherent to conventional reinforcement‑learning setups that rely on a single terminal reward; the coach can blame the precise agent responsible when a failure occurs. In practice, agents are trained through API calls to the coach, and once training concludes they run locally offline. The authors report significant performance boosts, noting a 17‑percentage‑point improvement on the AIME math competition and a 38‑percent gain in F1 for Kaggle‑style data‑science tasks. Training demands 2 to 8 quadruple instantiations of 80‑GB GPUs depending on the model size, the implementation is distributed under an MIT license, and Mappa remains agnostic to the choice of agents, tasks, or coach models.
Keywords: #gpt-oss:20b-cloud, API, Fine-tune, GPU, Gemini, Kaggle-style, LLM, LLaMA, MIT, Mappa, Qwen, RL, multi-agent, offline, tasks
news.ycombinator.com 14 days ago
|