Scraper Spider

324. HN AI on a Budget: Recompiling Llama.cpp for Qwen3.5 Inference on an HP Z440

The whitepaper "AI on a Budget" examines the feasibility of running large language models (LLMs) like Qwen3.5 locally using cost-effective hardware, specifically an HP Z440 workstation with dual NVIDIA RTX 3060 GPUs. The research demonstrates that high-performance AI inference can be achieved without exorbitant investments by optimizing both software and hardware configurations. Key findings include significant performance improvements through the use of architecture-specific compilation flags for Intel's Xeon E5-1620 v3 CPU, resulting in a custom backend outperforming mainstream solutions like LM Studio with 70 tokens per second on the Qwen3.5 model. The study emphasizes cost considerations by highlighting the inefficiencies of GUIs such as Electron-based interfaces, which waste VRAM and degrade performance compared to bare-metal implementations. Optimization techniques that leverage instruction sets like AVX2 and FMA3 further enhance CPU-side operations with the integration of Intel oneAPI Math Kernel Library. Additionally, the efficiency of MoE models over dense architectures is noted due to their reduced memory bandwidth requirements and faster inference speeds. Effective context management strategies are crucial in avoiding out-of-memory errors on systems with limited VRAM by using quantization flags and adjusting generation parameters. While a dual-RTX 3060 setup provides excellent value, upgrading to a single RTX 3090 could alleviate PCIe bottlenecks, offering further performance gains albeit at a higher cost. The Qwen3.5 series' capability to enable advanced AI applications within budget constraints underscores its practical utility for developers and critical fields like defense and energy. Overall, the paper concludes that strategic optimizations can make high-performance LLM inference accessible on constrained budgets, challenging the perception that advanced AI capabilities are limited by hardware costs. Keywords: #phi4, CUDA optimizations, DDR4 RAM, Debian 13, Electron framework, HP Z440, LLM inference, MoE architecture, NVIDIA RTX 3060, PCIe Gen3, Qwen35, context window, ik_llamacpp, tokens per second

rtx 3090

jeanbaptistefleury.neocities.org a day ago

525. HN Show HN: Run autoresearch on a gaming PC (Windows and RTX GPUs fork)

This repository serves as a fork of "karpathy/autoresearch" with the aim of converting gaming PCs into autonomous AI research machines, particularly focusing on native Windows support and NVIDIA GPUs with at least 10 GB VRAM. Its primary objective is to facilitate overnight experiments using a simplified GPT model setup called nanochat. Key features include autonomous experimentation within a fixed five-minute runtime for each experiment and specific compatibility with consumer-grade NVIDIA GPUs like the Ampere (RTX series), Ada, and Blackwell. These experiments are managed by AI agents through modifications in a single file (`train.py`) and context management via `program.md`. The design choices prioritize running experiments on a set time budget to enhance result comparability, although this limits cross-platform result comparison due to independence from compute platform specifics. The repository explicitly supports NVIDIA GPUs with 10 GB VRAM or higher, excluding laptop GPUs and lower capacity variants to manage performance variability, utilizing PyTorch SDPA attention and eager execution with autotuning based on hardware profiles. For quick start, users require Python version 3.10+ and the uv project manager for dataset preparation and dependency installation via `uv`, followed by experiment initiation using the same tool, which also supports smoke testing for validation. The project adopts a minimalist approach to dependencies, concentrating solely on PyTorch and essential small packages, ensuring experiments remain self-contained and suitable for consumer-grade hardware, with an MIT license. Keywords: #phi4, AI agent, AdamW, CUDA, Claude/Codex, GPT model, Muon, NVIDIA GPUs, PyTorch, RTX, SDPA attention, TinyStories, Windows, autoresearch, autotune, batch size, eager execution, experiments, gaming PC, karpathy/autoresearch, platform support, uv project manager, validation bits per byte

rtx 3090

github.com 2 days ago

835. HN Show HN: Luna Agent – Custom AI agent in ~2300 lines of Python, no frameworks

Luna Agent is a custom-built AI agent developed by Fabio Nonato de Paula using approximately 2300 lines of Python, crafted independently from existing frameworks as part of a homelab project. Designed to address limitations in other evaluated frameworks, Luna Agent stands out with its efficient design and minimalistic codebase. It incorporates persistent memory management through SQLite, enabling advanced search functionalities while also facilitating integration via JSON configuration files. The agent includes safety measures for native operations and provides session isolation through a Discord interface. Additionally, it supports extensive context handling and structured logging, allowing it to operate on powerful local hardware without the need for cloud-based APIs. Emphasizing flexibility, Luna Agent offers configurable points for future enhancements, such as an AI firewall, detailed in its DESIGN.md file. The project’s source code is publicly available on GitHub, accompanied by a comprehensive technical blog post that delves into its design choices and motivations. Keywords: #phi4, AI agent, Discord interface, FTS5, GitHub, JSON logging, LLM traffic, Luna Agent, MCP tool integration, Python, Qwen3-Coder-Next Keywords: Luna Agent, RTX 3090, SQLite, architectural decisions, architectural decisions Final List: Luna Agent, conversation compression, design philosophy, embeddings, filtering proxy, frameworks, homelab project, llama-server, tests, tests Extracted Keywords: Luna Agent

rtx 3090

nonatofabio.github.io 3 days ago

1529. HN Show HN: ContextCache – Cache tool schema KV states, skip 99% of prefill tokens

ContextCache is an open-source middleware that enhances the performance of large language model (LLM) interactions by caching tool schemas as key-value states, thus reducing unnecessary data processing and speeding up request handling. It addresses inefficiencies inherent in traditional LLM requests where static tool definitions are redundantly prefilled with each user query. The system significantly accelerates response times—evidenced by a reduction from 5,625ms to 193ms when managing 50 tools—while preserving the quality and accuracy of responses. Offering both CPU and GPU deployment options, ContextCache ensures high performance even on systems lacking powerful GPUs. It supports scalability with up to 100+ tools and incorporates features like independent caches for multiple tenants and least-recently-used (LRU) eviction strategies. Open-source under CC BY 4.0, it includes comprehensive documentation, a demo app, benchmarks, and integration guides. ContextCache operates in two primary modes: Route-only Mode, which facilitates quick query routing without an LLM (~500ms latency), and Full Pipeline Mode, providing complete orchestration from query routing to execution and synthesis using external LLMs such as Ollama or Claude. Additional features include compatibility with various LLM providers via OpenAI's API, secure server-side storage for credentials, a web-based admin UI for system management, and content-addressed caching to enhance storage efficiency across tenants. Overall, ContextCache is tailored for scenarios demanding rapid, efficient processing of LLM requests with minimal resource overhead. It offers flexibility in deployment environments and maintains high accuracy levels, making it an optimal choice for optimizing LLM interactions. Keywords: #phi4, API keys, CPU orchestrator, Claude, ContextCache, GPU, KV cache, LLM requests, OpenAI, Qwen3-8B, RTX 3090 Ti, content-addressed caching, enterprise features, llamacpp, multi-tenant, parameter extraction, persistent storage, server-side credentials, speedup, synthesis, tool routing, tool schemas, zero degradation

rtx 3090

github.com 7 days ago

3088. HN Sparky – useful 'living' OpenClaw bot

Sparky is an innovative "living" robot developed using OpenClaw, characterized by its seamless integration of personality design, voice user interface, and computer workflow enhancements. The project harnesses the power of an NVIDIA RTX 3090 for tasks such as face detection and voice processing while employing artificial intelligence to interact with tools like Emacs, SolveIt, tmux, and macOS. This robot is a manifestation of the creator's passion for synthesizing diverse ideas into a functional and engaging robotic companion. A video demonstration highlights Sparky’s capabilities, particularly its proficiency in multi-host networking and adept interaction within various workspaces, underscoring its potential as an advanced, interactive assistant in computational environments. Keywords: #phi4, AI tool-calling, NVIDIA RTX 3090, OpenClaw, SolveIt, Sparky, computer workflows, echo cancellation, emacs, face detection, macOS, personality design, robot buddy, tmux, voice UI, voice activity detection, wake word detection, workspace affordances

rtx 3090

alexisgallagher.com 13 days ago
https://github.com/algal/sparky 13 days ago

3089. HN Show HN: Provision Stateless GPU Compute with Claude Code's Remote Control

Claude Code's Remote Control, powered by the Terradev MCP Server, offers a sophisticated solution for managing stateless GPU compute resources via natural language processing. This system enables users to provision and manage GPUs across various cloud providers such as AWS, GCP, Azure, from their local environments while maintaining secure API key storage on individual machines. Key functionalities include real-time, cost-optimized provisioning of diverse GPU types like NVIDIA H100 and A100, and the creation of NUMA-aware Kubernetes clusters with GPU nodes. It also facilitates model deployment to serverless platforms such as InferX or HuggingFace Spaces, along with inference endpoint management and cost optimization. Users can efficiently handle instance operations—viewing, stopping, starting, terminating—and analyze cost trends through integrated tools. The setup process requires the installation of Terradev CLI and MCP Server, followed by local API key configuration and integration with Claude Code. This tool supports a broad spectrum of GPUs and cloud providers, enabling comprehensive GPU cloud management via conversational commands, thereby streamlining complex cloud operations for users. Keywords: #phi4, API Keys, Cloud Providers, Cost Optimization, GPU Compute, GPU Instances, HuggingFace Spaces, Inference Deployment, Kubernetes Clusters, Multi-Cloud, Remote Control, Stateless Provisioning, Terradev MCP

rtx 3090

github.com 13 days ago

ScraperSpider

Scraper
Spider