354.
HN
Tell HN: Tips for (mostly) free agentic coding setup
Agentic coding is revolutionizing software development by enabling more dynamic and automated processes. However, the cost of accessing premium tools poses a challenge for those without subscriptions. To mitigate this, several strategies allow developers to harness agentic coding resources with minimal financial investment. Utilizing OpenAI or Anthropic compatible APIs through open-source software (OSS) adapters is recommended, especially when providers offer free inference options. Another approach involves leveraging OpenRouter's complimentary models, which necessitate data storage usage; users can enhance their experience by spending around $10 to bypass rate limits and take advantage of Model IDs ending in `:free` during promotions for unlimited access without additional costs.
OpenCode stands out as a robust agentic harness that provides inference APIs supported by free tiers from various large language model (LLM) providers. It is important to note, however, that user data will be stored with these services. For those preferring local solutions, setting up a system with approximately 6-8GB of video RAM and 32GB of RAM allows for the running of ~30B-sized Mixture-of-Experts (MoE) models on one's own hardware. The GLM-4.7-Flash model is particularly suited for such environments in simpler harnesses like OpenCode.
While these cost-effective options are appealing, users must manage their expectations regarding data privacy and inference quality. For instance, OpenCode’s free Kimi 2.5 version differs from its paid counterpart, highlighting that not all features may be available without a fee. Additionally, comparisons between smaller open models and more comprehensive cloud versions should be avoided as they do not offer equivalent performance. Despite these limitations, the described tools can still produce impressive results, allowing users to explore agentic coding effectively while minimizing expenses.
Keywords: #phi4, APIs, Agentic coding, Anthropic, Claude Code, GLM-47-Flash, Kimi 25, MoE models, OSS adapters, OpenAI, OpenRouter, RAM, VRAM, data collection, inference, inference quality, models, promotional periods, rate limits
news.ycombinator.com a day ago
|
626.
HN
Simple CUDA-checkpoint wrapper to freeze and restore GPU processes quickly
`gpusched` is a sophisticated tool crafted for optimizing GPU process management through rapid freezing and restoration using NVIDIA's cuda-checkpoint technology. It efficiently offloads GPU virtual memory to host RAM, allowing the GPU to be reallocated without sacrificing quick recovery times. This utility offers notable advantages in performance by facilitating freezes and thaws approximately 25 to 30 times faster than re-loading models from scratch—taking around 600 milliseconds for freezing and about 400 milliseconds for thawing tasks. Installation is straightforward with a script accessible on GitHub, contingent upon a Linux environment and NVIDIA drivers version 580 or higher.
The tool includes both a Command Line Interface (CLI) for comprehensive process management—including starting daemons, running processes, checking statuses, logging outputs, and more—and an interactive terminal UI known as `gpusched dashboard`. Additionally, it integrates seamlessly into Python applications through its SDK without requiring external dependencies. Functionality extends to multi-GPU setups by enabling efficient checkpointing and restoration across GPUs.
Despite its strengths, the tool is limited to single-machine operations, lacking coordination capabilities for multi-node environments. It also necessitates root permissions due to cuda-checkpoint dependencies, and snapshots cannot be transferred between different GPU architectures. Future development ideas focus on enhancing functionality with disk-backed snapshots for persistent and limitless frozen models, introducing an HTTP API for remote management, and deploying policy-based eviction mechanisms to streamline resource optimization.
Licensed under Apache 2.0, `gpusched` stands out as a pivotal solution in improving the efficiency of managing large language models (LLMs), capitalizing on rapid checkpointing techniques to minimize downtime in GPU utilization cycles.
Keywords: #phi4, CLI, CUDA, GPU, Linux, NVIDIA, Python SDK, VRAM, benchmarks, checkpoint, daemon, development, freeze, future exploration Keywords: CUDA, gpusched, host RAM, limitations, process manager, restore, systemd
github.com 3 days ago
|
823.
HN
How to Vulkan in 2026
The document "How to Vulkan in 2026" serves as an advanced guide to developing a modern Vulkan graphics application using version 1.3, targeting developers already familiar with C/C++ and real-time graphics. It highlights significant evolutions within Vulkan over the past decade, introducing features such as dynamic rendering, buffer device address, descriptor indexing, and enhanced synchronization mechanisms, aiming to streamline efficient code writing by minimizing abstraction layers.
Key steps in setting up a Vulkan application include creating a Vulkan instance using SDL for platform-specific tasks, selecting appropriate physical devices with necessary queue families, and managing memory through the Vulkan Memory Allocator (VMA). The document describes creating a Vulkan-capable window, establishing a swapchain to render images across various devices, configuring depth testing via dedicated attachments, loading mesh data using tinyobjloader, and employing parallelism strategies like double buffering for optimal CPU-GPU task execution.
The guide emphasizes crucial tools like RenderDoc for debugging and SDL for managing platform-specific complexities. It covers efficient memory management by using `VMA_MEMORY_USAGE_AUTO`, ensuring high performance through simultaneous CPU preparation of frames while the GPU processes others. Buffers storing shader data, such as transformation matrices, leverage Vulkan 1.3's features to simplify access without descriptors.
Texture handling involves loading textures in KTX format for direct GPU memory upload, optimizing image tiling with layout transitions and copying commands. Synchronization between CPU and GPU is managed using fences, semaphores, and pipeline barriers to prevent resource conflicts. Command buffers are recorded into command pools before submission to the GPU queue, while shaders are written in Slang and compiled into SPIR-V format for Vulkan compatibility.
The document further details constructing a Vulkan graphics pipeline, including creating shader modules from SPIR-V code and setting up vertex input configurations, shader stages, viewport states, depth/stencil settings, and blending options. It describes a render loop where command buffers handle synchronization with fences and semaphores to coordinate CPU/GPU tasks efficiently.
Additionally, the guide outlines managing system events through SDL for platform-independent event handling, including application close, mouse interactions for object manipulation, key presses for toggling model instances, and window resizing necessitating swapchain recreation. This ensures responsive rendering in alignment with user interactions and application state changes.
Keywords: #phi4, C++20, CMake, GPU, KTX-Software, RenderDoc, SDL, SPIR-V, Slang, VMA, VRAM, VkShaderModuleCreateInfo, Vulkan, Vulkan SDK, anisotropic filtering, buffer device address, command buffers, depth attachment, descriptor indexing, descriptor sets, dynamic rendering, fence, frames in flight, glm, graphics application, image memory barrier, interactivity, interleaved attributes, logical device, multithreading, optimal tiling, phong lighting, physical devices, pipeline barriers, pipeline layout, queue families, render loop, resource allocation, shader data buffers, shaders, state management, swapchain, synchronization, texture loading, tinyobjloader, validation layers, vertex data, vkQueuePresentKHR, window resizing
www.howtovulkan.com 4 days ago
|
1040.
HN
Training LLMs on 1080 Tis without shadow weights
Project PRIMAL is an innovative research initiative focused on optimizing the training of Large Language Models (LLMs) using a novel approach known as the 4-bit Prime-Harmonic Training Engine. This project targets consumer-grade GPUs, specifically the GTX 1080 Ti, to address the issue of high VRAM usage associated with traditional Quantization-Aware Training by eliminating shadow weights, thereby reducing memory requirements significantly. Central to this initiative are key innovations like the Prime Harmonic Grid, which uses a custom Look-Up Table (LUT) based on prime reciprocals for precision optimization around zero—a region where LLM weights predominantly cluster. Additionally, the project introduces the Poltergeist Method, employing a "Decoupled Flipping" technique to minimize stochastic thrashing during training by utilizing an int8 buffer to cast gradient votes and updating weights only upon achieving consensus across micro-batches. These methods have proven effective in benchmarks, demonstrating the GTX 1080 Ti's efficient utilization by fully saturating VRAM for models with 0.1 billion parameters at batch sizes up to 64 while maintaining high throughput during training. Project PRIMAL is available as open-source software under the MIT license and requires a Pascal or newer NVIDIA GPU, along with CUDA version 11.8+ and Python 3.10+, to set up and run.
Keywords: #phi4, Batch Size, CUDA, Decoupled Flipping, Discrete Optimization Loop, GTX 1080 Ti, LLMs, Look-Up Table, NVIDIA GPU, Prime Harmonic Grid, Python, Quantization-Aware Training, Shadow Weights, Stochastic Thrashing, Throughput, VRAM
github.com 5 days ago
https://github.com/batteryphil/Primal-Discrete-LLM-Trai 5 days ago
|
1216.
HN
Show HN: 3D and World Models for Consistent AI Filmmaking
"Show HN: 3D and World Models for Consistent AI Filmmaking" introduces ArtCraft, an innovative tool that integrates artificial intelligence into the filmmaking process, aiming to enhance creativity and democratize film production by overcoming traditional industry constraints like nepotism and limited autonomy. The author emphasizes ArtCraft's role as a transformative force similar to digital audio workstations in music, providing filmmakers with intuitive 2D and 3D control surfaces for seamless image-to-image and image-to-video workflows, free from complex node graphs. This tool supports drag-and-drop functionality across creative canvases, facilitating rapid prototyping, editing, and compositing. ArtCraft leverages third-party compute providers to integrate existing models such as WorldLabs' Marble Gaussian Splats without mandatory payments, aligning with a "fair source" model that allows open-source access while planning for future offline capabilities and potentially portable OSS cloud solutions for AI tools. The author envisions expanding its features through further integrations with compute providers, developing a native client using Bevy, and incorporating local models to solidify ArtCraft's position as an indispensable tool for creative professionals in the filmmaking industry.
Keywords: #phi4, 3D compositing, 3D models, AI filmmaking, ArtCraft, Bevy, Blender, Cockroach DB, ControlNet, Figma, Gimp, I2I, I2V, IDE, Marble Gaussian Splats, UX/UI, VRAM, World Models, WorldLabs, cloud service, compute providers, creative autonomy, film school, local models, node graphs, photons-on-glass, prototyping, rotoscoping, text-to-image
getartcraft.com 6 days ago
|
1684.
HN
Emulator Bugs: Sega CD, Part 2
The blog post explores various technical challenges encountered during the emulation of Sega CD games, specifically focusing on "Snatcher" and "Batman Returns." In "Snatcher," an initial sprite display issue was identified as stemming from an integer overflow error due to misaligned sprite table addresses in the Genesis emulator's code. This problem was resolved by using a custom Rust profile that facilitated faster debugging. Additionally, issues related to VDP DMA reads interacting with word RAM were addressed by implementing a cycle delay to better replicate the hardware's behavior, which ultimately fixed all remaining bugs in "Snatcher" after further adjustments.
In "Batman Returns," initial graphical glitches were traced back to incorrect handling of TAS instructions executed by the sub CPU. These instructions failed because the emulator incorrectly handled bus locking, but fixing this resolved the visual issues and revealed a game freeze caused by a divide-by-zero exception on the sub CPU during gameplay. The underlying problem was found in how the zero flag was set for DIVS and DIVU instructions within the emulator, leading to erroneous branching behavior. Rectifying this error eliminated the freezing issue, completing the debugging process for "Batman Returns." Looking ahead, the author plans to tackle similar complexities with "Silpheed's" word RAM handoff code.
Overall, these bugs underscore the intricate challenges of emulating hardware-specific behaviors and optimizations used in Sega CD games, emphasizing the need for precise emulation techniques to ensure accurate game performance.
Keywords: #phi4, Batman Returns bug, Genesis code, Sega CD, TAS instruction, VDP DMA, VRAM, affine transformations, divide by zero exception, emulator bugs, integer overflow, sprite display issues, word RAM, write-through cache
jsgroth.dev 9 days ago
|
1709.
HN
Game Boy Snake: A Complete Implementation in Assembly
The article discusses implementing the classic Snake game for the Game Boy using RGBDS assembly language, showcasing fundamental development techniques within 8-bit CPU constraints (4 MHz, 8 KB RAM). It describes a state machine managing three primary screens—Title Screen, Play Screen, and Game Over Screen. The snake is represented through a circular array (ring buffer) of coordinates managed by two arrays for X and Y positions, utilizing an occupancy grid to efficiently track cell contents and prevent screen tearing with dirty rendering during VBlank.
The game is configured using constants that define playfield dimensions (20x18), a maximum snake length (64 segments via power-of-two wrapping), and frame speed. Numerical constants manage directions and states for streamlined control. Initialization involves disabling interrupts, setting hardware parameters, seeding the random number generator, and preparing initial graphics in VRAM.
In the main loop, the game operates safely between VBlank periods to process input, update states, and handle screen transitions based on current game state comparisons. Background tiles are preferred over sprites for rendering due to limitations (40 sprite maximum) that align better with grid-based movement. The snake's position is updated through direction inputs, with wall collision checks, occupancy grid updates, and flagged rendering changes.
Collision detection employs an efficient single-byte lookup method using the occupancy grid. Input handling uses edge detection on joypad readings to prevent reverse movements, while random number generation utilizes a linear feedback shift register combined with a hardware timer for food spawning in unoccupied cells. Text is rendered via a blitter mapping ASCII characters to tile indices.
The memory layout efficiently organizes game data within limited RAM, ensuring smooth gameplay. This Snake implementation serves as an educational resource for Game Boy programming techniques and provides a functional gaming experience, with potential optimizations like hardware interrupts and sound effects suggested. The complete source code can be accessed on GitHub.
Keywords: #phi4, Assembly, Collision Detection, Development, Edge Detection, Game Boy, Game Over Screen, Input Handling, Memory Layout, Play Screen, Polling, RGBDS, Random Number Generation, Ring Buffer, Sprites, State Machine, Text Blitting, Title Screen, VBlank, VRAM, WRAM
www.4rknova.com 9 days ago
|
1736.
HN
Skipping the ColecoVision's Boot Screen
The article explores techniques for bypassing the boot screen delay on the ColecoVision gaming console, which is recognized for its graphics and sound capabilities as well as its simple CPU design. It highlights how developers typically face a mandatory twelve-second display of the system logo and copyright message during startup, viewed as an inconvenience. To circumvent this delay, some games like Activision’s H.E.R.O. use a "test cartridge" mode by altering the first two bytes of the cartridge header to $55 $AA, signaling the console to skip regular initialization procedures and execute the main program directly.
The article provides comprehensive instructions for creating a silent-start cartridge, noting that it involves specific steps such as setting memory locations and initializing certain routines. It underscores the need for manual configuration in sound, graphics, and controller setup, diverging from standard production startup requirements. Compatibility with different BIOS versions is addressed, ensuring consistent RAM usage across them through adherence to documented calls.
By adopting these modifications, developers can enhance user experience by removing unnecessary delays during game startup on the ColecoVision. The author also shares a modified version of an existing project that successfully implements this silent-start technique, illustrating its practical application.
Keywords: #phi4, BIOS, CPU, ColEm, ColecoVision, I/O initialization, RAM, ROM usage, SN76489, TMS9918A, VRAM, ZEsarUX, cartridge architecture, compatibility concerns, controllers, delay loop, graphics, jump tables, retrocoding, shooting-gallery project, silent start, system logo screen, test cartridges
bumbershootsoft.wordpress.com 9 days ago
|
1957.
HN
Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"
The AXIOM Voice Agent is an innovative open-source platform developed by a first-year computer science engineering student, designed as a production-grade, fully offline voice agent tailored for robotics labs. It achieves sub-400ms latency on laptops with modest hardware specifications and has gained rapid adoption within 12 hours of its release. The platform features real-time embeddings using JSON RAG, hierarchical agentic RAG combining knowledge graphs and vector search, and optimized Whisper models to minimize errors in speech recognition. Additionally, it fine-tunes datasets for training the Lama 3.2 3b model and implements phonetic correctors to enhance text-to-speech quality.
AXIOM supports semantic search with SetFit, experiments with large language models (LLMs) like llama and kokora, and optimizes frontend performance using three.js for interactive 3D visualization. The project emphasizes privacy, local control, and edge AI capabilities, offering real-time speech processing, intelligent intent classification, RAG-powered responses, and multi-turn conversation management. Its architecture includes innovative features such as glued interactions, zero-copy inference, a 3D holographic UI, and dual corrector pipelines.
Licensed under Apache 2.0, AXIOM encourages community contributions while providing comprehensive documentation for setup, development, and deployment. It integrates with systems like WiredBrain RAG to enhance its functionality as a voice interface layer in robotics applications. The project supports over 100 concurrent users with sub-2-second latency and includes extensive resources such as template responses, knowledge facts, and project ideas.
AXIOM's security roadmap plans to migrate from .pkl to .safetensors format by Q1 2026 to mitigate risks, recommending isolated environments until then. The platform builds on open-source foundations like Sherpa-ONNX and SetFit, contributing significantly to the robotics and AI community. For further inquiries or contributions, contact details for Shubham Dev from Jaypee University of Information Technology are provided.
Keywords: #phi4, 3D models, 3D visualization, Apache 20 license, FIFO history management, FIFO interactions, FastAPI, GPU acceleration, GTX 1650, JSON RAG, Kokoro TTS, Ollama LLM, PostgreSQL, Python, RAG-powered responses, SQLite database, Semantic RAG, SetFit, Sherpa ONNX, Voice agent, WebGL carousel, WebSocket communication, context-aware dialogue, conversational intelligence, dual corrector pipeline, edge AI, fine-tuned dataset, hierarchical agentic RAG, holographic UI, intent classification, intent recognition, interaction DB logs, interactive UI, knowledge graph, llama 32, local control, local inference, minimal safe correction, minimal safe correctors, multi-turn conversation, parakeet TDT, pgvector, phonetic conversion, phonetic correctors, production-grade voice agent, real-time embeddings, robotics, semantic search, silero VAD, sub-400ms latency, template-based responses, threejs, vector search, voice capture, whisper models, zero-copy inference
github.com 11 days ago
|
2093.
HN
How Virtual Textures Really Work
Virtual texturing treats an enormous texture as a single continuous address space and streams only the pages required for the current view, thereby decoupling visible detail from physical GPU memory. The system comprises three layers: a virtual address space, a 2‑D page‑table texture that maps virtual pages to residency status and physical atlas coordinates, and a physical texture atlas that holds the resident pages. During rendering, a shader performs mip‑level selection based on screen‑space derivatives, calculates the virtual page coordinates, fetches the corresponding page‑table entry, translates it to an atlas coordinate if resident, and samples the physical texture; a fallback color is returned when a page is missing. To manage residency, a low‑resolution feedback pass records the pages and mip levels actually accessed each frame, packing this data into a compact 32‑bit buffer. A CPU‑side page manager decodes the feedback, keeps a small LRU cache of resident pages in the atlas, evicts the least‑recently used pages when necessary, and asynchronously streams new pages from disk or secondary storage, pinning low‑LOD pages minimally to avoid gaps. This closed loop converges to an optimal working set that gracefully degrades by falling back to lower‑resolution pages when demand exceeds cache capacity. The technique, pioneered in early console titles like Crash Bandicoot and refined in id Tech 5’s MegaTexture, enables artists to paint unique detail over vast scenes without tiling artifacts; its core concepts survive in modern engines through hardware‑accelerated sparse textures and virtual geometry (e.g., Nanite), while also enabling scientific visualization of enormous datasets by adapting resolution to what is actually visible. Performance limits are driven more by bandwidth and perceptual resolution than raw data size, so virtual texturing’s strength lies in keeping only what is observable in memory.
Keywords: #gpt-oss:20b, GPU, LOD, VRAM, atlas, feedback, mip, page, page table, residency, sparse textures, streaming, virtual textures
www.shlom.dev 12 days ago
https://crabernews.com/posts/50946 12 days ago
https://en.wikipedia.org/wiki/Lenna 11 days ago
https://mortenhannemose.github.io/lena/ 11 days ago
|
2197.
HN
Ask HN: Anyone Using a Mac Studio for Local AI/LLM?
The user is inquiring about the practical experience of using a Mac Studio equipped with either an M3 Ultra or M4 Pro chip for running large language models (LLMs) locally. They are particularly interested in the advantages of shared VRAM, which could enable the handling of larger models than would otherwise be possible on such hardware. However, they are also aware that this configuration may result in slower token generation times, and they are seeking insights into how this trade-off affects overall performance and usability in real-world scenarios.
Keywords: #qwen3:14b, AI, Hardware, LLM, Local LLM, M3 Ultra, M4 Pro, Mac Studio, Model Size, Performance, Shared Memory, Token Generation, VRAM
news.ycombinator.com 12 days ago
https://old.reddit.com/r/LocalLLaMA/search?q=mac+s 12 days ago
https://news.ycombinator.com/item?id=46319657 12 days ago
https://www.perplexity.ai/hub/blog/introducing-mod 11 days ago
|
2328.
HN
Show HN: LocalCoder – Tell it your hardware, get the exact local AI model to run
LocalCoder streamlines AI‑model configuration for diverse hardware platforms (Apple Silicon, NVIDIA GPUs, CPUs) by instantly delivering optimal model choices, quantization levels, token‑rate estimates, and context‑window capacities tailored to Qwen3‑Coder questions. It supplies immediate Ollama shell commands while also offering optional llama.cpp instructions and Visual Studio Code/Cursor IDE setup, ensuring flexibility for users demanding fine‑grained VRAM, thread, or context control. The free tier recommends the best model, whereas a $9 upgrade unlocks alternate models, expanded llama.cpp options, and enhanced IDE integration. The recommendation engine draws from curated matrices of Hacker News benchmarks, Unsloth documentation, and llama.cpp data, all without performing server‑side inference. Quantization guidance clarifies that Q8 offers the highest fidelity, Q4 balances speed and quality, and Q2 delivers faster processing at the expense of precision. Because Qwen3‑Coder employs a 480‑billion‑parameter Mixture‑of‑Experts architecture with only 3 billion active parameters, it comfortably operates within limited memory, and an example Continuous extension for VS Code/Cursor connects to a local Ollama server at port 11434.
Keywords: #gpt-oss:20b-cloud, Apple Silicon, CPU, GPU, IDE, LocalCoder, MoE, NVIDIA, Ollama, Qwen3-Coder, Unsloth, VRAM, VS Code, context window, llamacpp, quantization
localcoder.xyz 13 days ago
|
2360.
HN
We built a serverless GPU inference platform with predictable latency
The team constructed a serverless GPU‑inference platform focused on predictable latency and strict cost control for production AI tasks. They tackled challenges such as mitigating GPU cold starts and orchestrating queue scheduling, ensuring efficient multi‑tenant VRAM isolation without waste, choosing between model‑level and container‑level loading strategies, and routing traffic between batch and real‑time inference. Additional issues addressed included managing burst traffic without long‑term GPU reservations and balancing cost predictability with autoscaling behavior. Their documentation lists both failures and successes in the architecture, and they invite discussion on GPU scheduling, inference optimization, and workload isolation.
Keywords: #gpt-oss:20b-cloud, AI workloads, GPU, VRAM, batch, burst workloads, cold starts, container loading, cost control, inference, latency, model loading, multi-tenant isolation, queue scheduling, real-time inference, serverless
news.ycombinator.com 13 days ago
|
2586.
HN
How Virtual Textures Work
Virtual texturing, pioneered by Crash Bandicoot’s use of “virtual memory pages” for level sections and later refined in id Tech 5’s MegaTexture, replaces monolithic texture loads with a sparse, page‑based system that streams only the texels needed for the current view, thereby decoupling performance from total GPU memory. The approach relies on three GPU‑centric components: an addressing shader that computes mip‑level and virtual page coordinates based on screen‑space derivatives; a lightweight page table storing residency flags and atlas indices; and a physical texture atlas holding the resident pages. During rendering, the shader performs a lookup in the page table, maps the virtual page to a thread‑safe physical location, and fetches the texel; missing pages are substituted with a fallback color. A dedicated low‑resolution feedback pass records the page indices and mip levels actually accessed, packing this data into a buffer that the CPU‑side page manager consumes to decide which pages to load, evict, or pin. This closed‑loop system allows the working set to converge to a state where only the visible terrain is resident, dramatically reducing bandwidth and enabling ultra‑high‑resolution detail that would otherwise exceed memory limits. Modern hardware now exposes sparse textures, giving engines direct address translation and page tables, but engines still need custom feedback and eviction policies to maintain cross‑platform determinism. While classic real‑time games could still rely on traditional textures, virtual texturing shines in data‑heavy scenarios—such as open‑world slices, volumetric scientific datasets, and contemporary engines like Unreal’s Nanite—that demand scalable, efficient use of GPU resources without tile repetition or excessive bandwidth.
Keywords: #gpt-oss:20b-cloud, Atlas, Cache, Feedback, GPU, LOD, Memory, Page Table, Residency, Resolution, Shader, Sparse Textures, Streaming, UV, VRAM, Virtual Textures
www.shlom.dev 14 days ago
|