274.
HN
Show HN: Trained YOLOX from scratch to avoid Ultralytics (aircraft detection)
The author developed SkySpottr, an AR app designed to overlay information about aircraft using YOLOX models due to licensing restrictions with Ultralytics' YOLOv8. The development process began with training a model from scratch using an RTX 3090 and the COCO2017 dataset, focusing on aircraft detection. Various configurations like "nano," "tiny," "small," and custom "nanoish" models were tested, emphasizing adjustments for detecting small objects such as distant aircraft. During this phase, challenges included channel mismatches in configuration files and difficulties with high-altitude plane detection due to their minimal pixel size on screens.
To enhance the model's performance for small object detection, techniques like increasing input resolution and using mosaic and mixup augmentation were employed. For efficient deployment on iPhones, models underwent quantization and were implemented using CoreML. Integration of YOLOX with Apple’s Vision framework posed challenges, particularly in managing memory leaks by optimizing buffer handling.
Further improvements involved retraining the model with negative samples to minimize false positives, such as mistaking trees or clouds for aircraft. The author also incorporated self-sourced images from real-world app usage, labeled using a more accurate YOLO26-X model. This approach improved detection accuracy in challenging ground-pointed sky conditions compared to initial training on the COCO dataset.
Ultimately, YOLOX-Small models were successfully integrated into SkySpottr, demonstrating efficient performance on an iPhone. The project not only achieved its technical goals but also provided valuable insights into object detection, particularly the advantages of self-sourcing data and developing custom solutions beyond pre-packaged offerings like those from Ultralytics.
Keywords: #phi4, AGPL-30, AR app, COCO2017 dataset, CoreML, INT8 quantization, MIT license, SkySpottr, Ultralytics, YOLOX, YOLOv8, aircraft detection, debugging, false positives, iOS deployment, inference time, memory leak, model accuracy, negative samples, neural networks, object detection, real-world conditions, self-sourced images, training models
austinsnerdythings.com a day ago
|
790.
HN
Show HN: Trained YOLOX from scratch to avoid Ultralytics (iOS aircraft detect)
The author developed an AR app named SkySpottr, designed to overlay aircraft information by integrating device location, orientation, and ADS-B data. Initially utilizing YOLOv8 for object detection, they encountered licensing issues under AGPL-3.0 with Ultralytics, prompting a switch to training MIT-licensed YOLOX models from scratch. The author trained various configurations (Nano, Tiny, Small, Nanoish) on an RTX 3090 using the COCO2017 dataset and faced challenges such as channel mismatch errors, which were mitigated by increasing input resolution and adjusting convolution types with guidance from AI tools.
The author achieved high detection rates with the Small and Nanoish models but struggled with integrating YOLOX into iOS's CoreML due to preprocessing differences. To enhance performance, they implemented INT8 quantization, reducing model size while maintaining accuracy. Real-world tests revealed issues with false positives from non-aircraft objects and detecting distant aircraft, which were addressed by incorporating negative samples in the training dataset and using YOLO26-X for pseudo-labeling additional self-sourced images.
After retraining, SkySpottr showed improved accuracy with fewer false positives, benefiting from an enriched dataset of real-world images. The author concluded that developing their own model was beneficial for avoiding licensing issues and gaining deeper insights into object detection models. SkySpottr is now available on the App Store and continues to improve as more training data is collected.
Keywords: #phi4, ADS-B data, AGPL-30, AR app, COCO2017 dataset, CoreML, INT8 quantization, MIT license, SkySpottr, Ultralytics, YOLOX, YOLOv8, aircraft detection, false positives, iOS deployment, inference time, memory leak, model accuracy, neural networks, object detection, self-sourced images, training models
austinsnerdythings.com 4 days ago
|
1555.
HN
A Ralph Loop for Reading: Beating GPT 5.2 with a 4k Context Window (and 4 GPUs)
The text outlines an innovative approach for conducting deep financial research using a home server equipped with four RTX 3090 GPUs, leveraging the library Laconic. This method circumvents costly API services by employing limited-context models like qwen3:4b to handle complex tasks efficiently. The "Ralph Loop" strategy effectively manages context windows through graph theory principles, allowing atomic facts stored in a JSON notebook to streamline data processing without overwhelming model memory. By decomposing queries into key elements and refining results iteratively, the system autonomously synthesizes factual information.
The efficacy of this approach is demonstrated with qwen3:4b accurately identifying Chemistry Nobel Prize laureates for 2024—a task impossible without Laconic's framework—showcasing small models' potential to outperform larger ones through optimal context management. This innovation underpins eh-trade.ca, facilitating extensive research on 11,000 stocks at minimal cost and effort, yielding promising momentum-based stock strategy results.
Keywords: #phi4, API Subscription, Bash Script, Context Window, Financial Research, GPT 52, GPUs, Git, Graph Theory, JSON, LLM, Momentum Strategies, Notebook, RTX 3090, Ralph Loop, Stocks, Strategy, VRAM
stevehanov.ca 8 days ago
|
1598.
HN
HeartMuLa: Open-source music foundation model achieving commercial-grade quality
HeartMuLa is an open-source AI music generation model designed to produce professional-quality songs complete with lyrics through a hierarchical Transformer architecture and HeartCodec (12.5Hz). It is freely accessible under the Apache 2.0 license, making it ideal for both personal and commercial use without incurring any fees. Often compared to Suno, a closed-source alternative, HeartMuLa offers comparable quality while providing advantages such as local deployment and enhanced privacy control without requiring subscriptions. The model necessitates approximately 24GB of VRAM for optimal performance, with recommended GPUs including the RTX 3090, RTX 4090, or A100. For users lacking sufficient VRAM, cloud GPU services like RunPod or Vast.ai present viable alternatives to meet the resource demands.
Keywords: #phi4, A100, AI model, Apache 20 license, GPU memory, HeartCodec, HeartMuLa, RTX 3090, RTX 4090, RunPod, VRAM, Vastai, cloud services, hierarchical Transformer architecture, lyrics, music generation, open-source, privacy control, professional-quality
heart-mula.com 8 days ago
|
1682.
HN
Ace-Step 1.5 prompt tips: how I get more controllable music output
ACE-Step 1.5 is an innovative open-source music generation model designed for high-quality music creation accessible on consumer hardware. It efficiently generates music under two seconds per song using advanced GPUs like the A100 and within ten seconds on an RTX 3090, leveraging a hybrid architecture. This setup involves a Language Model (LM) acting as a planner that translates user inputs into detailed song blueprints to direct the Diffusion Transformer (DiT). The model supports diverse generation styles, languages, and editing capabilities with minimal VRAM requirements.
Key features of ACE-Step 1.5 include ultra-fast music synthesis, flexible audio durations, batch processing, and extensive stylistic control across over a thousand instruments. It provides advanced functionalities such as cover generation, vocal-to-BGM conversion, metadata manipulation, and multi-language lyric support. Access to the model is facilitated through Python on CUDA GPU platforms, with launch scripts tailored for various systems, including a portable package option for Windows users. Depending on VRAM availability, different LM models are recommended to balance performance and quality.
The developers at ACE Studio and StepFun emphasize responsible use of ACE-Step 1.5, highlighting potential risks such as copyright infringement and cultural insensitivity. Users are encouraged to ensure originality and adherence to legal standards. Comprehensive documentation and multilingual support are available on GitHub, ensuring robust user assistance and guidance.
Keywords: #phi4, ACE-Step, CUDA, DiT models, Diffusion Transformer, GPU VRAM, GitHub Pages, Gradio UI, Hugging Face, LM models, Language Model, LoRA training, REST API, benchmarking, copyright infringement, cultural diversity, editing capabilities, evaluation metrics, hybrid architecture, licensing, modelScope, multi-language lyrics, music generation, open-source, reinforcement learning, stylistic control
github.com 9 days ago
https://github.com/ace-step/ACE-Step-1.5 9 days ago
http://rochus-keller.ch/Diverses/Ace-Step-v1.5_demo1.mp 8 days ago
http://rochus-keller.ch/Diverses/Ace-Step-v1.5_demo2.mp 8 days ago
https://rochus-keller.ch/?p=1428 8 days ago
https://mordenstar.com/blog/dutyfree-shop 8 days ago
https://mordenstar.com/blog/screwdriver-sonata 8 days ago
|
1714.
HN
Writing an LLM from scratch, part 32a – Interventions: training a baseline model
In this segment of the series, the author outlines their approach to developing a baseline model for training language models from scratch using an RTX 3090 and later transitioning to cloud-based training on an 8x A100 machine for faster experimentation. Key interventions considered include dropout settings, learning rates, precision adjustments, batch sizes, bias adjustment in weight matrices, gradient clipping, and learning rate optimization. The author strategically narrows their focus by excluding extended data and multiple epochs, opting instead to maintain a consistent random seed for reproducibility while removing periodic validation to streamline training. A notable issue encountered was loss spikes attributed to exploding gradients, prompting the implementation of gradient clipping as an initial intervention to assess its impact on performance. Following these refinements, the model exhibited slight improvements in test dataset loss compared to prior iterations. The author's ongoing strategy involves testing various interventions to identify effective techniques for optimizing language model training from scratch.
Keywords: #phi4, Hugging Face Hub, LLM, attention weight biases, baseline model, batch size, cloud training, dropout, exploding gradients, gradient clipping, interventions, precision, random seed, training
www.gilesthomas.com 9 days ago
|
2135.
HN
Ace-Step 1.5: Pushing the Boundaries of Open-Source Music Generation
ACE‑Step v1.5 is an open‑source music foundation model that achieves commercial‑grade generation on consumer GPUs, producing a full 4‑minute track in roughly 2 seconds on an A100 and under 10 seconds on an RTX 3090 while consuming less than 4 GB of VRAM, making it executable locally without cloud dependence. Its hybrid design couples a language‑model planner that crafts detailed song blueprints—including metadata, lyrics, and captions through Chain‑of‑Thought reasoning—with a Diffusion Transformer (DiT) that renders the audio, enabling precise stylistic control, cover generation, repainting, vocal‑to‑background‑music conversion, and multilingual prompt compliance in over 50 languages. Users can personalize the model by training a lightweight LoRA on a few tracks, allowing their unique style to be imprinted. ACE‑Step incorporates intrinsic alignment via internal reinforcement learning to avoid external reward models or human‑feedback bias. Performance comparisons in Table 1 place ACE‑Step at the top tier across multiple quality metrics (Alignment, Lyric, Coherence, Memory, etc.), frequently achieving the highest or second‑highest scores against commercial and open‑source peers such as Suno‑v5, Mureka‑V7.6, and MinMax‑2.0; its speed advantage—10–120× faster than rivals that require 2–4 minutes or more—combined with an intuitive interface featuring collapsible lyric previews and example captions, positions it as a versatile, high‑quality tool for musicians, producers, and content creators.
Keywords: #gpt-oss:20b, A100, ACE-Step, Align, AudioBox, CE, CU, Chain-of-Thought, Cla, Coh, DiT, Diffusion, Editing, Generation Speed, LM, LoRA, Mem, Model, Mus, Music Generation, Open-Source, PC, PQ, Prompt, RTX 3090, Reinforcement Learning, SongEval, VRAM
ace-step.github.io 12 days ago
|
2316.
HN
Ace-Step 1.5
ACE‑Step 1.5 is an open‑source, lightweight music foundation model that achieves near‑commercial audio quality on consumer GPUs, generating a full song in under 2 seconds on an A100 (0.5–10 s depending on settings) and under 10 seconds on an RTX 3090 while requiring less than 4 GB VRAM for local use; its hybrid architecture first uses a language model to encode song structure, metadata, lyrics, and captions into a chain‑of‑thought plan, then hands these outputs to a Diffusion Transformer (DiT) for synthesis, with intrinsic reinforcement learning ensuring bias‑free alignment and strict prompt adherence; the system supports audio lengths from 10 s to 10 min (up to 600 s), over 50 languages, cover generation, track repainting, vocal‑to‑BGM conversion, and personalization via LoRA fine‑tuning from only a handful of songs, while also enabling batch and multi‑track generation of up to eight songs simultaneously, offering 1 000+ instrument/style options with fine‑tuned timbre control and versatile controls such as reference audio, track separation, metadata editing (duration, BPM, key, time signature), simple‑mode drafts, query rewriting, audio‑understanding (extracting BPM, key, captions), auto‑LRC generation, and quality scoring; training and deployment are streamlined—an hour‑long one‑click LoRA training on an RTX 3090 suffices for eight songs, and a portable Windows package bundles a Python 3.11 environment, CUDA 12.8, CPython or MPS support, and launch scripts (`start_gradio_ui.bat`, `start_api_server.bat`) that auto‑detect runtime, install the `uv` package manager, manage Git updates, set language, download the model from HuggingFace or ModelScope (or fallback to an auto‑chosen source), and allow custom environment variables (`LANGUAGE`, `DOWNLOAD_SOURCE`, `CHECK_UPDATE`, `CONFIG_PATH`, `LM_MODEL_PATH`, `INIT_LLM`); the API runs at `http://localhost:8001`, and the Gradio UI can be launched with comprehensive command‑line flags (`--port`, `--server-name`, `--share`, `--language`, `--init_service`, `--init_llm`, `--config_path`, `--lm_model_path`, `--offload_to_cpu`, `--download-source`, `--enable-api`, `--api-key`, `--auth-username`, `--auth-password`) with defaults managed via a `.env` file, facilitating both unauthenticated and authenticated local or public deployment; optional LLM loading is controlled by the `ACESTEP_INIT_LLM` environment variable (`auto`, `true/1/yes`, or `false/0/no`) and a `ACESTEP_LM_MODEL_PATH` for specifying the model, while the GPU optimization stack—offloading, quantization, batching limits—remains active, allowing Intel integrated GPUs (e.g., Ultra 9 285H) to run with quantized inference (nanovllm unsupported) and anticipating support for Intel discrete GPUs; the first run automatically downloads a strictly versioned checkpoint (e.g., `acestep‑5Hz‑lm‑0.6B`, `‑1.7B`, `‑4B`) and a suite of DiT variants (e.g., `acestep‑v15‑base`, `‑sft`, `‑turbo`, `‑turbo‑rl`) that are auto‑selected based on available VRAM (≤6 GB: DiT only; 6–12 GB: 0.6 B LM; 12–16 GB: 1.7 B LM; ≥16 GB: 4 B LM), providing a scalable quality‑versus‑memory trade‑off; developers can streamline dependencies with `uv add` and `uv sync --upgrade`, and the included config snippets illustrate how to set LLM policies and serving modes, offering an automated, GPU‑aware, and highly configurable pipeline for text, audio, and multimodal generation.
Keywords: #gpt-oss:20b-cloud, ACE-Step, Audio Generation, Batch Generation, CUDA 128, Diffusion Transformer, Gradio, LoRA Training, Multi-Track Generation, Python 311, Query Rewriting, REST API, Vocal2BGM
github.com 13 days ago
|