Altman Plans Trillions in AI Infrastructure; DeepSeek Challenges US AI Dominance
Key Takeaways
- OpenAI CEO Sam Altman plans trillions in AI infrastructure, including the Stargate venture.
- DeepSeek-R1 model matches US AI quality, costs less, outperforms OpenAI's o1 in benchmarks.
- Startups like Positron AI aim to cut AI inference costs, challenging Nvidia's dominance.
- Nous Research: Open-source AI models use 1.5-4x more tokens than closed-source models.
- Meta releases DINOv3, a self-supervised computer vision model with commercial license.
Top Stories
OpenAI CEO Sam Altman plans trillions in AI infrastructure.
On August 15, 2025, OpenAI CEO Sam Altman announced plans to invest trillions of dollars in AI infrastructure, including the Stargate venture. Altman also discussed the rocky rollout of GPT-5 and anticipates OpenAI eventually going public.
DeepSeek's DeepSeek-R1 model challenges US AI dominance.
DeepSeek claims its new AI model, DeepSeek-R1, matches the quality of its American competitors but is more affordable to develop and is offered at no cost. Developed by Chinese researchers, DeepSeek-R1 outperforms OpenAI's o1 model in benchmarks.
New AI startups aim to cut AI inference costs.
Startups like Positron AI, Groq, Cerebras Systems, and Sambanova Systems are aiming to reduce AI inference costs. The goal is to make powerful AI tools more accessible to freelancers and small businesses, potentially disrupting Nvidia's market dominance.
Nous Research study on open-source AI model token usage.
A study by Nous Research found that open-source AI models use 1.5 to 4 times more tokens than closed-source models for the same tasks. This could make them more expensive despite lower per-token costs.
Meta releases DINOv3, a self-supervised computer vision model.
Meta released DINOv3, a self-supervised computer vision model, achieving state-of-the-art performance across diverse domains. The model is designed for various vision tasks, including image classification, semantic segmentation, and object tracking, and is released under a commercial license.
AI Breakthroughs
stepfun-ai releases NextStep-1, a 14B-parameter image generation model.
stepfun-ai released NextStep-1, a 14B-parameter autoregressive model for image generation using continuous tokens. The NextStep Team submitted a paper on NextStep-1, achieving state-of-the-art performance.
Researchers introduce CATE-B, an open-source co-pilot system.
Researchers introduced CATE-B, an open-source co-pilot system using large language models (LLMs) to guide users through treatment effect estimation. CATE-B assists in constructing structural causal models and aims to facilitate the adoption of causal inference methods.
Survey on Large Model Empowered Embodied AI published.
A survey titled 'Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning' was published on arXiv on August 14, 2025. The survey focuses on the role of large models in enhancing embodied AI, particularly in decision-making and learning.
Researchers introduce JRDB-Reasoning, a benchmark for visual reasoning.
Researchers introduced JRDB-Reasoning, a difficulty-graded benchmark for visual reasoning in robotics. The benchmark extends the JRDB dataset with annotations for human-object interaction and geometric relationships.
Researchers propose Forgery Guided Learning (FGL) strategy.
Researchers propose a Forgery Guided Learning (FGL) strategy and Dual Perception Network (DPNet) to improve deepfake cross-domain detection. This approach shows strong generalization across various scenarios and effectively handles unknown forgery challenges.
Researchers introduce Self-Search Reinforcement Learning (SSRL).
Researchers introduce Self-Search Reinforcement Learning (SSRL), leveraging large language models (LLMs) as efficient simulators for agentic search tasks in reinforcement learning (RL).
Researchers introduce HumanSense, a benchmark for MLLMs.
Researchers introduce HumanSense, a benchmark for evaluating human-centered perception and interaction capabilities of Multimodal Large Language Models (MLLMs). The study reveals that leading MLLMs have considerable room for improvement in advanced interaction-oriented tasks.
Researchers propose HM-Talker for generating talking heads.
Researchers from Cornell University propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. M2DAO-Talker, a new method for audio-driven talking head generation, was published on arXiv on August 14, 2025.
Researchers propose MAGUS, a unified multi-agent framework.
Researchers propose MAGUS, a unified multi-agent framework for multimodal understanding and generation. MAGUS enables any-to-any capabilities across text, image, audio, and video and outperforms strong baselines.
Paper on integrating Reinforcement Learning with visual generative models.
A paper titled Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances was submitted to arXiv on August 14, 2025. The paper discusses the integration of reinforcement learning with visual generative models, focusing on enhancing controllability, consistency, and human alignment.
Researchers investigate text dominance in MLLMs.
Researchers have discovered that multimodal large language models (MLLMs) heavily rely on text for inference, underutilizing other modalities. This phenomenon, known as text dominance, has been systematically investigated across various data modalities.
Researchers introduce HPMI, a retraining-free backdoor attack.
Researchers propose Head-wise Pruning and Malicious Injection (HPMI), a novel retraining-free backdoor attack on transformers. HPMI works by pruning the least important head and injecting a pre-trained malicious head to establish the backdoor.
Researchers introduce MS-GRPO for post-training LLMs.
Researchers introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training Large Language Models (LLMs) as sequential decision-making agents. MS-GRPO improves smaller models without relying on large, computationally expensive models.
Researchers introduce NSegment+ for semantic segmentation.
Researchers introduce NSegment+, a novel augmentation framework that decouples image and label transformations to address implicit label noise in semantic segmentation. NSegment+ achieves mIoU gains on various datasets.
Researchers propose TriFlowSR for image super-resolution.
Researchers propose TriFlowSR, a novel framework for Ultra-High-Definition Reference-Based Landmark Image Super-Resolution. They also introduce Landmark-4K, the first RefSR dataset for UHD landmark scenarios.
Paper on Sparse Point Cloud Data Processing for human action recognition.
Researchers Maimunatu Tunau, Vincent Gbouna Zakka, and Zhuangzhuang Dai submitted a paper on arXiv evaluating data processing methods for mmWave radar sensors. The paper assesses recognition accuracy and computational cost using the MiliPoint dataset.
Researchers introduce Cross-Prompt Encoder (XPE) for LLMs.
Researchers introduce the Cross-Prompt Encoder (XPE), a method to improve performance on low-performing languages in large language models. Experiments on the SIB-200 benchmark show XPE is most effective for low-performing languages.
Research framework based on "Turtle Soup" game for LLMs.
A paper introduces a research framework based on the "Turtle Soup" game to investigate the imaginative reasoning capacity of Large Language Models (LLMs). The framework includes TurtleSoup-Bench, a bilingual benchmark with 800 puzzles, and Mosaic-Agent, an agent to assess LLMs.
Researchers introduce GenOM, an LLM-based ontology alignment framework.
Yiping Song, Jiaoyan Chen, and Renate A. Schmidt submitted a paper introducing GenOM, a large language model (LLM)-based ontology alignment framework. Experiments on the OAEI Bio-ML track show GenOM achieves competitive performance.
ICE framework for in-place prompting in dLLMs.
Xiangqi Jin et al. submitted a paper introducing ICE (In-Place Chain-of-Thought Prompting with Early Exit), a novel framework for in-place prompting in diffusion large language models (dLLMs). Experiments showed ICE achieved up to 17.29% accuracy improvement with 4.12x speedup on GSM8K.
STRIDE-QA dataset for spatiotemporal reasoning in urban driving.
A new visual question answering (VQA) dataset, STRIDE-QA, for spatiotemporal reasoning in urban driving scenes was submitted to arXiv on August 14, 2025. The dataset contains 16 million QA pairs over 285K frames.
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference.
A new paper introduces FreeKV, an algorithm-system co-optimization framework designed to enhance KV retrieval efficiency for large language models while maintaining accuracy. FreeKV employs speculative retrieval and hybrid KV layouts, achieving up to a 13x speedup compared to existing methods.
MIRRAMS: Learning Robust Tabular Models under Unseen Missingness Shifts.
A new paper introduces MIRRAMS, a deep learning framework designed to address the challenge of missing values in tabular data. MIRRAMS uses mutual information-based conditions to extract label-relevant information, promoting robustness against distributional shifts.
Researchers propose VSRM to promote efficient reasoning.
Researchers propose a novel rule-based verifiable stepwise reward mechanism (VSRM) to promote efficient reasoning in large reasoning models (LRMs). Experiments on AIME24 and AIME25 show substantial output length reduction while maintaining original reasoning performance.
Survey of Theories and Debates on Realising Emotion in AI.
A paper discusses the concept of Artificial Emotion (AE) in Artificial Intelligence (AI) and its potential advantages. They review current manifestations of AE in machine learning systems and examine emotion-modulated architectures.
We-Math 2.0: A Versatile MathBook System for Visual Mathematical Reasoning.
Runqi Qiao and colleagues submitted a paper introducing We-Math 2.0, a system designed to enhance the mathematical reasoning abilities of Multimodal Large Language Models (MLLMs). The system integrates a structured mathematical knowledge system with 491 knowledge points and 1,819 fundamental principles.
ToolACE-R: Model-aware Iterative Training and Adaptive Refinement for Tool Learning.
A new version (v2) of the paper 'ToolACE-R: Model-aware Iterative Training and Adaptive Refinement for Tool Learning' was submitted to arXiv on August 14, 2025. ToolACE-R uses model-aware iterative training and adaptive refinement to improve tool invocation and performance.
Method for text-driven image generation.
A paper proposes a method integrating text-image contrastive constraints with structural guidance mechanisms to improve semantic alignment and structural consistency. Experiments on the COCO-2014 dataset confirm the method's superior performance.
Real-World AI
Mobile-friendly deep learning for plant disease detection.
Researchers have developed a mobile-friendly deep learning solution for detecting plant diseases across 101 classes of 33 crops. EfficientNet-B1 achieved 94.7% classification accuracy, suitable for mobile deployment.
DxDirector-7B, an LLM for clinical diagnosis, is introduced.
A research paper introduces DxDirector-7B, a Large Language Model (LLM) designed to reverse the physician-AI relationship in clinical diagnosis. Evaluations show DxDirector-7B achieves superior diagnostic accuracy and reduces physician workload.
UI-Venus, a UI agent using screenshots, is introduced.
A technical report introduced UI-Venus, a UI agent using screenshots as input and based on a multimodal large language model. UI-Venus achieves SOTA performance on UI grounding and navigation tasks with reinforcement finetune (RFT) based on Qwen2.5-VL.
Self-supervised method for temporal super-resolution of energy data.
Researchers propose a self-supervised method using Generative Adversarial Transformers (GATs) for temporal super-resolution of energy data. This method can be trained without high-resolution data and reduces RMSE by 9%.
ReviewRL: Towards Automated Scientific Review with RL.
A paper introduces ReviewRL, a reinforcement learning framework for generating scientific paper reviews. The framework uses an ArXiv-MCP retrieval-augmented context generation pipeline, supervised fine-tuning, and a reinforcement learning procedure.