Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junhua Mao

Tony

OpenAI GPT-5 System Card

Dec 19, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram(+474 more)

Abstract:This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.

Via

Access Paper or Ask Questions

Pedestrian Crossing Action Recognition and Trajectory Prediction with 3D Human Keypoints

Jun 01, 2023

Jiachen Li, Xinwei Shi, Feiyu Chen, Jonathan Stroud, Zhishuai Zhang, Tian Lan, Junhua Mao, Jeonhyung Kang, Khaled S. Refaat, Weilong Yang(+2 more)

Figure 1 for Pedestrian Crossing Action Recognition and Trajectory Prediction with 3D Human Keypoints

Figure 2 for Pedestrian Crossing Action Recognition and Trajectory Prediction with 3D Human Keypoints

Figure 3 for Pedestrian Crossing Action Recognition and Trajectory Prediction with 3D Human Keypoints

Figure 4 for Pedestrian Crossing Action Recognition and Trajectory Prediction with 3D Human Keypoints

Abstract:Accurate understanding and prediction of human behaviors are critical prerequisites for autonomous vehicles, especially in highly dynamic and interactive scenarios such as intersections in dense urban areas. In this work, we aim at identifying crossing pedestrians and predicting their future trajectories. To achieve these goals, we not only need the context information of road geometry and other traffic participants but also need fine-grained information of the human pose, motion and activity, which can be inferred from human keypoints. In this paper, we propose a novel multi-task learning framework for pedestrian crossing action recognition and trajectory prediction, which utilizes 3D human keypoints extracted from raw sensor data to capture rich information on human pose and activity. Moreover, we propose to apply two auxiliary tasks and contrastive learning to enable auxiliary supervisions to improve the learned keypoints representation, which further enhances the performance of major tasks. We validate our approach on a large-scale in-house dataset, as well as a public benchmark dataset, and show that our approach achieves state-of-the-art performance on a wide range of evaluation metrics. The effectiveness of each model component is validated in a detailed ablation study.

* ICRA 2023

Via

Access Paper or Ask Questions

Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving

Dec 22, 2021

Jingxiao Zheng, Xinwei Shi, Alexander Gorban, Junhua Mao, Yang Song, Charles R. Qi, Ting Liu, Visesh Chari, Andre Cornman, Yin Zhou(+2 more)

Figure 1 for Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving

Figure 2 for Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving

Figure 3 for Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving

Figure 4 for Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving

Abstract:3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other use cases in many factors, including the 3D resolution and range of data, absence of dense depth maps, failure modes for LiDAR, relative location between the camera and LiDAR, and a high bar for estimation accuracy. Data collected for other use cases (such as virtual reality, gaming, and animation) may therefore not be usable for AV applications. This necessitates the collection and annotation of a large amount of 3D data for HPE in AV, which is time-consuming and expensive. In this paper, we propose one of the first approaches to alleviate this problem in the AV setting. Specifically, we propose a multi-modal approach which uses 2D labels on RGB images as weak supervision to perform 3D HPE. The proposed multi-modal architecture incorporates LiDAR and camera inputs with an auxiliary segmentation branch. On the Waymo Open Dataset, our approach achieves a 22% relative improvement over camera-only 2D HPE baseline, and 6% improvement over LiDAR-only model. Finally, careful ablation studies and parts based analysis illustrate the advantages of each of our contributions.

Via

Access Paper or Ask Questions

STINet: Spatio-Temporal-Interactive Network for Pedestrian Detection and Trajectory Prediction

May 08, 2020

Zhishuai Zhang, Jiyang Gao, Junhua Mao, Yukai Liu, Dragomir Anguelov, Congcong Li

Figure 1 for STINet: Spatio-Temporal-Interactive Network for Pedestrian Detection and Trajectory Prediction

Figure 2 for STINet: Spatio-Temporal-Interactive Network for Pedestrian Detection and Trajectory Prediction

Figure 3 for STINet: Spatio-Temporal-Interactive Network for Pedestrian Detection and Trajectory Prediction

Figure 4 for STINet: Spatio-Temporal-Interactive Network for Pedestrian Detection and Trajectory Prediction

Abstract:Detecting pedestrians and predicting future trajectories for them are critical tasks for numerous applications, such as autonomous driving. Previous methods either treat the detection and prediction as separate tasks or simply add a trajectory regression head on top of a detector. In this work, we present a novel end-to-end two-stage network: Spatio-Temporal-Interactive Network (STINet). In addition to 3D geometry modeling of pedestrians, we model the temporal information for each of the pedestrians. To do so, our method predicts both current and past locations in the first stage, so that each pedestrian can be linked across frames and the comprehensive spatio-temporal information can be captured in the second stage. Also, we model the interaction among objects with an interaction graph, to gather the information among the neighboring objects. Comprehensive experiments on the Lyft Dataset and the recently released large-scale Waymo Open Dataset for both object detection and future trajectory prediction validate the effectiveness of the proposed method. For the Waymo Open Dataset, we achieve a bird-eyes-view (BEV) detection AP of 80.73 and trajectory prediction average displacement error (ADE) of 33.67cm for pedestrians, which establish the state-of-the-art for both tasks.

* CVPR 2020

Via

Access Paper or Ask Questions

Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images

Nov 24, 2016

Junhua Mao, Jiajing Xu, Yushi Jing, Alan Yuille

Figure 1 for Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images

Figure 2 for Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images

Figure 3 for Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images

Figure 4 for Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images

Abstract:In this paper, we focus on training and evaluating effective word embeddings with both text and visual information. More specifically, we introduce a large-scale dataset with 300 million sentences describing over 40 million images crawled and downloaded from publicly available Pins (i.e. an image with sentence descriptions uploaded by users) on Pinterest. This dataset is more than 200 times larger than MS COCO, the standard large-scale image dataset with sentence descriptions. In addition, we construct an evaluation dataset to directly assess the effectiveness of word embeddings in terms of finding semantically similar or related words and phrases. The word/phrase pairs in this evaluation dataset are collected from the click data with millions of users in an image search system, thus contain rich semantic relationships. Based on these datasets, we propose and compare several Recurrent Neural Networks (RNNs) based multimodal (text and image) models. Experiments show that our model benefits from incorporating the visual information into the word embeddings, and a weight sharing strategy is crucial for learning such multimodal embeddings. The project page is: http://www.stat.ucla.edu/~junhua.mao/multimodal_embedding.html

* Appears in NIPS 2016. The datasets introduced in this work will be gradually released on the project page

Via

Access Paper or Ask Questions

Attention Correctness in Neural Image Captioning

Nov 23, 2016

Chenxi Liu, Junhua Mao, Fei Sha, Alan Yuille

Figure 1 for Attention Correctness in Neural Image Captioning

Figure 2 for Attention Correctness in Neural Image Captioning

Figure 3 for Attention Correctness in Neural Image Captioning

Figure 4 for Attention Correctness in Neural Image Captioning

Abstract:Attention mechanisms have recently been introduced in deep learning for various tasks in natural language processing and computer vision. But despite their popularity, the "correctness" of the implicitly-learned attention maps has only been assessed qualitatively by visualization of several examples. In this paper we focus on evaluating and improving the correctness of attention in neural image captioning models. Specifically, we propose a quantitative evaluation metric for the consistency between the generated attention maps and human annotations, using recently released datasets with alignment between regions in images and entities in captions. We then propose novel models with different levels of explicit supervision for learning attention maps during training. The supervision can be strong when alignment between regions and caption entities are available, or weak when only object segments and categories are provided. We show on the popular Flickr30k and COCO datasets that introducing supervision of attention maps during training solidly improves both attention correctness and caption quality, showing the promise of making machine perception more human-like.

* To appear in AAAI-17. See http://www.cs.jhu.edu/~cxliu/ for supplementary material

Via

Access Paper or Ask Questions

CNN-RNN: A Unified Framework for Multi-label Image Classification

Apr 15, 2016

Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, Wei Xu

Figure 1 for CNN-RNN: A Unified Framework for Multi-label Image Classification

Figure 2 for CNN-RNN: A Unified Framework for Multi-label Image Classification

Figure 3 for CNN-RNN: A Unified Framework for Multi-label Image Classification

Figure 4 for CNN-RNN: A Unified Framework for Multi-label Image Classification

Abstract:While deep convolutional neural networks (CNNs) have shown a great success in single-label image classification, it is important to note that real world images generally contain multiple labels, which could correspond to different objects, scenes, actions and attributes in an image. Traditional approaches to multi-label image classification learn independent classifiers for each category and employ ranking or thresholding on the classification results. These techniques, although working well, fail to explicitly exploit the label dependencies in an image. In this paper, we utilize recurrent neural networks (RNNs) to address this problem. Combined with CNNs, the proposed CNN-RNN framework learns a joint image-label embedding to characterize the semantic label dependency as well as the image-label relevance, and it can be trained end-to-end from scratch to integrate both information in a unified framework. Experimental results on public benchmark datasets demonstrate that the proposed architecture achieves better performance than the state-of-the-art multi-label classification model

* CVPR 2016

Via

Access Paper or Ask Questions

Generation and Comprehension of Unambiguous Object Descriptions

Apr 11, 2016

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, Kevin Murphy

Figure 1 for Generation and Comprehension of Unambiguous Object Descriptions

Figure 2 for Generation and Comprehension of Unambiguous Object Descriptions

Figure 3 for Generation and Comprehension of Unambiguous Object Descriptions

Figure 4 for Generation and Comprehension of Unambiguous Object Descriptions

Abstract:We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described. We show that our method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene. Our model is inspired by recent successes of deep learning methods for image captioning, but while image captioning is difficult to evaluate, our task allows for easy objective evaluation. We also present a new large-scale dataset for referring expressions, based on MS-COCO. We have released the dataset and a toolbox for visualization and evaluation, see https://github.com/mjhucla/Google_Refexp_toolbox

* We have released the Google Refexp dataset together with a toolbox for visualization and evaluation, see https://github.com/mjhucla/Google_Refexp_toolbox. Camera ready version for CVPR 2016

Via

Access Paper or Ask Questions

Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering

Nov 02, 2015

Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, Wei Xu

Figure 1 for Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering

Figure 2 for Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering

Figure 3 for Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering

Figure 4 for Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering

Abstract:In this paper, we present the mQA model, which is able to answer questions about the content of an image. The answer can be a sentence, a phrase or a single word. Our model contains four components: a Long Short-Term Memory (LSTM) to extract the question representation, a Convolutional Neural Network (CNN) to extract the visual representation, an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer. We construct a Freestyle Multilingual Image Question Answering (FM-IQA) dataset to train and evaluate our mQA model. It contains over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations. The quality of the generated answers of our mQA model on this dataset is evaluated by human judges through a Turing Test. Specifically, we mix the answers provided by humans and our model. The human judges need to distinguish our model from the human. They will also provide a score (i.e. 0, 1, 2, the larger the better) indicating the quality of the answer. We propose strategies to monitor the quality of this evaluation process. The experiments show that in 64.7% of cases, the human judges cannot distinguish our model from humans. The average score is 1.454 (1.918 for human). The details of this work, including the FM-IQA dataset, can be found on the project page: http://idl.baidu.com/FM-IQA.html

* Dataset released on the project page, see http://idl.baidu.com/FM-IQA.html ; NIPS 2015 camera ready version

Via

Access Paper or Ask Questions

Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

Oct 02, 2015

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille

Figure 1 for Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

Figure 2 for Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

Figure 3 for Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

Figure 4 for Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

Abstract:In this paper, we address the task of learning novel visual concepts, and their interactions with other concepts, from a few images with sentence descriptions. Using linguistic context and visual features, our method is able to efficiently hypothesize the semantic meaning of new words and add them to its word dictionary so that they can be used to describe images which contain these novel concepts. Our method has an image captioning module based on m-RNN with several improvements. In particular, we propose a transposed weight sharing scheme, which not only improves performance on image captioning, but also makes the model more suitable for the novel concept learning task. We propose methods to prevent overfitting the new concepts. In addition, three novel concept datasets are constructed for this new task. In the experiments, we show that our method effectively learns novel visual concepts from a few examples without disturbing the previously learned concepts. The project page is http://www.stat.ucla.edu/~junhua.mao/projects/child_learning.html

* ICCV 2015 camera ready version. We add much more novel visual concepts in the NVC dataset and have released it, see http://www.stat.ucla.edu/~junhua.mao/projects/child_learning.html

Via

Access Paper or Ask Questions