Current text-to-image editing models often encounter challenges with smoothly manipulating multiple attributes using a single instruction. Taking inspiration from the Chain-of-Thought prompting technique utilized in language models, we present an innovative concept known as Chain-of-Instruct Editing (CoIE), which enhances the capabilities of these models through step-by-step editing using a series of instructions. In particular, in the context of face manipulation, we leverage the contextual learning abilities of a pretrained Large Language Model (LLM), such as GPT-4, to generate a sequence of instructions from the original input, utilizing a purpose-designed 1-shot template. To further improve the precision of each editing step, we conduct fine-tuning on the editing models using our self-constructed instruction-guided face editing dataset, Instruct-CelebA. And additionally, we incorporate a super-resolution module to mitigate the adverse effects of editability and quality degradation. Experimental results across various challenging cases confirm the significant boost in multi-attribute facial image manipulation using chain-of-instruct editing. This is evident in enhanced editing success rates, measured by CLIPSim and Coverage metrics, improved by 17.86% and 85.45% respectively, and heightened controllability indicated by Preserve L1 and Quality metrics, improved by 11.58% and 4.93% respectively.
Face clustering tasks can learn hierarchical semantic information from large-scale data, which has the potential to help facilitate face recognition. However, there are few works on this problem. This paper explores it by proposing a joint optimization task of label classification and supervised contrastive clustering to introduce the cluster knowledge to the traditional face recognition task in two ways. We first extend ArcFace with a cluster-guided angular margin to adjust the within-class feature distribution according to the hard level of face clustering. Secondly, we propose a supervised contrastive clustering approach to pull the features to the cluster center and propose the cluster-aligning procedure to align the cluster center and the learnable class center in the classifier for joint training. Finally, extensive qualitative and quantitative experiments on popular facial benchmarks demonstrate the effectiveness of our paradigm and its superiority over the existing approaches to face recognition.