3/31/2023 0 Comments Ai vectorize imageVIM is also capable of performing class-conditioned generation, such as synthesizing a specific image of a given class (e.g., a dog or a cat). This two-stage model, VIM, is able to perform unconditioned image generation by simply sampling token-by-token from the output softmax distribution of the Transformer model. Using these tokens, we train a decoder-only Transformer to predict a sequence of image tokens autoregressively. With our trained ViT-VQGAN, images are encoded into discrete tokens represented by integers, each of which encompasses an 8x8 patch of the input image. Finally, the Stage 1 decoder is applied to these tokens to enable generation of high quality images from scratch. In the first stage, ViT-VQGAN converts images into discrete integers, which the autoregressive Transformer (Stage 2) then learns to model. Overview of the proposed ViT-VQGAN ( left) and VIM ( right), which, when working together, is capable of both image generation and image understanding. Specifically, we reduced the encoder output from a 768-dimension vector to a 32- or 8-dimension vector per code, which we found encourages the decoder to better utilize the token outputs, improving model capacity and efficiency. ![]() In addition, we introduce a linear projection from the output of the encoder to a low-dimensional latent variable space for lookup of the integer tokens. ![]() In our work, we propose taking this approach one step further by replacing both the CNN encoder and decoder with ViT. VQGAN uses transformer-like elements in the form of non-local attention blocks, which allows it to capture distant interactions using fewer layers. VQGAN is an improved version of this that introduces an adversarial loss to promote high quality reconstruction. One recent, commonly used model that quantizes images into integer tokens is the Vector-quantized Variational AutoEncoder (VQVAE), a CNN-based auto-encoder whose latent space is a matrix of discrete learnable variables, trained end-to-end. Vector-Quantized Image Modeling with ViT-VQGAN We describe multiple improvements to the image quantizer and show that training a stronger image quantizer is a key component for improving both image generation and image understanding. This approach, which we call Vector-quantized Image Modeling (VIM), can be used for both image generation and unsupervised image representation learning. Then a Transformer model is trained to model the quantized latent codes of an image. In the first stage, an image quantization model, called VQGAN, encodes an image into lower-dimensional discrete latent codes. In “ Vector-Quantized Image Modeling with Improved VQGAN”, we propose a two-stage model that reconceives traditional image quantization techniques to yield improved performance on image generation and image understanding tasks. But while such models have achieved strong performance for image generation, few studies have evaluated the learned representation for downstream discriminative tasks (such as image classification). The second stage can also be applied to autoregressively generate an image after the training. A second stage CNN or Transformer is then trained to model the distribution of encoded latent variables. In these approaches, a convolutional neural network (CNN) is trained to encode an image into discrete tokens, each corresponding to a small patch of the image. Several recent papers have exploited this formulation to dramatically improve image generation results through pre-quantizing images into discrete integer codes (represented as natural numbers), and modeling them autoregressively (i.e., predicting sequences one token at a time). This pre-training formulation does not make assumptions about input signal modality, which can be language, vision, or audio, among others. ![]() ![]() In large part, this has been accomplished through pre-training language models on extensive unlabeled text corpora. In recent years, natural language processing models have dramatically improved their ability to learn general-purpose representations, which has resulted in significant performance gains for a wide range of natural language generation and natural language understanding tasks. Posted by Jiahui Yu, Senior Research Scientist, and Jing Yu Koh, Research Software Engineer, Google Research
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |