Collection of large model daily reports from January 5th to January 7th

News5months agorelease AIWindVane
20 0

Collection of large model daily reports from January 5th to January 7th[Collection of large model daily reports from January 5th to January 7th] With the outbreak of large models (LLM), can we build different models to achieve the effect of 1+1>2; how does the language model perceive Time? Learn about “Time Vector”; New Moore Era: 2024 LLM Conjecture

With the explosion of large models (LLM), can we build different models to achieve the effect of 1+1>2?



Today’s large language model (LLM) is like an all-round warrior, capable of common sense and factual reasoning, understanding world knowledge, and generating coherent text… On the basis of these basic functions, researchers have made a series of efforts to improve these models. Fine-tune it to implement domain-specific features such as code generation, copy editing, and math problem solving. But these domain-specific models start to present some thorny issues. For example, some models are good at standard code generation but not proficient at general logical reasoning, and vice versa. In order to solve the above training cost and data challenges, Google proposed and studied practical settings for model combination. These settings include: (i) researchers can access one or more augmented models and anchor models, (ii) do not allow modify the weights of either model, and (iii) only have access to a small amount of data that represents the combined skill of a given model.

Climbing along the network cable has become a reality, Audio2Photoreal can generate realistic expressions and movements through dialogue



When you’re chatting with a friend across a cold phone screen, you have to guess the other person’s tone of voice. When he speaks, his expressions and even actions can appear in your mind. It would obviously be best if you could make a video call, but in actual situations you cannot make video calls at any time. If you are chatting with a remote friend, it is not through cold screen text or an avatar that lacks expression, but a realistic, dynamic, and expressive digital virtual person. This virtual person can not only perfectly reproduce your friend’s smile, eyes, and even subtle body movements. Will you feel more kind and warm? It really embodies the sentence “I will crawl along the network cable to find you.” This is not a science fiction fantasy, but a technology that can be realized in reality.

Google Image Generation AI Masters Multimodal Instructions



We have all seen the importance of instruction fine-tuning when working with large language models (LLMs). If applied properly, through fine-tuning of instructions, we can make LLM help us complete a variety of different tasks, turning it into a poet, programmer, playwright, scientific research assistant, or even an investment manager. Now that large models have entered the multi-modal era, is instruction fine-tuning still effective? For example, can we fine-tune control of image generation through multi-modal instructions? Unlike language generation, image generation involves multimodality from the beginning. Can we effectively enable models to grasp the complexity of multimodality? To solve this problem, Google DeepMind and Google Research proposed that a multi-modal instruction method can be used for image generation. This method can interweave information from different modalities to express the conditions for image generation.

What are the new developments in RAG, which specializes in making up for the shortcomings of large models? This review explains it clearly



The team of researcher Wang Haofen of Tongji University and the team of Professor Xiong Yun of Fudan University released a review of Retrieval Enhanced Generation (RAG), which comprehensively sorted out RAG from core paradigms, key technologies to future development trends. This work draws a clear blueprint for the development of RAG technology for researchers and points out the direction of future research and exploration. At the same time, it provides a reference for developers to help identify the advantages and disadvantages of different technologies, and guides how to most effectively utilize these technologies in diverse application scenarios.

No need for text annotation, TF-T2V reduces the cost of mass production of AI videos! Jointly created by Huake, Ali and others



In just the past two years, with the opening of large-scale image and text data sets such as LAION-5B, Stable Diffusion, DALL-E 2, ControlNet, and Composer, image generation methods with amazing effects have emerged one after another. The field of image generation is booming. However, compared to image generation, video generation still has huge challenges. First, video generation needs to process higher-dimensional data and consider the temporal modeling problems brought by the extra time dimension. Therefore, more video-text pair data are needed to drive the learning of temporal dynamics. However, accurate temporal annotation of videos is very expensive. This limits the scale of video-text data sets. For example, the existing WebVid10M video data set contains 10.7M video-text pairs, which is far from the LAION-5B image data set in data scale, seriously restricting the scale of the video generation model. Extension. In order to solve the above problems, the joint research team of Huazhong University of Science and Technology, Alibaba Group, Zhejiang University and Ant Group recently released the TF-T2V video solution.

How do language models perceive time? Learn about “time vector”



How exactly does a language model perceive time? How can we use a language model’s perception of time to better control output or even understand our brains? A recent study from the University of Washington and the Allen Institute for Artificial Intelligence provides some insights. Their experimental results show that temporal changes are encoded to a certain extent in the weight space of fine-tuned models, and weight interpolation can help customize language models to adapt to new time periods.

The New Moorean Era: 2024 LLM Conjecture



What are the most important trends over the next 5-10 years? The birth of ChatGPT brought an answer to this question, and also brought a clear signal to the future digital ecosystem: AI must be the core of future technological innovation and business model changes. As for how LLM will change in 2024, no one has a standard answer. The only thing that is certain is that the “New Moore’s Law” remains unchanged: the level of the model can be improved by one to two generations every 1-2 years, and the cost of model training every It will be reduced to 1/4 of the original value every 18 months, and the reasoning cost will be reduced to 1/10 of the original value every 18 months.

Massachusetts Brigham Hospital: ChatGPT has an accuracy of 71.7% in clinical decision-making!



Brigham Hospital in Massachusetts, one of the largest non-profit medical institutions in the United States, released a research paper on the application of ChatGPT in clinical medical decision-making. The hospital said that from proposing diagnosis, recommending diagnostic examinations to final diagnosis and nursing management decisions, ChatGPT has an accuracy of 71.7%, and its performance in the entire clinical decision-making is surprising. Especially compared with the initial diagnosis, ChatGPT showed the highest accuracy of 76.9% in the final diagnosis task.

© Copyright notes

Related posts

No comments

No comments...