2 Sources
2 Sources
[1]
World of Tech This Week - Mistral's Pixtral, Replit Agent, Hailuo Minimax Hollywood Video Model, Carpentopod and more
Exactly 7 years ago, a team of research scientists at Google Research and Google Brain released a paper titled "Attention is all you need". In this paper, they proposed a "simple network architecture" called the "Transformer". This became the pivotal and legendary architecture which now powers every large language model on our planet. Such is the nature of technology. One research paper is enough, to change the world. What kind of advancements are waiting for us in the future? Will there be another paper like this? More importantly, can we afford to miss it? This is why we come to you weekly with all the latest updates in the world of technology, right from the edge of tomorrow. Let's dive into what happened this week. Pixtral 12B v0.1 is Mistral's first Multi-Modal Open Source LLM Mistral is a French AI start-up known for its highly popular series of LLMs mainly because they are open-source. Anyone could download and start using it, and it was made so easy to download via Torrents that, it gained a lot of fame among the open source and LLM enthusiasts alike. A while back, it was Meta, which launched a huge open source model - Llama 3 with nearly 400 billion parameters. Competing against it is Mistral, which has come up with its first Multi-modal LLM called "Pixtral v0.1" which is a comparatively lighter model with 12 billion parameters. In LLM-speak, the number of parameters determines the complexity of the model, as to what kind of fine-tuning has been done on it. In the case of Pixtral, it is clear that it is based on the already fine-tuned Nemo-12B model that Mistral allows for download. Since the news is hot off the press, many of the GPU-powered enthusiasts are still downloading the 28GB torrent and the developers are yet to host it on capable servers, so the jury is still out on Pixtral's capabilities but this much is clear - the model is capable to not only understanding text, but understanding images as well. Which means, it is only a matter of time, until some one makes it generate images as well. For the geeks, the specifications are as follows - 1. Text backbone: Mistral Nemo 12B 2. Vision Adapter: 400M 3. Uses GeLU (for vision adapter) & 2D RoPE (for vision encoder) 4. Larger vocabulary - 131,072 5. Three new special tokens - 'img', 'img_break', 'img_end' 6. Image size: 1024 x 1024 pixels 7. Patch size: 16 x 16 pixels 8. Tokenizer support in mistral_common We will update you with more details next week, once the developers and enthusiasts start benchmarking this new piece of code. Surely, the LLM world will be happy to have more competition! Develop apps with zero coding knowledge with "Replit Agent" While "Cursor AI" was all the buzz some weeks back, Replit, the online IDE with support for more than 50 programming languages, launched its own AI coding assistant called "Replit Agent". Announced by CEO Amjad Masad, the Agent, which is already available as an early access feature for all coders, will help in every step of the way, to build any kind of program from scratch. The beauty is, thanks to the power of LLM, it makes coding so simple that, all you need to do, write prompts in natural language in a text box. The coder can ask anything they want, and the agent would just do it. In a demo video, Amjad asks the agent to "Make an app with local landmarks" in which he wants to add a search bar. Once the prompt is typed, there is a live update of what the Agent is doing in real time, as it writes code, uses the same repositories that a developer usually uses. Nothing is obscured, because whatever the agent does, is fully visible to the coder in real-time. In the "chat window", the coder can give the Agent the feedback, as it makes changes in real-time, fulfilling all the requirements. Basically the Replit Agent is doing what a developer would be doing, the same actions, imitated to produce a much faster workflow, or even goes to the extent where the developer wouldn't need any coding knowledge, and just "ask" the Agent to develop a program for them. Such a wild feature is now actually a reality in Replit, which is already well known amongst developers and coding enthusiasts. "Any time you have an idea, it doesn't have to be a fleeting idea, you can just make it happen and the code is all yours, it's there.." says Amjad in his intro video, detailing his inspiration behind this feature. In just 2 mins, in the demo video, he was able to add a complex "search" feature in his "ladmarks near me" app and deploys it to production. Ever heard of such a rapid time-to-market scenario? Well, that's what Replit Agent is for, and it's already getting some rave reviews from developers and enthusiasts who want to develop prototypes in an instant. Have you tried "Cursor" or "Replit Agent" yet? Hailuo "Minimax" AI - Hollywood production without a crew Once again, the Chinese have changed the game. A month back, "Kling AI" made a lot of waves in the GenAI video space, as it became an extremely formidable competitor to USA-based LLMs like LumaLabs and Runway. Now, it's the turn of "Hailuo - Minimax" to completely dominate the GenAI ecosystem this week. What's so special? Well, first of all, it's an entirely "Text to Video" model, in which, you type just the prompts, and with no kind of image guidance, the video will be generated. And the kind of videos it is generating, is easily top-tier in the current scenario, mainly because there is no artificial "slow motion" effect which is currently the pitfall of almost all the other video models. Check out their demo video for the kind of results that are possible with this video model - Since it is free to try, a lot of users have tried their hand with this model and are sharing some insane results. Like this thread here, the videos generated by this model seem to have a particular cinematic feel to it, suggesting that a lot of films have been used to train this model. In another post, the output is just stunning, to say the least. The colors, the movement, everything, when prompted right, seems to be right on track. Imagine creating your own movie, with this AI model, and you need nothing but a screenplay in your hands. This is exactly what the Hailuo-Minimax video model is going for. And this is just the beginning. After more updates, we are sure, this will one of the best video models to have come out of China. Hot on the heels of this update, we also have "Adobe Firely" which just came out with its own video model, with safe "copyright-free" generation. We will have more of this next week. AI that can do lip-reading and make captions Symphonic Labs is an AI startup based in San Francisco and Waterloo (Canada), which works on tools for multi-modal speech understanding. They have released a new tool on a website that has a name "readtheirlips" which is pretty self explanatory. It's a very simple yet extremely powerful tool, in which, all you need to do is, upload a video which you would want to decipher, and it will transcribe the captions, not based on convoluted audio processing algorithms, rather, it will use super-complex vision algorithms to read lips using motion-capture-like tech. Let's imagine a scenario where you have footage that is taken from afar, or taken with a disadvantage of having distorted audio. It's nearly impossible to make captions for such videos, as many subtitling specialists and experts know and have faced these situations in many documentaries. Well, this is the exact problem that the solution from Symphonic Labs tends to solve, with lip-reading. Taking advantage of a well-trained model that is specifically tutored on motion-data of faces, the tool is able to accurately figure out the movements of the eyes, the cheek bones and the mouth to reconstruct a perfect caption for that particular video sequence. Of course, a lot of responses were of concern on artificial intelligence entering the domain of surveillance in which an all-powerful state will be able to monitor the speech and expressions of all its citizens. But guess what masks are for? Well, not all tools require such a scrutiny but when it comes to latest technology, we always face detractors who would like to delay it as much as possible and maintain a status quo. However, history has taught us that technology advancements are inevitable and such a lip-reading solution might be of widespread use in forensics of the future investigative agencies. A Walking Table? In this week's robotics update is a walking table called "Carpentopod" by designer Giliam De Carpentier. Yes, it's a real thing, and a great personal project of Giliam, which is detailed in his blog. According to the designer, this pet project had been in the works for many years as something way back in 2008 when he wanted to generate various walking mechanisms in software. Since then, he has been gaining electronic skills and more working skills on wood design. The result, is a wireless walking wooden coffee table, that can be controlled using a joystick. It features a very interesting leg linkage mechanism in which, the table is able to actually move its legs. A total of 6 legs on each side were actually designed and solved in C++ programming, with a kinematics solver and the algorithm was refined by making the legs compete with each other, through various parameters. Both six-legged sections driven by motors, use simple electronics. "Is it honestly very useful? Maybe not. But is it fun to make it bring me a drink? Very much so." claims Giliam who got flooded with requests to buy one. However, he is not making them on demand, but has open sourced his designs on his blog and documented the entire process of making it. Such a cool indie project that has been in the works for 16 years, is a sigh to behold, indeed. Apple Intelligence gains Vision Finally, this is the much celebrated update of the week, Apple Intelligence has gained new skills in Vision processing. That means, Apple's own implementation of LLMs has gone multi-modal. Apart from integrations with ChatGPT, Apple had introduced its own "private on-device models" for a more personalized experience of AI and called it "Apple Intelligence". It's an extension of Siri which is gaining a lot of new capabilities. With Visual Intelligence, the new iPhone camera control button acts as a trigger point for just pointing at anything using the camera and asking Siri for information about it. One of the funnier demos that Apple showcased was of a iPhone user walking up to a dog, and taking a photo, then proceeds to ask Siri what kind of a dog it is, for which Siri will answer. The multi modal capabilities also expand to notifications and emails, as summaries will be generated in the notifications panel cards where previous only truncated texts were visible. Another feature that Apple added is the ability to search through tons of videos and photos to figure out specific moments through natural language understanding. This means that Apple Intelligence now has the ability to fully understand text, audio, photos and video making it truly powerful in the era of on-device LLMs. Aiding it is the newly spruced up A18 Pro chipset which has more NPUs than the previous generations. With this update, Apple is moving into the AI era at full speed, with Google playing catch-up with Gemini's implementations. It will be very interesting to see how Google responds and more importantly, how pure-play AI companies like OpenAI will take it. This concludes the wrap-up of all the latest from the world of technology this week, from Team FoneArena. There is always a lot more happening in this dynamic, so don't forget to come back next week!
[2]
This Week in AI: OpenAI's new Strawberry model may be smart, yet sluggish
Hiya, folks, welcome to TechCrunch's regular AI newsletter. If you want this in your inbox every Wednesday, sign up here. This week in AI, OpenAI's next major product announcement is imminent, if a piece in The Information is to be believed. The Information reported on Tuesday that OpenAI plans to release Strawberry, an AI model that can effectively fact-check itself, in the next two weeks. Strawberry will be a stand-alone product but will be integrated with ChatGPT, OpenAI's AI-powered chatbot platform. Strawberry is reportedly better at programming and math problems than other top-end generative AI models (including OpenAI's own GPT-4o). And it avoids some of the reasoning pitfalls that normally trip up those same models. But the improvements come at a cost: Strawberry is said to be slow -- quite slow. Sources tell The Information that the model takes 10-20 seconds to answer a single question. Granted, OpenAI will likely position Strawberry as a model for mission-critical tasks where accuracy is paramount. This could resonate with businesses, many of which have grown frustrated with the limitations of today's generative AI tech. A survey this week by HR specialist Peninsula found that inaccuracies are a key concern for 41% of firms exploring generative AI, and Gartner predicts that a third of all generative AI projects will be abandoned by the end of the year due to adoption blockers. But while some companies might not mind chatbot lag time, I think the average person will. Hallucinatory tendencies aside, today's models are fast -- incredibly fast. We've grown accustomed to this; the speed makes interactions feel more natural, in fact. If Strawberry's "processing time" is indeed an order of magnitude longer than that of existing models, it'll be challenging to avoid the perception that Strawberry is a step backward in some aspect. That's assuming the best-case scenario: that Strawberry answers questions consistently correctly. If it's still error-prone, like the reporting suggests, the lengthy wait times will be even tougher to swallow. OpenAI's no doubt feeling the pressure to deliver as it burns through billions spending on AI training and staffing efforts. Its investors and potential new backers hope to see a return sooner rather than later, one imagines. But rushing to put out an unpolished model such as Strawberry -- and considering charging substantially more for it -- seems ill-advised. I'd think the wiser move would be to let the tech mature a bit. As the generative AI race grows fiercer, perhaps OpenAI doesn't have the luxury. News Apple rolls out visual search: The Camera Control, the new button on the iPhone 16 and 16 Plus, can launch what Apple calls "visual intelligence" -- basically a reverse image search combined with some text recognition. The company is partnering with third parties, including Google, to power search results. Apple punts on AI: Devin writes about how many of Apple's generative AI features are pretty basic when it comes down to it -- contrary to what the company's bombastic marketing would have you believe. Audible trains AI for audiobooks: Audible, Amazon's audiobook business, said that it'll use AI trained on professional narrators' voices to generate new audiobook recordings. Narrators will be compensated for any audiobooks created using their AI voices on a title-by-title, royalty-sharing basis. Musk denies Tesla-xAI deal: Elon Musk has pushed back against a Wall Street Journal report that one of his companies, Tesla, has discussed sharing revenue with another of his companies, xAI, so that it can use the latter's generative AI models. Bing gets deepfake-scrubbing tools: Microsoft says it's collaborating with StopNCII -- an organization that allows victims of revenge porn to create a digital fingerprint of explicit images, real or not -- to help remove nonconsensual porn from Bing search results. Google's Ask Photos launches: Google's AI-powered search feature Ask Photos began rolling out to select Google Photos users in the U.S. late last week. Ask Photos allows you to ask complex queries like "Show the best photo from each of the National Parks I visited," "What did we order last time at this restaurant?," and "Where did we camp last August?" U.S. and EU sign AI treaty: At a summit this past week, the U.S., U.K., and EU signed up to a treaty on AI safety laid out by the Council of Europe (COE), an international standards and human rights organization. The COE describes the treaty as "the first-ever international legally binding treaty aimed at ensuring that the use of AI systems is fully consistent with human rights, democracy and the rule of law." Research paper of the week Every biological process depends on interactions between proteins, which occur when proteins bind together. "Binder" proteins -- proteins that bind to specific target molecules -- have applications in drug development, disease diagnosis, and more. But creating binder proteins is often a laborious and costly undertaking -- and comes with a risk of failure. In search of an AI-powered solution, Google's AI lab DeepMind developed AlphaProteo, a model that predicts proteins to bind to target molecules. Given a few parameters, AlphaProteo can output a candidate protein that binds to a molecule at a specified binding site. In tests with seven target molecules, AlphaProteo generated protein binders with 3x to 300x better "binding affinity" (i.e., molecule-binding strength) than previous binder-finding methods managed to create. Moreover, AlphaProteo became the first model to successfully develop a binder for a protein associated with cancer and complications arising from diabetes (VEGF-A). DeepMind admits, however, that AlphaProteo failed on an eighth testing attempt -- and that strong binding is usually only the first step in creating proteins that might be useful for practical applications. Model of the week There's a new, highly capable generative AI model in town -- and anyone can download, fine-tune, and run it. The Allen Institute for AI (AI2), together with startup Contextual AI, developed a text-generating English-language model called OLMoE, which has a 7-billion-parameter mixture-of-experts (MoE) architecture. ("Parameters" roughly correspond to a model's problem-solving skills, and models with more parameters generally -- but not always -- perform better than those with fewer parameters.) MoEs break down data processing tasks into subtasks and then delegate them to smaller, specialized "expert" models. They aren't new. But what makes OLMoE noteworthy -- besides the fact that it's openly licensed -- is the fact that it outperforms many models in its class, including Meta's Llama 2, Google's Gemma 2, and Mistral's Mistral 7B, on a range of applications and benchmarks. Several variants of OLMoE, along with the data and code used to create them, are available on GitHub. Grab bag This week was Apple week. The company held an event on Monday where it announced new iPhones, Apple Watch models, and apps. Here's a rundown in case you weren't able to tune in. Apple Intelligence, Apple's suite of AI-powered services, predictably got airtime. Apple reaffirmed that ChatGPT would be integrated with the experience in several key ways. But curiously, there wasn't any mention of AI partnerships beyond the previously announced OpenAI deal -- despite Apple lightly telegraphing such partnerships earlier this summer. In June at WWDC 2024, SVP Craig Federighi confirmed Apple's plans to work with additional third-party models, including Google's Gemini, in the future. "Nothing to announce right now," he said, "but that's our general direction." It's been radio silence since. Perhaps the necessary paperwork is taking longer to hammer out than expected -- or there's been a technical setback. Or maybe Apple's possible investment in OpenAI rubbed some model partners the wrong way. Whatever the case may be, it seems that ChatGPT will be the solo third-party model in Apple Intelligence for the foreseeable future. Sorry, Gemini fans.
Share
Share
Copy Link
A comprehensive look at the latest developments in AI, including new language models from Mistral and OpenAI, advancements in video generation, and the emergence of innovative AI agents and tools.

This week, the AI landscape saw significant advancements with the introduction of Mistral's Pixtral, a groundbreaking multimodal AI model. Pixtral stands out for its ability to process both text and images, marking a notable step forward in AI capabilities. The model demonstrates impressive performance in visual question answering and image captioning tasks, showcasing the potential for more sophisticated AI interactions with visual content
1
.OpenAI made waves with the introduction of its new language model, codenamed "Strawberry." While the model exhibits enhanced capabilities in reasoning and task completion, early reports suggest it may be slower than its predecessors. This trade-off between performance and speed highlights the ongoing challenges in AI development, where improvements in one area may come at the cost of another
2
.The tech world also saw the launch of Replit Agent, an AI-powered coding assistant designed to streamline the software development process. This tool aims to enhance productivity by offering intelligent code suggestions and automating routine tasks, potentially transforming the way developers work
1
.Two notable developments in video generation technology emerged this week. Hailuo, a new AI model, showcases the ability to generate high-quality video content from text prompts. Meanwhile, the aptly named "Hollywood" video model demonstrates advanced capabilities in creating cinematic-quality video sequences. These innovations signal a significant leap forward in AI-driven content creation, potentially revolutionizing industries from entertainment to advertising
1
.Related Stories
In an exciting development for the scientific community, Carpentopod was introduced as an AI-powered tool designed to assist researchers in literature reviews and hypothesis generation. This innovative application of AI technology could significantly accelerate the pace of scientific discovery by streamlining the research process
1
.As these AI advancements continue to emerge, they raise important questions about the future of work, creativity, and human-AI interaction. The development of more sophisticated language models and multimodal AI systems suggests a future where AI could play an even more significant role in various aspects of our lives, from content creation to scientific research. However, challenges such as the speed-performance trade-off seen in OpenAI's Strawberry model remind us that the path of AI development is not always straightforward, and there are still hurdles to overcome in creating truly efficient and versatile AI systems.
Summarized by
Navi
[1]
1
Technology

2
Technology

3
Policy and Regulation
