The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved
Curated by THEOUTPOST
On Fri, 4 Oct, 4:03 PM UTC
4 Sources
[1]
TikTok parent company ByteDance has a tool that's scraping the web 25 times faster than OpenAI
TikTok parent company ByteDance is amassing huge volumes of web data way faster than the other major web crawlers ByteDance may be planning to release its own LLM, and is aggressively using its web crawler, "Bytespider," to scrape up data to train its models, Fortune reported. Bytespider showed up on the scene in April, and since then, its rate of consumption puts web scrapers from OpenAI, Google, Meta, and Anthropic to shame. Sam Crowther, CEO of Kasada, a company that specializes in bot management, told the outlet that Bytespider's scraping rate is 25 times more than OpenAI's GPTbot and 3,000 times the rate of ClaudeBot, which is Anthropic's web crawler for its Claude LLM. Crowther also said that Kasada's data has seen "huge spikes in scraping activity" from Bytespider in the last six weeks. As Bytespider voraciously consumes the web, the U.S. government is trying to inhibit potential access of American user data to the Chinese government. In April, President Biden signed a bill forcing the ban of TikTok unless it was sold by ByteDance within the year. Given ByteDance's ticking clock for selling TikTok, the sense of urgency fits the massive rate of its web crawling activity -- whether for an LLM, a better algorithm, or something else, we don't know. What ByteDance plans to do with all of its newly-mined data remains to be seen. But TikTok has launched several AI-powered features for the platform. In May, it announced a suite of tools for advertisers to create AI-generated ads, and AI-generated avatars for brands and creators. TikTok is also rumored to be working on an internal search engine, with results powered by AI -- possibly using ChatGPT.
[2]
TikTok's parent launched a web scraper that's gobbling up the world's online data 25-times faster than OpenAI
ByteDance looks like it's eager to make up for lost time when it comes to scraping the web for data needed to train its generative AI models. The China-based parent company of video app TikTok released its own web crawler or scraper bot, dubbed Bytespider, sometime in April, according to research from Kasada, a company that specializes in bot management for companies with online data. The existence of the bot was also confirmed by Dark Visitors, which monitors scraper bots. ByteDance's bot has quickly become one of the most, if not the single most, aggressive scrapers on the internet, the research shows. It's scraping data at a rate that's many multiples of other major companies, such as (Google, Meta, Amazon, OpenAI, and Anthropic, which use their own scraper bots to help create and improve their large language or multimodal models, known as LLMs or LMMs. Sam Crowther, the CEO of Kasada, said since Bytespider showed up, it's been scraping data at about 25 times the rate of GPTbot, which scrapes data for OpenAI's ChatGPT platform and underlying models, for instance. Bytespider has been scraping at 3,000 times the rate of ClaudeBot, from Anthropic, which operates the Claude platform. As the months have gone by, Bytespider has become even more aggressive, according to Kasada. Data shows huge spikes in scraping activity from Bytespider over each of the last six weeks. Representatives of TikTok and ByteDance did not respond to emails seeking comment. ByteDance's aggressive scraping comes despite the possibility of TikTok being banned in the U.S in the coming months. President Joe Biden has signed legislation that requires ByteDance to sell TikTok, due to national security concerns, or shut it down. The Bytespider bot, much like those of OpenAI and Anthropic, does not respect robots.txt, the research shows. Robots.txt is a line of code that publishers can put into a website that, while not legally binding in any way, is supposed to signal to scraper bots that they cannot take that website's data. Web scraping goes back decades, mainly by search engines to gather links to web pages. But the rise of generative AI tools has added a new dimension and made the practice a prime source of lawsuits and controversy. People and organizations whose work has been scraped argue their copyright is being infringed in the process. All of the models that underly generative AI tools were trained on massive amounts of online data, effectively everything available on the web, particularly written information. Tech companies use scraper bots to essentially copy it all for all for free and put it into their datasets. "It's like they're trying desperately to catch up," Crowther said of the aggressive scraping being done by Bytespider. Just last year, ByteDance was reportedly so far behind in the generative AI race that it was using OpenAI to help build ByteDance's own LLM, which is against OpenAI's terms of service. Earlier this year, ByteDance released a chat-based LLM called Duabo, but work on that model would have been completed prior to the accumulation of more recent training data scraped by Bytespider. It's "clear" that ByteDance is at work on a new LLM, according to one person familiar with the company. As for what ByteDance plans to do with a new LLM, a person familiar with the company's ambitions said one goal has to do with the search function for TikTok. Last week, TikTok released an update to its current search function focused on keywords for ads, basically allowing advertisers to search in real time for words that are trending on TikTok. It allows marketers to build an ad with relevant keywords that would ostensibly help the ad show up on the screens of more users. A new AI model with data on more recent internet trends and topics could expand and improve TikTok's search environment further, according to the person familiar with the company's ambitions. "Given the audience and the amount of use, TikTok with a search environment that is a completely biddable space with keywords and topics, that would be very interesting to a lot of people spending a ton of money with Google right now," the person said.
[3]
TikTok's owner is scraping the web 25 times faster than OpenAI
As ByteDance develops artificial intelligence models to compete in China, the bot it uses to scrape data to train those models is reportedly spiking in activity. The TikTok owner launched its own web scraper, Bytespider, in April, and it's now scraping data multiple times faster than bots from other companies, Fortune reported, citing research from Kasada, a bot management company, and Dark Visitors, a monitor of scraper bots. Companies developing AI models, such as Google (GOOGL) and Meta (META), use scraper bots to gather data to train and improve the large language models (LLMs) and multimodal models that power the companies' AI services. Bytespider is scraping web data about 25 times faster than OpenAI's web scraper, GPTbot, Sam Crowther, CEO of Kasada, told Fortune. Compared with Anthropic's ClaudeBot, Bytespider is 3,000 faster. Like OpenAI's and Anthropic's bots, Bytespider ignores instructions from robots.txt, a non-legally binding line of code that tells web scrapers which data it can and cannot access on a website, Fortune reported. According to Kasada's data, Bytespider has had spikes in scraping activity in the last six weeks. "It's like they're trying desperately to catch up," Crowther told Fortune. ByteDance did not immediately respond to a request for comment. The China-based company released its AI-powered chatbot, Doubao, last August, and it's proving to be a tough competitor to homegrown rival Baidu's (BIDU) Ernie Bot. In May, ByteDance launched a series of Doubao LLMs for enterprises, which cost less than models from the company's Chinese competitors. Now, ByteDance is planning to build a new AI model using chips from China's Huawei, Reuters reported, citing three unnamed people familiar with the matter. However, a spokesperson for ByteDance previously told Quartz the company is not developing a new AI model. The company has also designed two AI chips with Taiwan Semiconductor Manufacturing Company (TSM) that ByteDance plans to mass produce by 2026, The Information reported, citing unnamed people familiar with the matter. By producing its own chips, the company could become less dependent on Nvidia's (NVDA) pricey graphics processing units, or GPUs, which are subject to U.S. export controls, people told The Information.
[4]
With TikTok Ban Looming Large, Parent ByteDance's Web Scraping Bot Draws Attention For Being More Aggressive Than The One ChatGPT Uses: Report
ByteDance, the company behind TikTok, has introduced a powerful web scraper named "Bytespider." Launched in April, Bytespider is recognized as one of the most aggressive data collectors online, outpacing other major tech firms significantly in terms of data collection speed. What Happened: Research conducted by Kasada, a bot management company, and Dark Visitors, a group monitoring scraper bots, confirmed Bytespider's activity. According to Kasada CEO Sam Crowther, Bytespider collects data 25 times faster than GPTbot, utilized by OpenAI for ChatGPT, and 3,000 times faster than ClaudeBot from Anthropic, Fortune reported on Friday. Despite the looming threat of a U.S. ban on TikTok, ByteDance continues its aggressive data collection strategy. President Joe Biden has demanded the sale or shutdown of TikTok due to national security concerns. Bytespider's disregard for robots.txt, a voluntary code that advises scrapers to avoid certain websites, adds to the controversy. See Also: Elon Musk Mocks Vinod Khosla After OpenAI Investor Mixes Up Argentina's Poverty And Unemployment Rates To The increase in web scraping is linked to ByteDance's development of a new large language model (LLM) to improve TikTok's search capabilities. A recent update to TikTok's search function allows real-time keyword searches for ads, potentially enhancing ad visibility. ByteDance has yet to respond to Benzinga's queries. Why It Matters: The aggressive web scraping by ByteDance follows a trend among major tech companies. In June, OpenAI and Anthropic were reported to have ignored web scraping rules, bypassing the robots.txt protocol to gather free data for AI model training. This practice has sparked controversy, highlighting the tension between AI development and data privacy. In August, NVIDIA faced scrutiny for scraping videos from platforms like YouTube to train its AI models. This revelation raised concerns about content creators' rights and the ethical implications of using publicly available data without explicit consent. Similarly, in September, Microsoft's owned LinkedIn was criticized for using user data for AI training without updating its terms of service, particularly affecting users in the U.S. Read Next: Elon Musk Says 'OpenAI Is Evil' After ChatGPT-Parent Reportedly Asked Investors Not To Invest In Rivals, Including xAI: Ark's Cathie Wood Says It's 'Not True' (Updated) Photo by XanderSt on Shutterstock This story was generated using Benzinga Neuro and edited by Pooja Rajkumari Market News and Data brought to you by Benzinga APIs
Share
Share
Copy Link
ByteDance, TikTok's parent company, has launched a web scraper called Bytespider that is collecting data at rates far exceeding those of major tech companies, raising questions about its AI ambitions and data privacy concerns.
ByteDance, the parent company of TikTok, has entered the web scraping arena with a powerful new tool called Bytespider. Launched in April 2024, this web crawler has quickly become one of the most aggressive data collectors on the internet, outpacing major tech companies in its ability to gather online information 1.
According to research by Kasada, a bot management company, Bytespider is operating at an astonishing rate:
Sam Crowther, CEO of Kasada, reported significant spikes in Bytespider's scraping activity over the past six weeks, indicating an intensification of ByteDance's data collection efforts 2.
Like some of its counterparts from other tech giants, Bytespider does not respect the robots.txt protocol, a voluntary code that signals which parts of a website should not be scraped. This aggressive approach has raised concerns about data privacy and the ethical implications of mass data collection 3.
The introduction of Bytespider aligns with ByteDance's efforts to catch up in the AI race. The company has already released an AI-powered chatbot called Doubao in China, which is competing with Baidu's Ernie Bot. ByteDance is also rumored to be developing a new AI model, potentially using chips from China's Huawei 3.
One possible use for the vast amount of data being collected is to enhance TikTok's search functionality. The platform recently updated its search feature to allow advertisers to track trending keywords in real-time. A more advanced AI model could further improve TikTok's search capabilities, potentially challenging Google's dominance in the digital advertising space 1.
ByteDance's aggressive data collection comes at a time when TikTok faces significant regulatory challenges in the United States. President Joe Biden has signed legislation requiring ByteDance to sell TikTok or shut it down, citing national security concerns. This situation adds complexity to ByteDance's AI development efforts and raises questions about the future of its data collection practices 4.
ByteDance's actions reflect a broader trend in the tech industry, where companies are racing to collect vast amounts of data to train and improve their AI models. This practice has sparked debates about copyright infringement, content creators' rights, and the ethical use of publicly available information for AI training purposes 4.
Reference
[1]
ByteDance, TikTok's parent company, is leading the race in China's generative AI market by aggressively hiring top talent and becoming Nvidia's largest chip customer in Asia, outpacing competitors like Alibaba and Baidu.
3 Sources
3 Sources
ByteDance, TikTok's parent company, plans to invest around $20 billion in AI infrastructure in 2025, focusing on enhancing its AI capabilities both domestically and internationally while navigating geopolitical challenges.
10 Sources
10 Sources
TikTok's parent company, ByteDance, is intensifying its efforts to design its own AI chips. This move aims to reduce reliance on foreign technology and boost its AI capabilities amid growing competition and regulatory challenges.
2 Sources
2 Sources
ByteDance, TikTok's parent company, plans to spend $7 billion on Nvidia GPUs in 2025, sidestepping US export restrictions by storing chips in offshore data centers. This move highlights the ongoing tension between US tech regulations and Chinese AI ambitions.
6 Sources
6 Sources
ByteDance, TikTok's parent company, launches OmniHuman-1, an advanced AI model capable of generating highly realistic full-body videos from a single image, raising both excitement and concerns in the tech world.
13 Sources
13 Sources