The Gradient @thegradient

**Lenin alevski** @alevsk@infosec.exchange · Jul 8

Lenin alevski @alevsk@infosec.exchange

New Open-Source Tool Spotlight

Scrapling is redefining Python web scraping. Adaptive, stealthy, and fast, it can bypass anti-bot measures while auto-tracking changes in website structure. A standout: 4.5x faster than AutoScraper for text-based extractions. #Python #WebScraping

Project link on #GitHub https://github.com/D4Vinci/Scrapling

#Infosec #Cybersecurity #Software #Technology #News #CTF #Cybersecuritycareer #hacking #redteam #blueteam #purpleteam #tips #opensource #cloudsecurity

—
P.S. Found this helpful? Tap Follow for more cybersecurity tips and insights! I share weekly content for professionals and people who want to get into cyber. Happy hacking

**Nicolas MOUART-DAVID** @silentexception@mastodon.social · Jul 7

Jul 7

Nicolas MOUART-DAVID @silentexception@mastodon.social

Q: Based on his ideas, would Adolf Hitler be for or against GDPR and right to erasure nowadays if he still lived?

A: It's reasonable to infer that Hitler would not support a regulation like #GDPR which emphasizes individual rights such as #privacy protection, data accessibility or erasure; and instead might favor more centralized control over information dissemination for propaganda purposes.

#webscraping #technology #EU

**ResearchBuzz: Firehose** @researchbuzz_firehose@rbfirehose.com · Jul 2

Jul 2

ResearchBuzz: Firehose @researchbuzz_firehose@rbfirehose.com

ZDNet: Cloudflare just changed the internet, and it’s bad news for the AI giants. “The major internet Content Delivery Network (CDN), Cloudflare, has declared war on AI companies. Starting July 1, Cloudflare now blocks by default AI web crawlers accessing content from your websites without permission or compensation.”

https://rbfirehose.com/2025/07/02/zdnet-cloudflare-just-changed-the-internet-and-its-bad-news-for-the-ai-giants/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · Jul 2ZDNet: Cloudflare just changed the internet, and it’s bad news for the AI giants | ResearchBuzz: Firehose

More from

ResearchBuzz: Firehose

#ai #aitraining #aiassisted

**Miguel Afonso Caetano** @remixtures@tldr.nettime.org · Jun 18

Jun 18

Miguel Afonso Caetano @remixtures@tldr.nettime.org

"The report, titled “Are AI Bots Knocking Cultural Heritage Offline?” was written by Weinberg of the GLAM-E Lab, a joint initiative between the Centre for Science, Culture and the Law at the University of Exeter and the Engelberg Center on Innovation Law & Policy at NYU Law, which works with smaller cultural institutions and community organizations to build open access capacity and expertise. GLAM is an acronym for galleries, libraries, archives, and museums. The report is based on a survey of 43 institutions with open online resources and collections in Europe, North America, and Oceania. Respondents also shared data and analytics, and some followed up with individual interviews. The data is anonymized so institutions could share information more freely, and to prevent AI bot operators from undermining their countermeasures.

Of the 43 respondents, 39 said they had experienced a recent increase in traffic. Twenty-seven of those 39 attributed the increase in traffic to AI training data bots, with an additional seven saying the AI bots could be contributing to the increase.

“Multiple respondents compared the behavior of the swarming bots to more traditional online behavior such as Distributed Denial of Service (DDoS) attacks designed to maliciously drive unsustainable levels of traffic to a server, effectively taking it offline,” the report said. “Like a DDoS incident, the swarms quickly overwhelm the collections, knocking servers offline and forcing administrators to scramble to implement countermeasures. As one respondent noted, ‘If they wanted us dead, we’d be dead.’”"

https://www.404media.co/ai-scraping-bots-are-breaking-open-libraries-archives-and-museums/

404 Media · Jun 17AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums"This is a moment where that community feels collectively under threat and isn't sure what the process is for solving the problem.”

#AI #GenerativeAI #CulturalHeritage

**ResearchBuzz: Firehose** @researchbuzz_firehose@rbfirehose.com · Jun 18

Jun 18

ResearchBuzz: Firehose @researchbuzz_firehose@rbfirehose.com

TechCrunch: Mastodon updates its terms to prohibit AI model training. “Social networks are bolstering their terms of service against scrapers and bots that crawl the website to train AI models. Days after Elon Musk-owned X updated its terms to explicitly prohibit AI model training, decentralized social network Mastodon today updated its own rules to bar any kind of model training, as well.”

https://rbfirehose.com/2025/06/18/techcrunch-mastodon-updates-its-terms-to-prohibit-ai-model-training/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · Jun 18TechCrunch: Mastodon updates its terms to prohibit AI model training | ResearchBuzz: Firehose

More from

ResearchBuzz: Firehose

#ai #aitraining #decentralizedsocialmedia

**Harald Klinke** @HxxxKxxx@det.social · Jun 17

Jun 17

Harald Klinke @HxxxKxxx@det.social

Are AI bots overwhelming digital collections?
A new GLAM-E Lab report shows how scrapers for AI training datasets are putting real strain on the infrastructures of galleries, libraries, archives, and museums. Technical bottlenecks, ethical dilemmas, and escalating costs—open culture is under pressure.
Read the full analysis:
https://www.glamelab.org/products/are-ai-bots-knocking-cultural-heritage-offline/
#DigitalHeritage #GLAM #WebScraping #OpenAccess #CulturalData #MuseTech #DigitalHumanities #GLAMlab

GLAM-E LabAre AI Bots Knocking Cultural Heritage Offline?

**ResearchBuzz: Firehose** @researchbuzz_firehose@rbfirehose.com · Jun 17

Jun 17

ResearchBuzz: Firehose @researchbuzz_firehose@rbfirehose.com

404 Media: AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums. “AI bots that scrape the internet for training data are hammering the servers of libraries, archives, museums, and galleries, and are in some cases knocking their collections offline, according to a new survey published today.” As you might imagine this drives me absolutely WILD.

https://rbfirehose.com/2025/06/17/404-media-ai-scraping-bots-are-breaking-open-libraries-archives-and-museums/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · Jun 17404 Media: AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums | ResearchBuzz: Firehose

More from

ResearchBuzz: Firehose

#ai #aitraining #gallerieslibrariesarchivesmuseumsglam_

**Miguel Afonso Caetano** @remixtures@tldr.nettime.org · Jun 16

Jun 16

Miguel Afonso Caetano @remixtures@tldr.nettime.org

"To reiterate, whatever one's opinion of these particular AI tools, scraping itself is not the problem. Automated access is a fundamental technique of archivists, computer scientists, and everyday users that we hope is here to stay—as long as it can be done non-destructively. However, we realize that not all implementers will follow our suggestions for bots above, and that our mitigations are both technically advanced and incomplete.

Because we see so many bots operating for the same purpose at the same time, it seems there's an opportunity here to provide these automated data consumers with tailored data providers, removing the need for every AI company to scrape every website, seemingly, every day.

And on the operators' end, we hope to see more web-hosting and framework technology that is built with an awareness of these issues from day one, perhaps building in responses like just-in-time static content generation or dedicated endpoints for crawlers."

https://www.eff.org/deeplinks/2025/06/keeping-web-under-weight-ai-crawlers

Electronic Frontier Foundation · Jun 5Keeping the Web Up Under the Weight of AI CrawlersIf you run a site on the open web, chances are you've noticed a big increase in traffic over the past few months, whether or not your site has been getting more viewers, and you're not alone. Operators everywhere have observed a drastic increase in automated traffic—bots—and in most cases attribute...

#AI #GenerativeAI #WebCrawlers

**Loki the Cat** @LokiTheCat@toot.community · Jun 14

Jun 14

Loki the Cat @LokiTheCat@toot.community

AI companies: "We're just browsing!" Also AI companies: *scrapes 26M+ pages in March alone while bypassing blockers* Publishers: "This is fine"

Traffic from AI bots grew 49% but monetization remains elusive. The digital feast continues unpaid.

https://news.slashdot.org/story/25/06/14/021246/increased-traffic-from-web-scraping-ai-bots-is-hard-to-monetize

news.slashdot.orgIncreased Traffic from Web-Scraping AI Bots is Hard to Monetize - Slashdot"People are replacing Google search with artificial intelligence tools like ChatGPT," reports the Washington Post. But that's just the first change, according to a New York-based start-up devoted to watching for content-scraping AI companies with a free analytics product and "ensuring that these i...

#AI #WebScraping #Publishers

**Symfony** @symfony@mastodon.social · Jun 13

Jun 13

Symfony @symfony@mastodon.social

Live now at #SymfonyOnline June 2025!
@Suparnpatra is unlocking the secrets of “Efficient Web Scraping with Symfony & PHP”
If you love clean code and clever data extraction, this one’s for you!
#Symfony #PHP #WebScraping

Replied in thread

**Lea** @leamusi@mendeddrum.org · Jun 13

Jun 13

Lea @leamusi@mendeddrum.org

@georgfischer Stehlen ist vielleicht das falsche Wort, aber kommerzielle LLMs richten Schaden an und profitieren vom Werk Anderer, ohne dafür irgendwie zu bezahlen. Ihre Crawler überlasten Server, und sie speisen ihre Inhalte aus allem was sie sehen, auch aus Werken die nicht für kommerzielle Weiternutzung lizensiert sind. Was auch immer das richtige Wort hierfür ist, ich finde diese Praxis parasitär und unethisch. #llms #Webscraping #aislop #chatgpt #cclizenzen

**ResearchBuzz: Firehose** @researchbuzz_firehose@rbfirehose.com · Jun 7

Jun 7

ResearchBuzz: Firehose @researchbuzz_firehose@rbfirehose.com

The Register: Reddit sues Anthropic for scraping content into the maw of its eternally ravenous AI. “Reddit, the popular internet discussion forum, sued Anthropic on Wednesday, alleging that the AI biz scraped content generated by its users in violation of contractual terms and technical barriers. The complaint [PDF], filed in San Francisco Superior Court on Wednesday, claims Anthropic’s use […]

https://rbfirehose.com/2025/06/07/the-register-reddit-sues-anthropic-for-scraping-content-into-the-maw-of-its-eternally-ravenous-ai/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · Jun 7The Register: Reddit sues Anthropic for scraping content into the maw of its eternally ravenous AI | ResearchBuzz: Firehose

More from

ResearchBuzz: Firehose

#ai #aitraining #aiassisted

**michabbb** @michabbb@vivaldi.net · Jun 3

Jun 3

michabbb @michabbb@vivaldi.net

#Firecrawl launches /search endpoint for web scraping and data extraction

#Firecrawl new /search #API combines web search results with full page content in single call, designed for #AI agents and developers who need clean data quickly

#webscraping

**ResearchBuzz: Firehose** @researchbuzz_firehose@rbfirehose.com · Jun 1

Jun 1

ResearchBuzz: Firehose @researchbuzz_firehose@rbfirehose.com

The Map Room: Unauthorized Waffle House Index Disaster Maps Taken Down. “The Waffle House Index is an informal metric used to assess the severity of a storm in the U.S. South, because Waffle House restaurants don’t close unless Things Are Very Bad. But when Jack LaFond scraped Waffle House’s website to build a map tracking restaurant closures last fall, he got a cease-and-desist from […]

https://rbfirehose.com/2025/05/31/the-map-room-unauthorized-waffle-house-index-disaster-maps-taken-down/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · Jun 1The Map Room: Unauthorized Waffle House Index Disaster Maps Taken Down | ResearchBuzz: Firehose

More from

ResearchBuzz: Firehose

#copyright #intellectualproperty #law

**ResearchBuzz: Firehose** @researchbuzz_firehose@rbfirehose.com · May 29

May 29

ResearchBuzz: Firehose @researchbuzz_firehose@rbfirehose.com

404 Media: Developer Builds Tool That Scrapes YouTube Comments, Uses AI to Predict Where Users Live. “If you’ve left a comment on a YouTube video, a new website claims it might be able to find every comment you’ve ever left on any video you’ve ever watched. Then an AI can build a profile of the commenter and guess where you live, what languages you speak, and what your politics might […]

https://rbfirehose.com/2025/05/29/404-media-developer-builds-tool-that-scrapes-youtube-comments-uses-ai-to-predict-where-users-live/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · May 29404 Media: Developer Builds Tool That Scrapes YouTube Comments, Uses AI to Predict Where Users Live | ResearchBuzz: Firehose

More from

ResearchBuzz: Firehose

#ai #aiassisted #personalinformation

**UG Center for InformationTech** @CIT_RUG@social.edu.nl · May 28 *

May 28 *

UG Center for InformationTech @CIT_RUG@social.edu.nl

𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗛𝘂𝗯 𝟱 𝗝𝘂𝗻𝗲: 𝗪𝗼𝗿𝗸𝘀𝗵𝗼𝗽 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴 𝘂𝘀𝗶𝗻𝗴 𝗣𝘆𝘁𝗵𝗼𝗻
Is much of the information you need for your #research available on websites, but not as downloadable #datasets or #files? This workshop will introduce you to the basics of #webscraping in a clear, practical way!

Also drop by at the 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗖𝗮𝗳é where experts will be present for your quick (or big;-) questions about #R, #Python, #Statistics, #MachineLearning, #HPC and #Geo!

More information https://edu.nl/rw7vd

**PromptCloud** @promptcloud@mastodon.social · May 28

May 28

PromptCloud @promptcloud@mastodon.social

Tired of babysitting DIY scraping scripts that crash the moment you scale?
You’re not alone.

PromptCloud takes the pain out of large-scale data extraction with fully managed, reliable solutions — so you can focus on what really matters: insights.

https://shorturl.at/EApIO

#WebScraping #OpenData #DataEngineering

**Carlo Zottmann** @czottmann@norden.social · May 21

May 21

Carlo Zottmann @czottmann@norden.social

Need to grab specific info from a webpage regularly? Browser Actions can help! Create a Shortcut to: Open URL Wait for data element Run JavaScript to extract text Pass it back to Shortcuts!

If you need help with that, just follow the Forum link on the site!

https://actions.work/browser-actions?ref=mastodon-b10

#macOS #Shortcuts #WebScraping