sigmoid.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A social space for people researching, working with, or just interested in AI!

Server stats:

596
active users

#webscraping

0 posts0 participants0 posts today

New Open-Source Tool Spotlight 🚨🚨🚨

Scrapling is redefining Python web scraping. Adaptive, stealthy, and fast, it can bypass anti-bot measures while auto-tracking changes in website structure. A standout: 4.5x faster than AutoScraper for text-based extractions. #Python #WebScraping

🔗 Project link on #GitHub 👉 github.com/D4Vinci/Scrapling

#Infosec #Cybersecurity #Software #Technology #News #CTF #Cybersecuritycareer #hacking #redteam #blueteam #purpleteam #tips #opensource #cloudsecurity

✨
🔐 P.S. Found this helpful? Tap Follow for more cybersecurity tips and insights! I share weekly content for professionals and people who want to get into cyber. Happy hacking 💻🏴‍☠️

Q: Based on his ideas, would Adolf Hitler be for or against GDPR and right to erasure nowadays if he still lived?

A: It's reasonable to infer that Hitler would not support a regulation like #GDPR which emphasizes individual rights such as #privacy protection, data accessibility or erasure; and instead might favor more centralized control over information dissemination for propaganda purposes.

ZDNet: Cloudflare just changed the internet, and it’s bad news for the AI giants. “The major internet Content Delivery Network (CDN), Cloudflare, has declared war on AI companies. Starting July 1, Cloudflare now blocks by default AI web crawlers accessing content from your websites without permission or compensation.”

https://rbfirehose.com/2025/07/02/zdnet-cloudflare-just-changed-the-internet-and-its-bad-news-for-the-ai-giants/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · ZDNet: Cloudflare just changed the internet, and it’s bad news for the AI giants | ResearchBuzz: Firehose
More from ResearchBuzz: Firehose

"The report, titled “Are AI Bots Knocking Cultural Heritage Offline?” was written by Weinberg of the GLAM-E Lab, a joint initiative between the Centre for Science, Culture and the Law at the University of Exeter and the Engelberg Center on Innovation Law & Policy at NYU Law, which works with smaller cultural institutions and community organizations to build open access capacity and expertise. GLAM is an acronym for galleries, libraries, archives, and museums. The report is based on a survey of 43 institutions with open online resources and collections in Europe, North America, and Oceania. Respondents also shared data and analytics, and some followed up with individual interviews. The data is anonymized so institutions could share information more freely, and to prevent AI bot operators from undermining their countermeasures.

Of the 43 respondents, 39 said they had experienced a recent increase in traffic. Twenty-seven of those 39 attributed the increase in traffic to AI training data bots, with an additional seven saying the AI bots could be contributing to the increase.

“Multiple respondents compared the behavior of the swarming bots to more traditional online behavior such as Distributed Denial of Service (DDoS) attacks designed to maliciously drive unsustainable levels of traffic to a server, effectively taking it offline,” the report said. “Like a DDoS incident, the swarms quickly overwhelm the collections, knocking servers offline and forcing administrators to scramble to implement countermeasures. As one respondent noted, ‘If they wanted us dead, we’d be dead.’”"

404media.co/ai-scraping-bots-a

404 Media · AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums"This is a moment where that community feels collectively under threat and isn't sure what the process is for solving the problem.”

TechCrunch: Mastodon updates its terms to prohibit AI model training. “Social networks are bolstering their terms of service against scrapers and bots that crawl the website to train AI models. Days after Elon Musk-owned X updated its terms to explicitly prohibit AI model training, decentralized social network Mastodon today updated its own rules to bar any kind of model training, as well.”

https://rbfirehose.com/2025/06/18/techcrunch-mastodon-updates-its-terms-to-prohibit-ai-model-training/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · TechCrunch: Mastodon updates its terms to prohibit AI model training | ResearchBuzz: Firehose
More from ResearchBuzz: Firehose

Are AI bots overwhelming digital collections?
A new GLAM-E Lab report shows how scrapers for AI training datasets are putting real strain on the infrastructures of galleries, libraries, archives, and museums. Technical bottlenecks, ethical dilemmas, and escalating costs—open culture is under pressure.
Read the full analysis:
glamelab.org/products/are-ai-b
#DigitalHeritage #GLAM #WebScraping #OpenAccess #CulturalData #MuseTech #DigitalHumanities #GLAMlab

GLAM-E LabAre AI Bots Knocking Cultural Heritage Offline?

404 Media: AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums. “AI bots that scrape the internet for training data are hammering the servers of libraries, archives, museums, and galleries, and are in some cases knocking their collections offline, according to a new survey published today.” As you might imagine this drives me absolutely WILD.

https://rbfirehose.com/2025/06/17/404-media-ai-scraping-bots-are-breaking-open-libraries-archives-and-museums/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · 404 Media: AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums | ResearchBuzz: Firehose
More from ResearchBuzz: Firehose

"To reiterate, whatever one's opinion of these particular AI tools, scraping itself is not the problem. Automated access is a fundamental technique of archivists, computer scientists, and everyday users that we hope is here to stay—as long as it can be done non-destructively. However, we realize that not all implementers will follow our suggestions for bots above, and that our mitigations are both technically advanced and incomplete.

Because we see so many bots operating for the same purpose at the same time, it seems there's an opportunity here to provide these automated data consumers with tailored data providers, removing the need for every AI company to scrape every website, seemingly, every day.

And on the operators' end, we hope to see more web-hosting and framework technology that is built with an awareness of these issues from day one, perhaps building in responses like just-in-time static content generation or dedicated endpoints for crawlers."

eff.org/deeplinks/2025/06/keep

Electronic Frontier Foundation · Keeping the Web Up Under the Weight of AI CrawlersIf you run a site on the open web, chances are you've noticed a big increase in traffic over the past few months, whether or not your site has been getting more viewers, and you're not alone. Operators everywhere have observed a drastic increase in automated traffic—bots—and in most cases attribute...
Replied in thread

@georgfischer Stehlen ist vielleicht das falsche Wort, aber kommerzielle LLMs richten Schaden an und profitieren vom Werk Anderer, ohne dafür irgendwie zu bezahlen. Ihre Crawler überlasten Server, und sie speisen ihre Inhalte aus allem was sie sehen, auch aus Werken die nicht für kommerzielle Weiternutzung lizensiert sind. Was auch immer das richtige Wort hierfür ist, ich finde diese Praxis parasitär und unethisch. #llms #Webscraping #aislop #chatgpt #cclizenzen

The Register: Reddit sues Anthropic for scraping content into the maw of its eternally ravenous AI. “Reddit, the popular internet discussion forum, sued Anthropic on Wednesday, alleging that the AI biz scraped content generated by its users in violation of contractual terms and technical barriers. The complaint [PDF], filed in San Francisco Superior Court on Wednesday, claims Anthropic’s use […]

https://rbfirehose.com/2025/06/07/the-register-reddit-sues-anthropic-for-scraping-content-into-the-maw-of-its-eternally-ravenous-ai/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · The Register: Reddit sues Anthropic for scraping content into the maw of its eternally ravenous AI | ResearchBuzz: Firehose
More from ResearchBuzz: Firehose

The Map Room: Unauthorized Waffle House Index Disaster Maps Taken Down. “The Waffle House Index is an informal metric used to assess the severity of a storm in the U.S. South, because Waffle House restaurants don’t close unless Things Are Very Bad. But when Jack LaFond scraped Waffle House’s website to build a map tracking restaurant closures last fall, he got a cease-and-desist from […]

https://rbfirehose.com/2025/05/31/the-map-room-unauthorized-waffle-house-index-disaster-maps-taken-down/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · The Map Room: Unauthorized Waffle House Index Disaster Maps Taken Down | ResearchBuzz: Firehose
More from ResearchBuzz: Firehose

404 Media: Developer Builds Tool That Scrapes YouTube Comments, Uses AI to Predict Where Users Live. “If you’ve left a comment on a YouTube video, a new website claims it might be able to find every comment you’ve ever left on any video you’ve ever watched. Then an AI can build a profile of the commenter and guess where you live, what languages you speak, and what your politics might […]

https://rbfirehose.com/2025/05/29/404-media-developer-builds-tool-that-scrapes-youtube-comments-uses-ai-to-predict-where-users-live/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · 404 Media: Developer Builds Tool That Scrapes YouTube Comments, Uses AI to Predict Where Users Live | ResearchBuzz: Firehose
More from ResearchBuzz: Firehose

📣 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗛𝘂𝗯 𝟱 𝗝𝘂𝗻𝗲: 𝗪𝗼𝗿𝗸𝘀𝗵𝗼𝗽 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴 𝘂𝘀𝗶𝗻𝗴 𝗣𝘆𝘁𝗵𝗼𝗻 🐍
Is much of the information you need for your #research available on websites, but not as downloadable #datasets or #files? This workshop will introduce you to the basics of #webscraping in a clear, practical way!

Also drop by at the 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗖𝗮𝗳é where experts will be present for your quick (or big;-) questions about #R, #Python, #Statistics, #MachineLearning, #HPC and #Geo!

ℹ️ More information 👉🏼edu.nl/rw7vd

Continued thread

2/

Scraping (as in Web Scraping) is the act of extracting data from HTML web-pages where the data is NOT machine-legible.

If the data, even in an HTML web-page, is in a machine-legible format, then it is NOT scraping.

...

And, getting data in JSON (key-value pairs) is definitely NOT scraping — as JSON's purpose is to communicate data in a machine-legible manner.

CC: @404mediaco