Lawsuit Accuses Anna's Archive of Hacking WorldCat, Stealing 2.2 TB Data

ancuuiqter@lemmy.world · edit-2 9 months ago

Lawsuit Accuses Anna's Archive of Hacking WorldCat, Stealing 2.2 TB Data

EinatYahav@lemmy.today · 9 months ago

tear down every paywall

body_by_make@lemmy.dbzer0.com · edit-2 9 months ago

Yes, let only the rich control your thoughts.

I’m not surprised this will get downvoted here, I’m as much of a pirate as anyone, but news needs to be paid or only people who can afford to control the news without income will control the news.

ShepherdPie@midwest.social · edit-2 9 months ago

Npt saying you’re right or wrong but paid news has been the model for quite a while now and that has resulted in 24 hour talking heads on TV, paid stories, clickbait, and people resorting to word of mouth on places like Facebook for all their news. It’s not as if the current trajectory is any better than your hypothetical one.

dangblingus@lemmy.dbzer0.com · edit-2 8 months ago

deleted by creator

RogueBanana@lemmy.zip · edit-2 9 months ago

Just going by anna’s blog post their business model seems to be trading information ie sharing the full database of hundreds of millions records with their memeber’s own records so the list keeps growing as more members join. Although I don’t see why they need a monopoly on said information given any other library would still continue working with them for their free streamlined process. There could be more to it but feels like they are wasting resources on this instead of putting them in things that actually matter.

Edit: also I don’t think they scrapped or have information about the members like location of each book, simply just the metadata so it really seems harmless to me

EveryMuffinIsNowEncrypted@lemmy.blahaj.zone · 9 months ago

Sigh… Whelp, time to go download a shit-ton of stuff before yet another friendly port goes down…

Snot Flickerman@lemmy.blahaj.zone · 9 months ago

https://annas-blog.org/worldcat-scrape.html

Relevant blog post. AA knew the risks in this, and this is sort of expected.

Darkassassin07@lemmy.ca · edit-2 9 months ago

Gotta wonder what their plan is. The lawsuit was an obvious outcome, and they haven’t exactly made much effort to make their actions appear legal.

I don’t see AA winning this one. Data’s out there though; no taking that back. Maybe they’ve just accepted the consequences… A martyr as it were.

BarrierWithAshes@kbin.social · 9 months ago

AA’s based outta Kazakhstan though. Lotta good a lawsuit filed in Ohio’s gonna do. At most I could see American ISPs implementing a DNS-level block against the site.

Darkassassin07@lemmy.ca · 9 months ago

Oh. Lol, get fucked WorldCat.

ancuuiqter@lemmy.world · 9 months ago

Would you be able to share where you learned that Anna’s Archive is based in Kazakhstan?

BarrierWithAshes@kbin.social · 9 months ago

I remember reading it on the site but I cannot find it now. I know for a fact she is based in Kazakhstan. So says her wikipedia page.

ancuuiqter@lemmy.world · edit-2 9 months ago

Maybe you’re thinking of Sci-Hub and its founder, Alexandra Asanovna Elbakyan?

I could not find a location on Anna’s Archive’s wiki page.

BarrierWithAshes@kbin.social · 9 months ago

Yeah i guess I am. Coulda sworn they were based in Kazakhstan. If theyre in any Five Eyes country they should gtfo. Too much copyright crap here.

xiao@sh.itjust.works · 9 months ago

Wish AA gonna be fine, they made me save literally hundred of US dollars…

MotoAsh@lemmy.world · edit-2 9 months ago

I mean… it’ll all come down to how they accessed the data. If they had a public portal and no EULA, they can push rocks. If the data wasn’t public or the ‘theives’ had to use non-standard channels, or otherwise violated an EULA, they’re likely screwed. Especially if they had to go through abnormal channels.

I know their data can be accessed publicly, but I’m pretty sure it’s under license. You cannot just use any old thing found in public… That’s the biggest reasons the AI models are technically theft: they weren’t licensed to commercially profit off of 99.99% of the things their LLMs are trained on, but the law and politicians are WAY behind the times. Commercial data they’d normally have to pay for is suddenly magically OK when laundered through an LLM…

Snot Flickerman@lemmy.blahaj.zone · 9 months ago

https://annas-blog.org/worldcat-scrape.html

WorldCat

That is when we set our sights on the largest book database in the world: WorldCat. This is a proprietary database by the non-profit OCLC, which aggregates metadata records from libraries all over the world, in exchange for giving those libraries access to the full dataset, and having them show up in end-users’ search results.

Even though OCLC is a non-profit, their business model requires protecting their database. Well, we’re sorry to say, friends at OCLC, we’re giving it all away. :-)

Over the past year, we’ve meticulously scraped all WorldCat records. At first, we hit a lucky break. WorldCat was just rolling out their complete website redesign (in Aug 2022). This included a substantial overhaul of their backend systems, introducing many security flaws. We immediately seized the opportunity, and were able scrape hundreds of millions (!) of records in mere days.

After that, security flaws were slowly fixed one by one, until the final one we found was patched about a month ago. By that time we had pretty much all records, and were only going for slightly higher quality records. So we felt it is time to release!

MotoAsh@lemmy.world · edit-2 9 months ago

Yea OK they’re fucked. I really really doubt they’ll be able to claim the data is solely comprised of the open works saved within that database. The only way they’d be able to get away with it is if they’ve meticulously harvested the data such that they only ever retrieved the open works or public domain works.

Anything not in that list or otherwise made available solely via their nonprofit efforts is going to be ammo in the lawsuit. Ammo that will hit its target.

Dkarma@lemmy.world · 9 months ago

“AI models are technically theft: they weren’t licensed to commercially profit off of 99.99%”

This is simply a lie. There is no license like what you describe. You never need a license to view or learn from something given away completely free on the internet. You guys keep pretending there’s a law that says otherwise . There is not or you’d post it.

Copyright does not cover viewing or experiencing a piece.

MotoAsh@lemmy.world · edit-2 9 months ago

Notice how I said “commercially profit” too. Read all the words next time.

Also LLMs do not “learn” anything, you idiot. That’s the entire point. They mathematically blender things. They DO NOT learn and create.

BearOfaTime@lemm.ee · 9 months ago

Honest question: if you connect to say an FTP server, and there’s no dialog claiming a EULA, would you be bound by one?

I don’t know how they got the data, but the whole EULA thing would rely on there being proof Anna agreed to one, right? That seems a bit tricky. As for “unauthorized access”, if a path is available, and Anna used it, again with no warnings, where’s the legal line?

Having been in civil court a few times, judges will ask people “do you have a document proving there was an agreement?”, over any circumstance that could be misconstrued, or is a verbal claim.

No doc, verbal claim is dismissed unless other party admits to the verbal claim in court, to the judge.

Just seems to me EULAs are terribly hard to enforce.

Again, I’m more thinking out loud. I have no idea how these cases tend to proceed.

FigMcLargeHuge@sh.itjust.works · 9 months ago

That is going to depend on what type of access the ftp server allows. If it’s anonymous then I would argue that no, you cannot be bound by a EULA if no dialog is presented. But the article mentions “In addition to harvesting data from WorldCat.org, the defendants are also accused of obtaining and using credentials of a member library to access WorldCat Discovery Services.” Now it’s just my speculation, but if they used someone else’s id to scrape the data, then WorldCat can just produce any documents that id agreed to, and it will apply here. Sounds like they done goofed.

MotoAsh@lemmy.world · edit-2 9 months ago

I think that would depend on how intentional the open port was.

If it’s something there and advertised, even if mentioned in one place in some archaic document, they’d probably be fine just for accessing it.

Though that would only absolve them of acquisition issues. If they’re using someone else’s work for profit, there is almost certainly enough room for the lawsuit.

Only a select few licenses even allow for open and unrestricted commercial use. Especially if the data itself is the licensed thing, since valuable data is far easier to convert than something like source code.

Snot Flickerman@lemmy.blahaj.zone · 9 months ago

You are generally required to put up unauthorized access warnings.

Similar to how you have to post “no trespassing” signs if you don’t want to be trespassed.

WarmApplePieShrek@lemmy.dbzer0.com · 9 months ago

That’s not true. Trespass works like that because big corporations don’t get trespassed much, but they lobbied for copyright to be automatic.

ancuuiqter@lemmy.world · 9 months ago

Here are the court filings if anyone would like to read them:

https://archive.org/details/gov.uscourts.ohsd.287709/

The following is a link to the docket (which the above link draws from), so people can follow the progress of the lawsuit:

https://www.courtlistener.com/docket/68157923/oclc-online-computer-library-center-inc-v-annas-archive/

ancuuiqter@lemmy.world · 9 months ago

As to how Anna’s Archive accomplished their data scraping, this is what OCLC is claiming (see page 62-63):

These attacks were accomplished with bots (automated software applications) that “scraped” and harvested data from WorldCat.org and other WorldCat®-based research sites and that called or pinged the server directly. These bots were initially masked to appear as legitimate search engine bots from Bing or Google.

To scrape or harvest the data on WorldCat.org, the bots searched WorldCat.org results, running a script based on OCN for individual JavaScript Object Notation, or “JSON,” records. As a result, WorldCat® data including freely accessible and enriched data, such as OCNs, were scraped from individual results on WorldCat.org.

The bots also harvested data from WorldCat.org by pretending to be an internet browser, directly calling or “pinging” OCLC’s servers, and bypassing the search, or user interface, of WorldCat.org. More robust WorldCat® data was harvested directly from OCLC’s servers, including enriched data not available through the WorldCat.org user interface.

Finally, WorldCat® data was harvested from a member’s website incorporating WorldCat® Discovery Services, a subscription-based variation of WorldCat.org that is available only to a member’s patrons. Again, the hacker pinged OCLC’s servers to harvest WorldCat® records directly from the servers. To do this through WorldCat® Discovery Services/FirstSearch, the hacker obtained and used the member’s credentials to authenticate the requests to the server as a member library.

From WorldCat® Discovery Services, hackers harvested 2 million richer WorldCat® records that included data not available in WorldCat.org. This hacking method resulted in the harvesting of some of OCLC’s most proprietary fields of WorldCat® data.

These hacking attacks materially affected OCLC’s production systems and servers, requiring around-the-clock efforts from November 2022 to March 2023 to attempt to limit service outages and maintain the production systems’ performance for customers. To respond to these ongoing attacks, OCLC spent over 1.4 million dollars on its systems’ infrastructure and devoted nearly 10,000 employee hours to the same.

Despite OCLC’s best efforts, OCLC’s customers experienced many significant disruptions in paid services during the aforementioned period as a result of the attacks on WorldCat.org, requiring OCLC to create system workarounds to ensure services functioned.

During this time, customers threatened and likely did cancel their products and services with OCLC due to these disruptions.

Because OCLC had to combat these persistent hacking attacks, OCLC was forced to divert existing personnel and resources from OCLC’s other products and services. As a result, OCLC’s development and improvements to other products and services were delayed and limited.

OCLC has devoted, at various times, ten or more employees to respond to and mitigate the harm from these attacks from October 2022 to present.

conciselyverbose@kbin.social · 9 months ago

None of this is “hacking”

isles@lemmy.world · 9 months ago

the hacker obtained and used the member’s credentials to authenticate the requests to the server as a member library.

Hacking is the act of breaking into a computer system without authorization or exceeding authorized access.

This part could be hacking. Not that I care and think this is frivolous.

requiring around-the-clock efforts from November 2022 to March 2023 to attempt to limit service outages and maintain the production systems’ performance for customers.

Doesn’t major hosting require 24/7 monitoring anyway? Like they should have been doing this for more than just 11/22 to 3/23.

ancuuiqter@lemmy.world · edit-2 9 months ago

Regarding the operating location(s) of Anna’s Archive, OCLC is alleging the following (pages 7-9):

C. Defendants Rely on Sophisticated Technology and Online Practices to Conceal their Identities.

Defendants understand that their pirate library enterprise and related activities, here, hacking and harvesting OCLC’s WorldCat® records, are illegal. Defendants admit that they are engaging in and facilitating mass copyright infringement, stating, “[w]e deliberately violate the copyright law in most countries.” In another blog post, Defendants noted that their activities could lead to arrest and “decades of prison time.” Defendants have also recognized that their hacking and distribution of OCLC’s data is improper, acknowledging that WorldCat® is a “proprietary database,” that OCLC’s “business model requires protecting their database,” and that Defendants are “giving it all away. :-).”

Because Defendants understand their actions infringe on copyright laws, amongst others, Defendants go to great lengths to remain anonymous to ensure both that Anna’s Archive’s domains are not taken down and to avoid the legal consequences of their actions, including civil lawsuits where parties like OCLC seek to vindicate their rights, as well as criminal and regulatory enforcement actions undertaken by government entities. None of Anna’s Archive’s domains or its online blog provide a business address, business contact, or other contact information that would be found on a legitimate entity’s website.

Defendants have explained in a blog post that they are “being very careful not to leave any trace [of their online activities], and having strong operational security.” For instance, Anna’s Archive utilizes a VPN with “[a]ctual court-tested no-log policies with long track records of protecting privacy.” Each of the Anna’s Archive domains are registered using foreign hosts, registrars, and registrants in order to conceal the identity of the site operators. Additionally, Defendants rely on multiple proxy servers to maintain anonymity. Defendants also use a free version of Cloudflare, a top-level hosting provider, so that they do not have to provide any payment or other identifying information. Defendants selected Cloudflare because they claim Cloudflare has resisted requests to take down websites for copyright infringement. The individuals behind Anna’s Archive also use usernames as pseudonyms to mask their identities online.

Through the work of a cyber security and digital forensic investigation firm, OCLC was able to identify one of the individuals behind Anna’s Archive by name and locate a United States address, Defendant Maria Dolores Anasztasia Matienzo. However, the physical address and contact information of Anna’s Archive and the identities and contact information of the John Does remain unknown. It is highly likely that Anna’s Archive is a non-domestic, foreign entity, based on the findings from OCLC’s investigator, as set forth below.

OCLC explained the above in their Motion To Serve Defendant Anna’s Archive By Email, as justification for why they seek “permission to serve Anna’s Archive by alternative means, here, email, pursuant to Federal Rule of Civil Procedure 4(h)(2) and (f)(3).”

Nougat@kbin.social · edit-2 9 months ago

You seem like someone who might be interested in !OriginalDocuments. link

ancuuiqter@lemmy.world · edit-2 9 months ago

The official Anna’s Archive Reddit account, AnnaArchivist, has responded to an r/Annas_Archive post linking the same Torrent Freak article:

Thanks! We’re not making any public statements about this lawsuit but rest assured we’re fine.

Lawsuit Accuses Anna's Archive of Hacking WorldCat, Stealing 2.2 TB Data

Lawsuit Accuses Anna's Archive of Hacking WorldCat, Stealing 2.2 TB Data

Lawsuit Accuses Anna's Archive of Hacking WorldCat, Stealing 2.2 TB Data * TorrentFreak