Make illegally trained LLMs public domain as punishment

🃏Joker@sh.itjust.works · 14 hours ago

Make illegally trained LLMs public domain as punishment

HexesofVexes@lemmy.world · edit-2 12 hours ago

I mean, if we really are following the spirit of copyright, since no-one at open AI or other companies developed matrix and vector multiplication (operations existing in the public domain because Platonism is a thing).

Edit: oh my, I guess the consensus is that stealing the work of mathematicians is ok (or more, classifying our constructions as discoveries).

catloaf@lemm.ee · 8 hours ago

Simple operations like vector multiplication are not works for the purposes of copyright law. If you invented an entirely new form of math, complete with novel formulae, you could conceivably assert patent rights and/or copyright over it, especially if you published a textbook. It would be more relevant, however, to discuss complex algorithms, such as for data compression. Those can certainly be patented. And, when implemented as a computer program, can certainly be copyrighted.

But if you’re just defining one simple operation, yeah, you’re unlikely to be able to assert any rights over it.

HexesofVexes@lemmy.world · 7 hours ago

Ehh no, you genuinely can’t patent any form of mathematics.

Mathematics falls under “exists in nature” (if you are a Platonist) or “abstract ideas” (gets even clear thinking Constructivists). So they’re excluded from parents and copyrights no matter how complex the system

Textbooks usually belong to the publisher (academics commonly have to pirate their own papers), so that’s usually a bust.

You might be able to patent an algorithm associated with a branch of mathematics, but that’s trickier than you think. Blank slate development can, and does, happen (see Compaq’s reimplementation of IBM’s bios). You’re banking on it not being reversed engineer able (spoiler, don’t take that bet if you’ve published your proofs!).

sugar_in_your_tea@sh.itjust.works · 12 hours ago

You can’t patent math, though you can copyright a specific explanation of math concepts.

If Open AI (or any AI company) is including copyrighted works in their solution, that’s a copyright violation and should be treated as such. But if they’re merely using the information from a copyrighted work but not violating the copyright itself, they’re fine.

HexesofVexes@lemmy.world · 10 hours ago

That’s rather the irony - mathematics takes a great deal of work and creativity. You can’t copyright mathematical work; but, put a set of lines together and shade in the polygons created and suddenly it becomes copyrightable. Somehow one is a creative work whose author requires protection, and the other is volunteered for involuntary public service.

The reason mathematics cannot be copyrighted: because it’s a “discovery”, rather than a “creation” (very much a point of view, and far from irrefutable fact). In mathematics, one should be aware, that the concept and it’s explanation (proof) are much the same thing.

All in all, the argument is either mathematical work should fall under copyright (an abhorrent idea), or copyright should be abolished as it rarely (if ever) does much good.

sugar_in_your_tea@sh.itjust.works · 9 hours ago

The point of copyright is to protect creators from having their work stolen.

If an indie artist creates a work, a larger company could copy that work and distribute it as their own, and since that larger company has deeper pockets than thy indie artist, they can flood the market before the original artist has a chance to profit from it. Or perhaps a larger org creates an expensive work (game, movie, etc), anyone could redistribute it without paying the original creator.

Without copyright, we’d get way more paywalls, invasive DRM, etc. We get a taste of that today, but it can get way worse.

So we definitely need copyright, it just needs to be a lot shorter (say, 10-20 years).

HexesofVexes@lemmy.world · 7 hours ago

I mean, the point was definitely stated as protecting creators. We’ve seen some solid David Vs Goliath stories of artists taking people who steal their work down.

However, this isn’t the reality for the majority of copyright. A lot of it just ties up works to companies owned by speculative shareholders (think of the lord of the rings).

Limits to duration would definitely help this, and we’d be on the same page there. However, I do still wonder if it shouldn’t be shorter for certain things (e.g. medical treatments or manufacturing), with the option of a public domain buyout to cover (reasonable, non-inflated) research costs.

sugar_in_your_tea@sh.itjust.works · 7 hours ago

medical treatments or manufacturing

Both would be patents, no? Those are 20 years in the US, whereas copyright is 70 years after author’s death (or 95-120 years for “work for hire” works). Both should be reduced, but they protect very different things and need different considerations.

j4k3@lemmy.world · 12 hours ago

What is this perspective?

HexesofVexes@lemmy.world · 10 hours ago

Oh, that copyright is bollocks. If you follow its intent, you should be including academics, and that state of affairs would be abhorrent (we’d stagnate).

j4k3@lemmy.world · 9 hours ago

I see the issue as more like thought policing is the inevitable outcome of calling training copyright infringement because there is no difference between a person that recalls information and talks about it with others and the intended use of published information for training. If training an AI with all the knowledge a person learns in a similar manner is somehow wrong, then the inevitable long term way this plays out is a Minority Report like dystopia. It sets the precedent for prosecution of people for their thoughts or intentions and not their actions. This kind of thought policing existed in the darkest depths of the medieval era, or even into more recent eras of witch hunts or McCarthyism. Perhaps we are on the brink of another such dark era.

As far as I am aware:

Copyright is intended to protect someone from another person copying their work for for financial gain, or to be much more specific–copying work for direct gain using any form of complex social hierarchy such as awards, reputation, or monetary gain.
What copyright does not protect is the dissemination of knowledge as it relates to publicly published works.
One has the choice to remain the sole proprietor of one’s knowledge, but to publish publicly is to relinquish ownership of the information contained within.
Principally, copyright protects that you were the first to write it, and the way in which you wrote it, but it does nothing to protect the knowledge contained within. If a person recalls that knowledge, they are not required to state a citation when speaking aloud, or in some way making use of that knowledge.
Copyright also has a scope of intent, and that primarily involves competitive works from ones peers and excludes the scope of general knowledge and usefulness to society at large.

I’m not trying to mock you, or say you are right or wrong. Quite frankly, I don’t think in these terms, or care about the kinds of people who do. I’m heavily abstracted and intuitively driven to understand. I believe everything that is not intuitive is simple not fully understood yet. However naïve that may be is irrelevant here. I’m of the bias that those with something to gain often lack objective thinking and show a measure of envy when unexpected changes occur in society. I’m not accusing you, but only sharing the most minor of biases I am aware of while trying to say I want to understand. I would like to know if there is anything in the framework I just laid out that is overlooked. I would like to better understand why you find this issue upsetting. I’m one of the most flawed and openly human people on Lemmy. Look at my history if in doubt. I have no skin in this game, just curiosity.

HexesofVexes@lemmy.world · 8 hours ago

My view is that of a scholar - one who does devote a large part of their life to freely creating and disseminating knowledge. I do indeed hold a strong bias here, one I’m happy to admit.

Much of the time, when I’ve run across copyright, it is rarely (if ever come to think of it) in the name of the author (a common requirement of journals being the giving up of ownership of one’s work). It normally falls to a company; one usually driven by shareholder value with little (if no) concern for the author’s rights. This tends to be the rule rather than the exception, and I’d argue that copyright in it’s current incarnation merely provides a legal avenue to steal the work of another, or hold to ransom their works from future generations. This contradicts the first point, and also the second (paywalled papers); indeed the lack of availability of academic works (created for free, or with public funding) is, I believe, a key driver of inequality in this world.

One can withold or even selectively share knowledge, and history will never know what that has cost us.

In terms of AI training, I wouldn’t say it is copyright infringement even in spirit, and I say this as one whose works are vomited out verbatim by LLMs when questioned about the field. The comparison with speaking is an interesting one, for we generally do try and attribute ideas if we hold the speaker in esteem, or feel their name will enhance our point. An AI, however, is not speaking of their own volition, but is instead acting in the interest of the company hosting them (and so would fall under the professional label rather than the personal). This might contradict your final point, if one assumes AI progresses as a subscription product (which looks likely).

I think your framework has merit, mostly because it is built on ideals (and we need more such thinking in the world); however, it does not quite match the observed data. Though, it does suggest the rules a better incarnation of copyright could adhere to.

More so, I think no-one has an issue with training publicly available models - it’s the ones under copyright themselves people are leery of.

j4k3@lemmy.world · 5 hours ago

I wholeheartedly agree about proprietary models. My perspective is as someone who saw the initial momentum of AI and only run models on my hardware. What you are seeing with your work is not possible from a base model in practice. There are too many holes that need to align in the swiss cheese to make that possible, especially with softmax settings for general use. Even with deterministic softmax settings this doesn’t happen. I’ve even tried overtraining with a fine tune, and it won’t reproduce verbatim. What you are seeing is only possible with an agenetic RAG architecture. RAG is augmented retrieval with a database. The common open source libraries are LangChain and ChromaDB for the agent and database. The agent is just a group of models running at the same time with a central model capable of functions calling in the model loader code.

I can coax stuff out of a base model that is not supposed to be there, but it is so extreme and unreliable that it is not at all something useful. If I give a model something like 10k tokens (words/fragments) of lead-in then I can start a sentence of the reply and the model might get a sentence or two correct before it goes off on some tangent. Those kinds of paths through the tensor layers are like walking on a knife edge. There is absolutely no way to get that kind of result at random or without extreme efforts. The first few words of a model’s reply are very important too, and with open source models I can control every aspect. Indeed, I run models from a text editor interface where I see and control every aspect of generation.

I tried to create a RAG for learning Operating Systems Principles and Practice, Computer Systems A Programmer’s Perspective, and Linux Kernel Development as the next step in learning CS on my own. I learned a lot of the limits of present AI systems. They have a lot of promise, but progress mostly involves peripheral model loader code more than it does with the base model IMO.

I don’t know the answer to the stagnation and corruption of academia in so many areas. I figure there must be a group somewhere that has figured out civilization is going to collapse soon so why bother.