It’s all made from our data, anyway, so it should be ours to use as we want
It’s not punishment, LLM do not belong to them, they belong to all of humanity. Tear down the enclosing fences.
This is our common heritage, not OpenAI’s private property
“Given they were trained on our data, it makes sense that it should be public commons – that way we all benefit from the processing of our data”
I wonder how many people besides the author of this article are upset solely about the profit-from-copyright-infringement aspect of automated plagiarism and bullshit generation, and thus would be satisfied by the models being made more widely available.
The inherent plagiarism aspect of LLMs seems far more offensive to me than the copyright infringement, but both of those problems pale in comparison to the effects on humanity of masses of people relying on bullshit generators with outputs that are convincingly-plausible-yet-totally-wrong (and/or subtly wrong) far more often than anyone notices.
I liked the author’s earlier very-unlikely-to-be-met-demand activism last year better:
…which at least yielded the amusingly misleading headline OpenAI ordered to delete ChatGPT over false death claims (it’s technically true - a court didn’t order it, but a guy who goes by the name “That One Privacy Guy” while blogging on linkedin did).
A similar argument can be made about nationalizing corporations which break various laws, betray public trust, etc etc.
I’m not commenting on the virtues of such an approach, but I think it is fair to say that it is unrealistic, especially for countries like the US which fetishize profit at any cost.
Yes, mining companies should all be nationalised for digging up the country’s ground and putting carbon in the country’s air.
You must be fun at parties.
this comment doesn’t make any sense
You must be new here.
So banks will be public domain when they’re bailed out with taxpayer funds, too, right?
They should be, but currently it depends on the type of bailout, I suppose.
For instance, if a bank completely fails and goes under, the FDIC usually is named Receiver of the bank’s assets, and now effectively owns the bank.
At the same time, if a bank goes under, that means they owe more than they own, so “ownership” of that entity is basically worthless. In those cases, a bailout of the customers does nothing for the owners, because the owners still get wiped out.
The GM bailout in 2009 also involved wiping out all the shareholders, the government taking ownership of the new company, and the government spinning off the newly issued stock.
AIG required the company basically issue new stock to dilute owners down to 20% of the company, while the government owned the other 80%, and the government made a big profit when they exited that transaction and sold the stock off to the public.
So it’s not super unusual. Government can take ownership of companies as a condition of a bailout. What we generally don’t necessarily want is the government owning a company long term, because there’s some conflict of interest between its role as regulator and its interest as a shareholder.
With banks this is also true if they do not have enough liquid assets to meet the legal requirements. So the bank might not be able to count all bank accounts as assets but the FDIC is. Also they can then restructure the bank and force creditors to take a haircut.
This is why investment banks should be separate from banks that have consumer accounts that are insured by the government.
Then you can just let the investment bank fail. This was the whole premise of glass steagall that was repealed under clinton…
Banks are redundant, so is the stock market. These institutions do not need to, and should not be private. They are level playing fields in the economy, not participants trying to tilt the board for taking over the game.
Public domain wouldn’t be the right term for banks being publicly owned. At least for the normal usage of Public Domain in copyright. You can copy text and data, you can’t copy a company with unique customers and physical property.
Oh good point. I’m not actually sure what the phrase would be… Publicly owned?
I mean, that sometimes did happen.
Germany propped up the Commerzbank after 2007 by essentially buying a large part of it, and managed to sell several tranches with a healthy profit.
Same is true for Lufthansa during COVID.
No, “the banks” wouldn’t be what the AI would be trained on, it would be the private info of individuals the banks do business with.
Imaginary property has always been a tricky concept, but the law always ends up just protecting the large corporations at the expense of the people who actually create things. I assume the end result here will be large corporations getting royalties from AI model usage or measures put in place to prevent generating content infringing on their imaginary properties and everyone else can get fucked.
It’s like what happened with Spotify. The artists and the labels were unhappy with the copyright infringement of music happening with Napster, Limewire, Kazaa, etc. They wanted the music model to be the same “buy an album from a record store” model that they knew and had worked for decades. But, users liked digital music and not having to buy a whole album for just one song, etc.
Spotify’s solution was easy: cut the record labels in. Let them invest and then any profits Spotify generated were shared with them. This made the record labels happy because they got money from their investment, even though their “buy an album” business model was now gone. It was ok for big artists because they had the power to negotiate with the labels and get something out of the deal. But, it absolutely screwed the small artists because now Spotify gives them essentially nothing.
I just hope that the law that nothing created by an LLM is copyrightable proves to be enough of a speed bump to slow things down.
Bandcamp still runs on this mode though, and quite well
It’s also one of the few places that have lossless audio files available for download. I’m a big fan of Bandcamp. I like having all my music local.
Same. I refuse to use spotify, i’ve got 400gb of mp3s and winamp
It won’t really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.
What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.
They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.
Legislation that prohibits publicly-viewable information from being analyzed without permission from the copyright holder would have some pretty dramatic and dire unintended consequences.
Not really. The same way you can’t sell live and public performance music for profit and not get sued. Case law right there, and the fact it’s performance vs publicly published doesn’t matter. How the owner and originator classifies or licenses it is the defining classification. It’s going to be years before anyone sees this get a ruling in court though.
That’s not what’s going on here, though. The LLM model doesn’t contain the actual copyrighted data, it’s the result of analyzing the copyrighted data.
An analogous example would be a site like TV Tropes. TV Tropes doesn’t contain the works that it’s discussing, it just contains information about those works.
No, the model does retain the original works in a lossy compression. This is evidenced by the fact that you can get a model to reproduce sections of its training data
You’re probably thinking of situations where overfitting occurred. Those situations are rare, and are considered to be errors in training. Much effort has been put into eliminating that from modern AI training, and it has been successfully done by all the major players.
This is an old no-longer-applicable objection, along the lines of “AI can’t do fingers right”. And even at the time, it was only very specific bits of training data that got inadvertently overfit, not all of it. You couldn’t retrieve arbitrary examples of training data.
Did you not read my original comment before responding?
You said:
What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.
But the point is that it doesn’t matter if the data is licensed or not. Lack of licensing doesn’t stop you from analyzing data once that data is visible to you. Do you think TV Tropes licensed any of the works of fiction that they have pages about?
They pulled a very public and out in the open data heist and got away with it.
They did not. No data was “heisted.” Data was analyzed. The product of that analysis does not contain the data itself, and so is not a violation of copyright.
You’re thinking of licensing as a person putting something online WITH a license.
The terminology in this case is whether or not it was LICENSED by the commercial entity using and selling it’s derivative. That is the default. The burden is on the commercial entity to prove they were the original creator of said content. It is by default plagiarism otherwise, and this is also the default.
Here’s an example: I write a story and post it online, and it is specific to a toothbrush and toilet scrubber falling in love, and then having dish scrubber pads as children. I say the two main characters are called Dennis and Fran, and their children are called Denise and Francesca. Then somebody goes to prompt OpenAI for a similar and it kicks out the exact same story with the same names, I would win that case based on it clearly being beyond a doubt plagiarism.
Unless you as OpenAI can prove these are all completely random-which they aren’t because it’s trained on my data-then I would be deemed the original creator of that story, and any sales of that data I would be entitled to.
Proving that is a different thing, but that’s what the laws say should happen. If they didn’t contact me to license that story, it’s still plagiarism. Same with music, movies…etc.
The product of that analysis does not contain the data itself, and so is not a violation of copyright.
That’s your opinion, not the opinion of a court or legislature. LLM products are directly derived from and dependent upon the training data, so it is positively considered a derivative work. However, whether it’s considered sufficiently transformative, or whether it passes the fair use test, has not to my knowledge been determined in court. (Note that I am assuming US law here.)
The courts have yet to come to a conclusion, the lawsuits are still ongoing. I think it’s unlikely they’ll conclude that the models contain the data, however, because it’s objectively not true.
The clearest demonstration I can think of to illustrate this is the old Stable Diffusion 1.5 model. It was trained on the LAION 5B dataset, which (as the “5B” indicates) contained 5 billion images. The resulting model was 1.83 gigabytes. So if it’s compressing images and storing them inside the model it’d somehow need to fit ~2.7 images per byte. This is, simply, impossible.
They pulled a very pubic and out in the open data heist
Oh no, not the pubes! Get those curlies outta here!
Best correction ever. Fixed. ♥️
It’s already illegal in some form. Via piracy of the works and regurgitating protected data.
The issue is mega Corp with many rich investors vs everyone else. If this were some university student their life would probably be ruined like with what happened to Aaron Swartz.
The US justice system is different for different people.
If we can’t train on unlicensed data, there is no open-source scene. Even worse, AI stays but it becomes a monopoly in the hands of the few who can pay for the data.
Most of that data is owned and aggregated by entities such as record labels, Hollywood, Instagram, reddit, Getty, etc.
The field would still remain hyper competitive for artists and other trades that are affected by AI. It would only cause all the new AI based tools to be behind expensive censored subscription models owned by either Microsoft or Google.
I think forcing all models trained on unlicensed data to be open source is a great idea but actually rooting for civil lawsuits which essentially entail a huge broadening of copyright laws is simply foolhardy imo.
Unlicensed from the POV of the trainer, meaning they didn’t contact or license content from someone who didn’t approve. If it’s posted under Creative Commons, that’s fine. If it’s otherwise posted that it’s not open in any other way and not for corporate use, then they need to contact the owner and license it.
They won’t need to, they will get it from Getty. All these websites have a ToS that make it very clear they can do whatever they want with what you upload. The courts will simply never side with the small time photographer who makes 50$ a month with his stock photos hosted on someone else’s website. The laws will be in favor of databrokers and the handful of big AI companies.
Anyone self hosting will simply not get a call. Journalists will keep the same salary while the newspaper’s owner gets a fat bonus. Even Reddit already sold it’s data for 60 million and none of that went anywhere but spezs coke fund.
Two things:
-
Getty is not expressly licensed as “free to use”, and by default is not licensed for commercial anything. That’s how they are a business that is still alive.
-
You’re talking about Generative AI junk and not LLMs which this discussion and the original post is about. They are not the same thing.
Reddit and newspapers selling their data preemptively has to do with LLMs. Can you clarify what scenario you are aiming for? It sounds like you want the courts to rule that AI companies need to ask each individual redditor if they can use his comments for training. I don’t see this happening personally.
Getty gives itself the right to license all photos uploaded and already trained a generative model on those btw.
EULA and TOS agreements stop Reddit and similar sites from being sued. They changed them before they were selling the data and barely gave notice about it (see the exodus from reddit pt2), but if you keep using the service, you agree to both, and they can get away with it because they own the platform.
Anyone who has their content on a platform of the like that got the rug pulled out from under them with silent amendments being made to allow that is unfortunately fucked.
Any other platforms that didn’t explicitly state this was happening is not in scope to just allow these training tools to grab and train. What we know is that OpenAI at the very least was training on public sites that didn’t explicitly allow this. Personal blogs, Wikipedia…etc.
-
But wouldn’t that mean making it open source, then it not functioning properly without the data while open, would prove that it is using a huge amount of unlicensed data?
Probably not “burden of proof in a court of law” prove though.
Making it open source doesn’t change how it works. It doesn’t need the data after it’s been trained. Most of these AIs are just figuring out patterns to look for in the new data it comes across.
So you’re saying the data wouldn’t exist anywhere in the source code, but it would still be able to answer questions based on the data it has previously seen?
That is how LLM works, they don’t store the data as data, but as weight values.
So then why, if it were all open sourced, including the weights, would the AI be worthless? Surely having an identical but open source version, that would strip profitability from the original paid product.
It wouldn’t be. It would still work. It just wouldn’t be exclusively available to the group that created it-any competitive advantage is lost.
But all of this ignores the real issue - you’re not really punishing the use of unauthorized data. Those who owned that data are still harmed by this.
It does discourages the use of unauthorised data. If stealing doesn’t give you competitive advantage, it’s not really worth the risk and cost of stealing it in the first place.
in civil matters, the burden of proof is actually usually just preponderance of evidence and not beyond a reasonable doubt. in other words to win a lawsuit, you only need to have more compelling evidence than the other person.
But you still have to have EVIDENCE. Not derivative evidence. The output of a model could be argued to be hearsay because it’s not direct evidence of originating content, it’s derivative.
You’d have to have somebody backtrack generations of model data to even find snippets of something that defines copyright material, or a human actually saying “Yes, we definitely trained on unlicensed data”.
so like I am not making any comment on anything but the legal system here. but it’s absolutely the case that you can win a lawsuit on purely circumstantial evidence if the defense is unable to produce a compelling alternative set of circumstances which can lead to the same outcome.
It could also contain non-public domain data, and you can’t declare someone else’s intellectual property as public domain just like that, otherwise a malicious actor could just train a model with a bunch of misappropriated data, get caught (intentionally or not) and then force all that data into public domain.
Laws are never simple.
It wouldn’t contain any public-domain data though. That’s the thing with LLMs, once they’re trained on data the data is gone and just added to the series of weights in the model somewhere. If it ingested something private like your tax data, it couldn’t re-create your tax data on command, that data is now gone, but if it’s seen enough private tax data it could give something that looked a lot like a tax return to someone with an untrained eye. But, a tax accountant would easily see flaws in it.
Forcing a bunch of neural weights into the public domain doesn’t make the data they were trained on also public domain, in fact it doesn’t even reveal what they were trained on.
LOL no. The weights encode the training data and it’s trivially easy to make AI generators spit out bits of their training data.
paper?
No, training data.
No, he’s challenging the assertion that it’s “trivially easy” to make AIs output their training data.
Older AIs have occasionally regurgitated bits of training data as a result of overfitting, which is a flaw in training that modern AI training techniques have made great strides in eliminating. It’s no longer a particularly common problem, and even if it were it only applies to those specific bits of training data that were overfit on, not on all of the training data in general.
I thought he meant LLMs shot out bits of paper like some ticker-tape parade.
How easy are we talking about here? Also, making the model public domain doesn’t mean making the output public domain. The output of an LLM should still abide by copyright laws, as they should be.
So what you’re saying is that there’s no way to make it legal and it simply needs to be deleted entirely.
I agree.
There’s no need to “make it legal”, things are legal by default until a law is passed to make them illegal. Or a court precedent is set that establishes that an existing law applies to the new thing under discussion.
Training an AI doesn’t involve copying the training data, the AI model doesn’t literally “contain” the stuff it’s trained on. So it’s not likely that existing copyright law makes it illegal to do without permission.
By this logic, you can copy a copyrighted imege as long as you decrease the resolution, because the new image does not contain all the information in the original one.
In the case of Stable Diffusion, they used 5 billion images to train a model 1.83 gigabytes in size. So if you reduce a copyrighted image to 3 bits (not bytes - bits), then yeah, I think you’re probably pretty safe.
More like reduce it to a handful of vectors that get merged with other vectors.
Right, like I did. They’re safeguarding Disney and other places like that now. It’s just the little guys who get screwed.
Delete them. Wipe their databases. Make the companies start from scratch with new, ethically acquired training data.
Mmm yes so all that electricity is pure waste
Genuine question, does anyone know how much of the electricity is used for training the model vs using it to generate responses?
Not specifically, but training is pretty fucking expensive to do, while generating is kinda easy. The OpenAI models are massive, training them cost a lot. Though they also have a lot of traffic. But unless they stop training new models, I don’t think generating answers will ever catch up to training.
For perspective, all of the data centers in the US combined use 4% of total electric load.
Yes!
The environmental cost of training is a bit of a meme. The details are spread around, but basically, Alibaba trained a GPT-4 level-ish model on a relatively small number of GPUs… probably on par with a steel mill running for a long time, a comparative drop in the bucket compared to industrial processes. OpenAI is extremely inefficient, probably because they don’t have much pressure to optimize GPU usage.
Inference cost is more of a concern with crazy stuff like o3, but this could dramatically change if (hopefully when) bitnet models come to frutition.
Still, I 100% agree with this. Closed LLM weights should be public domain, as many good models already are.
With current kWh/token it’s 100x of a regular google search query. That’s where the environmental meme came from. Also, Nvidia plans to manufacture enough chips to require global electricity production to increase by 20-30%.
Doesn’t Open AI just have the same efficiency issue as computing in general due to hardware from older nodes?
What are bitnet models and what does that change in a nutshell?
What are bitnet models and what does that change in a nutshell?
Read the pitch here: https://github.com/ridgerchu/matmulfreellm
Basically, using ternary weights, all inference-time matrix multiplication can be replaced with much simpler matrix addition. This is theoretically more efficient on GPUs, and astronomically more efficient on dedicated hardware (as adders take up a fraction of the space as multipliers in silicon). This would be particularly fantastic for, say, local inference on smartphones or laptop ASICs.
The catch is no one has (publicly) risked a couple of million dollars to test it with a large model, as (so far) training it isn’t more efficient than “regular” LLMs.
Doesn’t Open AI just have the same efficiency issue as computing in general due to hardware from older nodes?
No one really knows, because they’re so closed and opaque!
But it appears that their models perform relatively poorly for thier “size.” Qwen is nearly matching GPT-4 in some metrics, yet is probably an order of magnitude smaller, while Google/Claude and some Chinese models are also pulling ahead.
Only if they were trained on public material.
Doesn’t seem like this helps out all the writers / artists that the LLM stole from.
Are you threatening me with a good time?
First of all, whether these LLMs are “illegally trained” is still a matter before the courts. When an LLM is trained it doesn’t literally copy the training data, so it’s unclear whether copyright is even relevant.
Secondly, I don’t think that making these models “public domain” would have the negative effects that people angry about AI think it would. When a company is running a closed model internally, like ChatGPT for example, the model is never available for download in the first place. It doesn’t matter if it’s public domain or not because you can’t get a copy of it. When a company releases an open-weight model for public use, on the other hand, they usually encumber them with some sort of license that makes them harder for competitors to monetize or build on. Making those public-domain would greatly increase their utility. It might make future releases less likely, but in the meantime it’ll greatly enhance AI development.
The LLM does reproduce copyrighted data though.
How?
*it can produce data identical to data that has been copyrighted before
Your data is worthless. Only Linux type zealots (conspiracy theorists) harp on that. Ever copied a meme and shared it elsewhere?
Negative reputation troll.
Not only that, but copyright applies to copying, not reading, which is what it’s doing.
I mean, if we really are following the spirit of copyright, since no-one at open AI or other companies developed matrix and vector multiplication (operations existing in the public domain because Platonism is a thing).
Edit: oh my, I guess the consensus is that stealing the work of mathematicians is ok (or more, classifying our constructions as discoveries).
Simple operations like vector multiplication are not works for the purposes of copyright law. If you invented an entirely new form of math, complete with novel formulae, you could conceivably assert patent rights and/or copyright over it, especially if you published a textbook. It would be more relevant, however, to discuss complex algorithms, such as for data compression. Those can certainly be patented. And, when implemented as a computer program, can certainly be copyrighted.
But if you’re just defining one simple operation, yeah, you’re unlikely to be able to assert any rights over it.
Ehh no, you genuinely can’t patent any form of mathematics.
Mathematics falls under “exists in nature” (if you are a Platonist) or “abstract ideas” (gets even clear thinking Constructivists). So they’re excluded from parents and copyrights no matter how complex the system
Textbooks usually belong to the publisher (academics commonly have to pirate their own papers), so that’s usually a bust.
You might be able to patent an algorithm associated with a branch of mathematics, but that’s trickier than you think. Blank slate development can, and does, happen (see Compaq’s reimplementation of IBM’s bios). You’re banking on it not being reversed engineer able (spoiler, don’t take that bet if you’ve published your proofs!).
You can’t patent math, though you can copyright a specific explanation of math concepts.
If Open AI (or any AI company) is including copyrighted works in their solution, that’s a copyright violation and should be treated as such. But if they’re merely using the information from a copyrighted work but not violating the copyright itself, they’re fine.
That’s rather the irony - mathematics takes a great deal of work and creativity. You can’t copyright mathematical work; but, put a set of lines together and shade in the polygons created and suddenly it becomes copyrightable. Somehow one is a creative work whose author requires protection, and the other is volunteered for involuntary public service.
The reason mathematics cannot be copyrighted: because it’s a “discovery”, rather than a “creation” (very much a point of view, and far from irrefutable fact). In mathematics, one should be aware, that the concept and it’s explanation (proof) are much the same thing.
All in all, the argument is either mathematical work should fall under copyright (an abhorrent idea), or copyright should be abolished as it rarely (if ever) does much good.
The point of copyright is to protect creators from having their work stolen.
If an indie artist creates a work, a larger company could copy that work and distribute it as their own, and since that larger company has deeper pockets than thy indie artist, they can flood the market before the original artist has a chance to profit from it. Or perhaps a larger org creates an expensive work (game, movie, etc), anyone could redistribute it without paying the original creator.
Without copyright, we’d get way more paywalls, invasive DRM, etc. We get a taste of that today, but it can get way worse.
So we definitely need copyright, it just needs to be a lot shorter (say, 10-20 years).
I mean, the point was definitely stated as protecting creators. We’ve seen some solid David Vs Goliath stories of artists taking people who steal their work down.
However, this isn’t the reality for the majority of copyright. A lot of it just ties up works to companies owned by speculative shareholders (think of the lord of the rings).
Limits to duration would definitely help this, and we’d be on the same page there. However, I do still wonder if it shouldn’t be shorter for certain things (e.g. medical treatments or manufacturing), with the option of a public domain buyout to cover (reasonable, non-inflated) research costs.
medical treatments or manufacturing
Both would be patents, no? Those are 20 years in the US, whereas copyright is 70 years after author’s death (or 95-120 years for “work for hire” works). Both should be reduced, but they protect very different things and need different considerations.
What is this perspective?
Oh, that copyright is bollocks. If you follow its intent, you should be including academics, and that state of affairs would be abhorrent (we’d stagnate).
I see the issue as more like thought policing is the inevitable outcome of calling training copyright infringement because there is no difference between a person that recalls information and talks about it with others and the intended use of published information for training. If training an AI with all the knowledge a person learns in a similar manner is somehow wrong, then the inevitable long term way this plays out is a Minority Report like dystopia. It sets the precedent for prosecution of people for their thoughts or intentions and not their actions. This kind of thought policing existed in the darkest depths of the medieval era, or even into more recent eras of witch hunts or McCarthyism. Perhaps we are on the brink of another such dark era.
As far as I am aware:
-
Copyright is intended to protect someone from another person copying their work for for financial gain, or to be much more specific–copying work for direct gain using any form of complex social hierarchy such as awards, reputation, or monetary gain.
-
What copyright does not protect is the dissemination of knowledge as it relates to publicly published works.
-
One has the choice to remain the sole proprietor of one’s knowledge, but to publish publicly is to relinquish ownership of the information contained within.
-
Principally, copyright protects that you were the first to write it, and the way in which you wrote it, but it does nothing to protect the knowledge contained within. If a person recalls that knowledge, they are not required to state a citation when speaking aloud, or in some way making use of that knowledge.
-
Copyright also has a scope of intent, and that primarily involves competitive works from ones peers and excludes the scope of general knowledge and usefulness to society at large.
I’m not trying to mock you, or say you are right or wrong. Quite frankly, I don’t think in these terms, or care about the kinds of people who do. I’m heavily abstracted and intuitively driven to understand. I believe everything that is not intuitive is simple not fully understood yet. However naïve that may be is irrelevant here. I’m of the bias that those with something to gain often lack objective thinking and show a measure of envy when unexpected changes occur in society. I’m not accusing you, but only sharing the most minor of biases I am aware of while trying to say I want to understand. I would like to know if there is anything in the framework I just laid out that is overlooked. I would like to better understand why you find this issue upsetting. I’m one of the most flawed and openly human people on Lemmy. Look at my history if in doubt. I have no skin in this game, just curiosity.
My view is that of a scholar - one who does devote a large part of their life to freely creating and disseminating knowledge. I do indeed hold a strong bias here, one I’m happy to admit.
Much of the time, when I’ve run across copyright, it is rarely (if ever come to think of it) in the name of the author (a common requirement of journals being the giving up of ownership of one’s work). It normally falls to a company; one usually driven by shareholder value with little (if no) concern for the author’s rights. This tends to be the rule rather than the exception, and I’d argue that copyright in it’s current incarnation merely provides a legal avenue to steal the work of another, or hold to ransom their works from future generations. This contradicts the first point, and also the second (paywalled papers); indeed the lack of availability of academic works (created for free, or with public funding) is, I believe, a key driver of inequality in this world.
One can withold or even selectively share knowledge, and history will never know what that has cost us.
In terms of AI training, I wouldn’t say it is copyright infringement even in spirit, and I say this as one whose works are vomited out verbatim by LLMs when questioned about the field. The comparison with speaking is an interesting one, for we generally do try and attribute ideas if we hold the speaker in esteem, or feel their name will enhance our point. An AI, however, is not speaking of their own volition, but is instead acting in the interest of the company hosting them (and so would fall under the professional label rather than the personal). This might contradict your final point, if one assumes AI progresses as a subscription product (which looks likely).
I think your framework has merit, mostly because it is built on ideals (and we need more such thinking in the world); however, it does not quite match the observed data. Though, it does suggest the rules a better incarnation of copyright could adhere to.
More so, I think no-one has an issue with training publicly available models - it’s the ones under copyright themselves people are leery of.
I wholeheartedly agree about proprietary models. My perspective is as someone who saw the initial momentum of AI and only run models on my hardware. What you are seeing with your work is not possible from a base model in practice. There are too many holes that need to align in the swiss cheese to make that possible, especially with softmax settings for general use. Even with deterministic softmax settings this doesn’t happen. I’ve even tried overtraining with a fine tune, and it won’t reproduce verbatim. What you are seeing is only possible with an agenetic RAG architecture. RAG is augmented retrieval with a database. The common open source libraries are LangChain and ChromaDB for the agent and database. The agent is just a group of models running at the same time with a central model capable of functions calling in the model loader code.
I can coax stuff out of a base model that is not supposed to be there, but it is so extreme and unreliable that it is not at all something useful. If I give a model something like 10k tokens (words/fragments) of lead-in then I can start a sentence of the reply and the model might get a sentence or two correct before it goes off on some tangent. Those kinds of paths through the tensor layers are like walking on a knife edge. There is absolutely no way to get that kind of result at random or without extreme efforts. The first few words of a model’s reply are very important too, and with open source models I can control every aspect. Indeed, I run models from a text editor interface where I see and control every aspect of generation.
I tried to create a RAG for learning Operating Systems Principles and Practice, Computer Systems A Programmer’s Perspective, and Linux Kernel Development as the next step in learning CS on my own. I learned a lot of the limits of present AI systems. They have a lot of promise, but progress mostly involves peripheral model loader code more than it does with the base model IMO.
I don’t know the answer to the stagnation and corruption of academia in so many areas. I figure there must be a group somewhere that has figured out civilization is going to collapse soon so why bother.
-