OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling’s Harry Potter series::A new research paper laid out ways in which AI developers should try and avoid showing LLMs have been trained on copyrighted material.

  • Blapoo@lemmy.ml
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    We have to distinguish between LLMs

    • Trained on copyrighted material and
    • Outputting copyrighted material

    They are not one and the same

    • TwilightVulpine@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      Should we distinguish it though? Why shouldn’t (and didn’t) artists have a say if their art is used to train LLMs? Just like publicly displayed art doesn’t provide a permission to copy it and use it in other unspecified purposes, it would be reasonable that the same would apply to AI training.

      • Blapoo@lemmy.ml
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Ah, but that’s the thing. Training isn’t copying. It’s pattern recognition. If you train a model “The dog says woof” and then ask a model “What does the dog say”, it’s not guaranteed to say “woof”.

        Similarly, just because a model was trained on Harry Potter, all that means is it has a good corpus of how the sentences in that book go.

        Thus the distinction. Can I train on a comment section discussing the book?

      • scv@discuss.online
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Legally the output of the training could be considered a derived work. We treat brains differently here, that’s all.

        I think the current intellectual property system makes no sense and AI is revealing that fact.

      • TropicalDingdong@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        arrow-down
        1
        ·
        1 year ago

        I think this brings up broader questions about the currently quite extreme interpretation of copyright. Personally I don’t think its wrong to sample from or create derivative works from something that is accessible. If its not behind lock and key, its free to use. If you have a problem with that, then put it behind lock and key. No one is forcing you to share your art with the world.

    • Tetsuo@jlai.lu
      link
      fedilink
      English
      arrow-up
      0
      arrow-down
      1
      ·
      1 year ago

      Output from an AI has just been recently considered as not copyrightable.

      I think it stemmed from the actors strikes recently.

      It was stated that only work originating from a human can be copyrighted.

      • Anders429@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Output from an AI has just been recently considered as not copyrightable.

        Where can I read more about this? I’ve seen it mentioned a few times, but never with any links.

        • Even_Adder@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          They clearly only read the headline If they’re talking about the ruling that came out this week, that whole thing was about trying to give an AI authorship of a work generated solely by a machine and having the copyright go to the owner of the machine through the work-for-hire doctrine. So an AI itself can’t be authors or hold a copyright, but humans using them can still be copyright holders of any qualifying works.

  • fubo@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 year ago

    If I memorize the text of Harry Potter, my brain does not thereby become a copyright infringement.

    A copyright infringement only occurs if I then reproduce that text, e.g. by writing it down or reciting it in a public performance.

    Training an LLM from a corpus that includes a piece of copyrighted material does not necessarily produce a work that is legally a derivative work of that copyrighted material. The copyright status of that LLM’s “brain” has not yet been adjudicated by any court anywhere.

    If the developers have taken steps to ensure that the LLM cannot recite copyrighted material, that should count in their favor, not against them. Calling it “hiding” is backwards.

    • Eccitaze@yiffit.net
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      If Google took samples from millions of different songs that were under copyright and created a website that allowed users to mix them together into new songs, they would be sued into oblivion before you could say “unauthorized reproduction.”

      You simply cannot compare one single person memorizing a book to corporations feeding literally millions of pieces of copyrighted material into a blender and acting like the resulting sausage is fine because “only a few rats fell into the vat, what’s the big deal”

          • player2@lemmy.dbzer0.com
            link
            fedilink
            English
            arrow-up
            1
            ·
            edit-2
            1 year ago

            The analogy talks about mixing samples of music together to make new music, but that’s not what is happening in real life.

            The computers learn human language from the source material, but they are not referencing the source material when creating responses. They create new, original responses which do not appear in any of the source material.

    • Gyoza Power@discuss.tchncs.de
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      Let’s not pretend that LLMs are like people where you’d read a bunch of books and draw inspiration from them. An LLM does not think nor does it have an actual creative process like we do. It should still be a breach of copyright.

      • efstajas@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        … you’re getting into philosophical territory here. The plain fact is that LLMs generate cohesive text that is original and doesn’t occur in their training sets, and it’s very hard if not impossible to get them to quote back copyrighted source material to you verbatim. Whether you want to call that “creativity” or not is up to you, but it certainly seems to disqualify the notion that LLMs commit copyright infringement.

  • Uriel238 [all pronouns]@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 year ago

    Training AI on copyrighted material is no more illegal or unethical than training human beings on copyrighted material (from library books or borrowed books, nonetheless!). And trying to challenge the veracity of generative AI systems on the notion that it was trained on copyrighted material only raises the specter that IP law has lost its validity as a public good.

    The only valid concern about generative AI is that it could displace human workers (or swap out skilled jobs for menial ones) which is a problem because our society recognizes the value of human beings only in their capacity to provide a compensation-worthy service to people with money.

    The problem is this is a shitty, unethical way to determine who gets to survive and who doesn’t. All the current controversy about generative AI does is kick this can down the road a bit. But we’re going to have to address soon that our monied elites will be glad to dispose of the rest of us as soon as they can.

    Also, amateur creators are as good as professionals, given the same resources. Maybe we should look at creating content by other means than for-profit companies.

  • RadialMonster@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    what if they scraped a whole lot of the internet, and those excerpts were in random blogs and posts and quotes and memes etc etc all over the place? They didnt injest the material directly, or knowingly.

  • rosenjcb@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 year ago

    The powers that be have done a great job convincing the layperson that copyright is about protecting artists and not publishers. It’s historically inaccurate and you can discover that copyright law was pushed by publishers who did not want authors keeping second hand manuscripts of works they sold to publishing companies.

    Additional reading: https://en.m.wikipedia.org/wiki/Statute_of_Anne

  • Technoguyfication@lemmy.ml
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    People are acting like ChatGPT is storing the entire Harry Potter series in its neural net somewhere. It’s not storing or reproducing text in a 1:1 manner from the original material. Certain material, like very popular books, has likely been interpreted tens of thousands of times due to how many times it was reposted online (and therefore how many times it appeared in the training data).

    Just because it can recite certain passages almost perfectly doesn’t mean it’s redistributing copyrighted books. How many quotes do you know perfectly from books you’ve read before? I would guess quite a few. LLMs are doing the same thing, but on mega steroids with a nearly limitless capacity for information retention.

    • Teritz@feddit.de
      link
      fedilink
      English
      arrow-up
      0
      arrow-down
      1
      ·
      1 year ago

      Using Copyrighted Work as Art as example still influences the AI which their make Profit from.

      If they use my Works then they need to pay thats it.

      • coheedcollapse@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Still kinda blows my mind how like the most socialist people I know (fellow artists) turned super capitalist the second a tool showed like an inkling of potential to impact their bottom line.

        Personally, I’m happy to have my work scraped and permutated by systems that are open to the public. My biggest enemy isn’t the existence of software scraping an open internet, it’s the huge companies who see it as a way to cut us out of the picture.

        If we go all copyright crazy on the models for looking at stuff we’ve already posted openly on the internet, the only companies with access to the tools will be those who already control huge amounts of data.

        I mean, for real, it’s just mind-blowing seeing the entire artistic community pretty much go full-blown “Metallica with the RIAA” after decades of making the “you wouldn’t download a car” joke.

        • angstylittlecatboy@reddthat.com
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          1 year ago

          I feel like a lot of internet people (not even just socialists) go from seeing copyright as at best a compromise that allows the arts to have value under capitalism to treating it like a holy doctrine when the subject of LLMs comes up.

          Like, people who will say “piracy is always okay” will also say “ban AI, period” (and misrepresent organizations that want regulations on it’s use as wanting a full ban.)

          Like, growing up with an internet full of technically illegal content (or grey area at best) like fangames and YouTube Poops made me a lifelong copyright skeptic. It’s outright confusing to me when people take copyright as seriously as this.

        • Sir_Kevin@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          Fuckin preach! I feel like I’m surrounded by children that didn’t live through the many other technologies that have came along and changed things. People lost their shit when photoshop became mainstream, when music started using samples, etc. AI is here to stay. These same people are probably listening to autotuned music all day while they complain on the internet about AI looking at their art.

  • scarabic@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 year ago

    One of the first things I ever did with ChatGPT was ask it to write some Harry Potter fan fiction. It wrote a short story about Ron and Harry getting into trouble. I never said the word McGonagal and yet she appeared in the story.

    So yeah, case closed. They are full of shit.

    • PraiseTheSoup@lemm.ee
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      There is enough non-copywrited Harry Potter fan fiction out there that it would not need to be trained on the actual books to know all the characters. While I agree they are full of shit, your anecdote proves nothing.

      • Cosmic Cleric@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        1 year ago

        While I agree they are full of shit, your anecdote proves nothing.

        Why? Because you say so?

        He brings up a valid point, it seems transformative.

        • LittleLordLimerick@lemm.ee
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          The anecdote proves nothing because the model could potentially have known of the McGonagal character without ever being trained on the books, since that character appears in a lot of fan fiction. So their point is invalid and their anecdote proves nothing.

  • TropicalDingdong@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    1
    ·
    1 year ago

    Its a bit pedantic, but I’m not really sure I support this kind of extremist view of copyright and the scale of whats being interpreted as ‘possessed’ under the idea of copyright. Once an idea is communicated, it becomes a part of the collective consciousness. Different people interpret and build upon that idea in various ways, making it a dynamic entity that evolves beyond the original creator’s intention. Its like issues with sampling beats or records in the early days of hiphop. Its like the very principal of an idea goes against this vision, more that, once you put something out into the commons, its irretrievable. Its not really yours any more once its been communicated. I think if you want to keep an idea truly yours, then you should keep it to yourself. Otherwise you are participating in a shared vision of the idea. You don’t control how the idea is interpreted so its not really yours any more.

    If thats ChatGPT or Public Enemy is neither here nor there to me. The idea that a work like Peter Pan is still possessed is such a very real but very silly obvious malady of this weirdly accepted but very extreme view of the ability to possess an idea.

    • Bogasse@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      Well, I’d consider agreeing if the LLMs were considered as a generic knowledge database. However I had the impression that the whole response from OpenAI & cie. to this copyright issue is “they build original content”, both for LLMs and stable diffusion models. Now that they started this line of defence I think that they are stuck with proving that their “original content” is not derivated from copyrighted content 🤷

      • TropicalDingdong@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        arrow-down
        1
        ·
        1 year ago

        Well, I’d consider agreeing if the LLMs were considered as a generic knowledge database. However I had the impression that the whole response from OpenAI & cie. to this copyright issue is “they build original content”, both for LLMs and stable diffusion models. Now that they started this line of defence I think that they are stuck with proving that their “original content” is not derivated from copyrighted content 🤷

        Yeah I suppose that’s on them.

    • Laticauda@lemmy.ca
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      1 year ago

      Ai isn’t interpreting anything. This isn’t the sci-fi style of ai that people think of, that’s general ai. This is narrow AI, which is really just an advanced algorithm. It can’t create new things with intent and design, it can only regurgitate a mix of pre-existing stuff based on narrow guidelines programmed into it to try and keep it coherent, with no actual thought or interpretation involved in the result. The issue isn’t that it’s derivative, the issue is that it can only ever be inherently derivative without any intentional interpretation or creativity, and nothing else.

      Even collage art has to qualify as fair use to avoid copyright infringement if it’s being done for profit, and fair use requires it to provide commentary, criticism, or parody of the original work used (which requires intent). Even if it’s transformative enough to make the original unrecognizable, if the majority of the work is not your own art, then you need to get permission to use it otherwise you aren’t automatically safe from getting in trouble over copyright. Even using images for photoshop involves creative commons and commercial use licenses. Fanart and fanfic is also considered a grey area and the only reason more of a stink isn’t kicked up over it regarding copyright is because it’s generally beneficial to the original creators, and credit is naturally provided by the nature of fan works so long as someone doesn’t try to claim the characters or IP as their own. So most creators turn a blind eye to the copyright aspect of the genre, but if any ever did want to kick up a stink, they could, and have in the past like with Anne Rice. And as a result most fanfiction sites do not allow writers to profit off of fanfics, or advertise fanfic commissions. And those are cases with actual humans being the ones to produce the works based on something that inspired them or that they are interpreting. So even human made derivative works have rules and laws applied to them as well. Ai isn’t a creative force with thoughts and ideas and intent, it’s just a pattern recognition and replication tool, and it doesn’t benefit creators when it’s used to replace them entirely, like Hollywood is attempting to do (among other corporate entities). Viewing AI at least as critically as actual human beings is the very least we can do, as well as establishing protection for human creators so that they can’t be taken advantage of because of AI.

      I’m not inherently against AI as a concept and as a tool for creators to use, but I am against AI works with no human input being used to replace creators entirely, and I am against using works to train it without the permission of the original creators. Even in the artist/writer/etc communities it’s considered to be a common courtesy to credit other people/works that you based a work on or took inspiration from, even if what you made would be safe under copyright law regardless. Sure, humans get some leeway in this because we are imperfect meat creatures with imperfect memories and may not be aware of all our influences, but a coded algorithm doesn’t have that excuse. If the current AIs in circulation can’t function without being fed stolen works without credit or permission, then they’re simply not ready for commercial use yet as far as I’m concerned. If it’s never going to be possible, which I just simply don’t believe, then it should never be used commercially period. And it should be used by creators to assist in their work, not used to replace them entirely. If it takes longer to develop, fine. If it takes more effort and manpower, fine. That’s the price I’m willing to pay for it to be ethical. If it can’t be done ethically, then imo it shouldn’t be done at all.

      • Kogasa@programming.dev
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Your broader point would be stronger if it weren’t framed around what seems like a misunderstanding of modern AI. To be clear, you don’t need to believe that AI is “just” a “coded algorithm” to believe it’s wrong for humans to exploit other humans with it. But to say that modern AI is “just an advanced algorithm” is technically correct in exactly the same way that a blender is “just a deterministic shuffling algorithm.” We understand that the blender chops up food by spinning a blade, and we understand that it turns solid food into liquid. The precise way in which it rearranges the matter of the food is both incomprehensible and irrelevant. In the same way, we understand the basic algorithms of model training and evaluation, and we understand the basic domain task that a model performs. The “rules” governing this behavior at a fine level are incomprehensible and irrelevant-- and certainly not dictated by humans. They are an emergent property of a simple algorithm applied to billions-to-trillions of numerical parameters, in which all the interesting behavior is encoded in some incomprehensible way.

    • treefrog@lemm.ee
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      If you sample someone else’s music and turn around and try to sell it, without first asking permission from the original artist, that’s copyright infringement.

      So, if the same rules apply, as your post suggests, OpenAI is also infringing on copyright.

      • TropicalDingdong@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        arrow-down
        1
        ·
        1 year ago

        If you sample someone else’s music and turn around and try to sell it, without first asking permission from the original artist, that’s copyright infringement.

        I think you completely and thoroughly do not understand what I’m saying or why I’m saying it. No where did I suggest that I do not understand modern copyright. I’m saying I’m questioning my belief in this extreme interpretation of copyright which is represented by exactly what you just parroted. That this interpretation is both functionally and materially unworkable, but also antithetical to a reasonable understanding of how ideas and communication work.

  • paraphrand@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 year ago

    Why are people defending a massive corporation that admits it is attempting to create something that will give them unparalleled power if they are successful?

    • bamboo@lemm.ee
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Mostly because fuck corporations trying to milk their copyright. I have no particular love for OpenAI (though I do like their product), but I do have great distain for already-successful corporations that would hold back the progress of humanity because they didn’t get paid (again).

  • Jat620DH27@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 year ago

    I thought everyone knows that OpenAI has the same access to any books, knowledge that human beings have.

    • Redditiscancer789@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      Yes, but it’s what it is doing with it that is the murky grey area. Anyone can read a book, but you can’t use those books for your own commercial stuff. Rowling and other writers are making the case their works are being used in an inappropriate way commercially. Whether they have a case iunno ianal but I could see the argument at least.

      • Touching_Grass@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Harry potter uses so many tropes and inspiration from other works that came before. How is that different? wizards of the coast should sue her into the ground.

  • Sentau@lemmy.one
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    1 year ago

    I think a lot of people are not getting it. AI/LLMs can train on whatever they want but when then these LLMs are used for commercial reasons to make money, an argument can be made that the copyrighted material has been used in a money making endeavour. Similar to how using copyrighted clips in a monetized video can make you get a strike against your channel but if the video is not monetized, the chances of YouTube taking action against you is lower.

    Edit - If this was an open source model available for use by the general public at no cost, I would be far less bothered by claims of copyright infringement by the model

    • FMT99@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      But wouldn’t this training and the subsequent output be so transformative that being based on the copyrighted work makes no difference? If I read a Harry Potter book and then write a story about a boy wizard who becomes a great hero, anyone trying to copyright strike that would be laughed at.

    • Tyler_Zoro@ttrpg.network
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      AI/LLMs can train on whatever they want but when then these LLMs are used for commercial reasons to make money, an argument can be made that the copyrighted material has been used in a money making endeavour.

      And does this apply equally to all artists who have seen any of my work? Can I start charging all artists born after 1990, for training their neural networks on my work?

      Learning is not and has never been considered a financial transaction.

      • maynarkh@feddit.nl
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        1 year ago

        Actually, it has. The whole consept of copyright is relatively new, and corporations absolutely tried to have people who learned proprietary copyrighted information not be able to use it in other places.

        It’s just that labor movements got such non-compete agreements thrown out of our society, or at least severely restricted on humanitarian grounds. The argument is that a human being has the right to seek happiness by learning and using the proprietary information they learned to better their station. By the way, this needed a lot of violent convincing that we have this.

        So yes, knowledge and information learned is absolutely withing the scope of copyright as it stands, it’s only that the fundamental rights that humans have override copyright. LLMs (and companies for that matter) do not have such fundamental rights.

        Copyright by the way is stupid in its current implementation, but OpenAI and ChatGPT does not get to get out of it IMO just because it’s “learning”. We humans ourselves are only getting out of copyright because of our special legal status.

        • Even_Adder@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          0
          ·
          1 year ago

          You kind of do. Fair use protects reverse engineering, indexing for search engines, and other forms of analysis that create new knowledge about works or bodies of works. These models are meant to be used to create new works which is where the “generative” part of generative models comes in, and the fact that the models consist only of original analysis of the training data in comparison with one another means as your tool, they are protected.

          • maynarkh@feddit.nl
            link
            fedilink
            English
            arrow-up
            0
            ·
            1 year ago

            https://en.wikipedia.org/wiki/Fair_use

            Fair use only works if what you create is to reflect on the original and not to supercede it. For example if ChatGPT gobbled up a work on the reproduction of firefies, if you ask it a question about the topic and it just answers, that’s not fair use since you made the original material redundant. If it did what a search engine would do and just tell you that “here’s where you can find it, you might have to pay for it”, that’s fair use. This is of course US law, so it may be different everywhere, and US law is weird so the courts may say anything.

            That’s the gist of it, fair use is fine as long as you are only creating new information and only use the copyrighted old work as is absolutely necessary for your new information to make sense, and even then, you can’t use so much of the copyrighted work that it takes away from the value of it.

            Otherwise if I pirated a movie and put subtitles on it, I could argue it’s fair use since it’s new information and transformative. If I released the subtitles separately, that would be a strong argument for fair use. If I included a 10 sec clip in it to show my customers what the thing is like in action, then that may be argued. If it’s the pivotal 10 seconds that spoils the whole movie, that’s not fair use, since I took away from the value of the original.

            ChatGPT ate up all of these authors’ works and for some, it may take away from the value they have created. It’s telling that OpenAI is trying to be shifty about it as well. If they had a strong argument, they’d want to settle it as soon as possibe as this is a big stormcloud on their company IP value. And yeah it sucks that people created something that may turn out to not be legal because some people have a right to profit from some pieces of capital assets, but that’s the story of the world the past 50 years.

            • Even_Adder@lemmy.dbzer0.com
              link
              fedilink
              English
              arrow-up
              0
              ·
              1 year ago

              First of all, fair use is not simple or as clear-cut a concept that can be applied uniformly to all cases than you make it out to be. It’s flexible and context-dependent on careful analysis of four factors: the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of the use upon the potential market. No one factor is more important than the others, and it is possible to have a fair use defense even if you do not meet all the criteria of fair use.

              Generative models create new and original works based on their weights, such as poems, stories, code, essays, songs, images, video, celebrity parodies, and more. These works may have their own artistic merit and value, and may be considered transformative uses that add new expression or meaning to the original works. Providing your own explanation on the reproduction of fireflies isn’t making the original redundant nor isn’t reproducing the original, so it’s likely fair use. Plenty of competing works explaining the same thing exist, and they’re not invalid because someone got to it first, or they’re based on the same sources.

              Your example about subtitling a movie doesn’t meet the criteria for fair use because subtitling a movie isn’t a transformative use. It doesn’t add any expression or meaning, you doubly reproduce the original work in a different language, and it isn’t commentary, criticism, or parody. Subtitling a movie also involves using the entire work, which again weighs against fair use. The more of the original you use, the less likely it’s fair use. This might also have a negative effect on the potential market for the original, since it could reduce demand for the original or its authorized translations. Now, subtitling a short clip from a movie to illustrate a point in an educational video or a review would likely fly.

              Finally, uses that can result in lost sales for already established markets tend to be determined as not fair use by the courts. This doesn’t mean that uses that affect the market are unfair. That would mean you wouldn’t be able to create a parody movie or use snippets of a work for a review. These can be considered a fair use because they comment on or criticize the original work, unlike uploading a full movie, song, or translated script. Though I could be getting the wrong read here, since you didn’t explain how you came to any of your conclusions.

              I think you’re being too narrow and rigid with your interpretation of fair use, and I don’t think you understand the doctrine that well. I recommend reading this article by Kit Walsh, who’s a senior staff attorney at the EFF, a digital rights group, who recently won a historic case: border guards now need a warrant to search your phone. I’d like to hear your thoughts.

              • maynarkh@feddit.nl
                link
                fedilink
                English
                arrow-up
                0
                ·
                edit-2
                1 year ago

                I am not a lawyer by the way, I don’t even live in the US, so what I write is just my opinion.

                But fair use seems a ridiculous defense when we talk about the Github Copilot case, which is the first tangible lawsuit about it that I know of. The plaintiffs lay out the case of a book for Javascript developers as their example. The objective of the book to give you excercises in Javascript development, I would get the book if I wanted to do Javascript excercises. The book is copyrighted under a share-alike attribution required licence. The defendants Github and OpenAI don’t honour the license with Copilot and Codex. They claim fair use.

                So with the four factors:

                • the purpose and character of your use: .Well, they present their Javascript excercises as original work while it’s obvious they are not, they are reproducing the task they want letter by letter. It is even missing critical context that makes it hard to understand without the book, so their work does not even stand on its own. Also, they do this for monetary compensation, while not respecting the original license, which if someone was giving a commentary or criticism covered by fair use, would be as trivial as providing a citation of the book. They are also not producing information beyond what’s available in the book. Quite funnily, the plaintiffs mention that the “derivative” work is also not quite valuable, as the model answered with an example from a “what’s wrong with this, can you fix it?” section for a question about how to determine if a number is even.

                • the nature of the copyrighted work: It’s freely available, the licence only requires if you republish it, you should provide proper attribution. It is not impossible to provide use cases based on fair use while honouring the license. There is no monetary or other barrier.

                • the amount and substantiality of the portion taken: All of it, and it is reproduced verbatim.

                • the effect of the use upon the potential market: Github Copilot is in the same market as the original work and is competing with it, namely in showing people how to use Javascript.

                And again, I feel this is one layer. Copyright enforcement has never been predictable, and US courts are not predictable either. I think anything can come of this now that it’s big tech that is on the defendant side, and they have the resources to fight, not like random Joe Schmoes caught with bootleg DVDs. Maybe they abolish copyright? Maybe they get an exception? Since US courts have such wide jurisdiction and can effectively make laws, it is still a toss-up. That said, the Github Copilot class action case is the one to watch, and so far, the judge denied orders to dismiss the case, so it may go either way.

                Also by the way, the EU has no fair use protections, it only allows very specific exceptions for public criticism and such, none of which fits AI. Going by the example of Copilot, this would mean that EU users can’t use Copilot, and also that anything that was produced with the assistance of Copilot (or ChatGPT for that matter) is not marketable in the EU.

                • Even_Adder@lemmy.dbzer0.com
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  edit-2
                  1 year ago

                  I am not a lawyer either or a programmer for that matter, but the Copilot case looks pretty fucked. We can’t really get a look at the plaintiff’s examples since they have to be kept anonymous. Generative models weights don’t copy and paste from their training data unless there’s been some kind of overfitting, and some cases of similar or identical code snippets, might be inevitable given the nature of programming languages and common tasks. If the model was trained correctly, it should only ever see infinitesimally tiny parts of its training data. We also can’t tell how much of the plaintiff’s code is being used for the same reasons. The same is true of the plaintiff’s claims about the “Suggestions matching public code”.

                  This case is still in discovery and mired in secrecy, we might not ever find out what’s going on even once the proceedings have concluded.

      • zbyte64@lemmy.blahaj.zone
        link
        fedilink
        English
        arrow-up
        0
        ·
        1 year ago

        Ehh, “learning” is doing a lot of lifting. These models “learn” in a way that is foreign to most artists. And that’s ignoring the fact the humans are not capital. When we learn we aren’t building a form a capital; when models learn they are only building a form of capital.

        • Tyler_Zoro@ttrpg.network
          link
          fedilink
          English
          arrow-up
          0
          ·
          1 year ago

          Artists, construction workers, administrative clerks, police and video game developers all develop their neural networks in the same way, a method simulated by ANNs.

          This is not, “foreign to most artists,” it’s just that most artists have no idea what the mechanism of learning is.

          The method by which you provide input to the network for training isn’t the same thing as learning.

          • Sentau@lemmy.one
            link
            fedilink
            English
            arrow-up
            0
            ·
            1 year ago

            Artists, construction workers, administrative clerks, police and video game developers all develop their neural networks in the same way, a method simulated by ANNs.

            Do we know enough about how our brain functions and how neural networks functions to make this statement?

            • Yendor@reddthat.com
              link
              fedilink
              English
              arrow-up
              1
              ·
              1 year ago

              Do we know enough about how our brain functions and how neural networks functions to make this statement?

              Yes, we do. Take a university level course on ML if you want the long answer.

  • Thorny_Thicket@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 year ago

    I don’t get why this is an issue. Assuming they purchased a legal copy that it was trained on then what’s the problem? Like really. What does it matter that it knows a certain book from cover to cover or is able to imitate art styles etc. That’s exactly what people do too. We’re just not quite as good at it.

    • Hildegarde@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      A copyright holder has the right to control who has the right to create derivative works based on their copyright. If you want to take someone’s copyright and use it to create something else, you need permission from the copyright holder.

      The one major exception is Fair Use. It is unlikely that AI training is a fair use. However this point has not been adjudicated in a court as far as I am aware.

      • FatCat@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        It is not a derivative it is transformative work. Just like human artists “synthesise” art they see around them and make new art, so do LLMs.

      • LordShrek@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        this is so fucking stupid though. almost everyone reads books and/or watches movies, and their speech is developed from that. the way we speak is modeled after characters and dialogue in books. the way we think is often from books. do we track down what percentage of each sentence comes from what book every time we think or talk?

  • Tetsuo@jlai.lu
    link
    fedilink
    English
    arrow-up
    0
    arrow-down
    1
    ·
    edit-2
    1 year ago

    If I’m not mistaken AI work was just recently considered as NOT copyrightable.

    So I find interesting that an AI learning from copyrighted work is an issue even though what will be generated will NOT be copyrightable.

    So even if you generated some copy of Harry Potter you would not be able to copyright it. So in no way could you really compete with the original art.

    I’m not saying that it makes it ok to train AIs on copyrighted art but I think it’s still an interesting aspect of this topic.

    As others probably have stated, the AI may be creating content that is transformative and therefore under fair use. But even if that work is transformative it cannot be copyrighted because it wasn’t created by a human.

    • Even_Adder@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      1 year ago

      If you’re talking about the ruling that came out this week, that whole thing was about trying to give an AI authorship of a work generated solely by a machine and having the copyright go to the owner of the machine through the work-for-hire doctrine. So an AI itself can’t be authors or hold a copyright, but humans using them can still be copyright holders of any qualifying works.

    • habanhero@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      How do you tell if a piece of work contains AI generated content or not?

      It’s not hard to generate a piece of AI content, put in some hours to round out AI’s signatures / common mistakes, and pass it off as your own. So in practise it’s still easy to benefit from AI systems by masking generate content as largely your own.