qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code ge


From: Alex Bennée
Subject: Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators
Date: Fri, 24 Nov 2023 10:21:17 +0000
User-agent: mu4e 1.11.25; emacs 29.1

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Thu, Nov 23, 2023 at 05:39:18PM -0500, Michael S. Tsirkin wrote:
>> On Thu, Nov 23, 2023 at 05:58:45PM +0000, Daniel P. Berrangé wrote:
>> > The license of a code generation tool itself is usually considered
>> > to be not a factor in the license of its output.
>> 
>> Really? I would find it very surprising if a code generation tool that
>> is not a language model and so is not understanding the code it's
>> generating did not include some code snippets going into the output.
>> It is also possible to unintentionally run afoul of GPL's definition of 
>> source
>> code which is "the preferred form of the work for making modifications to 
>> it". 
>> So even if you have copyright to input, dumping just output and putting
>> GPL on it might or might not be ok.
>
> Consider the C pre-processor. This takes an input .c file, and expands
> all the macros, to split out a new .c file.
>
> The license of the output .c file is determined by the license of the
> input .c file. The license of the CPP impl (whether OSS or proprietary)
> doesn't have any influence on the license of the output file, it cannot
> magically force the output file to be proprietary any more than it can
> force it to be output file GPL.

LLM's are just a tool like a compiler (albeit with spookier different
internals). The prompt and the instructions are arguably the more
important part of how to get good results from the LLM transformation.
In fact most of the way I've been using them has been by pasting some
existing code and asking for review or transformation of it.

However I totally get that using the various online LLMs you have very
little transparency about what has gone into their training and therefor
there is a danger of proprietary code being hallucinated out of their
matricies. Conversely what if I use an LLM like OpenLLaMa:

  https://github.com/openlm-research/open_llama

I have fairly exhaustive definitions of what went into the training data
which of most interest is probably the StarCoder dataset (paper):

  https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view

where there are tools to detect if generated code has been lifted
directly from the dataset or is indeed a transformation.


>
> With regards,
> Daniel

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro



reply via email to

[Prev in Thread] Current Thread [Next in Thread]