joi, 21 martie 2024

O scurtă prezenatre a ,,Rethinking Machine Unlearning For Large Language Models”

Motivație

În această lucrare v-om vorbi despre o tematica care a apărut în mintea mea dintr-o simplă discuție cu unul dintre profesorii mei.

Într-una dintre ore l-am întrebat cum facem să ne asigurăm că un model care are access la internet să nu fie corrupt de știri înșelatoare sau alte lucruri care ar putea dăuna publiclui cu care interacționează. Răspunsul sau a fost unul simplu, asta este o problema la care încă se lucrează. Atunci a fost implantată această idee de machine unlearning (MU) în capul meu, dar și al colegilor mei.

Introducere

Lucrarea pe care dorim să o prezentăm a fost scrisă de Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Xiaojun Xu, Yuguang Yao, HangLi, KushR.Varshney, MohitBansal, Sanmi Koyejo, Yang Liu.

Aceștia vorbesc despre Large Language Models(LLM) și despre diferite metode de Mu. Să fim mai exacți acestia recunosc potențialul exceptional al acestor modele pentru a genera text care este apropiat cu cel scris de om, dar ridică o problemă care poate ține mai mult de domeniul legal sau al etici decât de știința calculatoarelor. Această problem se referă la faptul că abilitatea aceasta de a incopora date massive poate duce la baiasuri sociale sau alte probleme legale cum ar fi rasismul, proble legale cum ar fi jaibreaking sau atacuri cibernetice.

O scurtă prezentare a conceptului de MU

Acestia îl citeaza o lucrare intitulata Undersanding factors influencing machine unlearning, in aceasta ne se spune faptul ca dezvatarea care are si reinvarate de la inceput dupa ce ai scos date specifice este considerat un standard de aur, totusi aceasta ”miscare este una costisitoare”.

Acestia mai vorbesc si despre provocarile pe care le întâlnești în MU în contextual LLMs.

1. Este greu sa definești și localizezi unlearning targets

2. Creșterea LLMs-urilor si al black-boxurilor este o provocare pentru a developa tehnici de MU care pot fi adaptate

3. Unlearning este sub-specificat pentru LLMs (acesta imiplica totuși o serie de lucruri destul de complexe)

4. Infomațiile sensibile pot sa fie reversed-engineered din modelul editat

Aceștia definesc problema LLM unlearning astfel:

(LLM unlearning) Cum putem elimina influenta specifică a „unlearning targets„ in mod efficient si efectiv si sa eliminam capabilitati associate modelului in timp ce preservam preformanța modelului pentru lucruri care nu sunt targheturi.

· Unlearning targets: acestea sunt strans legate de obiectivele pentu unlearning(ex focus pe eliminarea influenței datelor, sau eliminarea capacitaii modelului)

· Influence erasure: se refera la faptul ca pentru a asigura stergea influențelor trebuie sa luăm în calcul simultan datele si influențele modelului.

· Unlearning effectiveness: un astpect crucial al acestei părți este conceptul de scopul pentru unlearning, acesta se referă la succesul de stergere a influenței.

· Unleaning efficiency & feasibility: costrurile sunt destul de ridicate mai ales cand vine vorba de reinvatare

Metode pentru MU:

Gradien ascent și variantele sale: face update la parametri modelului prin maximizarea probabilității de predicție greșita pentru monstrele din setul de uitare. Totuși acesta nu este sufficient singur. Alta varianta a sa este gradient descent care minimalizează probabilitatea de predictii pentru datele cu eticheta de uitare

Localization-informed unlearning: obiectivul este acela de a indetifica si localiza un subset al unității modelului care sunt esențiale pentru dezvățare. Este important să ștergem aceste date pentru a nu fi expuși atacurilor cibernetice.

Input-based vs model-based: parametri care pot fi învățați sunt dați prin solicitări de intrare nu prin greutăți/alte componente. Cu toate acestea nu pot avea neaparat randament pentru modele neinvățate cea ce duce la strategi slabe pentru unlearning.

Concluzii:

Lucrarea pe care am prezentato doreste să descopere aspecte nexplorate ale LLM unlearning. Să prezinte provocările care există în acest domeniu care sunt prezentate de research si practică, Acestea include generalitate autenticitate si precizie. Intorcândune la bazele științei calculatoarelor un algoritm trebuie sa aibă proprietățile prezentate mai sus: generalitate, finalitate, precizie.

Prin prezentarea acestei lucrări am droit să va informăm în legătură cu aces concept și să vă stârnim interesul.

Lucrarea originala poate fi gasita la: ,,Rethinking Machine Unlearning For Large Language Models”

miercuri, 20 martie 2024

AgentCoder: Multiagent-Code Generation with Iterative Testing and Optimisation

Introducere

   Progresele în procesarea limbajului natural (NLP) au fost semnificativ amplificate de dezvoltarea modelelor de limbaj bazate pe transformatori (LLMs). Aceste modele au revoluționat sarcinile NLP, în special în generarea de cod. În ciuda avansurilor lor, provocări rămân în echilibrarea generării de fragmente de cod cu generarea și executarea eficientă a cazurilor de test.

   Structura inițială a procesului de generare de cod presupune primirea unui prompt care urmează să fie analizat, înțeles și abstractizat de model, care în primul rând generează pseudocodul care satisface punctele identificate în prompt și în final transformă pseudocodul în cod. Dar acest sistem generează deseori cod care nu funcționează.

   S-au născut diferite abordări pentru optimizarea generării de cod, cele care merită menționate fiind:

Self-Edit: această abordare presupune primirea de teste ca și prompt și testează codul generat cu acestea. O problemă cu această abordare este că utilizatorul trebuie să scrie testele care vor fi folosite de model pentru generarea codului;
CodeCoT: pentru optimizarea abordării anterioare, CodeCoT presupune atât generarea codului cât și generarea testelor cu care urmează să fie testat codul. Dezavantajul acestei abordări este “trade-off”-ul ce are loc între generarea de cod și generarea de teste, acestea fiind realizate într-o singură conversație de același agent.

Următorul nivel al abordărilor este AgentCoder care urmează să fie prezentat în următoare secțiune. Aceasta presupune împărțirea task-urile de generarea cod, de crearea de design de teste și de generat teste între trei agenți. Astfel se asigură funcționalitatea codului generat și eliminarea “trade-off”-ului prezent la CodeCoT, fiindcă agenții sunt independenți unul de celălalt și lucrează într-un mod mult mai obiectiv.

Colaborarea multi-agent

Un sistem de tip multi-agent (MAS) este un cadru (“framework”) unde agenți (scripturi program, boți software sau roboți) multipli și autonomi interacționează, fiind capabili să comunice, coopereze, concureze sau să negocieze unul cu celălalt, într-un mediu comun (“shared environment”). Aceștia pot lucra independent sau împreună pentru a pentru a atinge obiective complexe sau rezolva probleme, astfel că integrarea LLMs în cadrul acestor sisteme ce utilizează colaborarea multi-agent reprezintă baseline-ul de la care AgentCoder pleacă, așa cum este precizat și în introducere.

Metodologie

Procesul începe prin inserarea unor cerințe pentru generarea codului, realizata de primul agent. Pe urma, se folosește al doilea agent, care are ca scop generarea de teste și de a verifica corectitudinea codului generat. Al treilea agent colectează codul și testele generate, și le rulează într-un environment local, pentru a obține un feedback. Dacă codul trece toate testele, atunci agentul îl returnează pentru a fi utilizat de către utilizator. Altfel, codul de eroare este retrimis agentului de generare al codului. Aceasta operatie este repetată până codul generat trece toate testele.

Agentul programator (“programmer”)

În framework-ul utilizat, agentul programator este alimentat de către LLMs. Acesta trebuie să ia în considerare doua scenarii: generarea codului și rafinarea codului. Pentru partea de generare, agentul programator întrebuințează un proces de tip CoT (Chain-of-Thought) pentru a putea imita procesul tipic ce are loc în programare și anume, de a sparge metodic în bucăți mai mici și mai ușor de gestionat sarcinile. Cei patru pași cu care este instruit procesul CoT sunt:

clarificarea și înțelegerea problemelor
selecția de algoritmi și metode
crearea pseudocodului
generarea de cod

Bucățile de cod generate de către agentul programator pot fi incorecte, lucru ce va conduce la picarea cazurilor de testare furnizate de către agentul care desemnează testele. În acest caz, agentul programator va primi feedback de la ceilalți agenți pentru a rafina codul.

Agentul proiectant de teste (“test designer”)

Și acest agent este alimentat de LLMs și reprezintă componenta framework-ului menită să testeze și să ofere feedback de încredere agentului programator pentru ca acesta să poată optimiza codul iterativ. Prompt-urile folosite de către agentul test designer în acest framework au fost proiectate pentru ca să îndeplinească următoarele trei așteptări:

să genereze cazuri de testare de bază
să acopere situațiile limită ale cazurilor de testare
să acopere inputuri la scară largă

Agentul executant de teste (“test executor”)

Spre deosebire de ceilalți agenți, agentul test executor este implementat în framework prin intermediul unui script Python ce interacționează cu environmentul local și ceilalți doi agenți.

La momentul primirii bucăților de cod de la agentul programmer și a cazurilor de testare generate de către agentul test designer, agentul test executor le validează în enviromentul local. Dacă toate cazurile de testare sunt trecute, acesta returnează codul developerului uman. Altfel, returnează informația erorii agentului programator pentru a putea să repare greșeala identificată.

Evaluarea modelului se realizeaza prin răspunderea la șase întrebări:

Care e performanța atinsă de AgentCoder?

o îmbunătățire de 32.7% față de baseline-ul modelului GPT-4
cu 8.8% mai mult ca abordările CodeCoT state-of-the-art

Cum contribuie diferiți agenți la ansamblul total?

Cum afectează rafinarea codului performanta AgentCoder-ului?
Cat de precise sunt testele generate?
Cat de adecvate sunt testele generate?

din realizarea evaluării a cât de multe linii de cod în soluția acoperită de către testele generate de către GPT-3.5-turbo, CodeCoT și AgentCoder, rezultă că cel din urmă generează teste cu cea mai mare acoperire

Ar trebui rolul agentului de generare și de testare de cod sa fie separate în doi agenți diferiți?

nu, după cum se poate vedea din datele furnizate în tabelele 6, 7 și 8 de mai jos

Concluzie

Autorii propun AgentCoder, o abordare care integrează mai mulți agenți pentru a îmbunătăți procesul de generare a codului în modelele de generare a codului. AgentCoder este compus din trei agenți: programatorul, designerul de teste și executorul de teste. În timpul generării codului, programatorul creează fragmente de cod, apoi designerul de teste generează cazuri de testare corespunzătoare. Executorul de teste testează codul generat într-un mediu local și, în cazul în care apare o eroare, transmite feedback-ul către programator și designer pentru a corecta problema.

Evaluările demonstrează că AgentCoder are o performanță superioară față de alte metode și modele existente, cum ar fi LLM-urile și metodele de inginerie a comenzilor. De exemplu, în seturile de date HumanEval-ET și MBPP-ET, AgentCoder îmbunătățește semnificativ performanța, creșterea ratei de succes pass@1 de la 69.5% și 63.0% la 77.4% și 89.1%, respectiv.

Sursă

Event Extraction by Answering (Almost) Natural Questions

2004.13625v2.pdf (arxiv.org)

Introduction

The goal of event extraction is to create structured information from unstructured one.

So, basically, it tries to answer questions like: “What is happening?”, ”Who?”, “What?” is involved in something..

Let’s see a practical example to get a better understanding:

Input ( unstructured information ):

“As part of the 11-billion-dollar sale of USA Interactive's film and television operations to the French company and its parent in December 2001, Interactive USA received 2.5 billion dollars in preferred shares in Vivendi Universal Entertainment.”

Output ( structured information ):

Event type: Transaction - Transfer - Ownership

Trigger: sale

Buyer: French company

Seller: Interactive USA

Artifact: operations

The trigger is essentially the word from the unstructured data that gave away the event type ( or whatever made the system come to the current conclusion ).
The arguments here: Buyer, Seller, Artifact are all arguments that have a semantic value for the current event type: Transaction - Transfer - Ownership

The problems so far:

Previous approaches rely too much on entity ( entity example: French company ) information for the extraction of arguments. This means that they can use pre-trained models to identify entities ( such as persons, places, organisations and so on ) and only b that ( once we have the entities established ), we can assign arguments roles to them. The problem here is that if we misidentify the entity or we mislabel the semantic class of an entity it’s game over.
We need two steps: identify the entities and categorize them and after that assign the semantic class. If we get something wrong, the next step of identification will build up on the previous one so we now have error propagation.
We can’t exploit the similarities between arguments that may be related but are part of different event types.

To get a better understanding of the flow that is imposed by the previous approaches: They first identify the event trigger, then the entities ( with no semantics, just: French company and operations ), and then assign some semantic classes ( argument roles ) to them: Buyer, Seller, Artifact and so on.

Different approach

Let’s try a different approach: QA ( Question Answering )

We are going to use two BERTs:

One for detecting the trigger
One that will answer the questions and find out the artifacts ( with argument roles already attached ) of the event type

Improvements:

This approach does not need to identify any entities as a prerequirement.
The QA templates permit transferring knowledge between similar event types
It can also work on zero-shot settings ( data that was not included in the training datasets, meaning that it can extract arguments for events it has never seen before )

General structure:

The first BERT is used to extract a single token from the unstructured data and associate it to a type ( from a pre-defined set of event types ).
The second BERT will try to answer the questions and identify the artifacts of the event type.
A dynamic threshold is applied to only retain the candidates that are above it.

How do you ask the questions?

For identifying the trigger we can use one of the following:

What is trigger
Trigger
Action
Verb

So for Action ( the third option from the ones presented above ), it would be like this: [CLS] Action [SEP] As part of the 11 billion-dollar sale ... [SEP]

Note: [CLS] and [SEP] are part of the standard BERT-style format.

After that, for identifying the arguments we can use:

The role name ( like: agent, place ).
Use the WH question: “WHO? is the” for persons “WHERE? is the” for places.
Use a more natural sounding question described in the ACE (Automatic Content Extraction) annotation guidelines. Use the trigger itself in the questions: WHO is the person in <trigger>?

What models are there to answer the questions?

So, as said before, we use BERT for both trigger and arguments detection:

BERT_QA_Tigger
BERT_QA_Arg

First we decide and use a template ( for asking the questions ). It is then translated into the BERT-like sequences:

[CLS]<question>[SEP]<sentence>[SEP].

That then gets contextualized.

Now, the output layer of both QA models is different:

BERT_QA_Trigger predicts the event type for each token in the sentence (or None, if it is not an event trigger)

For the trigger prediction we use a parameter matrix Wtr that is defined on R (H x T) (where T is the total number of event types + 1 (for the non-trigger token)) and H is the hidden size of the transformer. We use a softmax layer to convert the logits into a multi-class probability ( for every event type basically ). Now, when trying to test on new, unseen data, we apply argmax on the probabilities to get the highest probability ( obviously 🙂).

BERT_QA_Args predicts the start and end offsets for the argument span (so the start and end index of the words that represent an argument)

We use 2 new matrices here: Ws( Weights start ), We( Weights end ), and then use softmax on them to convert the logits into a multi-class probability ( of a token (word) to be selected as a start / end span ).
When training we use the sum of the start token loss and the end token loss. We obviously try to minimize that.
When testing, this deed seems to be more complicated than expected: there are usually a lot or no spans to be selected from. That is why we use a dynamic threshold.

Now, let’s describe how we are going to get the arguments.

We will use 2 algorithms:

Harvest all valid argument spans candidates ( for each argument role ) Enumerate all possible combinations of start, end ( via 2 imbricated loops ). We then eliminate some of them that do not fit some conditions:

Start and end must be in the input sentence.
The length of the span itself must ( obviously 🙂) be smaller than the maximum allowed.
All arguments spans should have a higher probability than that of the [CLS] (no argument) token.
Calculate the relative score for the candidate span selected.

Filter out candidate spans:

Get a threshold.
Get rid of the spans that have a greater score than the previously computed threshold.

Experiments

Experiments were conducted on:

ACE 2005 corpus with 5272 event trigger and 9612 arguments ( fully annotated )

Evaluation

In order to evaluate, we consider the following:

A trigger is correctly identified if its offset match those of the gold-standard trigger and it is also correctly classified (33 total events)
An event argument is correctly identified if its semantic role is correctly identified (22 in total)

The performance of the argument extraction is directly impacted by the one of the trigger extraction.

Now, in order to better understand how the dynamic threshold impacts the performance of the framework, experiments were conducted with and without it:

Last row ( of Table 3 ) shows performance with it.
The one above, without. Also, this shows how good the last 2 question templates perform as well.

To accommodate for unseen roles, the following experiment was conducted:

80% of the argument roles were kept. 20% were removed and only seen in testing.

As seen in Table 5, the BERT_QA model is substantially better than other methods of event extraction.

In order to see the impact that different question forming templates have, experiments were conducted and the differences were not very high ( as seen in Table 6 ).

But, using the “in <trigger>” template, consistently improved the performance, and it makes sense, because it indicates where the trigger is in the sentence.

Because the template 3 is using descriptions for argument roles in the annotation guideline, it encodes more semantic information about the role, thus giving the best performance.

Error Analysis

Complex sentence structures seem to be problematic:

E.G: “[She] visited the store and [Tom] did too.”

Tom is not extracted as an entity ( only “She” is )

E.G: “Canadian authorities arrested two Vancouver-area men on Friday and charged them in the deaths of [329 passengers and crew members of an Air-India Boeing 747 that blew up over the Irish Sea in 1985, en route from Canada to London] “

The victim was not extracted in full ( 329 passengers and crew members of an Air-India Boeing 747 that blew up over the Irish Sea in 1985, en route from Canada to London )

Conclusion

Most methods go through these 3 steps:

Trigger detection
Entity recognition
Argument role assignment.

The presented framework skips the entity recognition stage.

So, in

“Apple announced the launch of its new iPhone in California”

a regular framework would identify:

Trigger: announced
Entities: Apple, iPhone, California
And then adding roles: Apple-Subject, iPhone-Object, California-Location.

The current QA framework skips directly to assigning roles with no entities step needed.

It identifies:

Trigger: announced
And based on a few guidelines it starts to ask questions like: What was announced? or Where the announcement was made? To then directly identify the semantic classes and the entities in one step.

sistemeinteligente2024