understanding-rlhf.github.io

Main

id

334092581

name

understanding-rlhf.github.io · homepage snapshot

processing priority

3

site type

0 (generic, awaiting analysis)

review version

11

html import

20 (imported)

Events

first seen date

2025-02-04 12:17:32

expired found date

-

created at

2025-02-04 12:17:32

updated at

2025-02-04 12:17:33

Domain name statistics

length

28

crc

24386

tld

86

nm parts

0

nm random digits

0

nm rare letters

0

Connections

is subdomain of id

87719371 (github.io)

previous id

0

replaced with id

0

related id

-

dns primary id

0

dns alternative id

0

lifecycle status

0 (unclassified, or currently active)

Subdomains and pages

deleted subdomains

0

page imported products

0

page imported random

0

page imported parking

0

Error counters

count skipped due to recent timeouts on the same server IP

0

count content received but rejected due to 11-799

0

count dns errors

0

count cert errors

0

count timeouts

0

count http 429

0

count http 404

0

count http 403

0

count http 5xx

0

next operation date

-

Server

server bits

—

server ip

-

Mainpage statistics

mp import status

20

mp rejected date

-

mp saved date

-

mp size orig

27677

mp size raw text

15064

mp inner links count

0

mp inner links status

1 (no links)

Open Graph

title

description

Study the role of on-policy sampling and negative gradients in existing preference fine-tuning algorithms.

image

site name

author

updated

2026-01-09 05:47:12

raw text

Understanding RLHF Abstract Setup Empirical Analysis Theoretical Analysis BibTeX Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data Fahim Tajwar *1 , Anikait Singh *2 , Archit Sharma 2 , Rafael Rafailov 2 , Jeff Schneider 1 , Tengyang Xie 3 , Stefano Ermon 2 Chelsea Finn 2 Aviral Kumar 4 * Equal Contribution (coin-flip) 1 Carnegie Mellon University 2 Stanford University 3 University of Wisconsin, Madison 4 Google DeepMind ICML 2024 Paper arXiv Code Which approaches are important for fine-tuning LLMs with preference data and why? Our main finding is that approaches that utilize on-policy sampling or attempt to push down the likelihood on certain responses (e.g., those with negative gradient such as DPO) tend to outperform offline and maximum likelihood objectives. Combining on-policy sampling with methods that handle ...

Text analysis

redirect type

0 (-)

block type

0 (no issues)

detected language

1 (English)

category id

AI [en] (229)

index version

2025123101

spam phrases

0

Text statistics

text nonlatin

0

text cyrillic

0

text characters

11253

text words

1997

text unique words

592

text lines

186

text sentences

69

text paragraphs

27

text words per sentence

28

text matched phrases

2

text matched dictionaries

2

Link statistics

links self subdomains

0

links other subdomains

0

links other domains

0

links spam adult

0

links spam random

0

links spam expired

0

links ext activities

5

links ext ecommerce

0

links ext finance

0

links ext crypto

0

links ext booking

0

links ext news

0

links ext leaks

0

links ext ugc

9

links ext klim

0

links ext generic

0

dol status

0

dol updated

2026-01-09 05:47:12

RSS

rss path

rss status

1 (priority 1 already searched, no matches found)

rss found date

-

rss size orig

0

rss items

0

rss spam phrases

0

rss detected language

0 (awaiting analysis)

inbefore feed id

-

inbefore status

0 (new)

Sitemap

sitemap path

sitemap status

1 (priority 1 already searched, no matches found)

sitemap review version

2

sitemap urls count

0

sitemap urls adult

0

sitemap filtered products

0

sitemap filtered videos

0

sitemap found date

-

sitemap process date

-

sitemap first import date

-

sitemap last import date

-