Pham Thanh

thanhpn

💬 Senior Full-stack and Blockchain Engineering

Avatar
Activity
JanFebMarAprMayJunJulAugSepOctNovDec
1 activities in 2024
LessMore
Pinned memos

Proximal policy optimization (PPO) is an algorithm that aims to improve the stability of training by avoiding overly large policy updates. It is a popular and effective method used for training [[Reinforcement Learning | reinforcement learning]] models in complex environments. To achieve this, PPO uses a ratio that indicates the difference between the current policy and the old policy and clips this ratio within a specific range, ensuring that the policy updates are not too large and the training process is more stable...

A Reward model is a critical component in Reinforcement Learning for Large Language Models (LLMs), designed to evaluate and score the quality of generated responses. It plays a key role in aligning LLMs with human values and improving their output through iterative refinement.

An introduction to Q-learning, a model-free reinforcement learning algorithm used to learn optimal policies in Markov Decision Processes.

July 2024
June 2023
Published Reward modelJune 23
Published Q learningJune 22
May 2023
February 2023
Published Plonky2February 28
January 2023
Published Polygon zkEVM architectureJanuary 03
December 2022
Published StarkNet architectureDecember 26
September 2022
Published Zero-knowledge proofsSeptember 06
August 2022
Published Multisign walletAugust 10
July 2022
Published Anchor frameworkJuly 01
June 2022
Published Blockchain bridgeJune 21
Dwarves Foundation
Memo