Pham Thanh

💬 Senior Full-stack and Blockchain Engineering

blockchain AI LLM reinforcement-learning EVM

Activity

Pinned memos

Proximal policy optimization (PPO) is an algorithm that aims to improve the stability of training by avoiding overly large policy updates. It is a popular and effective method used for training [[Reinforcement Learning | reinforcement learning]] models in complex environments. To achieve this, PPO uses a ratio that indicates the difference between the current policy and the old policy and clips this ratio within a specific range, ensuring that the policy updates are not too large and the training process is more stable...

AI LLM reinforcement-learning

Reward model

A Reward model is a critical component in Reinforcement Learning for Large Language Models (LLMs), designed to evaluate and score the quality of generated responses. It plays a key role in aligning LLMs with human values and improving their output through iterative refinement.

AI LLM reinforcement-learning

Q learning

An introduction to Q-learning, a model-free reinforcement learning algorithm used to learn optimal policies in Markov Decision Processes.

AI LLM machine-learning

July 2024

Published Proximal policy optimizationJuly 03

June 2023

Published Reward modelJune 23

Published Q learningJune 22

Published Introduction to reinforcement learning and its application with LLMsJune 05

May 2023

Published Select vector database for LLMMay 18