Category Archives: Uncategorized

GOOD

Together with Joseph Modayil, this year I am teaching the part on reinforcement learning of the Advanced Topics in Machine Learning course at UCL.

Lectures

Note that there will be two lectures about AlphaGo on March 24. We will talk about AlphaGo in the context of the whole course at the normal place and time (9:15am in Roberts 412), and in addition David Silver will give a seminar that afternoon. Neither of these will be required for the exam.

Introduction to reinforcement learning updated January 14 (Lecture: January 14)
Exploration and Exploitation updated January 21 (Lecture: January 21)
Markov decision processes updated January 27 (Lecture: January 28)
Dynamic programming updated February 3 (Lecture: February 4)
Learning to predict updated February 10 (Lecture: February 11)
Learning to control updated March 16 (Lecture: February 25)
Value function approximation updated March 2 (Lecture: March 3)
Policy-gradient algorithms updated March 9 (Lecture:…

View original post 51 more words

Uncategorized

Value iteration networks

February 9, 2017 zr9558 Leave a comment

Excellent

the morning paper

Value Iteration Networks Tamar et al., NIPS 2016

‘Value Iteration Networks’ won a best paper award at NIPS 2016. It tackles two of the hot issues in reinforcement learning at the moment: incorporating longer range planning into the learned strategies, and improving transfer learning from one problem to another. It’s two for the price of one, as both of these challenges are addressed by an architecture that learns to plan.

In the grid-world domain shown below, a standard reinforcement learning network, trained on several instances of the world, may still have trouble generalizing to a new unseen domain (right-hand image).

(This setup is very similar to the maze replanning challenge in ‘Strategic attentive writer for learning macro actions‘ from the Google DeepMind team that we looked at earlier this year. Both papers were published at the same time).

… as we show in our experiments, while standard…

View original post 1,025 more words

Uncategorized

LEARNING REINFORCEMENT LEARNING (WITH CODE, EXERCISES AND SOLUTIONS)

January 19, 2017 zr9558 Leave a comment

http://www.wildml.com/2016/10/learning-reinforcement-learning/

Skip all the talk and go directly to the Github Repo with code and exercises.

WHY STUDY REINFORCEMENT LEARNING

Reinforcement Learning is one of the fields I’m most excited about. Over the past few years amazing results like learning to play Atari Games from raw pixels and Mastering the Game of Go have gotten a lot of attention, but RL is also widely used in Robotics, Image Processing and Natural Language Processing.

Combining Reinforcement Learning and Deep Learning techniques works extremely well. Both fields heavily influence each other. On the Reinforcement Learning side Deep Neural Networks are used as function approximators to learn good representations, e.g. to process Atari game images or to understand the board state of Go. In the other direction, RL techniques are making their way into supervised problems usually tackled by Deep Learning. For example, RL techniques are used to implement attention mechanisms in image processing, or to optimize long-term rewards in conversational interfaces and neural translation systems. Finally, as Reinforcement Learning is concerned with making optimal decisions it has some extremely interesting parallels to human Psychology and Neuroscience (and many other fields).

With lots of open problems and opportunities for fundamental research I think we’ll be seeing multiple Reinforcement Learning breakthroughs in the coming years. And what could be more fun than teaching machines to play Starcraft and Doom?

HOW TO STUDY REINFORCEMENT LEARNING

There are many excellent Reinforcement Learning resources out there. Two I recommend the most are:

The latter is still work in progress but it’s ~80% complete. The course is based on the book so the two work quite well together. In fact, these two cover almost everything you need to know to understand most of the recent research papers. The prerequisites are basic Math and some knowledge of Machine Learning.

That covers the theory. But what about practical resources? What about actually implementing the algorithms that are covered in the book/course? That’s where this post and the Github repository comes in. I’ve tried to implement most of the standard Reinforcement Algorithms using Python, OpenAI Gym and Tensorflow. I separated them into chapters (with brief summaries) and exercises and solutions so that you can use them to supplement the theoretical material above.All of this is in the Github repository.

Some of the more time-intensive algorithms are still work in progress, so feel free to contribute. I’ll update this post as I implement them.

Introduction to RL problems, OpenAI gym
MDPs and Bellman Equations
Dynamic Programming: Model-Based RL, Policy Iteration and Value Iteration
Monte Carlo Model-Free Prediction & Control
Temporal Difference Model-Free Prediction & Control
Function Approximation
Deep Q Learning (WIP)
Policy Gradient Methods (WIP)
Learning and Planning (WIP)
Exploration and Exploitation (WIP)

LIST OF IMPLEMENTED ALGORITHMS

Dynamic Programming Policy Evaluation
Dynamic Programming Policy Iteration
Dynamic Programming Value Iteration
Monte Carlo Prediction
Monte Carlo Control with Epsilon-Greedy Policies
Monte Carlo Off-Policy Control with Importance Sampling
SARSA (On Policy TD Learning)
Q-Learning (Off Policy TD Learning)
Q-Learning with Linear Function Approximation
Deep Q-Learning for Atari Games
Double Deep-Q Learning for Atari Games
Deep Q-Learning with Prioritized Experience Replay (WIP)
Policy Gradient: REINFORCE with Baseline
Policy Gradient: Actor Critic with Baseline
Policy Gradient: Actor Critic with Baseline for Continuous Action Spaces
Deterministic Policy Gradients for Continuous Action Spaces (WIP)
Deep Deterministic Policy Gradients (DDPG) (WIP)
Asynchronous Advantage Actor Critic (A3C) (WIP)

DeepMind, Uncategorized

解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术

December 1, 2016 zr9558 Leave a comment

文／Not_GOD（简书作者）
原文链接：http://www.jianshu.com/p/d347bb2ca53c
著作权归作者所有，转载请联系作者获得授权，并标注“简书作者”。

Two years ago, a small company in London called DeepMind uploaded their pioneering paper “Playing Atari with Deep Reinforcement Learning” to Arxiv. In this paper they demonstrated how a computer learned to play Atari 2600 video games by observing just the screen pixels and receiving a reward when the game score increased. The result was remarkable, because the games and the goals in every game were very different and designed to be challenging for humans. The same model architecture, without any change, was used to learn seven different games, and in three of them the algorithm performed even better than a human!

It has been hailed since then as the first step towards general artificial intelligence – an AI that can survive in a variety of environments, instead of being confined to strict realms such as playing chess. No wonder DeepMind was immediately bought by Google and has been on the forefront of deep learning research ever since. In February 2015 their paper “Human-level control through deep reinforcement learning” was featured on the cover of Nature, one of the most prestigious journals in science. In this paper they applied the same model to 49 different games and achieved superhuman performance in half of them.

Still, while deep models for supervised and unsupervised learning have seen widespread adoption in the community, deep reinforcement learning has remained a bit of a mystery. In this blog post I will be trying to demystify this technique and understand the rationale behind it. The intended audience is someone who already has background in machine learning and possibly in neural networks, but hasn’t had time to delve into reinforcement learning yet.

我们按照下面的几个问题来看看到底深度强化学习技术长成什么样？

什么是强化学习的主要挑战？针对这个问题，我们会讨论 credit assignment 问题和 exploration-exploitation 困境。
如何使用数学来形式化强化学习？我们会定义 Markov Decision Process 并用它来对强化学习进行分析推理。
我们如何指定长期的策略？这里，定义了 discounted future reward，这也给出了在下面部分的算法的基础。
如何估计或者近似未来收益？给出了简单的基于表的 Q-learning 算法的定义和分析。
如果状态空间非常巨大该怎么办？这里的 Q-table 就可以使用（深度）神经网络来替代。
怎么样将这个模型真正可行？采用 Experience replay 技术来稳定神经网络的学习。
这已经足够了么？最后会研究一些对 exploration-exploitation 问题的简单解决方案。

强化学习

Consider the game Breakout. In this game you control a paddle at the bottom of the screen and have to bounce the ball back to clear all the bricks in the upper half of the screen. Each time you hit a brick, it disappears and your score increases – you get a reward.

图 1：Atari Breakout 游戏：来自 DeepMind

Suppose you want to teach a neural network to play this game. Input to your network would be screen images, and output would be three actions: left, right or fire (to launch the ball). It would make sense to treat it as a classification problem – for each game screen you have to decide, whether you should move left, right or press fire. Sounds straightforward? Sure, but then you need training examples, and a lots of them. Of course you could go and record game sessions using expert players, but that’s not really how we learn. We don’t need somebody to tell us a million times which move to choose at each screen. We just need occasional feedback that we did the right thing and can then figure out everything else ourselves.
This is the task reinforcement learning tries to solve. Reinforcement learning lies somewhere in between supervised and unsupervised learning. Whereas in supervised learning one has a target label for each training example and in unsupervised learning one has no labels at all, in reinforcement learning one has sparse and time-delayed labels – the rewards. 基于这些收益，agent 必须学会在环境中如何行动。

尽管这个想法非常符合直觉，但是实际操作时却困难重重。例如，当你击中一个砖块，并得到收益时，这常常和最近做出的行动（paddle 的移动）没有关系。在你将 paddle 移动到了正确的位置时就可以将球弹回，其实所有的困难的工作已经完成。这个问题被称作是 credit assignment 问题——先前的行动会影响到当前的收益的获得与否及收益的总量。

一旦你想出来一个策略来收集一定数量的收益，你是要固定在当前的策略上还是尝试其他的策略来得到可能更好的策略呢？在上面的 Breakout 游戏中，简单的策略就是移动到最左边的等在那里。发出球球时，球更可能是向左飞去，这样你能够轻易地在死前获得 10 分。但是，你真的满足于做到这个程度么？ 这就是 exploration-exploit 困境 ——你应当利用已有的策略还是去探索其他的可能更好的策略。

强化学习是关于人类（或者更一般的动物）学习的重要模型。我们受到来自父母、分数、薪水的奖赏都是收益的各类例子。credit assignment 问题 和 exploration-exploitation 困境 在商业和人际关系中常常出现。这也是研究强化学习及那些提供尝试新的观点的沙盒的博弈游戏的重要原因。

Markov Decision Process

现在的问题就是，如何来形式化这个强化学习问题使得我们可以对其进行分析推理。目前最常见的就是将其表示成一个 Markov decision process。

假设你是一个 agent，在一个环境中（比如说 Breakout 游戏）。环境会处在某个状态下（比如说，paddle 的位置、球的位置和方向、每个砖块是否存在等等）。agent 可以在环境中执行某种行动（比如说，将 paddle 向左或者向右移动）。这些行动有时候会带来收益（比如说分数的增加）。行动会改变环境并使其新的状态，然后 agent 又可以这行另外的行动，如此进行下去。如何选择这些行动的规则称为策略。一般来说环境是随机的，这意味着下一个状态的出现在某种程度上是随机的（例如，你输了一个球的时候，重新启动新球，这个时候的射出方向是随机选定的）。

图 2：左：强化学习问题。右：Markov decision process

状态和行动的集合，以及从一个状态跳转到另一个状态的规则，共同真诚了 Markov decision process。这个过程（比方说一次游戏过程）的一个 episode 形成了一个有限的状态、行动和收益的序列：

这里的 s_i 表示状态，a_i 表示行动，而 r_{i+1} 是在执行行动的汇报。episode 以 terminal 状态 s_n 结尾（可能就是“游戏结束”画面）。MDP 依赖于 Markov 假设——下一个状态 s_{i+1} 的概率仅仅依赖于当前的状态 s_i 和行动 a_i，但不依赖此前的状态和行动。

Discounted Future Reward

To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. How should we go about that?

Given one run of the Markov decision process, we can easily calculate the total reward for one episode:

Given that, the total future reward from time point t onward can be expressed as:

But because our environment is stochastic, we can never be sure, if we will get the same rewards the next time we perform the same actions. The more into the future we go, the more it may diverge. For that reason it is common to use discounted future reward instead:

Here γ is the discount factor between 0 and 1 – the more into the future the reward is, the less we take it into consideration. It is easy to see, that discounted future reward at time step t can be expressed in terms of the same thing at time step t+1:

If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like γ=0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor γ=1.

A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward.

Q-learning

In Q-learning we define a function Q(s, a) representing the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on.

**The way to think about Q(s, a) is that it is “the best possible score at the end of the game after performing action a in state s“. It is called Q-function, because it represents the “quality” of a certain action in a given state.

This may sound like quite a puzzling definition. How can we estimate the score at the end of game, if we know just the current state and action, and not the actions and rewards coming after that? We really can’t. But as a theoretical construct we can assume existence of such a function. Just close your eyes and repeat to yourself five times: “Q(s, a) exists, Q(s, a) exists, …”. Feel it?

If you’re still not convinced, then consider what the implications of having such a function would be. Suppose you are in state and pondering whether you should take action a or b. You want to select the action that results in the highest score at the end of game. Once you have the magical Q-function, the answer becomes really simple – pick the action with the highest Q-value!

Here π represents the policy, the rule how we choose an action in each state.

OK, how do we get that Q-function then? Let’s focus on just one transition <s, a, r, s’>. Just like with discounted future rewards in the previous section, we can express the Q-value of state s and action a in terms of the Q-value of the next state s’.

This is called the Bellman equation. If you think about it, it is quite logical – maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state.

The main idea in Q-learning is that we can iteratively approximate the Q-function using the Bellman equation. In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns. The gist of the Q-learning algorithm is as simple as the following[1]:

α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a] cancel and the update is exactly the same as the Bellman equation.

The maxa’ Q[s’,a’] that we use to update Q[s,a] is only an approximation and in early stages of learning it may be completely wrong. However the approximation get more and more accurate with every iteration and it has been shown, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value.

Deep Q Network

环境的状态可以用 paddle 的位置，球的位置和方向以及每个砖块是否消除来确定。不过这个直觉上的表示与游戏相关。我们能不能获得某种更加通用适合所有游戏的表示呢？最明显的选择就是屏幕像素——他们隐式地包含所有关于除了球的速度和方向外的游戏情形的相关信息。不过两个时间上相邻接的屏幕可以包含这两个丢失的信息。

如果我们像 DeepMind 的论文中那样处理游戏屏幕的话——获取四幅最后的屏幕画面，将他们重新规整为 84 X 84 的大小，转换为 256 灰度层级——我们会得到一个 256^{84X84X4} 大小的可能游戏状态。这意味着我们的 Q-table 中需要有 10^67970 行——这比已知的宇宙空间中的原子的数量还要大得多！可能有人会说，很多像素的组合（也就是状态）不会出现——这样其实可以使用一个稀疏的 table 来包含那些被访问到的状态。即使这样，很多的状态仍然是很少被访问到的，也许需要宇宙的生命这么长的时间让 Q-table 收敛。我们希望理想化的情形是有一个对那些还未遇见的状态的 Q-value 的猜测。

这里就是深度学习发挥作用的地方。神经网络其实对从高度结构化的数据中获取特征非常在行。我们可以用神经网络表示 Q-function，以状态（四幅屏幕画面）和行动作为输入，以对应的 Q-value 作为输出。另外，我们可以仅仅用游戏画面作为输入对每个可能的行动输出一个 Q-value。后面这个观点对于我们想要进行 Q-value 的更新或者选择最优的 Q-value 对应操作来说要更方便一些，这样我们仅仅需要进行一遍网络的前向传播就可立即得到所有行动的 Q-value。

图 3：左：DQN 的初级形式；右：DQN 的优化形式，用在 DeepMind 的论文中的版本

DeepMind 使用的深度神经网络架构如下：

这实际上是一个经典的卷积神经网络，包含 3 个卷积层加上 2 个全连接层。对图像识别的人们应该会注意到这里没有包含 pooling 层。但如果你好好想想这里的情况，你会明白，pooling 层会带来变换不变性 —— 网络会对图像中的对象的位置没有很强的感知。这个特性在诸如 ImageNet 这样的分类问题中会有用，但是在这里游戏的球的位置其实是潜在能够决定收益的关键因素，我们自然不希望失去这样的信息了！

网络的输入是 4 幅 84X84 的灰度屏幕画面。网络的输出是对每个可能的行动（在 Atari 中是 18 个）。Q-value 可能是任何实数，其实这是一个回归任务，我们可以通过简单的平方误差来进行优化。

给定转移 <s, a, r, s’>，Q-table 更新规则变动如下：

对当前的状态 s 执行前向传播，获得对所有行动的预测 Q-value
对下一状态 s’ 执行前向传播，计算网络输出最大操作：max_{a’} Q(s’, a’)
设置行动的 Q-value 目标值为 r + γ max_{a’} Q(s’, a’)。使用第二步的 max 值。对所有其他的行动，设置为和第一步返回结果相同的 Q-value 目标值，让这些输出的误差设置为 0
使用反向传播算法更新权重

Experience Replay

到现在，我们有了使用 Q-learning 如何估计未来回报和用卷积神经网络近似 Q-function 的方法。但是有个问题是，使用非线性函数来近似 Q-value 其实是非常不稳定的。对这个问题其实也有一堆技巧来让其收敛。不过这样也会花点时间，在单个 GPU 上估计要一个礼拜。

其中最为重要的技巧是 experience replay。在游戏过程中，所有的经验 <s, a, r’, s’> 都存放在一个 replay memory 中。训练网络时，replay memory 中随机的 minibatch 会取代最近的状态转移。这会将连续的训练样本之间的相似性打破，否则容易将网络导向一个局部最优点。同样 experience replay 让训练任务与通常的监督学习更加相似，这样也简化了程序的 debug 和算法的测试。当然我们实际上也是可以收集人类玩家的 experience 并用这些数据进行训练。

Exploration-Exploitation

Q-learning 试着解决 credit assignment 问题——将受益按时间传播，直到导致获得受益的实际的关键决策点为止。但是我们并没有解决 exploration-exploitation 困境……

首先我们注意到，当 Q-table 或者 Q-network 随机初始化时，其预测结果也是随机的。如果我们选择一个拥有最高的 Q-value 的行动，这个行动是随机的，这样 agent 会进行任性的“exploration”。当 Q-function 收敛时，它会返回一个更加一致的 Q-value 此时 exploration 的次数就下降了。所以我们可以说，Q-learning 将 exploration 引入了算法的一部分。但是这样的 exploration 是贪心的，它会采用找到的最有效的策略。

对上面问题的一个简单却有效的解决方案是 ε-greedy exploration——以概率ε选择一个随机行动，否则按照最高的 Q-value 进行贪心行动。在 DeepMind 的系统中，对ε本身根据时间进行的从 1 到 0.1 的下降，也就是说开始时系统完全进行随机的行动来最大化地 explore 状态空间，然后逐渐下降到一个固定的 exploration 的比例。

Deep Q-learning 算法

现在我们得到加入 experience replay的最终版本：

DeepMind 其实还使用了很多的技巧来让系统工作得更好——如 target network、error clipping、reward clipping 等等，这里我们不做介绍。

该算法最为有趣的一点就是它可以学习任何东西。你仔细想想——由于我们的 Q-function 是随机初始化的，刚开始给出的结果就完全是垃圾。然后我们就用这样的垃圾（下个状态的最高 Q-value）作为网络的目标，仅仅会偶然地引入微小的收益。这听起来非常疯狂，为什么它能够学习任何有意义的东西呢？然而，它确实如此神奇。

Final notes

Many improvements to deep Q-learning have been proposed since its first introduction – Double Q-learning, Prioritized Experience Replay, Dueling Network Architecture and extension to continuous action space to name a few. For latest advancements check out the NIPS 2015 deep reinforcement learning workshop and ICLR 2016 (search for “reinforcement” in title). But beware, that deep Q-learning has been patented by Google.

It is often said, that artificial intelligence is something we haven’t figured out yet. Once we know how it works, it doesn’t seem intelligent any more. But deep Q-networks still continue to amaze me. Watching them figure out a new game is like observing an animal in the wild – a rewarding experience by itself.

Credits

Thanks to Ardi Tampuu, Tanel Pärnamaa, Jaan Aru, Ilya Kuzovkin, Arjun Bansal and Urs Köster for comments and suggestions on the drafts of this post.

Links

David Silver’s lecture about deep reinforcement learning
Slightly awkward but accessible illustration of Q-learning
UC Berkley’s course on deep reinforcement learning
David Silver’s reinforcement learning course
Nando de Freitas’ course on machine learning (two lectures about reinforcement learning in the end)
Andrej Karpathy’s course on convolutional neural networks

[1] Algorithm adapted from http://artint.info/html/ArtInt_265.html

文／Not_GOD（简书作者）
原文链接：http://www.jianshu.com/p/d347bb2ca53c
著作权归作者所有，转载请联系作者获得授权，并标注“简书作者”。

Uncategorized

Excerpts from the Grothendieck-Serre Correspondence

November 15, 2014 zr9558 Leave a comment

Matt Baker's Math Blog

Like many fellow mathematicians, I was very sad to hear the news that Alexander Grothendieck passed away yesterday. The word “genius” is overused; or rather, does not possess sufficiently fine gradations. I know quite a few mathematical geniuses, but Grothendieck was a singularity. His ideas were so original, so profound, and so revolutionary – and he had so many of them! – that I will not even attempt to summarize his contributions to mathematics here. Rather, I thought that I would share some of my favorite passages from the fascinating Grothendieck-Serre Correspondence, published in a bilingual edition by the AMS and SMF. They illuminate in brief flashes what made Grothendieck so extraordinary — but also human. They also illustrate how influential Serre was on Grothendieck’s mathematical development. Before I begin, here is a quote from another wonderful book, Alexander Grothendieck: A Mathematical Portrait, edited by Leila Schneps:

…the…

View original post 2,863 more words

Uncategorized

254A announcement: Analytic prime number theory

November 15, 2014 zr9558 Leave a comment

What's new

In the winter quarter (starting January 5) I will be teaching a graduate topics course entitled “An introduction to analytic prime number theory“. As the name suggests, this is a course covering many of the analytic number theory techniques used to study the distribution of the prime numbers $latex {{mathcal P} = {2,3,5,7,11,dots}}&fg=000000$. I will list the topics I intend to cover in this course below the fold. As with my previous courses, I will place lecture notes online on my blog in advance of the physical lectures.

The type of results about primes that one aspires to prove here is well captured by Landau’s classical list of problems:

Even Goldbach conjecture: every even number $latex {N}&fg=000000$ greater than two is expressible as the sum of two primes.
Twin prime conjecture: there are infinitely many pairs $latex {n,n+2}&fg=000000$ which are simultaneously prime.
Legendre’s conjecture:…

View original post 3,947 more words

Uncategorized

Manjul Bhargava and his 290 theorem

August 15, 2014 zr9558 Leave a comment

Fight with Infinity

ICM 2014今天在韩国首尔召开。正如之前所预测的那样，Manjul Bhargava获得了2014年的Fields Medal. 一同获奖的还有Artur Avila, Martin Hairer和Maryam Mirzakhani.

这份名单相当有趣：史上第一位女性获奖者(或许是为了赶在美国第一位女总统之前？)；4人广义上的“祖国”(印度、巴西、奥地利以及伊朗)此前均无Fields奖得主(何其政治正确！当然这也反映了现今欧美基础学科研究人员的“去欧美化”趋势)；3人的研究工作和遍历理论紧密相关；2人有参加IMO的经历(有此经历的Fields奖得主越来越多，当然，其中并没有华人的身影——丘成桐或许会愿意就数学研究和奥林匹克数学的关系作进一步的评论)；等等。

本文将介绍获奖者Manjul Bhargava的一项“初等”工作：简化了Conway-Schneeberger 15定理的证明，并进一步证明了Conway的290猜想。

1.
我们感兴趣的是在整格$latex Bbb Z^n$上取整值的$latex n$元多项式$latex f$。若$latex f$是齐次的，这相当于要求$latex f$的系数为整数。对可表示集$latex R_f:=f(Bbb N^n) subset Bbb Z$(约定$latex 0 in Bbb N$)的研究贯穿了整个数论史：
(1.1)Fermat集中研究了用2元2次整系数多项式表示素数$latex p$的问题，并发现
若$latex f(x,y)=x^2+y^2$，则$latex p in R_f$当且仅当$latex p$形如$latex 4a+1$；
若$latex f(x,y)=x^2+2y^2$，则$latex p in R_f$当且仅当$latex p$形如$latex 8a+1$或$latex 8a+3$；
若$latex f(x,y)=x^2-2y^2$，则$latex p in R_f$当且仅当$latex p$形如$latex 8a+1$或$latex 8a+7$；
若$latex f(x,y)=x^2+3y^2$，则$latex p in R_f$当且仅当$latex p$形如$latex 3a+1$；
若$latex f(x,y)=x^2+5y^2$，则$latex p in R_f$当且仅当$latex p$形如$latex 20a+1$或$latex 20a+9$；
此类现象是代数数论乃至类域论的渊薮。
(1.2)由Fermat二平方和定理开始，Euler等数学家获得了一系列经典结果。
(Fermat二平方和定理, 由Euler证明) 若$latex f(x,y)=x^2+y^2$，则自然数$latex k in R_f$当且仅当$latex k$的奇素因子(若有)均形如$latex 4a+1$。
(Lagrange四平方和定理) 若$latex f(x,y,z,w)=x^2+y^2+z^2+w^2$，则$latex R_f=Bbb N$。
(Legendre三平方和定理) 若$latex f(x,y,z)=x^2+y^2+z^2$，则自然数$latex k in R_f$当且仅当$latex k$不能写成$latex 2^{2a}(8b+7)$的形式。
(1.3) 平方数有一类推广，即所谓的多边形数：填满正多边形内部的点的个数。
(Gauss三角数定理，“Eureka定理”)令$latex f(x,y,z)=frac{x(x+1)}{2}+frac{y(y+1)}{2}+frac{z(z+1)}{2}$，则$latex R_f=Bbb N$。
推广Gauss三角数定理和Lagrange四平方和定理，我们有如下结果：
(Fermat多边形数定理，由Cauchy证明) 任意自然数均可表示为不超过$latex n$个$latex n$边形数之和。
(1.4)从Lagrange四平方和定理出发，我们也可以研究高次幂多项式的表示问题：
(Waring问题，由Hilbert解决) 给定$latex k geq 2$，$latex f=sum_{1 leq i leq g} x_i^k$。对于充分大的$latex g$，$latex R_f=Bbb N$。
关于$latex g$的下确界$latex…

View original post 234 more words

Uncategorized

Khot, Osher, Griffiths

August 15, 2014 zr9558 Leave a comment

What's new

In addition to the Fields medallists mentioned in the previous post, the IMU also awarded the Nevanlinna prize to Subhash Khot, the Gauss prize to Stan Osher (my colleague here at UCLA!), and the Chern medal to Phillip Griffiths. Like I did in 2010, I’ll try to briefly discuss one result of each of the prize winners, though the fields of mathematics here are even further from my expertise than those discussed in the previous post (and all the caveats from that post apply here also).

Subhash Khot is best known for his Unique Games Conjecture, a problem in complexity theory that is perhaps second in importance only to the $latex {P neq NP}&fg=000000$ problem for the purposes of demarcating the mysterious line between “easy” and “hard” problems (if one follow standard practice and uses “polynomial time” as the definition of “easy”). The $latex {P neq…

View original post 1,244 more words

Uncategorized

Lindenstrauss, Ngo, Smirnov, Villani

August 13, 2014 zr9558 Leave a comment

What's new

As is now widely reported, the Fields medals for 2010 have been awarded to Elon Lindenstrauss, Ngo Bao Chau, Stas Smirnov, and Cedric Villani. Concurrently, the Nevanlinna prize (for outstanding contributions to mathematical aspects of information science) was awarded to Dan Spielman, the Gauss prize (for outstanding mathematical contributions that have found significant applications outside of mathematics) to Yves Meyer, and the Chern medal (for lifelong achievement in mathematics) to Louis Nirenberg. All of the recipients are of course exceptionally qualified and deserving for these awards; congratulations to all of them. (I should mention that I myself was only very tangentially involved in the awards selection process, and like everyone else, had to wait until the ceremony to find out the winners. I imagine that the work of the prize committees must have been extremely difficult.)

Today, I thought I would mention one…

View original post 3,867 more words

Uncategorized

Avila, Bhargava, Hairer, Mirzakhani

August 13, 2014 zr9558 Leave a comment

What's new

The 2014 Fields medallists have just been announced as (in alphabetical order of surname) Artur Avila, Manjul Bhargava, Martin Hairer, and Maryam Mirzakhani (see also these nice video profiles for the winners, which is a new initiative of the IMU and the Simons foundation). This time last year, I wrote a blog post discussing one result from each of the 2010 medallists; I thought I would try to repeat the exercise here, although the work of the medallists this time around is a little bit further away from my own direct area of expertise than last time, and so my discussion will unfortunately be a bit superficial (and possibly not completely accurate) in places. As before, I am picking these results based on my own idiosyncratic tastes, and are not necessarily the “best” work of these medallists.

Artur Avila works in dynamical systems and in the…

View original post 1,412 more words

Uncategorized

245A: Problem solving strategies

July 18, 2014 zr9558 Leave a comment

What's new

This is going to be a somewhat experimental post. In class, I mentioned that when solving the type of homework problems encountered in a graduate real analysis course, there are really only about a dozen or so basic tricks and techniques that are used over and over again. But I had not thought to actually try to make these tricks explicit, so I am going to try to compile here a list of some of these techniques here. But this list is going to be far from exhaustive; perhaps if other recent students of real analysis would like to share their own methods, then I encourage you to do so in the comments (even – or especially – if the techniques are somewhat vague and general in nature).

(See also the Tricki for some general mathematical problem solving tips. Once this page matures somewhat, I might migrate it to the…

View original post 6,008 more words

Uncategorized

The parity problem obstruction for the binary Goldbach problem with bounded error

July 14, 2014 zr9558 Leave a comment

What's new

Two of the most famous open problems in additive prime number theory are the twin prime conjecture and the binary Goldbach conjecture. They have quite similar forms:

Twin prime conjecture The equation $latex {p_1 – p_2 = 2}&fg=000000$ has infinitely many solutions with $latex {p_1,p_2}&fg=000000$ prime.
Binary Goldbach conjecture The equation $latex {p_1 + p_2 = N}&fg=000000$ has at least one solution with $latex {p_1,p_2}&fg=000000$ prime for any given even $latex {N geq 4}&fg=000000$.

In view of this similarity, it is not surprising that the partial progress on these two conjectures have tracked each other fairly closely; the twin prime conjecture is generally considered slightly easier than the binary Goldbach conjecture, but broadly speaking any progress made on one of the conjectures has also led to a comparable amount of progress on the other. (For instance, Chen’s theorem has a version for the twin prime conjecture, and a version…

View original post 1,123 more words

Uncategorized

Mac OS X下MATLAB R2012b安装及破解

November 30, 2013 zr9558 6 Comments

http://blog.sina.com.cn/s/blog_559d66460101caur.html

Mac OS X下MATLAB R2012b安装及破解 (Updated on 10/19/2013)

(2013-03-08 05:28:16)

转载▼

【转载请注明出处！！！】

有条件的请使用正版软件！（正义脸）

1. 下载 MATLAB R2012b 安装包。安装文件的大小约为5G。之前的下载地址都失效了，为了方便大家，po主自己传了一份到百度网盘。因为度娘不允许单个文件大于4G，所以压成了5 parts，请把5 parts都下完后再解压… 侵删，勿跨省跨国追捕…

Part 1: http://pan.baidu.com/s/15lLMe 提取密码：ygxm

Part 2: http://pan.baidu.com/s/1dFl4 提取密码：naxi

Part 3: http://pan.baidu.com/s/19V4cW 提取密码：zrwu

Part 4: http://pan.baidu.com/s/1FvjiY 提取密码：ag8t

Part 5: http://pan.baidu.com/s/1899jD 提取密码：f85x

PS：大家随便下，po主不设权限不收钱，不用谢，作为交换请不要关掉页面的background music，让po主私心promote一下G-Dragon的音乐 Mac OS X下MATLAB R2012b安装及破解 (Updated on 10/19/2013)

各种风格都有，歌曲排序大概是普通青年->文艺青年->黑泡青年，请根据喜好自行切换 Mac OS X下MATLAB R2012b安装及破解 (Updated on 10/19/2013)

2. 下载 MATLAB R2012b 安装密钥/License。地址：http://vdisk.weibo.com/s/sSDYb （微博快被@爆了… 改天传一份到度娘去…）。解压后安置在某处待用。

3. 双击MATLAB安装文件(.iso file)，选择 Install for Mac。进入以下画面时，点选 Install without using the Internet （离线安装），然后点击Next。

4. 在这一步中选择 I have the File Installation Key for my license。在前面下载并解压的安装密钥文件夹中，打开文件“MatLab R2012b 安装密钥.txt”，复制任意一个密钥，粘贴到输入框中。据说不同长度的license代表所含的组件数量不一样。点击Next。

5. 在这一步中，根据自己的需要，选择安装类型。普通用户的话选Typical即可，高端用户可选择自定义安装Custom。然后Next。

6. 在这一步中，点击Browse，选择license文件的位置。找到在前面下载并解压的安装密钥文件夹，在里面的License文件夹中，点选license_standalone.dat。然后点击Next。PS:当时我遇到一个问题，就是没有办法点选license文件，里面的其它文件也一样，全都显示灰色并且不可选。后来我发现是由于路径语言造成的。如果你的系统是中文的，请无视。如果你的系统是英文的，而路径里面包含中文（如下图所示），那么请重命名含有中文的文件夹名称，保证指向license文件的路径是纯英文的即可。

7. 之后就开始安装了。Enjoy the free MATLAB on your Mac!

【Update】其它问题：

(1) A server line could not be found in your license file. You will have to manually edit the SERVER line in /Applications/MATLAB_R2012b.app/licenses/network.lic

PS：这个解决方法来自[2]。因为我之前已经装了Xcode，虽然没有特别去设置什么，但没有碰到这个问题。有这个问题的小伙伴们如果没有装Xcode，就先装Xcode，还是有问题的话就装下面这个patch。

当你在安装过程中遇到上面那段英文时，点击ok，你会进入一个帮助界面http://www.mathworks.co.uk/support/compilers/R2012b/maci64.html#n1

在底部会看到

Notes for the Mac Platform

A patch is required to add support for Xcode 4.6, 4.5, 4.4 and 4.3. See Solution 1-FR6LXJ for the patch and installation instructions.

点击 solution 1-FR6LXJ，按照这里的步骤重新设置和安装你的Xcode, Matlab[3]

References:

[1] http://bbs.weiphone.com/read-htm-tid-5419285.html

[2] http://blog.sina.com.cn/s/blog_c29649af0101f5g4.html

[3] http://www.mathworks.co.uk/support/solutions/en/data/1-FR6LXJ/

Uncategorized

[转]DJVU格式电子图书介绍

November 24, 2013 zr9558 Leave a comment

bestorworst

这篇文章发的比较久了，版本比较旧了，下载2.0.1可以选择这里。
另：WinDjView 主页

———————-
转载请注明源自清溪长河，多谢配合！
原文地址» http://www.10kn.com/114/
由长河发表于:2008-09-06 07:35

自我分享《有限元编程》等几个DJVU格式电子书，近两天经常有朋友问我关于DJVU格式，这才发现这种格式的电子书籍在中国竟然没我想象中那么普及！真实遗憾啊，这可是超级好的格式，如果你对它不了解，也没关系，相信阅读完这片文章你就会喜欢上它啦~

什么是djvu？

djvu是一种电子文档格式，于1996年被美国AT&T实验室研制成功。与我们熟悉的pdf、ppt等一样，djvu文件中存储了文字、图片等信息，可通过软件进行此类文档的制作、分享、阅读操作。

djvu的技术特色，即为什么要选择djvu？

djvu没有ppt与pdf，甚至没有pdg更大众化，但是它的技术特点确是其先天优势，具体来讲有以下几点：

1.双层打印

很多电子书籍都在面临一个问题：书籍页面中有图片和文字，如果使用高分辨率打印，则电子文档会变得相当臃肿；如果降低分辨率，则书籍中图片质量会受到很大影响。djvu的出现完美解决了这个难题，它使用双层打印的技术。具体来讲，用高分辨率还原文字、较低分辨率压缩背景图片——既提高了文字可读性，又使整个图像的质量得到保证。

2.超级强大的压缩功能

这是djvu最吸引我的地方。因为我的资料很多，现在已经超过50盘DVD了，如果每本书都动辄几十MB，实在是吃不消。由于上面提到的双层打印技术的实施，使得djvu成为压缩比最大的一款电子资料格式。ak saint曾经做过试验，打印一份110页、全彩300dpi高分辨率的商业演示文稿，tiff格式占用2.5G、pdf占用155MB、jpeg占用128MB，而djvu只用了区区3MB！这可是1000：1的超高压缩率，而且前提是图片质量得到保证的情况！

3.适合网络分享

这点同时也是pdf的特点，不过pdf靠的是身后强大的Adobe产业链支持，但djvu不是这样，它适合网络传输的理由只因为它很小。你可以轻易地把很小的djvu文档上传到网络，如果安装了djvu浏览器辅助插件（插件安装文件也非常小），也可在网络上直接进行阅读、放缩、旋转等操作，最后直接打印出图。文档还完美支持超链接，可组成一组带链接的文件发送给你的朋友。

通过以上的介绍，你是否已经对djvu产生了浓厚的兴趣了呢？下面给出几个你可能会用到的链接地址：

~~DJVU阅读器：WinDjView 0.4.1 下载~~

View original post

Uncategorized

44个精彩的物理趣题（转自MATRIX67）

November 16, 2013 zr9558 Leave a comment

GaofeiZhang, Uncategorized

[转载] 尤老师的乒乓球之路

November 2, 2013 zr9558 Leave a comment

作者: zr9558

标题: [转载] 尤老师的乒乓球之路

时间: Fri Jan 8 20:51:28 2010

点击: 129

【以下文字转载自 D_Maths 讨论区】

【原文由 gfzhang 所发表】

我们系里和所里各有一张乒乓球台子。水平低一些的就在系里玩，高一些的就在所里。就像CBA与NBA，分得很清楚。比如，何老师和孙老师就总在系里。汪老师，钟老师和武海军他们一般都在所里。在系里打得好的，比如老纪，有时就去一下所里。而状态不能保持的，比如朱老师，现在就只能在系里打一打。当然也有例外，比如尤老师一直在所里打。那只是因为他的办公室在所里。

尤老师不仅自己对乒乓球保持着浓厚的兴趣，而且也热心关注我们系的乒乓球事业。比如在最近的一场师生对抗赛中，尤老师带病参加了比赛。虽然以零比三惜败给我系一个女同学，可是他那种战斗到底永不言输的精神给我们留下了深刻的印象。

尤老师除了训练的很刻苦之外，还请人买了很贵的球拍，同时从网上下载了教学录像来纠正自己的动作。尤老师的心中有一个梦想：就是希望有一天能够得到大家的认可，从而名正言顺的加入到我们系的高手俱乐部。因为尤老师是主任，所以汪老师，朱老师他们和尤老师比赛时总打假球。有时故意回一个很高的球，等尤老师把球扣过来时再大声喊“好球 ”！有一次刚把球回过去，就开始喊“好球”。结果尤老师一拍子把球抡在了网子上。我在旁边实在看不下去了。就说：“尤老师，他们是说这颗球是刚买的，是个好球”。

尽管如此，在过去的一年里有尤老师的乒乓球技术还是取得了可喜的进步。比如他学会了拉下旋球，学会了搓球，并且不再吃我发的不转球。有时我甚至想或许真的有那么一天，即使我不让球也输给了尤老师。那该是怎样的一个世界啊。

— ※ 来源:．南京大学小百合站 http://bbs.nju.edu.cn [FROM: 202.119.34.246] —

※ 来源:．南京大学小百合站 http://bbs.nju.edu.cn [FROM: 123.114.37.153]

Uncategorized

LaTeX 作图工具介绍

October 24, 2013 zr9558 2 Comments

LaTeX 作图工具介绍

logiclife

LaTeX 虽然在公式编辑上远胜任于 Office，但对于作图，初学者可能会觉得门槛较高而对 LaTeX 产生畏难情绪。不过，如果掌握了恰当的工具，LaTeX 下的作图仍然很方便且优于 Office。

利用命令绘图

利用命令可以精准控制图形的形状和位置，对于结构性较强的图形，利用命令画图比手工绘图更值得推荐。LaTeX 本身有一些命令可以绘制简单的图形，但绘制复杂图形则需要使用一些宏包，其中常用的宏包有：

tikz，非常强大的作图宏包，几乎可以画任何图形。甚至可以绘制简单的函数图像。其官方使用手册的最新版厚达726页。网上也有非常多的实例展示如何用 tikz 命令绘制各种图形，例如这个网页。
pstricks，老牌的作图宏包，异常强大。遗憾的是不支持 pdflatex 编译，不过支持xelatex（或许反过来说更对，xelatex支持pstricks）。
metapost，这是在 LaTeX 诞生之初就有的绘图工具，但因为不是 LaTeX 的宏包，而只是一个外部命令行工具，使用起来不够方便。不能直接在 LaTeX 中用代码画图，而必须用 metapost 命令画好图生成 eps 或 pdf 格式的文件供 LaTeX 调用。不过， metapost 的绘图能力独步天下，大概只有 pstricks 可以与之匹敌。
gnuplot，外部命令行工具，绘制函数图像的不二选择。提供和 LaTeX 的接口。这里是一个很好的简明入门教程。
xy-pic（其实宏包名为xy），如果是画交换图，特别是范畴论中的图形，使用 xy 宏包会极为方便。但画结构性不那么好的图形则比较麻烦。
bussproof，写 Gentzen 式树状逻辑推演极为方便。
qtree，画 tableau 证明树或语法分析树极为方便，但树枝没有箭头。
xy-ling，另一个画树状图的宏包，极其灵活，处理语言学中各种语法分析树不在话下。

其中前 3 种熟练掌握一种就完全够用了，后 5 种则是面向特殊用途的。

利用 GUI 绘图软件绘图

毕竟有些复杂的图用命令绘制仍然不方便（特别是结构性不那么好的图），这时需要使用外部绘图软件先手工绘制出图形，然后在 LaTeX 文档中调用由这些软件生成的图片或 tex 代码。理论上，任何绘图软件都可以生成可供 LaTeX 调用的图片，但考虑到有些图形上需要添加公式，这时普通的绘图软件就不够用了。我所了解的支持添加 LaTeX 公式的绘图软件有如下这些：

Inkscape，非常强大的矢量绘图软件，可实现很多复杂的效果，跨平台，且支持多种文件格式保存。Ubuntu 可通过源安装。没有特别声明支持 LaTeX，但实际上所绘图片可以直接存成 tex 格式（其代码利用了 pstricks 宏包），也可以存成 pdf 文件，然后在保存选项中选择包含 LaTeX 代码（用于处理图片中的公式），Inkscape 会生成一个名为<image>.pdf_tex的文件，最后在 LaTeX 主文档中使用 input 命令包含这个文件即可。详见这个文档说明。如果不需要绘制函数图形，Inkscape 是这里所列的绘图软件中绘图能力最强的。
Ipe，比 Inkscape 小巧，因而绘图功能也较弱，但如果只需要绘制简单图形，也够用了。不能导出为 tex 代码，直接生成 eps 或 pdf 格式图片供 LaTeX 文档调用，能自动剪裁图片大小，去掉白边。跨平台。Ubuntu 可通过源安装。Linux 下必须通过命令行启动。
LaTeXDraw，与 Ipe 类似。好处是在手工绘图的同时自动生成 tex 代码（利用了 pstricks 宏包）。跨平台。Ubuntu 可通过源安装。
XFig，比较老牌的支持 LaTeX 的 GUI 绘图软件。手工绘图后生成 .pstex（存储图片信息）和 .pstex_t（存储图片中的公式信息）文件供 tex 主文档调用。跨平台。Ubuntu 可通过源安装。虽然不是专业的图片编辑软件，但与 Inkscapte 相比，XFig 处理简单的数学图形可能更方便。缺点是：界面丑陋，而且不支持 pdflatex 编译，要先用 latex 编译，然后转成 pdf。
TpX，是我接触最早的支持 LaTeX 的 GUI 绘图软件，据说是一个经济学家因为要出书，图片太多，不方便处理，所以自己动手写了这个软件。与 Ipe 类似。小巧，方便。缺点是只支持 Windows。
GeoGebra，专门绘制函数图像，支持导出为 tikz 或 pstricks 代码，跨平台。Ubuntu 可通过源安装。
Dia，专门绘制流程图，支持导出为…

View original post 10 more words

Uncategorized

Polymath8: Writing the paper

October 22, 2013 zr9558 Leave a comment

What's new

The main objectives of the polymath8 project, initiated back in June, were to understand the recent breakthrough paper of Zhang establishing an infinite number of prime gaps bounded by a fixed constant $latex {H}&fg=000000$, and then to lower that value of $latex {H}&fg=000000$ as much as possible. After a large number of refinements, optimisations, and other modifications to Zhang’s method, we have now lowered the value of $latex {H}&fg=000000$ from the initial value of $latex {70,000,000}&fg=000000$ down to (provisionally) $latex {4,680}&fg=000000$, as well as to the slightly worse value of $latex {14,994}&fg=000000$ if one wishes to avoid any reliance on the deep theorems of Deligne on the Weil conjectures.

As has often been the case with other polymath projects, the pace has settled down subtantially after the initial frenzy of activity; in particular, the values of $latex {H}&fg=000000$ (and other key parameters, such as $latex {k_0}&fg=000000$, $latex {\varpi}&fg=000000$…

View original post 691 more words

ZHANG RONG

Category Archives: Uncategorized

Theoretical & empirical analysis of Expected Sarsa

On-Policy First-Visit Monte Carlo

UCL course – 2016

Lectures

Value iteration networks

LEARNING REINFORCEMENT LEARNING (WITH CODE, EXERCISES AND SOLUTIONS)

WHY STUDY REINFORCEMENT LEARNING

HOW TO STUDY REINFORCEMENT LEARNING

TABLE OF CONTENTS

LIST OF IMPLEMENTED ALGORITHMS

解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术

强化学习

Markov Decision Process

Discounted Future Reward

Q-learning

Deep Q Network

Experience Replay

Exploration-Exploitation

Deep Q-learning 算法

Final notes

Credits

Links

Excerpts from the Grothendieck-Serre Correspondence

254A announcement: Analytic prime number theory

Manjul Bhargava and his 290 theorem

Khot, Osher, Griffiths

Lindenstrauss, Ngo, Smirnov, Villani

Avila, Bhargava, Hairer, Mirzakhani

245A: Problem solving strategies

The parity problem obstruction for the binary Goldbach problem with bounded error

Mac OS X下MATLAB R2012b安装及破解

Mac OS X下MATLAB R2012b安装及破解 (Updated on 10/19/2013)

[转]DJVU格式电子图书介绍

44个精彩的物理趣题（转自MATRIX67）

[转载] 尤老师的乒乓球之路

LaTeX 作图工具介绍

Polymath8: Writing the paper

zr9558's Blog