Problem 7.1.5

system · 5. Juni 2018 um 12:53

Disclaimer: Dieser Thread wurde aus dem alten Forum importiert. Daher werden eventuell nicht alle Formatierungen richtig angezeigt. Der ursprüngliche Thread beginnt im zweiten Post dieses Threads.

kristin97 · 5. Juni 2018 um 12:53

Problem 7.1.5
Es wird in der Aufgabenstellung gesagt, dass nur die States relevant sind, in denen A am Zug ist.
Am Zug sein bedeutet ja, dass das Spiel noch nicht vorbei ist (sonst ist ja keiner am Zug).
Jetzt habe ich zwei Probleme:

Ich habe einen State s, in dem A am Zug ist und entweder gewinnen kann oder einen anderen Zug machen kann, der nicht direkt gewinnt. Hier habe ich angenommen, dass R(s) = 100. Irgendwo muss ich diese Zahl ja einbauen und da jeder State in dem A schon gewonnen hat, eigentlich ja gar keiner ist, dachte ich das muss hier sein. Ist halt aber schon sehr hingecheatet. Kann man das so machen?
Ich habe einen State s, von dem aus A eine Aktion ausführen kann, woraufhin B entweder gewinnen kann (Fall 1), oder in einen State s’ übergehen kann, von dem aus das Spiel weitergeht (Fall 2). In Fall 1 könnte man es zwar so modellieren, dass wir einen State haben. Aber selbst dann muss ich das Maximum über eine leere Menge ziehen (es gibt keine Aktionen mehr) und das macht ja keinen Sinn. Wenn das kein State ist, habe ich allerdings keine Utility, die ich für die Berechnung in s verwenden kann. Was tue ich hier? Ich muss das ganze ja auch noch mit Fall 2 verrechnen, wenn ich Fall 1 einfach weglasse, “merkt” der Algorithmus ja nicht, dass s ein schlechter Zustand ist.

Danke :huh:

ThiloK · 5. Juni 2018 um 13:41

I’m not sure if I got this correctly but one “round” [m]T(s, a, s’)[/m] is defined as the probability of [m]A[/m] moving action [m]a[/m] followed by [m]B[/m]‘s move. After both moves we are in state [m]s’[/m] and can update the utility because it’s [m]A[/m]'s move again. In my opinion, this is what’s meant with “game states, where it’s [m]A[/m]'s move”. In addition, reward should be given only in finite states such as (4,3) or (3,1) and zero else. But I’m not a hundred percent sure about this. Hope this helps - otherwise you should wait for @Jazzpirate’s answer

Jazzpirate · 5. Juni 2018 um 13:53

This specifically includes terminal states of course! „A won“ and „B won“ are terminal states. Sorry if that wasn’t clear.

No, this is obviously not allowed. If A hasn’t won yet, they don’t get their reward.

Or in other words:

Exactly, thank you

ThiloK · 5. Juni 2018 um 14:23

Can we assume reward as zero in non-terminal states?

Jazzpirate · 5. Juni 2018 um 14:50

Damn, yes, that I definitely should have mentioned!

Stella · 5. Juni 2018 um 22:58

For the drawing exercice: I assume we only need to do reachable states and once a terminal state was reached no more actions are performed, correct?

For 5. I’m not really sure about the wording. Can I set gamma (e.g. to 0.99) or do I need to do this for a “general” gamma or should I do it for different gammas possibly leading to different strategies?

Jazzpirate · 6. Juni 2018 um 07:17

Correct

The second one. Just take an undetermined variable \gamma - ultimately it doesn’t matter; the strategy should be the same in all cases, which is why we can just leave gamma unspecified.

kristin97 · 6. Juni 2018 um 11:14

By now I still have this problem. I get that (3,1) should be a terminal state but there are no actions, so what is the maximum over all possible actions? An empty set doesn’t have a maximum. If I assume that the result is zero, what practically makes sense because we have only non-negative values, I get the problem that I have to divide by zero („Now assume that your opponent picks a move with probability inversely proportional to our utility of the resulting state“).
My idea was now to introduce an epsilon greater zero that should be the maximum of an empty set. Is there any better way to do that?

ThiloK · 6. Juni 2018 um 12:37

What I did was assuming self-loops in terminal states and calculating the utility with [m]s’ = s[/m] which works fine for my defined Bellman equation. Finally, that leads to multiplying with zero and therefore having [m]U(s) = R(s)[/m] for all [m]s ∈ F[/m].

kristin97 · 6. Juni 2018 um 12:53

I already did that but my problem are the parents of terminal states. My probability T in the Bellman-Equation contains 1/U(s) for every successor state. I don’t see how to prevent this due to the inversely proportionality and so I have to divide by zero. That’s why I wanted to introduce an epsilon.

ThiloK · 6. Juni 2018 um 13:42

Maybe you should assume T as 1 if there is only one reachable state.

Jazzpirate · 6. Juni 2018 um 16:05

Yes, it does - the maximum over an order L of the empty set is the minimum of L. Hence the maximum is 0; which nicely amounts to „the utility of a terminal node is exactly its reward“ - which makes sense if you think about it

Hmm, this doesn’t happen to me, but then again:

This doesn’t happen for me either; mine only contains 1/(the sum of the utilities for all successor states)

Jazzpirate · 6. Juni 2018 um 16:13

Note that by saying “probability inversely proportional to our utility of the resulting state” I don’t mean that there is a fixed proportionality factor between all utilities and their probabilities. What I meant to say is simply “the higher the utility, the lesser the probability”.

Jazzpirate · 6. Juni 2018 um 16:18

In either case; you may safely assume that the probability of a set of states {s_1,…s_n} of which U(s_1)=…=U(s_k)=0 is 1/k for each s_i up until k and 0 for all others. i.e. the opponent always picks a state with Utility 0, and if there’s more than one all of those with equal probability. You’ll note that this satisfies a) the proportionality requirement under b) the additional assumption that 0 is as bad as it gets and is infinitely worse than every other utility.
Kinda hacky, but I still assume that there’s an “error” in your Bellman-equation somehow, I understand if you don’t want to throw everything away or spend hours looking for how your Bellman-equation might be wrong, and this should get you unstuck without violating the spirit of the exercise.

Jazzpirate · 6. Juni 2018 um 16:52

DAMN, but I just noticed that I’M AN IDIOT.

This is incredibly misleading; since this is only the case if s1 and s2 are the only successors >.<

Sorry, I thought the way I put it would clearly imply one unique way to go about doing things, which it clearly did not. Even more so, I accidentally very much suggested a different way to model this than I intended. See previous comment for how to continue if you encounter a division by zero.

Saphira · 8. Juni 2018 um 12:58

Could you please give another valid example, that makes clear which function you intended? If that is feasible consider the utilities 30, 20, 10 for 3 successor states.

ThiloK · 9. Juni 2018 um 09:56

What I’d say (not validated yet): what you have to do is calculating the relation between the utilities and inverse them.
That means, if:

[m]U(s_1) = 10[/m],
[m]U(s_2) = 2 * U(s_1) = 20[/m] and
[m]U(s_3) = 3 * U(s_1) = 30[/m] for player [m]A[/m]
then for player [m]B[/m] it is twice as probable to prefer [m]s_1[/m] to [m]s_2[/m] and three times more probable to prefer [m]s_1[/m] to [m]s_3[/m]. Of course [m]B[/m] is also 1.5 times as likely to pick [m]s_2[/m] than [m]s_3[/m]. Hope that hint could help you.

Jazzpirate · 9. Juni 2018 um 12:04

Actually, what I intended was simply $P(s)=1-(U(s)/SUM(U(s’)))$ (where s’ ranges over all successors), but any probability assignment where the enemy is less likely to pick a successor with high utility is fine

ThiloK · 16. Juli 2018 um 08:41

The solution from assignments.pdf says “we initialize wit [m]U(s) = 50[/m] for all states” but then during the update of state [m](A_2, B_3)[/m] (second bullet) it seems that the current reward [m]R(s)[/m] for this state is zero where I excepted to use 50. So specifically what’s the difference between utility and reward function?

Edit: And why is B’s probability of moving right to [m](A_1, B_4)[/m] the same as moving left to [m](A_1, B_2)[/m], where he’d won instantly?