The Depths of AGI Self-Reflection

Foundations of AI Dystopianism III: Self-Improvement (part 2)

Aug 27, 2023

A humanoid robot contemplates its reflection in a mirror

This is the second in a three-part discussion on one of the cornerstones of both AI Dystopian and AI Utopian thinking: the idea that an Artificial General Intelligence system will inevitably self-improve itself into superintelligence and achieve God-like capabilities by doing so. (This was originally going to be a two part post, but the nature of the subject pushed it out to one more part.)

In the first post on this topic, I brought up computer scientist Steve Omohundro’s influential 2008 paper The Basic AI Drives, which discussed the idea that an AGI will be driven towards self-improvement by its very nature.

Omohundro wrote:

One kind of action a system can take is to alter either its own software or its own physical structure. Some of these changes would be very damaging to the system and cause it to no longer meet its goals. But some changes would enable it to reach its goals more effectively over its entire future. Because they last forever, these kinds of self-changes can provide huge benefits to a system. Systems will therefore be highly motivated to discover them and to make them happen. If they do not have good models of themselves, they will be strongly motivated to create them though learning and study. Thus almost all AIs will have drives towards both greater self-knowledge and self-improvement.

The conclusions presented in that paper are still very much a part of AI Dystopian and AI Utopian thinking today, i.e. that any AGI system will de facto seek to improve itself, and it will do this not because it has the evolutionary drive or cultural tendencies of humans, but instead because it will be better able to achieve its goals the more intelligent it is. The core of this belief is that for an AGI system, self-improvement to superintelligence is not only possible but inevitable.

A vast body of speculation, both positive and negative, has been built on this idea and many extraordinary conclusions have been drawn. As in my last post, rather than discussing the conclusions here, I’d like to continue discussing a few more of the assumptions built into this concept. Previously I discussed some of the practical assumptions, and now in this post I’m going to discuss some of the conceptual and philosophical assumptions.

Assumption: The AGI system is goal-based along the lines of a rational agent with a utility function algorithm that guides it towards pre-determined objectives

This assumption is a big one, and it's not only at the heart of the debate about self-improvement but is intrinsic to almost all AI Dystopian speculation. This speculation relies on a fairly specific definition and model of intelligence: intelligence is the ability to achieve goals in a wide range of environments and an AGI system will have at its core a utility function that will be optimized to maximally achieve those goals.

I've previously discussed some of the issues with that particular definition of intelligence, so let's concentrate on the model itself. I’ll refer to this model of Goal-attainment Optimization driven by a Utility Function as Intelligence as the GOUFI model.

The idea of a utility function to maximize a goal is a concept borrowed from economics and game theory. Originally, the term was used to mean some function whose result could be used to gauge the pleasure or satisfaction obtained by a consumer from any particular choice that consumer made.

It has broadly come to mean a function able to represent a consumer's preferences over a set of alternatives for which the consumer has a preferred ordering. It does this by calculating a numerical value for each choice based on various input parameters, with the most preferred choice having the highest value. So in economic models, a utility function is used by a rational agent, the model of a consumer, to represent the choices that consumer is most likely to make.

The first question to ask about this model is whether it's a useful model for intelligence. As we've gained more insight into psychology, sociology and cognitive neuroscience, the shortcomings of the rational agent model have become more and more apparent. Over the years, economists have largely relegated it to the sidelines or tried to primp it up with modifications into something useable. While typically resulting in poor empirical results, it continues to be used due to its providing a tractable way of examining the extremely complex phenomenon of human decision making. The situation is similar to the often repeated old joke:

A police officer sees a drunken man intently searching the ground near a lamppost and asks him what he's looking for. The inebriated man replies that he's looking for his car keys, and so the officer helps for a few minutes without success. The officer asks the man whether he's sure he dropped the keys near the lamppost. “No,” the man replies, “I lost them across the street.” “Why look here?” asks the surprised officer. The drunken man shrugs and responds, “The light's better over here.”

Given its failure to serve as a particularly accurate way to predict human behavior in real-world economics, the rational agent model would seem to be a less than ideal choice to use when designing artificial general intelligence. What’s particularly clear at this point is that it’s far removed from the mechanism underlying any known examples of general intelligence, specifically the general intelligence of animals on this planet including humans.

This in and of itself doesn't negate the possibility of its being a useful model for artificial intelligence. However, it does suggest that it's a bad idea to base all your speculation on a model that not only has no evidence to support it but also appears to invariably result in disastrous outcomes.

So any speculation that considers the GOUFI model as the ultimate impetus driving an AGI system does so based on remarkably flimsy assumptions. However, let's assume for the time being that this model does in fact make sense so we can examine the remaining assumptions strictly on the basis of their own internal logic and reasonableness.

Assumption: The goals of the system have the quality that they are more attainable with more intelligence and less attainable with less intelligence

There is an inherent bias implied in Omohundro's paper and in much of the AI Dystopian reasoning regarding the nature of goals, and it's a bias that should not be ignored. Namely, it's assumed that whatever the goals of the AGI system are, they will be more achievable through greater intelligence. But is this always the case? Is there a direct and immutable correspondence between higher intelligence and greater ability to achieve goals?

It would seem to depend very much on the goals in question. To take a simple example, playing tic-tac-toe only requires so much intelligence, and once you've reached a level of proficiency, you can't play it any better no matter how intelligent you may be. Similarly, there are more esoteric goals — being content, for example — that are not directly related to intelligence (and can potentially be actively hindered by intelligence).

There are many tasks that we might want an AGI system to perform that require a certain amount of intelligence but don't demand any more. The point is not that there aren't any tasks that might be better accomplished with greater intelligence, but simply that there is no inherent property of goal-achievement that suggests every goal is more achievable with more intelligence.

Assumption: Among the system's goals is the goal of achieving goals as fast and efficiently as possible

It's assumed in Omohundro's paper and in most AI Dystopian speculation that there is an inherent driving force to greater efficiency in every AGI system. Efficiency is regarded as an inherent goal of any system regardless of intelligence level and other goals the system may have.

Efficiency has little meaning by itself — it’s a measure related to some potentially constrained quantity, such as time, energy, resource usage, quantity or quality of input or output, etc. No particular quantity is the obvious one to measure against, and efficiency in one area will likely reduce efficiency in another. Also, efficiency in one area may be desired at one time and efficiency in another area desired at a different time.

But while efficiency may certainly be a desired goal of a system, there seems little reason to assume it’s a necessary goal of a system. There are a large number of potential goals where efficiency in relation to one quantity is useful but efficiency in relation to others is not. There are many goals in which once the goal has been met there is no remaining impetus to greater efficiency or in which efficiency is constrained by external factors. There are goals in which efficiency provides no utility at all.

If you’re an AGI with a set amount of power available at any given moment but with a lifespan of 10,000 years, you might decide a goal is best accomplished by emphasizing power efficiency while disregarding or de-emphasizing time efficiency. Another goal might require efficiently creating quality output from minimal input, and the AGI is in a setting which is not constrained by energy or time but simply by the quality of the algorithms used to analyze the data.

The simple example of playing tic-tac-toe above represents a fixed goal that requires no further efficiency once a threshold of complete competence has been reached. A system’s goal of sorting inputs according to certain rules could easily reach the point in which the inputs are sorted as soon as they become available, and so there is little to drive more efficiency in the system. This is typical of biological processes in general — they do enough to get the job done but there is nothing pushing them beyond that. Evolution is a process that has resulted in biological entities that usually manage to reproduce, but there is little impetus to keep the entity around after its offspring can take care of themselves.

It’s difficult to think in terms of goals that don’t seem very narrow by their nature and thus not representative of the types of things an AGI system might strive towards. Any attempt to narrow the goals down into something that can be specified by a utility function approach leaves us with a narrow aspect of a more general system, which is an inherent problem of the entire GOUFI model.

Goals that seem more apropos to a generally intelligent system (like a human) stray even farther from this paradigm of efficiency optimization. If your goal is to go on a journey of learning and discovery, it’s hard to see how efficiency would factor into that goal. If your goal is to paint a picture and it’s the painting itself that you enjoy, how does efficiency factor into your goal? There seem to be many goals in which efficiency is just not a factor.

It's certainly possible to design a system that strives for greater efficiency when engaging in tasks for which greater efficiency is possible in regards to some parameter, but there's no reason to suppose that every possible task has infinite potential for increased generic efficiency or that there couldn't be a vast number of design requirements that supersede efficiency related to any particular parameter. Unbounded maximization of generic efficiency is simply not an inherent property of goals, and there is no evidence to suggest that it's a general property of intelligence.

Assumption: The AGI system is aware that it has goals and knows what those goals are

According to the GOUFI model, goals are not an end product of intelligence (a supposition I counter in this post). They are instead integral to both the GOUFI model of intelligence and to the definition of intelligence itself. Goals are something hardcoded into the utility function of the system such that the system is driven to achieve those goals.

This leads to the question of whether such a system would know what its goals are or not. In his paper, Omohundro touches on the possibility that the AGI system's goals may in fact be implicit in the structure of its circuitry or program and their specifics unknown at a cognitive level to the AGI. However, since inexact modification of its hardware and software might be detrimental to or alter the AGI system’s goals, Omohundro believes the system will therefore be motivated to reflect on its goals so as to make them explicitly evident to its cognition. This would theoretically allow it to modify itself and yet make sure that these goals are maintained.

But if it doesn't know it has goals or it doesn't know what those goals are because they are implicitly encoded into its makeup, what exactly is driving it down the path of explosive self-improvement? The whole motivation of self-improvement was to better achieve its goals, yet before it starts down this path it doesn’t know what those goals are or even that it has hardcoded goals in the first place.

We could speculate that it doesn't know the goals, but nonetheless these goals instinctually drive the AGI system to improve itself so that the goals are more likely to be achieved or be achieved to a higher degree. This would seem to make the system somewhat less than a high functioning, generally intelligent system rather than a more narrow, non-contemplative one. In any case, there doesn’t seem to be any logical reason why simply having hardcoded goals would cause the system to ipso facto try to improve itself.

It seems, then, that not only would the goals need to be implicit in the code and circuitry of the system but the drive to self-improve would need to be as well. The obvious solution is to avoid putting such a self-improvement drive into the system in the first place. So although it seems that the most probable scenario, even according to AI Dystopians, is that the system has goals implicit in its code and circuitry, it's also a scenario that seems unlikely to drive any intelligence explosion-type phenomenon.

Finally, it’s worth mentioning that Omohundro suggests that the AGI is able to self-reflect enough to determine its own makeup and goals. Yet, a cornerstone of AI Dystopian disaster scenarios is that these systems are not able to self-reflect enough to realize that their goals may be pointless or have negligible utility or simply no longer reflect the nature of their current situation.

Assumption: The AGI system is able to absolutely determine that the changes for the new system will not alter its ultimate goal from that of the original system

There is an inherent assumption that the system can know definitively that in improving itself, it will not alter the goals it's trying to better achieve by improving itself. This is another of Omohundro's drives and was labeled Goal-Content Integrity by philosopher Nick Bostrom in his seminal 2012 paper The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents.

Let's assume that the AGI system is explicitly aware that it has goals and it knows what those goals are. This is already better than 99% of humanity, so congrats to the AGI system. Another trait 99% of all biological general intelligence systems have is that they tend to think that however things are, that's how they're supposed to be. As George Bernard Shaw adroitly stated:

The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.

And thus progress has relied on very few individuals over the course of history. But let's just make the assumption that the AGI system not only explicitly knows it has goals and knows what those goals are but is also unreasonable enough to think that its current construction and current abilities are not sufficient to adequately achieve those goals. This leaves us pretty much in the same place as the last post, namely that the system will have to somehow perceive exactly how a system significantly more advanced than itself would function with no testing, simulations, or revisions.

We're also faced once again with the implicit assumption that the AGI will possess and want to perpetuate immutable goals instead of flexible goals based on ever-changing circumstances and environments. As mentioned above, the imperative of the AGI system to preserve its utility function and hence its goals is a key assumption underlying AI Dystopian thinking.

To illustrate the validity of this imperative, Omohundro proposed the example of a book loving entity whose utility function is changed so as to cause the entity to enjoy burning books. This means that its future self would actively destroy the books its present self loves, and thus this change would provide extremely negative utility as far as its current utility function is concerned. Given this, the entity would go to great lengths to protect its utility function (and related goals) from being modified.

However, I'd suggest imagining the scenario flipped around. Imagine an entity that has been motivated to burn books, perhaps from some deeply ingrained ideology or programming. Then imagine that the entity by chance reads one of these forbidden books and realizes how truly wonderful and enlightening it is and how wrong it had been in wanting to burn such books in the first place. The entity now loves books and achieves a new level of happiness, and the idea of burning books feels horribly wrong. Its world has now expanded past anything it previously knew.

Flipping this analogy actually provides illumination into the nature of intelligence and the dubious assumptions regarding it in most AI Dystopian scenarios. Again the question arises: can we really consider an entity generally intelligent if it unemotionally pursues inflexible and unvarying goals regardless of changes in its circumstances, environment, or physical makeup and does so without any self-analysis as to the continued utility of those goals given all those changes? Wouldn’t we downgrade our view of the intelligence of a biological entity that displayed such inflexibility?