Ethics & Mental Models for Learning-based and Classical AI Systems

Last year I was talking a class on AI & Society, and we had a lecture on ethical considerations in AI Systems. For example, there’s a classic debate about whose safety you should/would prioritize in a self driving car — e.g. a passenger vs a pedestrian vs someone’s pet dog vs a (stray) raccoon — AND how you’d assign blame for an accident. I spoke to the professor after class and made the case that the discussions were based on antiquated models of how AI systems work AND that it would be a lot more useful (to everyone) if the discussions were grounded in how modern systems function. I didn’t really have specific suggestions then, but I’ve been working on production level robot systems for almost 6 months now, so I want to take a stab at constructing the right mental model.

There’s two distinct types of systems: learning-based (think ChatGPT) and not learning-based (think something preprogrammed like a robotic arm in a car factory). I’m going to refer to the latter as classical from now on, but these methods aren’t necessarily old, and their fundamental characteristic is that they aren’t learning-based. Often times, AI systems will be a mix, but individual components will still be clearly one or the other.

Learning-based systems are defined by a dataset plus a model architecture and a training objective. Classical systems are based on a set of assumptions and a spec.

Learning-based Systems

The right model for thinking about the performance of a learning-based system is a basketball court. If you take a certain player, they will have some probability of making a shot from each point on the court. No shot is a guaranteed miss or make — even dunks are not 100% in the NBA. Instead, some shots are just more or less likely than others, and the probability changes gradually: the chances of making a 19 ft shot are pretty close to the chances of making a 20 ft shot.

In the analogy, the point you’re shooting from is the model’s input. The architecture and training objective of a model help determine how that probability changes as you change the input. No matter how you practice, the probability of making a dunk drops off quickly as you move further away from the rim. Likewise, poor architectures and objectives will fail to produce good models no matter how much data or compute you use.

The dataset is what determines where exactly the model does well or poorly. In the analogy, Steph Curry is a great three point shooter (at least partially) because he practices three point shots over and over and over again. He’s able to make 30+ ft shots because he’s trained from that spot i.e. he’s trained on that input. However, there are still shots that are outside his repertoire (i.e. outside his training data) where he will perform poorly.

Classical Systems

On the other hand, the assumptions and specs for a classical system are much more clear cut. The spec may be complicated; the assumptions may be explicit or implicit, and some algorithms will fail more nicely than others, but everything is fundamentally sharper. For example, a robot might have some maximum weight for objects it can lift, or a sensor might have specific environmental conditions where it can and cannot work. You might expect a self-driving car to go at least 65 mph when unobstructed on the highway. My main point is that these systems do not have the same randomness of learned systems. The analogy is closer to something like a one rep max in weightlifting. You have some maximum capability, and this capability can be improved over time, but at any one time, you can either lift a certain weight or you can’t. You might be able to lift more weight by say, using a weightlifting belt, but that isn’t random: it’s just a new case i.e. a new assumption or a different part of the spec. You still have some max limit under these new conditions, and it’s still the case that you can either lift a certain weight or you can’t. Now that rule doesn’t perfectly hold in weightlifting or classical AI systems: people have good and bad days, and so do systems, but on the whole, the behavior of classical systems is much more discrete.

Comparing the Two

To summarize: both types of systems have failure cases, but these failure cases have different shapes. Classical systems either work or don’t, and the change in performance is more sudden while learning-based systems are never guaranteed to do anything, and they fail more gradually.

This also means that they are diagnosed and fixed differently. For learning-based systems, you can often fix issues by feeding in more data on areas where it’s weak. For example, all of the LLM companies have (directly or indirectly) hired thousands of contractors to build proprietary datasets for topics like coding and chemistry. It may not always work — the architecture and training objective may still be limiting factors — but the first thing you try is always new and more data.

For classical systems, you just need to extend the logic for the successful cases to include some of the failures. For example, your existing logic might assume that the ground is flat — causing your robot to do something stupid when the ground is slanted. You’d either: a. remove the assumption and tweak all of your logic to be more general or b. add a new case i.e. a new set of logic to specifically handle situations where the ground isn’t flat.

Back to Ethics

Ok, getting back to ethical questions. I think it’s fairly straightforward for classical systems: the spec acts like a contract. If something bad happens while the system is being used outside of the spec, it’s (generally) the user’s fault. If the usage is within spec, then it’s the engineer’s fault. For example, it wouldn’t be Cuisinart’s fault if you put a toaster in the bath tub. Obviously, it may not always be clear whether something is within the spec, and the more detailed a spec is the better, but there’s at least a solid, somewhat obvious framework for ethics-related questions. Now there are definitely still larger (for example, societal) questions that can’t necessarily be answered by a spec, but we have a clear approach for the more immediate concerns.

Most of the impressive systems that have come out recently are learning-based — from LLMs to self driving cars. That means that they have no performance guarantees i.e. you can’t be sure that the Cybertruck will always drive over the blue haired liberal instead of the true American patriot, and you can only be 10% sure that it will drive over someone. Engineers only have two ways to control models: a. through training data curation and b. by building scaffolding around the learned model. An example of the former would be a self-driving car company intentionally collecting data with shiny objects because this is a known failure case. Alternatively, if your car is moving too quickly towards an object that is too close, you might want to run a simple algorithm to safely stop as soon as possible — regardless of what the self-driving model suggests. Overriding the model like this is an example of scaffolding.

Often, it takes real effort to a. identify failure cases in learning-based systems, b. find the right data to patch the issue, and c. get the right amount of that new data to fix the problem without degrading a model’s performance elsewhere (i.e. you can’t have too much or too little of the new data). If a failure case is sufficiently rare, an engineering team might have to intentionally put the car/robot/model in that specific scenario just to get enough data for that edge case. That is a very conscious decision. And to be clear, these aren’t necessarily objective failure cases. An LLM company like Anthropic would consider their model answering a question about making bioweapons a failure even if the information provided in the response is completely accurate. I should also mention that data curation isn’t always so deliberate: for example, images that don’t have alt text captions are — often if not always — thrown out when training image generation models because you’re trying to learning the relationship between the image and the accompanying text — which means you have to have the accompanying text. This is a massive portion of potential training data, but the data is thrown out in bulk — engineers aren’t making decisions about individual cases.

In Practice

Alright, so that gives us two forms of responsibility/culpability: if an organization intentionally induces a behavior through either technique (i.e. through training data curation or scaffolding), you might hold them accountable for that behavior. On the other hand, if they fail to address a certain case, then you might be able to argue that they were negligent. An example might be not removing copyrighted work from an LLM’s training data or not training a self-driving model on examples where a car in front suddenly brakes.

Those two forms of accountability — for induced behavior and negligence — don’t sound so extraordinary, but the key difference with learning-based systems is the lack of guarantees. Even the most well trained LLMs can be “jailbroken” (i.e. tricked) to say things they’re not supposed to — they might even just do it randomly. You cannot make a determination about a model based on a single instance of a behavior — rather you have to look for patterns in its performance.

Instead of requiring that a model does something, you might require that a model’s training data includes examples of that behavior or that guardrails are in place to override the model and ensure that behavior — or at least ensure that behavior in most cases. And you’d evaluate the learning-based system by seeing how it does on some relevant test cases. These test cases should be as diverse and numerous as possible, and these specific examples should not be used to train the model. For example, if you want a (learning-based) robot to always stop in front of red objects, you would test what it does in front of a stop sign, a bottle of ketchup, an apple, etc. It wouldn’t make sense to test the model on just one example and call it a day because that wouldn’t tell you anything — regardless of whether it passes that specific test. Going back to our analogy: at some point in time, Shaq has made a free throw, and at some point in time, Steph Curry has missed a wide open three. Those cases don’t tell you anything about their general performance. You just cannot get an accurate depiction without looking at aggregate statistics.

Conclusion

I hope this provides some useful context for thinking about system behavior and how you’d create and enforce requirements. At a high level, the antiquated view of AI systems, the modern view of classical systems, and the modern view of learning-based systems are not so different. All 3 have failure modes that can be chipped away at. But the nature of each failure case is different — how they’re fixed and how you’d create requirements are also different. I think that difference matters when thinking about correct behavior and policy — and when thinking about what is and isn’t possible. Sometimes, you just cannot get certain behavior or certain guarantees out of learning-based systems. That might mean that learning-based systems are just not the right tool for the job, which is quite often the case. That’s not a bad thing.