As you know if you teach at a U.S. public school – or even if you just read the June 2013 New York Times feature about it – a consortium of state boards of education recently decided that we should have uniform standards about what we teach our kids. Hence, the alliteratively-named “Common Core curriculum” is here. While you probably know that this curriculum is going to bring with it an increase in standardized testing, what you might not know is that the curriculum is also going to bring with it a strong push to have computers grade student writing. As I write this, about a half-dozen companies are vying for government contracts to do just that. Each of them has brought together a team of statisticians, linguists and computer programmers to produce the best possible software capable of automatically grading essays.
The reason for the push is both grim and obvious: money. Now that our schools are going to have more standardized tests, there are going to be more student paragraphs thatneed grading. Grading written work is laborious and time-consuming and, from a school board’s point of view, expensive. What computers offer is the ability to do this task faster and – once they are up and running – more cheaply.To be fair to the school boards, assuming the computer programs work, doing this task more efficiently could yield some benefits. It would lift a significant burden from the harried, overworked and underappreciated group of teachers and grad students we currently pay to grade standardized tests. But I suspect that, for most people, the thought of a computer “reading” essays is reflexively anxiety-provoking. It brings out the inner Luddite. Are we really supposed to believe that a machine can do just as good a job as a human being at a task like reading?
“Text analysis,” as computer programmers call it, is the process by which computers are able to grade essays. Text analysis employs techniques called machine learning to accomplish this feat – in particular, it uses supervised (vs. unsupervised) machine learning. The difference is that, in supervised machine learning, the computer’s grades are originally derived from – and then cross-checked against – the grades given by actual human beings. The unsupervised process, by contrast, tries to skip the human beings altogether.
The way supervised machine learning basically works is this: The computer treats the student’s essay as though it were just some random assemblage of words. Indeed, the jargon term for the main analytical technique here is actually “Bag of Words” (the resonance of which is simultaneously kind of insulting and weirdly reassuring). The program then measures and counts some things about the words that the programmer thinks are likely to be correlated with good writing. For example, how long is the average word? How many words are in the average sentence? How accurate are the quotations from the source text, if any? Did the writer remember to put a punctuation mark at the end of every sentence? How long is the essay?
The programmer then “trains” the computer by telling it the grades assigned by human graders to a “training set” of essays. The computer compares – mathematically – the various things it has measured to the grades assigned to the essays in the training set. “Johnny’s essay had an average word length of 5 lettersand an average sentence length of 20 words. The human told me that Johnny gets a B+. Now I know that much about essays that get a B+.” From there, given an average word length, sentence length, word frequency and so on, the computer is able to calculate the probability that a given student essay would receive a particular grade. When it encounters a new essay, it can take the things it knows how to measure, and – based on what it learned from the grades assigned to the training essays – simply assign the most probable grade.
It turns out that this works surprisingly well. Shockingly well. What we might think of as totally surface-level or accidental features – like having more words per sentence – are actually correlated very strongly with earning better grades. Statistical analyses, at least, tell us that machine-learning techniques perform just as well as human beings – that is, their grades for new essays are the same or similar a huge majority of the time. Plus there are some ways in which the computers might actually be better. The people who presently do essay grading for tests like the AP English exam work long hours and have to grade essay after essay. Unlike computers, they get tired and irritable and bored. Also unlike computers, they come pre-equipped with a whole bunch of biases. These little nuisances and irritations, we might hope, will wash away if we instead let a computer sort unfeelingly through the bag of words.
But, alas, it isn’t so. Since the process by which the computer “learns” is anchored to grades assigned by human beings – the training set teaches the computer what kinds of grades we tend to give to what kinds of essays – the tiresome, unsexy little things that make us imperfect are built right into the system. For instance, if the graders who grade the training set tend to strongly penalize nonstandard uses of English – including nonstandard uses more common among racial minorities – so too will the machine. The computer will operationalize, and then perpetually reinstate, the botherations and biases we feed it. The best strategy, thus, will be to use mechanized, look-alike writing, which will be tautologically defined as good writing because it is associated with receipt of a good grade.
One obvious problem is that if you know what the machine is measuring, it is easy to trick it. You can feed in an “essay” that it is actually a bag of words (or very nearly so), and if those words are SAT-vocab-builders arranged in long sentences with punctuation marks at the end, the computer will give you a good grade. The standard automated-essay-scoring-industry response to this criticism is that anyone smart enough to figure out how to trick the algorithm probably deserves a good grade anyway. But this reply smacks of disingenuity, since it’s obvious that the grade doesn’t reflect what it’s “supposed to” – namely, the ability to write a reasonably high-quality essay on some more or less arbitrary topic.
Another slightly less obvious problem is that, since the computer is just measuring and counting, it can’t actually give you meaningful feedback or criticism. It has no idea what big-picture themes you were exploring, what your tone was, or even what you actually said. It just tries to approximate the score you should get – that is, it tries to put you into a little box. While this kind of box-sorting is fine for literalgrading, it doesn’t really help with teaching you to be a better writer.
There are other problems, too. A former professor at MIT named Les Perelman has pointed out that the way the automated-essay-grading companies are analyzing their software’s performance is unfairly biased toward the machine. Perelman’s paper, although eye-glazingly dense in data analysis, notes that while a human grader’s reliability is checked by comparing his or her grades to someone else’s, the machine’s reliability is checked against a resolved grade, which reflects the judgments of multiple human readers. But the standard statistical measure of agreement, called Cohen’s Kappa, is – as Perelman puts it – “meant to compare the scores of two autonomous readers, not a reader score and an artificially resolved score.”
There is a deeper issue, though, to which I suspect many people’s thoughts will have jumped immediately: the idea that reading and writing are uniquely human, and that our ability to do these things is part of what separates us from machines. Show me the most robust correlations in the world, and still I will show you – the horror! the horror! – a robot doing something I had firmly believed only a human could do.
While it may be tempting to dismiss these reactionary worries as empirically ill-informed, I think we should resist. There is, at the end of the day, something soul-shakingly serious about these feelings. The grandiose concepts they invoke – like “the place of humankind in nature” – are banalified and overused now, but I think it’s worth taking them seriously. If computers can do things that we thought only human beings could do, can we continue to think of ourselves as unique? If computers can carry out operations that we thought only the human mind could carry out, are we forced to think of our minds as essentially mechanical? I know these are heady questions, but I don’t mean to ask them as an invitation to fatuous navel-gazing. I just mean that when I honestly contemplate computers reading essays, they just sprout up, as insistent as nettles.
Here, then, is my attempt at an answer: Human beings possess an extraordinary set of intellectual capacities that are not replicable by any combination of computational techniques, no matter how sophisticated. These capacities are those that we traditionally think of as belonging to – and representing the achievements of – (no pun) the humanities. What computers will never do, that is, is create originalworks of art as emotionally rich, thematically evocative or aesthetically stunning as those created by human beings. There is, in other words, no set of computational techniques capable of mirroring the intelligence we use in creating original artistic works, especially those that reach our deepest emotional depths.
Some people may hasten to respond that apparently I am ignorant of what is already out there. There is, for instance, a program written by David Cope, a music professor at UC Santa Cruz, capable of engineering new works of classical music that sound just like those by, say, J. S. Bach. Indeed, Cope’s program is so artful in its imitations that trained classical musicians have mistaken its compositions for works by the laureled master himself. Cope’s software treats all common features of two musical works as indicators of “style,” measured along several dimensions, including rhythm, melody and harmony. The program then uses that information – plus certain randomizing and recombining functions – to create new compositions. Give it a database of any classical composer’s works, and it can create a pretty convincing mimesis.
In addition, there are programs that can write books. For instance, some algorithms can generate new novels in certain well-established and trope-defined genres like the whodunit or Harlequin romance. These programs, like Cope’s software, find common features of their target genre – character traits, mytharcs, sentence structures – and then recombine them. There is even a program developed by Russian computer scientists that has written its own take on “Anna Karenina” in the style of Haruki Murakami. After performing an extensive analysis of data about each of their books, it produced a novel called – I am not kidding – “True Love.”
If there are several things out there today that look an awful lot like a computer writing an interesting novel or composing a beautiful piece of music, doesn’t that suggest that, someday, a computer might succeed in creating truly moving artwork? Isn’t success, then, really just a matter of highly subjective artistic sensibilities? Indeed, isn’t this whole reaction against artificial intelligence (AI) just a rehash of the sniveling, woeful, “our culture-has-reached-its-nadir” nonsense that holds that thetrue humanities are beyond the comprehension of the plebes? The same “sweetness and light” nonsense that Matthew Arnold was peddling in 1869, when he argued that our culture would, by neglecting the humanities, “fall into our common fault of overvaluing machinery?”
No. And no for a simple reason. None of these programs can do the thing I’m talking about. In other words, they are, all of them, cheating. Not because there are not enormous intellectual challenges in getting a computer to imitate a great artist. But simply because miming another’s style is not the same thing as original, spontaneous creation. Sorry to be harsh, but it’s true: It is totally obvious that, given enough data about Bach’s piano concertos or Murakami’s literary idiosyncrasies, it is possible to manufacture a convincing imitation. But this is just categorically not the same kind of intelligence as that required to create artistic works of one’s own.
What AI is actually using is data analysis, modeling and measurement, which are, of course, good tools for mimicking works that already exist. But it is only by redefining the goal that this counts as success. There is another domain of human cognition, a creative domain, that draws on a different set of capacities altogether – intuition, aesthetic judgment, emotional awareness and self-expression. Spontaneous creation requires these capacities; mimicry through data-crunching can dispense with them. Show me a computer that engages them.
By its very nature as a pre-programmed device, a computer needs a human being to interpret its acts, which are themselves structured by its human creators. A computer has no long-shuttered pains, no treasured memories, no unhealed heartache, no silly childhood fondnesses, no snarled complexes about its parents, no sexual ecstasy – in short, nothing worth writing or painting or singing about at all. To do as we do, the computer would be forced to approximate all of the messiest parts of human experience – what Freud called the unconscious – through some mind-bogglingly complex amalgam of methods, the functional equivalent of living its own human life. If such a task is not actually impossible, it is so monstrously difficult that it might as well be.
So what? What does this have to do with automated essay grading or standardized testing or the fate of children in the U.S. public education system?
If you’re persuaded by what I’ve said about the limitations of AI, it follows that, as long as automated essay grading is around, the trend toward the mechanization of student writing is not going to change. The present state of automated essay grading is not just some stopgap measure until we can get better robots. It contains, subject to certain tweaks, all the essential elements of the full-blown future itself. In that future, not even the simulacrum of human responsiveness will be available on many of our most important assessments of writing. The apex of the academic year will be a test of writing that no human being will ever read, care about or feel anything for.
Thus, our culture will stop engaging with students on those very aspects of the humanities that make them worth studying in the first place. We are going to end up with a system that dispenses rewards in a way that is indifferent to – and divorced from – the most alluring parts of the humanities, those creative capacities that they let us engage. If our instruction in the humanities necessitates ignoring these abilities, then it is my opinion that there no longer is much point to teaching the humanities at all, and we should end the charade. In other words, if this kind of mechanized, standardized-test-friendly drivel is all we can offer our children as “the humanities,” then who cares about the humanities?
Once the use of automated essay grading becomes common knowledge, the implicit message will be hard to miss. For any self-aware, warm-blooded American teenager, the conclusion will be all but inescapable: Nobody cares what you have to say. It could be brilliant and moving; it could be word-salad or utter balderdash; it really doesn’t matter. Content, feeling, creativity, thematic depth – none of it matters. Today’s students will recognize this; they will react to it; and it will inform who they grow up to be. Indeed, I confess that if I were a teenager, my response would be the same as theirs – the selfsame response that we tend to associate with (and dismiss as just) teen angst. What is the point, after all, in being rewarded by a system that doesn’t care who you are? If no one is going to read the essays, we might as well rip them up.
But in truth, what is to be done? After all, don’t we need guidelines about what our kids should be learning in school, especially if we’re going to entrust them to it many hours a day? Are we really to believe that the answer is to let them do some wishy-washy, personal-feelings writing or painting and hope it will teach them the skills they need? What part of that is going to be useful for getting a job?
But this is precisely the problem. We have become so veritably obsessed with the ideas of performance and achievement and rank that we have let ourselves completely lose sight of the things we cared about in the first place. We try to subsume the qualifications for everything – even disciplines where right and wrong are not only subjective, they are immaterial – under neat, mechanized, formal criteria. Have we forgotten the feeling of naïve, impractical fascination? Have we forgotten that we, too, once were interested in things just because they grabbed our attention? Trying ferociously to kindle that spark with the needs of the “21st-century job market” or anxieties about GPAs is, of course, only going to extinguish it.
The deepest reason to get rid of automated essay grading is not that the statistical correlations aren’t good enough yet (this is fixable) or that one can cleverly trick the computer (this is true, but not the root issue). The reason to get rid of automated essay grading is that the whole point of doing something like writing an essay is to learn to engage on a level that machines cannot participate in or really appreciate. It’s to use the other part of your mind in an effort to communicate with other people. That, the profound and joyous sense of recognition that comes from communication, is the thing worth teaching, and it is the thing worth learning. We should put aside the pretensions of objectivity and practicality and get back to the part that really matters. It is time for us to slacken our grip.