It is becoming easier for a machine to compose a piece of art, give it a painting, a novel or music, however, this still requires an immersive amount of data. Machines lack human-like features including a lack personalised perception, cannot think outside the box and can only make sense out of spoon-fed data. This test has been designed to pick out near-human behaviors that can invent original content at the spot, using personal experiences, emotions and originality. The challenge shall take the form of a “story bag” approach. This principle can be applied in the virtual space to construct a framework, in which either a human or artificial intelligence can invent stories for the purpose of our test.
Design
The interrogator picks random objects from a large and diversified dataset that are accompanied by specific descriptive attributes (nouns characterized by adjectives). For example, items selected could include a “shiny coffee machine” and an “old mountaineer” and would then form the basis for the intelligence to develop a story.
The interrogator may choose to reveal the attributes all at once or piece by piece, in an interactive manner. During the conversation, both participants can ask questions, respond, give remarks or interrupt. Each of the elements selected can be associating with story’s character, setting or plot. The story must meet the following requirements:
- Setting the scene of the story beyond laws of nature/real life
- Showing empathy between at least two characters of the story
- Involve at least two opposing emotions, such as fear and forgiveness.
Once the intelligence is ready, the story can be shared in an interactive session, giving the interrogator the opportunity to further embellish, expand and complement the story to heighten the enjoyment of the session.
Next, the interrogator asks the intelligence to compose a dummy story with the same descriptive attributes, but the story must persuade that the intelligence is a computer, under the following criteria:
- No emotional interaction or empathy should be involved.
- The story must obey to the rules of nature/life.
At the final stage, the tester will follow the evaluation process and decide whether the testee passes the test.
Rationale
Emotions and intelligence are co-related and when a machine can emulate human behaviour, it can reason and show emotions. Thorndike (1920) mentioned emotional intelligence as one’s ability to perceive, understand, manage, and express emotion within oneself and in dealing with others [1]. Furthermore, Salovey (1990), defined the five domains critical to emotional intelligence [2]
The testee passes the test when it can respond to changes in the environment, be proactive and sociable (Balakrishnan,2019) [3]. The test proposed is expected to capture these properties as creation is a complex task which requires a deep understanding of the natural world, relationships, emotions, plot to make it consistent throughout the narrative, and an ability to communicate effectively.
Evaluation
The human-like story that the testee has created along with the interaction between parties, will submitted to a panel of human individuals (committee), each of whom will provide a likelihood from 1 to 10 reflecting how confident the human thinks the story is an original piece written by another human (1 = a computer is at the other end, 10 = a human is at the other end).
Correspondingly, the committee will evaluate the computer-like story of the entrant with the same likelihood scale. Both stories shall be evaluated against:
- Characters – are they believable, authentic, multi-dimensional and consistent?
- Setting – is the setting realistic, does it suit the plot and characters?
- Plot – are there complex story arcs, do they intertwine, does it make best use of the characters and setting?
- Conflict & resolution – does it fit in with the characters involved, does it drive the plot, andthe conflict resolved in an authentic way?
- Themes – does the plot follow the themes in an artistic and intelligent manor?
- Morals – is there a meaning behind the story and how well developed and subtle is this?
- Symbolism – are the objects used in a symbolic or basic manor?
- Point of view – did the perspective of the narrative fit in with the plot and symbolism portrayed?
- Pulling it all together – did the elements come together in a complete package which was inventive, coherent, entertaining to deliver a satisfying story?
The final grade will be the mean score of the committee members for both stories with the higher grade representing that the intelligence is more likely to be a human.
Criteria
A benchmark is created by having a panel of human candidates submitting stories. The average performance can be used as a baseline for the human-like story. The following calculations will be made:
- The mean value of all the entrants taking the test, as an average performance benchmark.
- The standard deviation of the final grade (section Evaluation) of each entrant from the average performance benchmark.
- The standard deviation of the final grade (section Evaluation) of each entrant from the average benchmark of a human-like story which was calculated once as a constant benchmark.
The committee will compare the above metrics among all entrants and decide whether an entrant has succeeded the test. A winning entrant must achieve low deviation from human-like benchmark, and high final score above a critical value that arises from the deviation between the entrants.
Discussion
The challenging part of this test is for a story to be original and coherent throughout its entire narrative. True creativity is hard to achieve. Some games already use AI to write stories, such as: https://play.aidungeon.io/. In which, the AI starts with an initial context, e.g., Zombie Apocalypse or Fantasy, and then try to develop a story based on user input. However, the more the user interacts with the agent the harder it is for the agent to be consistent with itself from earlier narratives. We expect our test entrants will struggle in a similar manner.
Long-term time dependencies have been something AI agents struggle with, especially in the field of natural language processing (NLP) [6] and natural language generation (NLG) [7]. This has fostered the engineering of specific architectures such as LSTM networks [8] and we have seen significant progress with new inventions like the transformer architecture [9] in the GPT-3 [10] model (and soon GPT-4 [11]). Some contents produced by GPT-3 model has already proven to be very convincing, and we can reasonably expect new technics to be discovered that will even beat the transformer architecture.
Citations
[1] Thorndike, E.L. 1920. Intelligence and its use. Harper’s Magazine, 140, 227-235.
[2] Salovey, P. Mayer, J.D. 1990. Emotional intelligence. Imagination, Cognition, and Personality, 9, 185-211.
[3] S.Balakrishnan, “An Overview of Agent Based Intelligent Systems and Its Tools”, CSI Communications magazine, Volume No. 42, Issue No. 10, January 2019, pp. 15-17.
[4] Allison L.Coates, Henry S.Baird, Richard J.Fateman, 2001 “Pessimal Print: A Reverse Turing Test”, IEEE.
[5] Mark O. Riedl, December 2014, “The Lovelace 2.0 Test of Artificial Creativity and Intelligence”, Georgia Institute of Technology.
[6] Natural language processing [Online] Available at https://en.wikipedia.org/wiki/Natural_language_processing (Accessed on 02-06-2021)
[7] Natural-language generation [Online] Available at https://en.wikipedia.org/wiki/Natural- language_generation (Accessed on 02-06-2021)
[8] Long short-term memory [Online] Available at https://en.wikipedia.org/wiki/Long_short- term_memory (Accessed on 02-06-2021)
[9] Transformer (machine learning model) [Online] Available at https://en.wikipedia.org/wiki/Transformer_(machine_learning_model) (Accessed on 02-06-2021)
[10] GPT-3 [Online] Available at https://en.wikipedia.org/wiki/GPT-3 (Accessed on 02-06-2021)
[11] GPT-4 [Online] Available at http://www.gpt-4.com/ (Accessed on 02-06-2021)

