Question Generation is a subfield of Text Generation that deals with the automatic generation of valid and fluent natural language questions based on given passages and target answers.

Evaluation refers to the process of determining the performance or accuracy of a model or algorithm. It involves comparing the model's predictions or outputs against a set of pre-defined correct results (ground truth or gold labels). The choice of evaluation method usually depends on the specific task or problem at hand.

Natural Language Processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of 'understanding' the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Text Generation is a subfield of Natural Language Processing (NLP) which attempts to teach machines to automatically generate texts that are indistinguishable from texts written by humans.

<h1>Evaluating for Diversity in Question Generation over Text</h1> <h2>Abstract</h2> <p>Generating diverse and relevant questions over text is a task with widespread applications. We argue that commonly-used evaluation metrics such as BLEU and METEOR are not suitable for this task due to the inherent diversity of reference questions, and propose a scheme for extending conventional metrics to reflect diversity. We furthermore propose a variational encoder-decoder model for this task. We show through automatic and human evaluation that our variational model improves diversity without loss of quality, and demonstrate how our evaluation scheme reflects this improvement.</p> <h2>1 Introduction</h2> <p>Question generation has widespread applications in online education (Lindberg et al., 2013), search (Rothe et al., 2014), and question answering (Yang et al., 2017; Lewis and Fan, 2019). Generating a single question per context can be inadequate as questions are naturally diverse in aspect, answer, and phrasing. Following Li et al. (2016), the first two correspond to semantic diversity , while the third one corresponds to lexical diversity . Evaluation in question generation uses metrics including BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and ROUGE (Lin, 2004). Common to these metrics is an implicit assumption that references are paraphrases . In question generation this assumption does not hold, as multiple diverse questions may be equally relevant for a given context. To measure the ability of a system to produce diverse sets of questions, a new evaluation metric is needed. We propose a function composition framework consisting of meta-metrics relying on existing evaluation metrics. As we will show in Section 3, our framework generalizes F 1 , a widely-used measure in information retrieval and multivariate analysis. Context: Amazon.com, Inc. is an American electronic commerce and cloud computing company based in Seattle, Washington, founded by Jeff Bezos on July 5, 1994. Question 1: Which Seattle-based company was founded in 1994? — Amazon.com Question 2: What is the name of the company that Jeff Bezos founded? — Amazon.com Question 3: When was Amazon.com founded? — July 5, 1994 Recent papers on text generation over images have sought to increase diversity through generative modeling such as variational autoencoders (Jain et al., 2017). To demonstrate the usefulness of our F-like metrics, we build a conditional variational autoencoder for text-to-text generation inspired by these ideas and based on Zhang et al. (2016). We achieve significant improvements in diversity without performance loss, in terms of both automatic and human evaluation. We show that our F-like approach better captures the improvement in diversity offered by this model, and that the breakdown into “precision” and “recall” enabled by the meta-metric helps illustrate the strengths and weaknesses of different systems.</p> <h2>2 Related Work</h2> <p>Early work on natural language question generation focuses primarily on rule-based models (Heilman and Smith, 2010; Agarwal and Mannem, 2011) and template-based slot filling (Lindberg et al., 2013). End-to-end neural models were introduced by Du et al. (2017), and extende in (Chan and Fan, 2019; Bao et al., 2020). Generating diverse questions is discussed in a series of recent papers (Zhou et al., a r X i v : 2008 . 07291v1 [ c s . C L ] 17 A ug 2020 2017; Harrison and Walker, 2018; Song et al., 2018; Yao et al., 2018; Shen et al., 2020) attempting to improve semantic diversity by conditioning on (potential) answer positions or question types. While showing promising results, such prior information may not always be practically available. Sultan et al. (2020) recently applied nucleus sampling (Holtzman et al., 2020) to diversify question generation models, demonstrating improved performance on a downstream question answering task. Diverse text generation has been studied for other tasks, including image question generation (Jain et al., 2017), conversation modeling (Li et al., 2016), machine translation (Zhang et al., 2016; Schulz et al., 2018), and image captioning (Vijayakumar et al., 2016; Pu et al., 2016; Dai et al., 2017). Models typically rely either on conditional variational autoencoders (CVAE) (Kingma and Welling, 2013; Sohn et al., 2015) or conditional generative adversarial networks (CGAN) (Mirza and Osindero, 2014), and evaluation has relied on BLEU, ROUGE, and METEOR. Several schemes for evaluating with diversity in image captioning were proposed in (Alihosseini et al., 2019). Their proposals rely either on statistical divergence between language modeling of the generated and reference sets, or on Jaccard index computed at the n-gram level. In Dhingra et al. (2019), another metric is proposed for scoring systems where generated sentences overlapping with items on the source side rather than the target side should also be scored highly.</p> <h2>3 Evaluating for Diversity</h2> <p>Conventional metrics evaluate the correctness of proposed questions in relation to a set of reference questions; they do not, however, measure the degree to which the proposed questions cover the set of reference questions. Consider the sentence “Germany won the 2014 world cup” , paired with two reference questions “Who won the 2014 world cup?” and “Which event did Germany win in 2014?” . Using traditional metrics, a system that always generates “Who won the 2014 world cup?” and a system that alternates between generating the two would be given equal, perfect scores, when in fact the first system has only learned half the task. Moreover, systems that generate a sentence usings parts of both questions may wrongly be scored highly (see Appendix A for an example). In this work, we propose a framework to extend commonly used scoring functions to account for coverage over reference questions. Given a set of reference questions R ⊂ Ω , a set of predicted questions P ⊂ Ω , and a scoring function s : Ω × Ω → R , we propose the following two functions for comparing P and R : To compute the function u , we identify for each predicted question the best match in the references w.r.t. the scoring function. We then compute how “close” the predictions are to the references by summing up scores between each predicted question and its respective best match. The function v is computed analogously in Eq. (2). 1 We combine u and v with their harmonic mean. This leads to an overall measure to compare P and R , In the special case of a binary scoring function s : Ω × Ω → { 0 , 1 } , where 1 is given to exact matches, u and v are identical to conventional precision and recall , respectively, and Eq. (3) is equal to F 1 . Therefore, we can interpret u and v as generalized precision and recall, respectively. To the best of our knowledge, our construction of the F function is new. The “best match” idea used in comparing P and R , i.e., examining the optimal score an item in the set can attain w.r.t. another set, has been applied in local community detection in network analysis (Clauset, 2005; Lancichinetti et al., 2009) and behavior research in social networks (Adali et al., 2010). In machine learning, the work by Goldberg et al. (2010) on clustering analysis bears the closest resemblance to this idea. However, their assumption of a symmetric s -function does not hold for many NLP applications.</p> <h2>4 Variational Question Generation</h2> <p>Our approach extends the encoder-decoder model proposed by Du et al. (2017) with a simple latent variable following Zhang et al. (2016). Given a training corpus of context-question pairs D = { ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , . . . , ( X D , Y D ) } such that X i = { x i 1 , . . . , x ij } and Y i = { y i 1 , . . . , y ik } , the objective is to minimize the negative loglikelihood of the training data. We first encode each token in the context paragraph using either GloVe embeddings (Pennington et al., 2014) or ELMo embeddings (Peters et al., 2018). The entire sentence is then encoded through a 600-dimensional BiLSTM, giving forwards, backwards, and concatenated representations − → b t , ←− b t , b t = (cid:104) − → b t , ←− b t (cid:105) for each timestep t . We construct context representations for each decoding step using bilinear attention as presented by Luong et al. (2015). That is, given a query vector k t corresponding to decoding step t : We compute query vectors using another BiLSTM. The input to each step t consists of the concatenation of the GloVe embedding e t of the target-side token generated at the previous timestep t − 1 , and the previous context vector c t − 1 . That is We compute c 0 as (cid:104) −−→ b | X | , ←− b 1 (cid:105) . Finally, the distribution over target-side tokens at decoder step t is computed as At training time, we rely on teacher forcing to provide the token to be embedded in e t . That is, we use the gold token y t − 1 . At inference time, we use the generated token ˆ y t − 1 at the previous timestep. We decode using beam search with a beam size of three, following Du et al. (2017). Following (Zhang et al., 2016; Jain et al., 2017), we introduce a latent variable z ∈ R d conditional on x in the decoder to model the underlying semantic space. We redefine the query vectors used for attention, Eq. (7), as We introduce an approximation q φ ( z | x, y ) for the intractable true posterior p ( z | x, y ) . Instead of the true log-likelihood, we optimize the evidence lower bound (ELBO), a key idea underpinning variational autoencoders (Kingma and Welling, 2013; Sohn et al., 2015): Following Zhang et al. (2016), we define q φ ( z | x, y ) through a neural network with parameters φ as a Gaussian of the form We obtain a representation of the context paragraph used in the posterior by mean-pooling over the context encoder states. Similarly, we represent the target question using mean-pooled ELMo vectors (Peters et al., 2018). That is, We then obtain the Gaussian parameters using a neural network</p> <h2>5 Experiments and Evaluations</h2> <p>We compare our model to the deterministic baseline from Du et al. (2017) 2 on the SQuAD dataset (Rajpurkar et al., 2016). We demonstrate comparable performance using conventional metrics, and better performance on our proposed metrics. We further corroborate this finding through human evaluation. Inspired by Vijayakumar et al. (2016), we also evaluate a random beam selection (RBS) heuristic, where we induce diversity by sampling from the top b beams. Details of hyperparameters are given in Appendix B. Table 1 reports the results where METEOR is chosen as the s -function. METEOR has been shown to correlate well with human judgments (Banerjee and Lavie, 2005). In the interest of space, results of BLEU and ROUGE are given in Appendix C. For METEOR, we average first per context c and subsequently over the dataset D to prevent overweighting contexts with more reference questions: We report results of the variational model using 25, 50, and 100 dimensional latent variables, along with the random beam selection extension of both systems the two-layer baseline and the 100dimensional CVAE. The 100-dimensional version of our variational model performs favorably, especially in terms of F-metrics. Through our recall score, we can identify exactly where the improvement occurs –our models show larger improvement on recall, e.g. they match more references. Increasing the dimensionality of the latent variable z strictly improves the recall of our models, and consequently the F-metrics. This supports our intuition that the degree of diversity the model can express is controlled by the amount of information that is encoded in the latent variable. The RBS heuristic adds a small but rather consistent gain to both the baseline and the CVAE model. Example outputs of our system can be found in Appendix D. To confirm our findings, we use Amazon Mechanical Turk to conduct human evaluation. We presented the annotators a context paragraph with questions generated by the baseline and our CVAE model. We tasked them to rate the example with three criteria, fluency , relevancy , and diversity . Table 2 shows the two systems, with their RBS extension, performed comparably in terms of fluency and relevancy, whereas the CVAE demonstrated a significant improvement in diversity. (With two-sample t-tests, p < 0.05 for both CVAE vs. baseline and CVAE+RBS vs. baseline+RBS. More details are in the annotation file.) The finding is consistent with the automatic evaluation.</p> <h2>6 Conclusion</h2> <p>We have introduced a framework to extend existing evaluation metrics into F 1 -like scoring functions explicitly rewarding diversity and enabling detailed comparison in terms of precision and recall. Furthermore, we have presented the first variational autoencoder for question generation over general text. Our model shows comparable results to the baseline in terms of conventional evaluation metrics, while producing significantly more diverse questions according to human and automatic evaluation. Our experiments suggest that the modeling of diversity is an important aspect of question generation systems, both to generate engaging questions and to better model the inherently diverse training inputs, and our development of a family of metrics suitable for evaluating models in this setting represents a step in that direction.</p> <h2>References</h2> <p>pages 521–530, Austin, Texas. Association for Computational Linguistics. Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, Hangbo Bao, and Ming Zhou. 2017. Neural question generation from text: A preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing , pages 662–671. Springer.</p> <h2>A In-Between Response</h2> <p>Our evaluation framework gives lower scores to in-between questions. We illustrate this now with an example in Section 5. Suppose we have a context with the reference set R = { r 1 , r 2 } , where • r 1 : who won the 2014 world cup • r 2 : which event did Germany win in 2014 And the system outcome is a set P containing a single element • p 1 : which event did the 2014 world cup The values of METEOR, ROUGE, BLEU (sentence-level) and the corresponding F-metrics are • BLEU: 0.5946 F-BLEU: 0.2867 • ROUGE: 0.6240 F-ROUGE: 0.5987 • METEOR: 0.3773 F-METEOR: 0.3516</p> <h2>B Hyperparameters</h2> <p>We implemented our models in TensorFlow. Wherever possible, we kept the hyperparameters identical to the ones used by Du et al. (2017). For the variational model, we experiment with different dimensionalities for the latent variable z , choosing from { 25 , 50 , 100 } . Following Sønderby et al. (2016), we anneal the KL term, with a scaling factor starting at 0 and increasing at a rate of 0 . 03 per iteration. This rate is determined through greedy search from the set { 0 . 01 , 0 . 03 , 0 . 05 , 0 . 1 } . We apply dropouts to the latent variable z on the example level, dropping out the variable entirely with a probability 0 . 2 , selected from the set { 0 . 1 , 0 . 2 , 0 . 3 , 0 . 4 , 0 . 5 } . For the baseline, we saw slight improvements from stacking two LSTM layers in the encoder and the decoder (with dropouts of probability 0 . 3 applied in-between), while for the variational model multiple layers had no significant impact. For the RBS heuristic, we selected the number of beams b = 2 on the development set as well.</p> <h2>C Results of Additional Metrics</h2> <p>Results with ROUGE and BLEU are summarized in Table 3 and 4.</p> <h2>D System Outputs</h2> <p>Some example outputs from the baseline system and our best-performing CVAE model are given below.</p> <h2>D.1</h2> <p>Context: according to the doctrine of impermanence , life embodies this flux in the aging process , the cycle of rebirth ( samsara ), and in any experience of loss . Du et al.: what is the cycle of rebirth ? CVAE 1: what is the process of impermanence ? CVAE 2: according to the doctrine of impermanence , what does the cycle of rebirth mean ? Du et al. + RBS 1: how does the cycle of rebirth ? Du et al. + RBS 2: how does the cycle of rebirth ? CVAE + RBS 1: what does the term of impermanence refer to ? CVAE + RBS 2: according to the doctrine of cycle , what does the cycle of impermanence refer to ? D.2 Context: the world trade center path station , which opened on july 19 , 1909 as the hudson terminal , was also destroyed in the attack . Du et al.: on what date was the world trade center ? CVAE 1: what was destroyed in the attack ? CVAE 2: when was the world trade center station opened ? Du et al. + RBS 1: when did the world trade center rail open ? Du et al. + RBS 2: when was the world trade center station opened ? CVAE + RBS 1: what was the hudson terminal ? CVAE + RBS 2: when did the hudson terminal open ? D.3 Context: michiru ¯ oshima created orchestral arrangements for the three compositions , later to be performed by an ensemble conducted by yasuzo takemoto . Du et al.: who created musical arrangement for the three compositions ? CVAE 1: what was the name of the orchestral ¯ oshima created ? CVAE 2: who created compositions arrangements ? CVAE 3: who created created arrangements ? Du et al. + RBS 1: who performed the ensemble ¯ oshima ? Du et al. + RBS 2: who performed a ensemble ¯ oshima ? Du et al. + RBS 3: who performed the ensemble ¯ oshima ? CVAE + RBS 1: who performed the musical arrangements for the three compositions ? CVAE + RBS 2: who created the compositions to be performed by michiru ensemble ? CVAE + RBS 3: who performed musical musical in the three compositions ? D.4 Context: examples include a concert on 23 march 1833 , in which chopin , liszt and hiller performed -lrbon pianos -rrba concerto by j.s. bach for three keyboards ; and , on 3 march 1838 , a concert in which chopin , his pupil adolphe gutmann , charles-valentin alkan , and alkan ’s teacher joseph zimmermann performed alkan ’s arrangement , for eight hands , of two movements from beethoven ’s 7th symphony . Du et al.: who performed alkan ’s ? CVAE 1: who performed alkan ’s 8th march ? CVAE 2: who performed alkan ’s 8th march ? CVAE 3: how many movements did joseph play in the concert ? Du et al. + RBS 1: who performed alkan ’s arrangement ? Du et al. + RBS 2: who performed alkan ’s first solo ? Du et al. + RBS 3: who performed alkan ’s arrangement ? CVAE + RBS 1: who performed a solo concert ? CVAE + RBS 2: who performed the concert at the march of 1876 ? CVAE + RBS 3: who performed liszt ’s arrangement 23 ?</p>

Generating diverse and relevant questions over text is a task with widespread applications. We argue that commonly-used evaluation metrics such as BLEU and METEOR are not suitable for this task due to the inherent diversity of reference questions, and propose a scheme for extending conventional metrics to reflect diversity. We furthermore propose a variational encoder-decoder model for this task. We show through automatic and human evaluation that our variational model improves diversity without loss of quality, and demonstrate how our evaluation scheme reflects this improvement.

Publication:

Evaluating for Diversity in Question Generation over Text

Related Fields of Study

Citations

References