Natural Language Processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of 'understanding' the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Text Generation is a subfield of Natural Language Processing (NLP) which attempts to teach machines to automatically generate texts that are indistinguishable from texts written by humans.

Paraphrasing is a subfield of Text Generation that deals with the automatic generation of natural language text that expresses the same semantic meaning as a given input text, but using different words or syntactic structures. The goal of paraphrasing is to retain the original meaning while avoiding plagiarism, simplifying complex sentences, or adapting the language to a different audience.

<h1>PPDB: The Paraphrase Database</h1> <h2>Abstract</h2> <p>We present the 1.0 release of our paraphrase database, PPDB. Its English portion, PPDB:Eng, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140 million paraphrase patterns, which capture many meaning-preserving syntactic transformations. The paraphrases are extracted from bilingual parallel corpora totaling over 100 million sentence pairs and over 2 billion English words. We also release PPDB:Spa, a collection of 196 million Spanish paraphrases. Each paraphrase pair in PPDB contains a set of associated scores, including paraphrase probabilities derived from the bitext data and a variety of monolingual distributional similarity scores computed from the Google n -grams and the Annotated Gigaword corpus. Our release includes pruning tools that allow users to determine their own precision/recall tradeoff.</p> <h2>1 Introduction</h2> <p>Paraphrases, i.e. differing textual realizations of the same meaning, have proven useful for a wide variety of natural language processing applications. Past paraphrase collections include automatically derived resources like DIRT (Lin and Pantel, 2001), the MSR paraphrase corpus and phrase table (Dolan et al., 2004; Quirk et al., 2004), among others. Although several groups have independently extracted paraphrases using Bannard and CallisonBurch (2005)’s bilingual pivoting technique (see Zhou et al. (2006), Riezler et al. (2007), Snover et al. (2010), among others), there has never been an official release of this resource. In this work, we release version 1.0 of the ParaPhrase DataBase PPDB, 1 a collection of ranked English and Spanish paraphrases derived by: • Extracting lexical, phrasal, and syntactic paraphrases from large bilingual parallel corpora (with associated paraphrase probabilities). • Computing distributional similarity scores for each of the paraphrases using the Google n grams and the Annotated Gigaword corpus. In addition to the paraphrase collection itself, we provide tools to filter PPDB to only retain high precision paraphrases, scripts to limit the collection to phrasal or lexical paraphrases (synonyms), and software that enables users to extract paraphrases for languages other than English.</p> <h2>2 Extracting Paraphrases from Bitexts</h2> <p>To extract paraphrases we follow Bannard and Callison-Burch (2005)’s bilingual pivoting method. The intuition is that two English strings e 1 and e 2 that translate to the same foreign string f can be assumed to have the same meaning. We can thus pivot over f and extract h e 1 , e 2 i as a pair of paraphrases, as illustrated in Figure 1. The method extracts a diverse set of paraphrases. For thrown into jail , it extracts arrested, detained, imprisoned, incarcerated, jailed, locked up, taken into custody , and thrown into prison , along with a set of incorrect/noisy paraphrases that have different syntactic types or that are due to misalignments. For PPDB, we formulate our paraphrase collection as a weighted synchronous context-free grammar (SCFG) (Aho and Ullman, 1972; Chiang, 2005) ... fünf Landwirte , weil ... 5 farmers were in Ireland ... ... oder wurden , gefoltert or have been , tortured festgenommen thrown into jail festgenommen imprisoned ... ... ... ... with syntactic nonterminal labels, similar to Cohn and Lapata (2008) and Ganitkevitch et al. (2011). An SCFG rule has the form: where the left-hand side of the rule, C , is a nonterminal and the right-hand sides f and e are strings of terminal and nonterminal symbols. There is a one-toone correspondence, ∼ , between the nonterminals in f and e : each nonterminal symbol in f has to also appear in e . Following Zhao et al. (2008), each rule r is annotated with a vector of feature functions ~ ϕ = { ϕ 1 ...ϕ N } which are combined in a log-linear model (with weights ~ λ ) to compute the cost of applying r : To create a syntactic paraphrase grammar we first extract a foreign-to-English translation grammar from a bilingual parallel corpus, using techniques from syntactic machine translation (Koehn, 2010). Then, for each pair of translation rules where the left-hand side C and foreign string f match: we pivot over f to create a paraphrase rule r p : with a combined nonterminal correspondency function ∼ p . Note that the common source side f implies that e 1 and e 2 share the same set of nonterminal symbols. The paraphrase rules obtained using this method are capable of making well-formed generalizations of meaning-preserving rewrites in English. For instance, we extract the following example paraphrase, capturing the English possessive rule: The paraphrase feature vector ~ ϕ p is computed from the translation feature vectors ~ ϕ 1 and ~ ϕ 2 by following the pivoting idea. For instance, we estimate the conditional paraphrase probability p ( e 2 | e 1 ) by marginalizing over all shared foreign-language translations f :</p> <h2>3 Scoring Paraphrases Using Monolingual Distributional Similarity</h2> <p>The bilingual pivoting approach anchors paraphrases that share an interpretation because of a shared foreign phrase. Paraphrasing methods based on monolingual text corpora, like DIRT (Lin and Pantel, 2001), measure the similarity of phrases based on distributional similarity. This results in a range of different types of phrases, including paraphrases, inference rules and antonyms. For instance, for thrown into prison DIRT extracts good paraphrases like arrested, detained, and jailed . However, it also extracts phrases that are temporarily or causally related like began the trial of, cracked down on, interrogated, prosecuted and ordered the execution of , because they have similar distributional properties. Since bilingual pivoting rarely extracts these non-paraphrases, we can use monolingual distributional similarity to re-rank paraphrases extracted from bitexts (following Chan et al. (2011)) or incorporate a set of distributional similarity scores as features in our log-linear model. Each similarity score relies on precomputed distributional signatures that describe the contexts that a phrase occurs in. To describe a phrase e , we gather counts for a set of contextual features for each occurrence of e in a corpus. Writing the context vector for the i -th occurrence of e as ~s e,i , we can aggregate over all occurrences of e , resulting in a distributional signature for e , ~s e = P i ~s e,i . Following the intuition that phrases with similar meanings occur in 759 the long-term achieve 25 goals 23 plans 97 investment 10 confirmed 64 revise 43 the long-term the long-term the long-term the long-term the long-term .. .. L-achieve = 25 L-confirmed = 64 L-revise = 43 ⇣ R-goals = 23 R-plans = 97 R-investment = 10 ⇣ the long-term ⌘ = ~ sig ⇣ (a) The n -gram corpus records the long-term as preceded by revise (43 times), and followed by plans (97 times). We add corresponding features to the phrase’s distributional signature retaining the counts of the original n -grams. long-term investment holding on to det amod the JJ NN VBG IN TO DT NP PP VP ⇣ ⇣ the long-term ⌘ = ~ sig ⇣ dep-det-R-investment pos-L-TO pos-R-NN lex-R-investment lex-L-to dep-amod-R-investment syn-gov-NP syn-miss-L-NN lex-L-on-to pos-L-IN-TO dep-det-R-NN dep-amod-R-NN (b) Here, position-aware lexical and part-of-speech n gram features, labeled dependency links , and features reflecting the phrase’s CCG-style label NP / NN are included in the context vector. similar contexts, we can then quantify the goodness of e 0 as a paraphrase of e by computing the cosine similarity between their distributional signatures: A wide variety of features have been used to describe the distributional context of a phrase. Rich, linguistically informed feature-sets that rely on dependency and constituency parses, part-of-speech tags, or lemmatization have been proposed in work such as by Church and Hanks (1991) and Lin and Pantel (2001). For instance, a phrase is described by the various syntactic relations such as: “what verbs have this phrase as the subject?”, or “what adjectives modify this phrase?”. Other work has used simpler n -gram features, e.g. “what words or bigrams have we seen to the left of this phrase?”. A substantial body of work has focussed on using this type of feature-set for a variety of purposes in NLP (Lapata and Keller, 2005; Bhagat and Ravichandran, 2008; Lin et al., 2010; Van Durme and Lall, 2010). For PPDB, we compute n -gram-based context signatures for the 200 million most frequent phrases in the Google n -gram corpus (Brants and Franz, 2006; Lin et al., 2010), and richer linguistic signatures for 175 million phrases in the Annotated Gigaword corpus (Napoles et al., 2012). Our features extend beyond those previously used in the work by Ganitkevitch et al. (2012). They are: • n -gram based features for words seen to the left and right of a phrase. • Position-aware lexical, lemma-based, part-ofspeech, and named entity class unigram and bigram features, drawn from a three-word window to the right and left of the phrase. • Incoming and outgoing (wrt. the phrase) dependency link features, labeled with the corresponding lexical item, lemmata and POS. • Syntactic features for any constituents governing the phrase, as well as for CCG-style slashed constituent labels for the phrase. Figure 2 illustrates the feature extraction for an example phrase.</p> <h2>4 English Paraphrases – PPDB:Eng</h2> <p>We combine several English-to-foreign bitext corpora to extract PPDB:Eng: Europarl v7 (Koehn, 2005), consisting of bitexts for the 19 European languages, the 10 9 French-English corpus (CallisonBurch et al., 2009), the Czech, German, Spanish and French portions of the News Commentary data (Koehn and Schroeder, 2007), the United Nations Frenchand Spanish-English parallel corpora (Eisele and Chen, 2010), the JRC Acquis corpus (Steinberger et al., 2006), Chinese and Arabic 760 newswire corpora used for the GALE machine translation campaign, 2 parallel Urdu-English data from the NIST translation task, 3 the French portion of the OpenSubtitles corpus (Tiedemann, 2009), and a collection of Spanish-English translation memories provided by TAUS. 4 The resulting composite parallel corpus has more than 106 million sentence pairs, over 2 billion English words, and spans 22 pivot languages. To apply the pivoting technique to this multilingual data, we treat the various pivot languages as a joint NonEnglish language. This simplifying assumption allows us to share statistics across the different languages and apply Equation 2 unaltered. Table 1 presents a breakdown of PPDB:Eng by paraphrase type. We distinguish lexical (a single word), phrasal (a continuous string of words), and syntactic paraphrases (expressions that may contain both words and nonterminals), and separate out identity paraphrases. While we list lexical and phrasal paraphrases separately, it is possible that a single word paraphrases as a multi-word phrase and vice versa – so long they share the same syntactic label.</p> <h2>5 Spanish Paraphrases – PPDB:Spa</h2> <p>We also release a collection of Spanish paraphrases: PPDB:Spa is extracted analogously to its English counterpart and leverages the Spanish portions of the bitext data available to us, totaling almost 355 million Spanish words, in nearly 15 million sentence pairs. The paraphrase pairs in PPDB:Spa are anno2 http://projects.ldc.upenn.edu/gale/ data/Catalog.html 3 LDC Catalog No. LDC2010T23 4 http://www.translationautomation.com/ For the above annotation predicate, we extract VBP → expect, which is matched by paraphrase rules like VBP → expect | anticipate and VBP → expect | hypothesize. To search for the entire relation, we replace the argument spans with syntactic nonterminals. Here, we obtain S → NP expect S , for which PPDB has matching rules like S → NP expect S | NP would hope S , and S → NP expect S | NP trust S . This allows us to apply sophisticated paraphrases to the predicate while capturing tated with distributional similarity scores based on lexical features collected from the Spanish portion of the multilingual release of the Google n -gram corpus (Brants and Franz, 2009), and the Spanish Gigaword corpus (Mendonca et al., 2009). Table 2 gives a breakdown of PPDB:Spa.</p> <h2>6 Analysis</h2> <p>To estimate the usefulness of PPDB as a resource for tasks like semantic role labeling or parsing, we analyze its coverage of Propbank predicates and predicate-argument tuples (Kingsbury and Palmer, 2002). We use the Penn Treebank (Marcus et al., 1993) to map Propbank annotations to patterns which allow us to search PPDB:Eng for paraphrases that match the annotated predicate. Figure 3 illus761 1 3 5 -30 -25 -20 -15 -10 -5 0 A v g . S c o r e Pruning Threshold 0 0.5 1 -30 -25 -20 -15 -10 -5 0 0 50 100 150 C o v e r a g e P P / T y p e 0 0.2 0.4 0.6 0.8 1 -30 -25 -20 -15 -10 -5 0 0 20 40 60 80 100 120 140 160 C o v e r a g e P a r a p h r a s e s / T y p e Pruning Threshold Relation Tokens Covered Paraphrases / Type Relation Types Covered (b) PPDB:Eng’s coverage of Propbank predicates with up to two arguments. Here we consider rules that paraphrase the full predicate-argument expression. trates this mapping. In order to quantify PPDB’s precision-recall tradeoff in this context, we perform a sweep over our collection, beginning with the full set of paraphrase pairs and incrementally discarding the lowest-scoring ones. We choose a simple estimate for each paraphrase pair’s score by uniformly combining its paraphrase probability features in Eq. 1. The top graph in Figure 4a shows PPDB’s coverage of predicates (e.g. VBP → expect) at the type level (i.e. counting distinct predicates), as well as the token level (i.e. counting predicate occurrences in the corpus). We also keep track of average number of paraphrases per covered predicate type for varying pruning levels. We find that PPDB has a predicate type recall of up to 52% (accounting for 97.5% of tokens). Extending the experiment to full predicate-argument relations with up to two arguments (e.g. S → NNS expect S ), we obtain a 27% type coverage rate that accounts for 40% of tokens (Figure 4b). Both rates hold even as we prune the database down to only contain high precision paraphrases. Our pruning method here is based on a simple uniform combination of paraphrase probabilities and similarity scores. To gauge the quality of our paraphrases, the authors judged 1900 randomly sampled predicate paraphrases on a scale of 1 to 5, 5 being the best. The bottom graph in Figure 4a plots the resulting human score average against the sweep used in the coverage experiment. It is clear that even with a simple weighing approach, the PPDB scores show a clear correlation with human judgements. Therefore they can be used to bias the collection towards greater recall or higher precision.</p> <h2>7 Conclusion and Future Work</h2> <p>We present the 1.0 release of PPDB:Eng and PPDB:Spa, two large-scale collections of paraphrases in English and Spanish. We illustrate the resource’s utility with an analysis of its coverage of Propbank predicates. Our results suggest that PPDB will be useful in a variety of NLP applications. Future releases of PPDB will focus on expanding the paraphrase collection’s coverage with regard to both data size and languages supported. Furthermore, we intend to improve paraphrase scoring by incorporating additional sources of information, as well as by better utilizing information present in the data, like domain or topic. We will also address points of refinement such as handling of phrase ambiguity, and effects specific to individual pivot languages. Our aim is for PPDB to be a continuously updated and improving resource. Finally, we will explore extensions to PPDB to include aspects of related large-scale resources such as lexical-semantic hierarchies (Snow et al., 2006), textual inference rules (Berant et al., 2011), relational patterns (Nakashole et al., 2012), and (lexical) conceptual networks (Navigli and Ponzetto, 2012). 762</p> <h2>Acknowledgements</h2> <p>We would like to thank Frank Ferraro for his Propbank processing tools. This material is based on research sponsored by the NSF under grant IIS-1249516 and DARPA under agreement number FA8750-13-2-0017 (the DEFT program). The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes. The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA or the U.S. Government.</p> <h2>References</h2> <p>763 764</p>

We present the 1.0 release of our paraphrase database, PPDB. Its English portion, PPDB:Eng, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140 million paraphrase patterns, which capture many meaning-preserving syntactic transformations. The paraphrases are extracted from bilingual parallel corpora totaling over 100 million sentence pairs and over 2 billion English words. We also release PPDB:Spa, a collection of 196 million Spanish paraphrases. Each paraphrase pair in PPDB contains a set of associated scores, including paraphrase probabilities derived from the bitext data and a variety of monolingual distributional similarity scores computed from the Google n-grams and the Annotated Gigaword corpus. Our release includes pruning tools that allow users to determine their own precision/recall tradeoff.

Publication:

PPDB: The Paraphrase Database

Related Fields of Study

Citations

References