A core problem in data mining is to retrieve data in a easy and human friendly way.
We approach such problem by carrying out a mapping between natural language (NL)
and SQL syntactic structures. The mapping is automatically derived by applying machine learning algorithms.
In particular, we generate a dataset of pairs of NL questions and SQL queries represented by means
of their syntactic trees automatically derived by their respective syntactic parsers.
Then, we train a classifier for detecting correct and incorrect pairs of questions and queries using kernel methods along with Support
Vector Machines [Giordani and Moschitti, 2009].
Here we make available the corpora we generated starting from GeoQueries250 and RestQueries corpora. Questions in both corpora were originally collected from a web-based interface and manually translated into logical formulas in Prolog by Mooney's group [Tang and Mooney, 2001]. Popescu et al.  manually converted them into SQL. Thanks to our clustering algorithm we discovered and fixed many errors and inconsistencies in SQL queries.
Datasets can be downloaded in separate zip files.
Each zip file contains 5 files:
- NLs.txt: contains the generalized NL questions together with an increasing NLid.
- SQLs.txt: contains the generalized SQL queries together with an increasing SQLid.
- NL-trees.txt: contains the parse trees of each NL questions together with an increasing NLid.
- SQL-trees.txt: contains the parse trees of each SQL queries together with an increasing SQLid.
- AllPos.txt: contains all positive pairings beetween NL questions and SQL queries grouped into clusters, expressed as (NLid,SQLid,CLid) where CLid is the cluster associated with the pair (NLid,SQLid) and those identifiers refer to question/query (parse trees) mentioned above.
The first corpora is about geography questions. After the generalization process the initial 250 pairs of questions/queries were reduced to 155 pairs containing 149 NL question and 80 SQL queries.
We found 76 clusters, from which we generated 164 positive for a total of 149 x 80 pairs.
RestQueries The second dataset regards questions about restaurants. The initial 250 pairs were generalized by 197 pairs involving 126 NL questions and 77 SQL queries. We clustered these pairs in only 26 groups which lead to 852 positive examples.
[Tang and Moonet, 2001] Tang, L.R., Mooney, R.J., sing multiple clause constructors in inductive logic programming for semantic parsing. In: Proceedings of the 12th European Conference on Machine Learning, Freiburg, Germany (2001) 466-477
[Popescu et al, 2003] Popescu, A.M., A Etzioni, O., A Kautz, H., Towards a theory of natural language interfaces to databases. In: Proceedings of the 2003 International Conference on Intelligent User Interfaces, Miami, Association for Computational Linguistics (2003) 149 - 157
[Giordani and Moschitti, 2009] Giordani, A., Moschitti, A., Syntactic structural kernels for natural language interfaces to databases. Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I, Springer-Verlag (2009) 391 - 406