Work performed from the beginning of the project to the end of the period and main results achieved so far
One simple yet operational view of automated data science that Synth has contributed is that of predictive autocompletion in a spreadsheet environment. Imagine the end-user of spreadsheet software is filling out some entries, and assume that there are regularities in the data, and that the data has been entered in a systematic manner. The predictive autocompletion task is then to automatically predict not only which cells the user will fill out but also predict the right values for such cells as well as an estimate of the confidence of the prediction. Solving the autocompletion task is, in a nutshell, the overall task addressed by Synth, as it requires solutions to all three challenges mentioned in the abstract. Initial approaches to autocompletion have been contributed.
Furthermore, important progress has been made in that the key components of an automatic data scientist have been identified and that prototypes for all of these are under way:
The underlying idea of SynthLog is, just like in inductive databases, that data science becomes a querying and inference process in which the patterns and models become first class citizens. But while traditional inductive databases are based on relational databases, SynthLog’s data model is based on the much more expressive probabilistic logical language ProbLog. ProbLog tightly integrates probabilistic models with logic, databases, and constraints, and supports deductive, probabilistic inference as well as constraint solving. SynthLog’s data models (DM) are now viewed as ProbLog programs, which can be used for inference and learning and which can be combined (through a set of algebraic operations) with other data models. Furthermore, the results of learning and inference, the inductive models (IM), are also assimilated as regular data models so that they become first class citizens, and a convenient closure property is satisfied.
SynthLog serves as the back-end of the Synth automated data scientist and it is intended to support the data science expert.
In particular, there is the Synth-a-Sizer approach as a preliminary automated data wrangling system that starts from a set of .csv files and transforms these into the typical format that can be processed by machine learning algorithms. Synth-a-Sizer is being expanded towards more automation and towards coping with richer data formats.
The Synth framework will consist of both a front- and a back-end. The front-end of SYNTH will extend traditional spreadsheet software (in particular Microsoft’s Excel) with facilities for autocompletion and is intended to support the naive end-user. The back-end of SYNTH is the SynthLog language.
Progress beyond the state of the art and expected results until the end of the project
When comparing the Synth project and approach to the state-of-the-art, the following points are important:
The project is on track and its expected results till the end of the project include a fully worked out proof-of-principle of the sketched approach to automated data science, a novel language for data science (SynthLog), a novel semi-automatic data wrangling system, various new approaches to learning constraints and predictive models, prototype implementations of both the front- and back- end, as well as an evaluation of the approach by end-users and on applications in rostering and sports analytics.