Goals and Progress

Work performed from the beginning of the project to the end of the period and main results achieved so far

One simple yet operational view of automated data science that Synth has contributed is that of predictive autocompletion in a spreadsheet environment. Imagine the end-user of spreadsheet software is filling out some entries, and assume that there are regularities in the data, and that the data has been entered in a systematic manner. The predictive autocompletion task is then to automatically predict not only which cells the user will fill out but also predict the right values for such cells as well as  an estimate of the confidence of the prediction. Solving the autocompletion task is, in a nutshell, the overall task addressed by Synth, as it requires solutions to all three challenges mentioned in the abstract. Initial approaches to autocompletion have been contributed.

Furthermore, important progress has been made in that the key components of an automatic data scientist have been identified and that prototypes for all of these are under way:

  1. The SynthLog language as the unifying IDM language for supporting data scientists

    The underlying idea of SynthLog is, just like in inductive databases, that data science becomes a querying and inference process in which the patterns and models become first class citizens. But while traditional inductive databases are based on relational databases, SynthLog’s data model is based on the much more expressive probabilistic logical language ProbLog. ProbLog tightly integrates probabilistic models with logic, databases, and constraints, and supports deductive, probabilistic inference as well as constraint solving. SynthLog’s data models (DM) are now viewed as ProbLog programs, which can be used for inference and learning and which can be combined (through a set of algebraic operations) with other data models. Furthermore, the results of learning and inference, the inductive models (IM), are also assimilated as regular data models so that they become first class citizens, and a convenient closure property is satisfied.

    SynthLog serves as the back-end of the Synth automated data scientist and it is intended to support the data science expert.
     

  2. A (partially) automated data wrangling system

    In particular, there is the Synth-a-Sizer approach as a preliminary automated data wrangling system that starts from a set of .csv files and transforms these into the typical format that can be processed by machine learning algorithms. Synth-a-Sizer is being expanded towards more automation and towards coping with richer data formats.
     

  3. Methods for learning inductive models, in particular constraints and predictive models
    1. While constraints are ubiquitous in artificial intelligence and constraints are also commonly used in machine learning and data mining, the problem of learning constraints from examples has received much less attention. The Synth Project has devoted special attention to the learning of constraints and has contributed a wide variety of techniques for learning constraints, such as Excel constraints in spreadsheets and mathematical programs used in operations research.
    2. Synth is also contributing new algorithms for predictive learning, in particular, it is extending the structure learning approaches for the ProbLog language and it has contributed to the MERCS approach for learning multi-directional ensembles of multi-target decision trees.
       
  4. Prototype implementation

    The Synth framework will consist of both a front- and a back-end. The front-end of SYNTH will extend traditional spreadsheet software (in particular Microsoft’s Excel) with facilities for autocompletion and is intended to support the naive end-user. The back-end of SYNTH is the SynthLog language.

Progress beyond the state of the art and expected results until the end of the project

When comparing the Synth project and approach to the state-of-the-art, the following points are important:

  1. While most approaches to automating data science and machine learning focus on the modeling step, Synth focusses on the overall data science process including the pre-processing steps.
  2. Synth focusses on symbolic and probabilistic modeling and learning methods (rather than on the popular neural networks) because it wants to support both learning and reasoning. In addition, Synth has a lot of attention for the learning of constraints. While many use constraints in machine learning and in problem solving, there have been only few attempts at learning them.
  3. Synth’s ultimate grand challenge is to automate data science so much that it becomes accessible to non-expert users. Most other approaches to automating data science and machine learning do target the data scientists rather than the end-users. Synth addresses both through its front-end and its back-end.
  4. Synth is aimed at identifying a small and principled set of necessary components for automating data science and at developing a unifying language and framework that incorporates these.

The project is on track and its expected results till the end of the project include a fully worked out proof-of-principle of the sketched approach to automated data science, a novel language for data science (SynthLog), a novel semi-automatic data wrangling system, various new approaches to learning constraints and predictive models, prototype implementations of both the front- and back- end, as well as an evaluation of the approach by end-users and on applications in rostering and sports analytics.