Producing fair outcomes with synthetic data
Not only can the use of synthetic AI make high-stake use cases more feasible, it also enables reducing bias found in datasets
The generation of synthetic datasets for training Machine Learning systems is becoming more popular. This technique, which uses generative AI, allows engineers to have larger datasets that carry the same properties as the original dataset. Furthermore, as the synthetic data is not based on real-world sampling, it is privacy-preserving.
Synthetaic, a company focused on creating synthetic data for high-stake Machine Learning solutions, has just raised $3.5 M in seed funding.
The company’s founder, Corey Jaskolski, got the idea when he was creating a full digital record of one of the last remaining Sumatran Rhinos in Indonesia. The 3D scan (displayed below) was so realistic, he thought it was a photo. He argues that “if my synthetic digitized rhino is indistinguishable from a photo, is it as real as a photo? I realized that if I can create 3D models that look real to me, I can use these images to train AI systems.”
This technique permits the generation of data where real-world examples are sparse. For example, for the detection of rare brain cancers or extremist stickers on cars, the imbalance between positive and negative examples heavily impacts the model’s performance.
Recently, Synthetaic worked on generating chest x-rays of COVID patients to assist doctors in the disease’s detection.
However, if you generate synthetic datasets from sources which are bias, and real-life datasets often are (e.g. Compas recidivism dataset), you are perpetuating that bias in your data augmentation. Therefore, you must always be aware of the potential biases that exist in your data and include parity penalties in your optimization procedures.
Why it matters
Real datasets are often too small for adequate training. This heavily impacts the feasibility of high stake and high reward Machine Learning use cases. While the generation of synthetic data can help solve this problem, researchers must still be aware of social biases in their original training datasets.
It is true that modifying the ratio of a feature in an original dataset during augmentation could be considered as injecting a new bias. In fact, your new dataset might not reflect reality as well as the original one. One might say that you are simply replacing one bias with another. Remember that the aim is not always to represent reality accurately but to produce fair outcomes.
Perfomers – the new and improved Transformers
By approximating Transformers’ attention mechanism, researchers have drastically reduced their computational cost
Transformers have recently revolutionized the Artificial Intelligence community. This type of deep learning model has demonstrated state-of-the-art performance in NLP. Promising results show that they are also relevant in Computer Vision tasks. Unfortunately, Transformers scale quadratically on the number of tokens. This leads to a heavy computational load when training these models. As a consequence, most AI teams are unable to leverage the power of this technique.
A team combining researchers from Google, the University of Cambridge, DeepMind, and the Alan Turing Institute propose a new type of Transformers, dubbed Perfomers. The new technique estimates regular full-rank-attention Transformers with high accuracy. As can be observed in the image below, the calculation for the attention mechanism is decomposed, reducing its operational cost (from quadratic to linear). The paper contains extensive mathematical theory, guaranteeing unbiased (or nearly-unbiased) estimation of the attention matrix, uniform convergence, and low estimation variance.
Why it matters
Transformers provide an intelligent mechanism for identifying complex dependencies in input sequences. Unfortunately, that mechanism carries an immense computational cost, prohibiting their use. Performers use a different backbone mechanism to calculate attention, providing accuracy with linear (instead of quadratic) cost. This effectively makes the method more accessible, which in turn democratizes the use of Artificial Intelligence in both research and industry.
Interactive Data Science Communication
A new visual article about COVID-19 marks a trend of increasing interactive communication in Data Science
The past 40 years have seen a complete shift in how people communicate. The internet allows for instant transmission of information. In this age, sorting out the valid from the unreliable and unproven has become an enormous challenge.
Furthermore, video and audio formats make up an increasing amount of shared information. Explaining through text and data visualizations is difficult as a reader’s background heavily influences their level of comprehension.
To cope with this, data science writers aim to make their articles more interactive. Displaying data dynamically enhances readers’ learning experience. They are able to play with the visualization and understand concepts clearly.
A beautiful example relevant to the current scenario is an article from the Financial Times Visual Journalism Team. The article, entitled “Covid-19: The global crisis — in data”, uses data from around the world to tell the Coronavirus story. Dynamic visualizations coupled with good storytelling and relevant external links make for a poignant article.
The Financial Times Visual Journalism Team has created a plethora of other articles, which you can find here.
In the Machine Learning field, Distill is a publication platform for interactive articles. The platform aims to advance dialogue, promote outstanding communication, and support scientific integrity. Leveraging web-tools allows for the use of reactive diagrams, breaking free from the traditional PDF format. Examples include using t-SNE effectively (displayed below), attention and augmented RNNs, and visual exploration of Gaussian Processes.
Why it matters
New tools, such as Observable, allow you to interact with your readers to convey information with style and clarity. In an age combining misinformation with a trend of ever-increasing amounts of generated data, communicating clearly and efficiently is paramount.