Show and Tell Using Computer; A Deep learning approach.

February 1, 2022

The Artificial intelligence is making its way in all fields of the world by improving every type of work. The field of Artificial Intelligence (AI) is based on understanding cognition, learning and teaching like humans. It's one of the interesting topic that everybody wants to learn how it works. Artificial Neural Networks are forming an important part in Artificial intelligence. I would like to share my experience on Artificial Neural Network, Artificial Neural Networks are used to generate Artificial Intelligence. Artificial Neural Networks are spreading everywhere, the Artificial Neural Network is an attempt to simulate the functions of biological neurons in human brain.

This deep learning application can benefit people with visual impairments. Well-trained models can be deployed within screen readers to help visually challenged people to understand the context and the contents that an image may be trying to communicate. A screen reader is essentially a type of assistive technology that manifests itself as software liaisons or through specially designed hardware devices used by visually challenged people to navigate the web. It is also efficacious in general utilisation of electronic devices such as mobile phones, computers, and tablets. The main goal of a screen reading assistive technology is to encode and transcribe the visual data into synthesised speech. If the screen reader contains a braille output, it generates relevant braille patterns that the user can recognize by tracing their fingers over as guided by the braille system.

Our final year Master’s project is a Captioning System for Image, which produces output in form of text and speech. Our web application system can identify the majority of images and caption them correctly with decent accuracy, But there are some problems with this web application, where a new image may not be identified accurately amongst this model. The reason for this issue is the lack of training image data. However, we have proposed a solution whereas the future model of our application will train a broad range of data sets processed through the model. We have implemented the Transformer deep learning model in our web application, which is the primary model in training and captioning huge sets of images from the data sets. We used the Django framework and Python for back-end. We train our data set using high performance GPU using software as a service provided by PaperSpace where we used the GPU RTX5000 with 30GB RAM to train our model. We created a prototype by using JavaScript for client-side dynamic interactions and form submissions with the Django framework for web-server handling. Finally, we deployed the Web application online using the Heroku Deployment service by SalesForce and synchronised the code to UbiOps inference system, which will process the user input image and leverage the training from the data set to produce accurate and descriptive captions.

Creating an Image caption generation application using deep learning and the front-end application using respective proprietary technologies has been a challenging yet fulfilling endeavour that has allowed the members of this team to not only learn and implement new things, but also helped us build character and learn to work in a team. Coming up with a well-trained deep learning model that is capable of discerning image features while at the same time, using said features for generating grammatically correct semantic sentences is a remarkable achievement. This project serves as a stepping stone for future developments and advancements in web technologies and the field of deep learning as there is always scope for improvement be it the image feature extractions, generation speed of the model or the caption generated by the model. Hopefully, This project marks a major milestone in the years to come and will be remembered with fondness and the memories of hard work and creative ingenuity will be long remembered.