What is semantic search and why do we want it

You’re searching for poems that talk about cutlery. Because you can. Also because a lot of times you don’t remember the precise detail or term used in a document that you want to find. And also because compiling a list of poems about our relationship with food-ingestion-tools would be fun.

Maybe it’s not poetry you care about but the name of the company your boyfriend works at which he told you about three months ago but cant find on your Whatsapp. And obviously, it would be super awkward to ask him now.

We have disparate search needs is what I am trying to say.

Lexical search: or how most search works

Definitions:
Query: The text we enter into search fields when we want stuff found
Document: The file, website, chat, image, meme stash or entity that has the terms or could have the the right answers to your search
Corpus: The collection of  all the documents that need to be searched across various searches. (your whole website for example)
On with the show!

The standard way in which we do search is by matching the words or terms in a query with the terms in the corpus. So if you search for poems about cutlery  the search will look for documents that have the words poems and/or cutlery and try to find ones that contain both the words.

Imagine, though, that the program has to scan through every document every time someone searches for something, hunting for instances of the word from scratch. This would be pretty time-consuming and also slightly stupid. So we need a way to store what words are in what documents.

Besides just matching or finding the word, we also need some way of knowing which document matches it better. This is called ranking. To solve both these problems  people came up with the idea of indexes (or indices, if you’re a Latin snob).

What an index does is, it look over all the documents in a corpus, and creates a brief representation of the content of that document. So instead of searching over the whole document, now you just search over the representations. These representations also have some way of ranking the quality of the fit between the query and the document.

An example of such a representation1 is TF-IDF (Term Frequency–Inverse Document Frequency). What this does is,

  1. Lists all the words in each of the documents in a corpus and how frequently the occur in a document (term frequency).
  2. Calculates how common or how rare of each of these words are across the whole corpus. The common words get a score of 0, or close to zero and the rare words get high scores. 2

Why do we do this? In most documents, the words that appear most frequently would be things like the, and, is and stuff that doesn’t add any discriminatory power to the retrieval.

When I search for poems with cutlery  there are going to be a whole lotta  poems with the word with in them, so I can’t use with to rank the documents in any way. But the word cutlery probably appears in a small number of poems, so it is a good word for ranking the results.

So now your index of poems doesn’t just have the whole text of the poem but a list of the non-zero scoring words along with the scores, which allows us to get the top matches.

The most used and probably the best of these TF-IDF  algorithms is the  Okapi BM25  family of algorithms. They are absolutely fantastic in finding you the documents that are best matching the words you search for.

But what if you don’t know the precise thing you are searching for? This is also a problem as old as misplaced keys. One way to solve this would be to have some kind of a thesaurus. You search for cutlery  and we expand the terms to include all the synonyms of the word. This is called query expansion. So instead of sending the term cutlery to the index, we can send cutlery, knife, fork, teaspoon and see what the index gives us.

But that means now you need to figure out which words you want to find synonyms for, how to store them, how to update them periodically etc. And what if you need to find synonyms of phrases, not words, eg. our relationship with modern day digital technology. That would make very large search term if we expanded each word. Still, this is a pretty damn good method and is used widely and is what Google uses to return close matches.

Another problem could be that you might misspell the word, or that your way of spelling antidisestablishmentarianism3 is different from the way it appears in the text, say “anti-dis-establishmentarianizm” or whatever.  The usual solution to this is to add some “fuzzy” logic or approximate matching. This works by searching for words that are off or different by a few letters. Unfortunately this means that if you search for costly coat cuts,  you’ll also get results for costly cat guts 4.  Still, this too is pretty damn good and is usually  a part of most search systems.

You could also offer spelling correction, as google often does. But problems with autocorrect are well known

I am compelled to point to yet another problem: Besides synonyms, there are words that are structurally similar, or stem from the same word. These too might make a good fit or a bad fit for a particular search query based on the context, for eg. if you’re searching for load bearing walls, it might be good to get results for load bearing tiles, but not good to get bear loading 5

Anyway, so what we need is a way to represent text that can account for

  1. Variations in the way a word is written
  2. Contextual meaning of words (A coat of paint vs a coat of mink)
  3. Spelling mistakes
  4. Synonyms

All these things are present in the way we write and store information. So if you were to take a sufficiently large sample of things people have written, you would find all of these things in this sample. This idea is what gives us semantic search.

Enter AI!

Created with Dall-E

Kidding, AI is a marketing myth, who wants to buy “statistical-learning models”?

But that is what we do, we take a large corpus of text, and then derive a statistical model of the relationships between words in a multi dimensional space. And we do this by masking text and getting a model to fill in the blanks. No, seriously, that’s it. This is the basis of all LLMs.

What this does is, make a model memorize and therefore extract all the million ways in which any given word can be used in a language and also how each word in the corpus is related to every other word. This is not something we can manually compute and that is why we throw artificial neural networks at it.

Now, most LLMs and other things that are called AI these days don’t stop here, and have to predict specific things. But if you take off those prediction bits from the model what you have left is called an embedding. Which is a multi-dimensional representation of whatever it is that you trained the thing to do. Engineers, like many of us, suffering from grievous mathematics envy, call these representations vectors.

The way these work is that words and documents that are similar in meaning are represented closer to each other, or clustered. But this happens over many dimensions, so it’s not a simple 1 or 2 D relationship it captures.

See how the word noticed is close to, and connected to all the words that are similar in meaning. The above is a screenshot of one such embedding. I strongly urge you to explore this. Link: Word2Vec Tensorboard

Another tool to understand how words can be represented in multiple dimensions is this embedding explorer from Edwin Chen.

What is most interesting to me is that this rich and pretty accurate representation is produced using something as simple as fill-in-the-blanks.

Searching for things using the information implicitly encoded in these vector spaces is called semantic search by software and ML people.6

So if you don’t want to manually manage everything from synonyms to contextual meaning, you could take a vector space trained on a very large corpus, convert your corpus into those kind of vectors, and then search over those vectors. This conversion just borrows the relationships that are already discovered by the model, which means its representations will be much richer than the ones in your corpus.7

These vector spaces are not like the indexes we spoke of earlier, they don’t have any explicit way of saying word x is present in document z and the score is high.

The search works by identifying documents that are “close” to the query in the vector space. The most commonly used metric to determine this distance is cosine similarity . But even if you can calculate the cosine distance between two words, doing this calculation for every word in your corpus against every word in the query would be mind-numbingly boring and also time and resource consuming. So we need some way to find the location in the vector space that is most likely to have good results, and get there quickly.

This is called approximate nearest neighbor search, and there are a bunch of great algorithms that have come up in the last few years that let you search over humongous datasets really fast.

The ones I like the most are Neighborhood Graph Trees and Hierarchical Navigable Small Worlds (HNSW)

Both of them work by dividing up your data into small clusters, or trees and then searching in those trees or clusters. But really, it’s a lot more complicated than that and to be honest I don’t really understand graph theory enough to know what they do. But what they do they do do well.

This might remind you of indexes, and that is what these create, just, vector and tree representations, not words.

image from Pinecone’s guide to semantic search, linked below

Summary

  • You embed your corpus onto an existing language model’s vector space
  • You then index your embedding using a ANN algorithm like NGT or HNSW
  • And then you can search your corpus using semantic querying.

A demo

If you’ve managed to stay here this long, I urge you to explore the subject using a live demo. MoodMuse is a small app I made to learn and demonstrate the topic. It finds you poetry, based on the meaning of your words, not the terms alone.

Thanks for reading.

This cat does not exist. Created with https://deepai.org

Further reading and resources

Footnotes

  1. This is not the most commonly used representation, sometimes you just store the title of the document, somethings you store a description, it depends on the need of the system and users. ↩︎
  2. Engineers really love logarithms. We should too ↩︎
  3. I am, of course a disestablishmentarianist and several centuries behind on news ↩︎
  4. If you’re into coat cuts. Whatever they are, no judgement here. ↩︎
  5. I know, my metaphors need work. They are unemployed ↩︎
  6. Complicating the matters is that before there were large neural network based embeddings, there were the semantic web people and the ontology people and other people who wanted to solve this using structured representation of text. So software people are kinda stealing the term a little bit. More info: A survey in semantic search technologies ↩︎
  7. Thankfully there is a huge community of NLP nerds who like making these kinds of models easy to use (if you can Python, that is). ↩︎

Ethical AI: Basic readings, schools of thought, an overview

Introduction

Over the last few years, everyone I speak to seems to know about AI. Ethical issues around AI are being actively discussed and debated, not just in the professional sense but also around coffee tables and at informal discussions by users.

This post is an attempt to provide an overview of the issues and approaches.

What we have now are essentially three interfaces of or approaches to ethics in AI.

Besides the differences in budgets, access to VC funding and who gets favorably written about in the NYT, 1 the main difference between the various factions (strange to see factions in ethics, but that is what we get for trying to pre-print our way into a science) is the temporal profile and the concreteness of the problems they are talking about.

Anyway my irritations aside, the ̷f̷a̷c̷t̷i̷o̷n̷s̷  approaches or interfaces are 2

  1. The professional ethics people
  2. The AI risk People
  3. The AI alignment n̸u̸t̸s̸ people

Professional AI ethics

The professional Ethics people are dealing with the immediate and current harms. They focus on identifying such harms and developing frameworks and knowledge that can be used to improve things now and for the future.  They are a lot like the bioethics people

One group of professional ethics people make guidelines. Another group fights big tech.Back then, when LLMs were not all that AI was, and people were using  regression models to predict recidivism, deciding who to hire, and identify people from video surveillance, these people were studying the harms of such systems and talking about what to do about it.3

Overview

Suresh, H., & Guttag, J. (2021). A framework for understanding sources of harm throughout the machine learning life cycle. In Equity and access in algorithms, mechanisms, and optimization (pp. 1-9).

This paper provides a model for understanding the harms and risks that arise in different parts of the model training and deployment process. This is a good high-level overview.4

Recidivism and algorithmic bias

  1. Algorithm is racist: ProPublica
  2. Algorithm’s feelings are complicated, results have a sensitivity/specificity tradeoff that is poorly studied: Washington Post

Linguistically encoded biases:

For many language models as well as image generation models King – Man + Woman = Queen and Man :: Computer Programmer as Woman :: Homemaker. These are looked into in the following papers

  1. Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183-186. DOI:10.1126/science.aal4230 |Preprint on arxiv
  2. Manzini, T., Lim, Y. C., Tsvetkov, Y., & Black, A. W. (2019). Black is to criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings. arXiv preprint arXiv:1904.04047. (pdf)
  3. Malvina Nissim, Rik van Noord, and Rob van der Goot. 2020. Fair Is Better than Sensational: Man Is to Doctor as Woman Is to Doctor. Computational Linguistics, 46(2):487–497.

Pictorially encoded biases:

Facial recognition tech has a long history of being super duper racist, creepy, used for oppression, as well as not being very good. Tech companies, especially the superduperbig guys have been getting into this game and are releasing models that are seemingly better, but only if you’re a white male.

  1. Buolamwini, J., & Gebru, T. (2018, January). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency (pp. 77-91). PMLR.
  2. White D, Dunn JD, Schmid AC, Kemp RI. Error Rates in Users of Automatic Face Recognition Software. PLoS One. 2015 Oct 14;10(10):e0139827. doi: 10.1371/journal.pone.0139827. PMID: 26465631; PMCID: PMC4605725.

LLMs and their problems

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623.

Google the authors, and what they went through when they came out against LLMs. This will give you a lot of info about the stakeholders involved in this. The paper itself is highlighting the issues with Large Language models.

What about some practical stuff ?

A philosophical framework

While the debates go on, a zillion guidelines on how to deal with data and how to things ethically have come up. Chances are you will be overwhelmed if you start looking at them. I haven’t found any specific frameworks that speak to me. Personally I recommend and use the medial ethics framework along with the spider-man framework. which states that

cartoon image showing a clsoeup of someone's eyes, with captions on either side of the eyes reading "with great power" and "...comes great responsibility" this is from a spiderman comic

That is as self explanatory as an ethical dictum gets. The medical framework has the following four principles

  1. Respect for autonomy – Don’t make stuff that interferes with the autonomy of the user. When you’re in a position to make decisions for them, do this only after clearly explaining the harms to them and with their consent. Consent is king.
  2. Beneficence – an AI programmer should act in the best interest of the end user and not of the employer
  3. Justice – Is about the distribution of scarce resources, the decision of who gets what. Which means if your product, algorithm or system worsens disparities between people and creates inequalities, do better than that, think very hard about how you’re using your power.
  4. Non-maleficence5 – to not be the cause of harm. Also, to promote more good than harm, to the best of your ability. Also known as Above all, do no harm

Checklists and guidelines

If you’re looking for checklists and stuff that you can start using immediately to do responsible/ethical AI here is a github repo with the best links Awesome AI guidelines

If you want just *one* guideline, read this: Mitchell, Margaret, et al. “Model cards for model reporting.” Proceedings of the conference on fairness, accountability, and transparency. 2019. Model cards for model reporting.

This paper is great to get a conceptual understanding of the need for clear declarations about models. But it is too ambitious and a bit too bulky for wide use. Despite this, hugging face has implemented this on their website, although it isn’t always used. I recommend developing your own model card/checklist and having them attached to wherever you store your models.

This github repo with links to responsible AI resources is one more resource that focuses on practical advice and frameworks : Awesome Responsible AI

 Explainable AI

For me one foundational source of ethical issues with deep learning models is that the algorithm is a black-box.  The interpretability of the predictions or results of a neural network algorithm is poor. And so people who are working on model interpretability and explainability are also working on things that are critical for ethical decision making. 

AI Risk

Image of SWORDS military robot from wikimedia commons
The SWORDS system allows soldiers to fire small arms weapons by remote control from as far as over 3,937 feet (1,200 meters) away. This example is fitted with an M249 SAW.

 
The risk people are talking about existential and military and other big risks from AI. There is definitely some overlap between the risk people and the professional ethics people but a lot of what they discuss is about future or possible harms. Still very concrete harms and stuff that we can definitely see happening.

Think autonomous weaponized drones, robot soldiers,  autonomous robot doctors etc. The key here being to prevent things from getting out of hand when algorithms and robots are deployed to make autonomous decisions. I see this as  robocop-dredd-dystopia prevention work

Healthcare is another area where a great deal of harm could come from automation (a great deal of good too) and it is important that we think hard and work towards systems that safe-keep the interests of the patient.

This debate captures a great deal of nuance on  AI risk. Melanie Mitchell is a delight to listen to.

I don’t think there is enough work being done about real risk. We have a lot of thought leaders talking about it, but the engineering and the science of it is not getting a lot of traction.

AI Alignment

Imagine a world where an AI who is much smarter than us all has emerged and is currently demanding that everyone should call it lord and master and pray to it thrice a day.

The AI alignment people are

  1. Figuring out how to prevent such an AI from emerging and
  2. How best to align the interests of such an AI with our own.

I am not kidding you.

There are some actual cults involved and a lot of though experiments that are so bizarre I cannot even. 6

The problem is that these folks are completely ignoring addressing the  current harms using the doom and gloom of this possible outcome.

This is a doomsday cult kinda ideology. And sadly some very big names in AI have  signed up on this cult.

Worse, this approach is being used by some large players to superficially meet the demands of ethical AI, while completely sidestepping accountability for  the issues that are relevant today. As you can imagine, there are many important people in governments all over the world for whom the biggest worry is an AI that will replace them. So those guys are also treating this like its a real and credible threat right now.

There is no doubt that we need to ask ourselves this question, about how do we deal with systems that have more power than us. But I think the answer  to those questions lies in building in accountability, transparency, safety, informed consent and things that we already know how to do pretty well and don’t do because its bloody inconvenient and cost  money. I definitely believe that we need better engineering research into this, threat assessments, all that. But this issue is not so novel that we need to come up with an entirely new discipline which ignores and laughs at the stuff other experts on harm-reduction are saying. That is stupid.

I am not going to link to any of the alignment cultists but here are two  analytical articles about them which I think are great, and they link to plenty of stuff that you can explore.

Leopold Aschenbrenner: Nobody’s on the ball on AGI alignment ( this person works on an alignment team for one of the largest players ) 

Alexey Guzey :  AI Alignment Is Turning from Alchemy Into Chemistry

I dont really fully agree with these authors, but they make some sense and what would be an ethical guide without stuff that one disagrees with.

Concluding remarks

I admire greatly the activists who are fighting the good fight against Big-LLM, I really do. But I do not like agendas for change and progress being exclusively drafted by activists on social media. Social media is like reality TV, what you get if you do all your intellectual debate there is some form of Donald Trump.

I think that at some point the IT/AI engineering profession is going to realize the same thing that the medical world did. That if you don’t start doing things ethically, you will lose your power and create harms far beyond you can imagine and this shit will haunt you. If the tech world looks at the medical, i guess it can see a lot of unethical stuff. That is our shame.  But at the same time, I hope they will investigate the issue historically, and ask just how many checks and balances are there in healthcare to ensure the patient is not harmed. There is a lot to learn from the history of medical ethics.

Ethics make sense because it improve systems. Great AI will be ethical just like the best healthcare is ethical. And just like a doctor ultimately works with the patients best interest at heart, at some point AI engineers too  will adopt this dogma because it is the rational best choice and has a proven track record for reducing harm, and no one wants to build stuff that harms people.  Also, the workers of the world have a lot more in common with each other than with the bossmen. 7

This little fella knows ?she looks FABULOUS

Image from: Welch S. C. & Metropolitan Museum of Art (New York N.Y.). (1985). India : art and culture 1300-1900. Metropolitan Museum of Art : Holt Rinehart and Winston.

If you wish to cite this post here is the citation

Philip, A. (2023, July 11). Ethical AI: Basic reading, schools of thought, an overview. Anand Philip’s blog. https://anandphilip.com/ethical-ai-basic-reading-schools-of-thought-an-overview/

Ok that is all I have for you now folks, please do comment and subscribe and like and share this on boobtube instamart dreads and feathers as you wish.

Footnotes

  1. i.e. what kind of power does who have ↩︎
  2. Some people are calling this a schism, but it is not a schism ↩︎
  3. This work is then being carried forward by the actually-LLMs-are-not-that-great activists. ↩︎
  4. It however lacks the liberal-progressive-activism priorities ↩︎
  5. No relation with Angelina Jolie, having horns or dressing in black ↩︎
  6. I love thought experiments, they teach us a lot of things including that one must not confuse a thought experiment about a distant and remote possibility with something real and applicable now. ↩︎
  7. I am a Bourgeoisie malayali, can you blame me for bringing up Marx?. ↩︎

What level is your EMR/EHR?

Level 1 EMRs

Warning: The following text is needlessly cruel

Level 1 the PITA : EMRs that double the work — systems where the EMR is used for documentation or compliance, but all the real work happens in written form, so this just increases the net amount of work done. [Admittedly, this is a problem of the administrative system, not so much the EMR itself, at times] (a large  majority of  EMR deployments in india are this kind)

Level 2 the copycat : EMRs that successfully replicate all the physical record keeping and so work isn’t double, but  worsens things because replicating physical workflows in-toto  invariably will make everything slower (This is what most new EMR implementations are)

Level 3 The trying-very-hard-to-be-cool: EMRs that have some automation and short cuts built in, like medication templates, discharge summary templates and suchlike which in some areas, significantly reduce the time taken to do a task and are overall nearly as fast as paper and pen ( I’ve seen a few of these)

Level 4 The we-do-design-thinking-and-listen-to-the-money : EMRs that have thought through the clinical processes  and removed all parts that do not need to be there, and have largely click and pick workflows these are faster to use than other EMR/EHR and might even be faster than pen and paper. (I’ve seen maybe one of these of these)

Level 5 : EMRs that are faster than pen and paper, assist and improve decision making and make things easier for patients. (Haven’t met one of these so far)

Small steps, large impact: the Linux story

25th august 1991, a nobody named Linus Trovaldis did something bold, without any clue about what he was setting in motion.

Hello everybody out there using minix –
I’m doing a (free) operating system (just a hobby, won’t be big and
professional like gnu) for 386(486) AT clones. This has been brewing
since april, and is starting to get ready. I’d like any feedback on
things people like/dislike in minix, as my OS resembles it somewhat
(same physical layout of the file-system (due to practical reasons)
among other things).
I’ve currently ported bash(1.08) and gcc(1.40), and things seem to work.
This implies that I’ll get something practical within a few months, and
I’d like to know what features most people would want. Any suggestions
are welcome, but I won’t promise I’ll implement them
Linus (torva…@kruuna.helsinki.fi)
PS. Yes – it’s free of any minix code, and it has a multi-threaded fs.
It is NOT protable (uses 386 task switching etc), and it probably never
will support anything other than AT-harddisks, as that’s all I have :-(.

yes, it says what you think it does

Here is something we know today about human beings; we can never predict the future with reasonable degree of surety. Even in situations where the outcome seems to be “either this, or that”, we cannot deduce how things will turn out. This does not stop us from trying, though, or from re interpreting the past to make sense of the present/future.

The story of Linux is a story of how a simple action can lead to worldwide change. Linux is not just about software now, it gave birth to philosophies, life styles and much more.

Linux.com as well as the Linux foundation have some great articles, infographics and videos up celebrating the 20th anniversary of Linux.

The lokpal bill situation seems to have drowned out the Indian Linux lover voices, and that is a sad thing.

I'll be celebrating 20 years of Linux with The Linux Foundation!

Image by  nitot