Searching by Intent

When I search with Mojeek, I tend to break my search up into two, broad steps.

One, locate the relevant pool of knowledge.

For example, I might search for PBS and filter by site.

  • Search for “pbs”
  • Click on the link “See more results from www.pbs.org »”

And, step two, search that pool for my subject.

  • Update the search terms to “site:www.pbs.org baratunde thurston”
  • Search

This works well when the pool of knowledge that I’m targeting is confined to a particular website.

What I would like to do is create my own pool on the fly.

Conceptually, I would perform a search for a broad category–step one. And then I would search within those results–step two.

I think this would address an issue where Mojeek sometimes jumbles search terms and produces unexpected results. And, I think that happens because I can’t communicate my intent to Mojeek. In my experience, I can’t effectively group terms together. And features like Focus are too cumbersome for one-off searches. So, I thought these two-step searches might be a solution. I would just need a search operator that could capture the initial search. For example, if there was a link called “Search within these results” on the first page, then that information would have to be reflected on the next search results page. This is exactly the relationship between "See more results from www.pbs.org" and “site:www.pbs.org”

I don’t have a convincing example today. But, I’ve thought about this before. And, I hope you can see the value in communicating intent to the algorithm without forcing it to guess at semantic meaning like other search engines do.

You might also think of this request as prefiltering your results or as a cascading search.

4 Likes

How about introducing an and and or operator? Like for example, if | is an or operator, if you want to search multiple sites, you could do something like this:

query insite:example1.com|example2.com|example3.com

where any page on those 3 websites that include the query will be included in the search term. With this, you can do that 2-step search, creating your own pool on the fly by adding the websites on the search terms.

Having an and operator would further expand this capability. Instead of the current allin operator which requires to be written at the beginning, limiting its use, having, for example, + as an and operator, would allow something like this:

query1 query2 intitle:query3+query4 insite:example1.com|example2.com|example3.com

to filter the results even more.

1 Like

This is a possibility because of Focus so an operator could theoretically create these/use the functionality on the fly. I can’t say that it would be a priority given other things being worked on, but I’ve taken a note on it.

1 Like

No. This misses the point I’m making.

I recently searched for:

adam grant doctor vaccine

Hopefully, you can recognize that I mean the person, Adam Grant, and perhaps a few keywords, doctor and vaccine.

When I search with DuckDuckGo, the results reflect the fact that I’m talking about Adam Grant.

Whereas, when I perform the same search on Mojeek, the results reflect “adam” or “doctor”. The first one hundred results don’t even place the keywords Adam and Grant together.

In this specific case, I can use the query:

"adam grant" doctor vaccine

Those keywords allow Mojeek to find relevant information.

But this doesn’t answer my original request.

There, I made the point that I cannot communicate intent or reliably group keywords together. The reason is, on the first point, Mojeek is based on lexical not semantic meaning. And, on the second point, there is no explicit grouping operator.

While in this specific case I can use double quotes to trick Mojeek into producing results relating to Grant; the same trick won’t work, for example, on:

furniture store adirondack chair

These keywords will likely be jumbled by Mojeek. And using quotes—"furniture store"—won’t give me a list of stores.

This is why I made the argument about meaning and grouping. And it is why I suggested a cascading search. Mojeek can probably find furniture stores. And it can probably find Adirondack chairs. But, without a way to group terms or communicate meaning, there is no efficient way to search for Adirondack chairs in furniture stores.

I need the ability to create a set and then limit my second search to that set.

And, in the case of the chair, it would not make sense to invest the time to create a list of stores or to retain the list in a Focus per se. In order to compete with semantic search engines, I need to be able to create the relevant set on the fly.

References

2 Likes

By cascading search, do you mean you want to search for furniture stores first, then using those websites of furniture stores you found, you want to search for adirondack chair within them? I think that could work on my suggestion of an or operator, like this:

adirondack chair site:furniturestore1.com|furniturestore2.com|furniturestore3.com

where you could copy paste those websites you found on the search term. Though I can see how an advanced search feature, like having checkboxes on the sites, would make it more convenient.

But how about introducing a parenthesis () operator, making it possible to do this:

(furniture store) (adirondack chair)

To directly communicate intent that “furniture store”, and its variations from semantic search and stemming, should be close together, and the same goes for “adirondack chair”. This would be simpler and faster.

What do you think of these? Did I miss your point again, or did I finally get it?

1 Like

In the particular case of furniture store adirondack chair, I think the practical solution is to only search for “adirondack chair”, and forget about the rest. First, finding something to buy is, for good or ill, the default expectation. Second, actual furniture stores are generally not tagged anywhere saying “this is an actual furniture store, as opposed to some other way of selling chairs” – if they don’t provide the information, there’s not much point in searching for it. (Anthropomorphizing a bit, if you ask Mojeek “Which of these are actual furniture stores?”, it can only respond “I don’t know either”.) Even if there was a registry of real stores, it would be constantly corrupted by non-store advertisers determined to get themselves on the list, or it would wait years between updates, or be silently abandoned.

But I’m pretty sure I understand the concept you’re getting at.

1 Like

I have an example where searching within a set of existing results would be useful.

The US Supreme Court term recently ended. And, a panelist in a roundtable discussion mentioned that a lower court judge responded to one of the decisions.

Mojeek did a good job of finding the case:

# Query
federal judge non compete since:20240630

# URL
https://www.mojeek.com/search?q=federal+judge+non+compete+since%3A20240630

# Decoded URL
https://www.mojeek.com/search?q=federal+judge+non+compete+since:20240630

# Result
Federal Judge in Texas Issues Injunction Putting FTC
https://www.jdsupra.com/legalnews/federal-judge-in-texas-issues-3019713/
crawled today
See more results from www.jdsupra.com »

Texas judge temporarily blocks government ban on non-compete
https://abc7.com/post/texas-judge-temporarily-blocks-government-ban-compete-agreements/15028372/
crawled 1 day ago

Judge delays ban on employee non-compete agreements
https://westsidepeoplemag.com/judge-delays-ban-on-employee-non-compete-agreements/
crawled 1 day ago

Federal Judge Considering Stay of FTC Non-Compete Rule
https://www.gkglaw.com/publications/820-federal-judge-considering-stay-ftc-non-compete-rule
crawled today
See more results from www.gkglaw.com »

However, the story I read did not mention the Supreme Court.

At this point, I could simply add keywords such as “supreme court”. But, I suspect that those added keywords would overpower my existing ones. And, the results would be dominated by higher-ranked but less relevant results.

What I really want to do is to search within these existing results. That would keep the part that is working and which I like. And, it would narrow the results and allow me to focus just on the ones which are most relevant.


Going ahead and conducting the follow-on search, Mojeek produced the following results:

# Query
supreme court federal judge non compete since:20240630

# URL
https://www.mojeek.com/search?q=supreme+court+federal+judge+non+compete+since%3A20240630

# Decoded URL
https://www.mojeek.com/search?q=supreme+court+federal+judge+non+compete+since:20240630

# Result
Federal Cases (but not US Supreme Court) | Drew Capuder's
https://dcemploymentlawblog.com/category/federal-cases-but-not-us-supreme-court/
crawled 1 day ago
See more results from dcemploymentlawblog.com »

New York Court Declines to Enforce Non-Compete Clause Against
https://www.thenjemploymentlawfirmblog.com/new-york-court-declines-to-enforce-non-compete-clause-against-former-employees/
crawled 2 days ago

Federal High Court and National Industrial Court lack
https://pmnewsnigeria.com/2024/06/21/federal-high-court-and-national-industrial-court-lack-jurisdiction-on-chieftaincy-matters/
crawled 1 day ago

Supreme Court of N.H. v. Piper :: 470 U.S. 274 (1985) :: Justia
https://supreme.justia.com/cases/federal/us/470/274/
crawled 1 day ago
See more results from supreme.justia.com »

BREAKING: District Court in Texas Issues Injunction on FTC
https://www.ctemploymentlawblog.com/2024/07/articles/breaking-district-court-in-texas-issues-injunction-on-ftc-non-compete-rule-leaving-future-in-doubt/
crawled 1 day ago
See more results from www.ctemploymentlawblog.com »

My second result was published in 2018 but was only recently crawled.

My fifth result was recently published and relates the Supreme Court decision to the lower court injunction. This was the answer I was looking for.

Returning to the original results, ctemploymentlawblog dot com appeared as my result number nineteen. So, if I had been able to search within my existing results then it is likely that would have been the top search result.


With respect to the replies I’ve missed, I want people to think about what they would do if they had a set of Mojeek results and wanted to search within those results. That is the issue I’m trying to focus on and solve.

And, my additional concern is that I would like the search process to remain simple and not involve complicated and long queries.

Also, please keep in mind that Mojeek is mostly a lexical search engine. And the implicit meaning humans use in everyday speech is not handled there. That is by design. And that’s fine. Only, there are opportunities to use lexical search more effectively. And I hope this thread demonstrates one case.

As a sidenote, if you’d like to know more about making the Web machine-readable then you might try reading this article about the semantic web:

2 Likes

mike brings up a very interesting issue here, and maybe more than one

the issue is not only with mojeek, but it seems mojeek tends to treat search terms with OR rather than AND and i find this to be a problem - i notice this a lot when using eTools for odd-ball searches where all of my selected engines returns few, if any results, while mojeek often tends to return several, all of which are irrelevant because some of the terms are ignored

if i include multiple search terms, i absolutely expect all of the results to include them as well

i sort of like the ‘+’ operator idea (rather than dbbl quotes) because it could provide some sequencing of the terms, but maybe there’s a better way???

the “jumbling” of terms that mike mentions is perhaps another issue - there are many times where i want the terms to be sequenced so the first term needs to be found in the document before the second term, and so on, but all should be equally weighted - potentially this could be addressed by numbering the terms…

1.<term> 2.<term> ... (same as 2.<term> 1.<term> ...)

or perhaps just a single character, such as preceding the pre-ordered search terms with a dot …

.term1 .term2

… and a result page might be “term1 and the dog bit term2”

YES! by default (no AND), AND should be implicit and OR is needed also - for ex. cat|dog seems to currently return only pages that include both … and as the poster said, ‘|’ could also be extended to work with site: searches

term site:a.com|b.com (the equivalent of term site:a.com site:b.com)

yeah, again mojeek seems to make OR implicit rather than AND, however i don’t think that mojeek should assume “adam grant”, the phrase, unless it’s quoted, or “adam” “grant”, the sequence, unless the user says so (1.adam 2.grant)

off topic a bit, but what if mr. grant has a middle name? …

"adam * grant" ← one wildcard for one unknown word

wouldn’t that be the same as dbbl quoting the 2 phrases?

2 Likes

I think that @snatchlightning was suggesting this syntax would first search for furniture store and then search within those results for adirondack chair which was my suggestion.

Whereas, double quoting each phrase would conduct one search where each phrase appeared in the results exactly as entered.

Technically, the only problem with parentheses is I think other search engines already use this for logical grouping such as:
supreme court (site:pbs.org/newshour OR site:bbc.com)
on DuckDuckGo. So, Mojeek might want to reserve parentheses for that use.

References


Note: I’m linking to my original Focus feedback above. And, in general, my reply there doesn’t make sense because I did not know what a Focus was. The only feature Focus provides is searching multiple websites. But, I did not have a working definition of Focus. I thought it was for saving search parameters and that site: was only the first of many. That makes sense because I mistook Focus for making a template out of the site: advanced search operator:

1 Like

AND is the default for Mojeek, there is no OR.

Wildcards can be used with * - take a look at:

https://www.mojeek.com/search?q="wooden+*+legs"

for example.

1 Like

hmmm, perhaps the issue is how search term matches are considered - for example, searching for ‘sting bing coming’ produces results where the terms are parts of words rather than whole words and while in some cases that’s desirable, there’s clear cut cases where it isn’t

i can see where a search for these terms perhaps should include ‘comings’, but not ‘comes’, or ‘binge’ and certainly not ‘by’, ‘be’ or ‘b’, all of which were highlighted as keywords in the results

sure enough - thanks for that - but does it work in the way one expects given that the search terms are quoted? for example the search you provided returns …

… quality wooden furniture legs, sofa legs, chair legs, bed legs and …

wooden furniture legs” is expected to be highlighted in the results, but the remainder, i’m not so sure

anyway, i think i’ve driven this thread a bit off-topic, so i’ll let you guys get back to it

1 Like

I found another example for my suggestion to offer search by intent.

When I search for “ubuntu snap store powershell” the keyword “ubuntu” pushes my preferred result (snapcraft.io/powershell) down the ranking.

Note that I have explicitly turned site clustering off with the URL parameter &si=0

# Query
ubuntu snap store powershell

# URL
https://www.mojeek.com/search?q=ubuntu+snap+store+powershell&cdate=1&dlen=0&rp_i=0&si=0

# Decoded URL
https://www.mojeek.com/search?q=ubuntu+snap+store+powershell&cdate=1&dlen=0&rp_i=0&si=0

# Result
Microsoft Brings PowerShell to the Ubuntu Snap Store - OMG!
https://www.omgubuntu.co.uk/.../07/install-powershell-ubuntu-linux-...
crawled 6 months ago

Install powershell on Ubuntu using the Snap Store | Snapcraft
https://snapcraft.io/install/powershell/ubuntu
crawled 1 month ago

Install powershell-preview on Ubuntu using the Snap Store |
https://snapcraft.io/install/powershell-preview/ubuntu
crawled 3 months ago

Install powershell on Linux | Snap Store
https://snapcraft.io/powershell    <<<< My preferred result.
crawled 2 months ago

What I’d want is more like the site: results for keywords with the same semantic meaning as above.

# Query
site:snapcraft.io powershell

# URL
https://www.mojeek.com/search?q=site%3Asnapcraft.io+powershell&cdate=1&dlen=0&rp_i=0&si=0

# Decoded URL
https://www.mojeek.com/search?q=site:snapcraft.io+powershell&cdate=1&dlen=0&rp_i=0&si=0

# Result
Install powershell on Linux | Snap Store
https://snapcraft.io/powershell    <<<< My preferred result.
crawled 2 months ago

Install powershell-preview on Linux | Snap Store
https://snapcraft.io/powershell-preview
crawled 3 months ago

PowerShell launches as a snap | Snapcraft
https://snapcraft.io/blog/powershell-launches-as-a-snap
crawled 1 year ago

Install powershell-preview on elementary OS using the Snap
https://snapcraft.io/install/powershell-preview/elementary
crawled 8 months ago

And, of course, the reason why I keep asking for a new operator is I don’t know the precise host (snapcraft.io) before I search. So, I’d want to create the set to search in with the keywords “ubuntu snap store”.

So the prospective search would be something like:

# Query 
ubuntu snap store AND powershell

I’ve changed my mind. And now I think it makes sense to use “AND” as the explicit search operator that would trigger the new algorithm.

1 Like

I want to link to your blog post here. And say that the expanded number of supported sites can really help API users.

Also, as of right now, the API documentation is slightly out of date. fi now supports 100 sites per the blog post.



2 Likes

Thanks @mike for the reminder, this has been raised so it’s on our list.