However, when opened, the sites clearly indicate that the dates written were wrong. And this problem appears on almost every result. Looking closely, the dates are last crawled date instead (i.e., the date 24 Jan 2024 in the 1st result is indeed 2 months ago). According to the blog about search operators:
The date a page has will be either when Mojeek has noticed that it was last modified or, if the page has never been found as modified, when Mojeek first crawled the page. The dates do not reflect when a page was first published, although in some cases these won’t be far apart.
Do you mean that if a page is never modified, Mojeek would use the first crawled date, instead of when the page is actually published? I find that logic, to say it kindly, extremely bizarre. Who, except the SEO folks, would prefer the first crawled date over the actual date? So here are my questions for Mojeek:
Is this a bug, or is this intentional?
If intentional, for what purpose? For the benefit of the users, or limitations of the search engine?
If it can’t be repaired immediately or at all, can you at least indicate it to avoid confusing users? Something like “We did not find any articles within the date you specified”, then a button if they want to use the first crawled date instead.
This is one of those things which is very easy for a human to pick out, but more difficult to automate and scale. It is the reason why this functionality since/before is on the operators page but not in a UI element.
This is correct, the date of a page is when it first entered into the index.
This should improve along with crawling more, and we’re aware of it, but it’s a case of where to allocate effort. We’re aware of it and returning every now and again to thinking about ways it could be improved.
So:
intentional
limitations
the dates are all going to be last crawled/modified, so there would be no two steps to this process, there is only one date to offer, the text in Operators says:
to restrict results to those that were last modified since then
which I guess could be better as
to restrict results to those that were last modified since then (in cases where a page has not been modified this will be the date we found the page)
First, put “beta” on date operators. I don’t think operators is shorthand for incomplete features.
Second, instead of wrapping it on the description, make it more prominent. Something like:
On Mojeek there are two ways in which you can narrow down the pages returned based upon dates. (Note: Currently, we can only check if the page has been modified AFTER the first crawled date, but we’re working on fixing this.)
I believe the proper word is “fix”, not “improve”, as it can produce wrong results, breaking basic assumptions about such a feature. Anything less would sound deceptive.
That said, I appreciate your admission of this problem, and would be willing to help in testing out if you found possible solutions.
I don’t care when a page was last crawled. I can’t think of anyone I know who would care either. If “date” doesn’t always always mean “latest update by the page’s owner”, I don’t see the point of offering it at all.
(I’m sure that last crawled dates are significant for those who build and maintain Mojeek, but that’s different.)
My reasoning: When I look at a Mojeek search result, I shouldn’t be having to worry whether Mojeek has done its job well; I should be focused on the content.
When using this feature it is the last modified date as determined by our crawl. As we’re not crawling/recrawling fast enough for some places, the date might not be accurate.
It will however be 100% accurate for some sites, ones where that recrawl is happening very fast. We are working on this specific area and it is becoming more accurate over time.
I understood that, and it’s not really what I was referring to.
Let’s say that some site was last updated twenty years ago. It’s no use to me to know that it was crawled last week or last month; if Mojeek isn’t going to show “2004” on it, then I’d regard all of Mojeek’s date information as corrupted.