nnir imap with gmail: query encoding

Discussion:

Carlos Pita

2014-09-01 17:41:35 UTC

Hi all,

I'm having some trouble getting nnir to work with non ascii queries.

In my gmail account there is an email containing the word "diría" in the
"facultad" folder. When I search that folder for "diría" there are no
matches. Moreover, when I search the same folder for "diria" there are
no matches either. So I have no way to search the folder for that word.

Is that supported? Should I configure anything encoding-related?

Cheers
--
Carlos

Carlos Pita

2014-09-01 18:09:12 UTC

Permalink

Doing a raw imap search won't do the trick either:

CHARSET ISO-8859-1 BODY "diría" -> no matches
CHARSET UTF-8 BODY "diría" -> no matches

CHARSET ISO-8859-1 BODY "que" -> fine, lot of matches
CHARSET UTF-8 BODY "que" -> fin, lot of matches

Here is some info about my emacs encoding:

Coding system for saving this buffer:
U -- utf-8-emacs

Default coding system (for new files):
U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for keyboard input:
U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for terminal output:
U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for inter-client cut and paste:
nil
Defaults for subprocess I/O:
decoding: U -- utf-8-unix (alias: mule-utf-8-unix)

encoding: U -- utf-8-unix (alias: mule-utf-8-unix)

Priority order for recognizing coding systems when reading files:
1. utf-8 (alias: mule-utf-8)
2. iso-2022-7bit
3. iso-latin-1 (alias: iso-8859-1 latin-1)
4. iso-2022-7bit-lock (alias: iso-2022-int-1)
5. iso-2022-8bit-ss2
6. emacs-mule
7. raw-text
8. iso-2022-jp (alias: junet)
9. in-is13194-devanagari (alias: devanagari)
10. chinese-iso-8bit (alias: cn-gb-2312 euc-china euc-cn cn-gb gb2312)
11. utf-8-auto
12. utf-8-with-signature
13. utf-16
14. utf-16be-with-signature (alias: utf-16-be)
15. utf-16le-with-signature (alias: utf-16-le)
16. utf-16be
17. utf-16le
18. japanese-shift-jis (alias: shift_jis sjis)
19. chinese-big5 (alias: big5 cn-big5 cp950)
20. undecided

Post by Carlos Pita
Hi all,
I'm having some trouble getting nnir to work with non ascii queries.
In my gmail account there is an email containing the word "diría" in the
"facultad" folder. When I search that folder for "diría" there are no
matches. Moreover, when I search the same folder for "diria" there are
no matches either. So I have no way to search the folder for that word.
Is that supported? Should I configure anything encoding-related?
Cheers
--
Carlos

Eric Abrahamsen

2014-09-03 02:01:58 UTC

Permalink

Post by Carlos Pita
CHARSET ISO-8859-1 BODY "diría" -> no matches
CHARSET UTF-8 BODY "diría" -> no matches
CHARSET ISO-8859-1 BODY "que" -> fine, lot of matches
CHARSET UTF-8 BODY "que" -> fin, lot of matches

Are you doing the above in a telnet session, or where? IMAP uses a
variant of UTF-7 as its internal coding, maybe try with that and see
what happens?

FWIW, it doesn't look like nnir/IMAP work for me here, either, searching
for Chinese characters.

Eric

Post by Carlos Pita
U -- utf-8-emacs
U -- utf-8-unix (alias: mule-utf-8-unix)
U -- utf-8-unix (alias: mule-utf-8-unix)
U -- utf-8-unix (alias: mule-utf-8-unix)
nil
decoding: U -- utf-8-unix (alias: mule-utf-8-unix)
encoding: U -- utf-8-unix (alias: mule-utf-8-unix)
1. utf-8 (alias: mule-utf-8)
2. iso-2022-7bit
3. iso-latin-1 (alias: iso-8859-1 latin-1)
4. iso-2022-7bit-lock (alias: iso-2022-int-1)
5. iso-2022-8bit-ss2
6. emacs-mule
7. raw-text
8. iso-2022-jp (alias: junet)
9. in-is13194-devanagari (alias: devanagari)
10. chinese-iso-8bit (alias: cn-gb-2312 euc-china euc-cn cn-gb gb2312)
11. utf-8-auto
12. utf-8-with-signature
13. utf-16
14. utf-16be-with-signature (alias: utf-16-be)
15. utf-16le-with-signature (alias: utf-16-le)
16. utf-16be
17. utf-16le
18. japanese-shift-jis (alias: shift_jis sjis)
19. chinese-big5 (alias: big5 cn-big5 cp950)
20. undecided

Andrew Cohen

2014-09-03 14:00:02 UTC

Permalink

CHARSET ISO-8859-1 BODY "diría" -> no matches CHARSET UTF-8 BODY
"diría" -> no matches
CHARSET ISO-8859-1 BODY "que" -> fine, lot of matches CHARSET
UTF-8 BODY "que" -> fin, lot of matches

Eric> Are you doing the above in a telnet session, or where? IMAP
Eric> uses a variant of UTF-7 as its internal coding, maybe try with
Eric> that and see what happens?

Eric> FWIW, it doesn't look like nnir/IMAP work for me here, either,
Eric> searching for Chinese characters.

Eric> Eric

This isn't currently supported in nnir directly, but can be done with a
raw imap query. As I recall the trick is you can't use a quoted string
but have to use a literal; literals need to include the number of octets
(including CR and LF) in brackets following the search term (this is
what I understand from reading the imap RFC).

Carlos Pita

2014-09-03 16:14:30 UTC

Permalink

Post by Andrew Cohen
Eric> Are you doing the above in a telnet session, or where? IMAP
Eric> uses a variant of UTF-7 as its internal coding, maybe try with
Eric> that and see what happens?
Eric> FWIW, it doesn't look like nnir/IMAP work for me here, either,
Eric> searching for Chinese characters.
Eric> Eric
This isn't currently supported in nnir directly, but can be done with a
raw imap query. As I recall the trick is you can't use a quoted string
but have to use a literal; literals need to include the number of octets
(including CR and LF) in brackets following the search term (this is
what I understand from reading the imap RFC).

I've also tested it using python imaplib and there is no way to encode
the query inline, gmail will complain about being unable to parse it as
soon as you move outside the ascii charset, no matter the
enconding. Indeed, you have to use a literal to get it working, but I
have no idea how to patch nnir in order to do that.

Cheers
--
Carlos

Eric Abrahamsen

2014-09-06 04:26:44 UTC

Permalink

Post by Andrew Cohen

CHARSET ISO-8859-1 BODY "dirÃa" -> no matches CHARSET UTF-8 BODY
"dirÃa" -> no matches
CHARSET ISO-8859-1 BODY "que" -> fine, lot of matches CHARSET
UTF-8 BODY "que" -> fin, lot of matches

Eric> Are you doing the above in a telnet session, or where? IMAP
Eric> uses a variant of UTF-7 as its internal coding, maybe try with
Eric> that and see what happens?
Eric> FWIW, it doesn't look like nnir/IMAP work for me here, either,
Eric> searching for Chinese characters.
Eric> Eric
This isn't currently supported in nnir directly, but can be done with a
raw imap query. As I recall the trick is you can't use a quoted string
but have to use a literal; literals need to include the number of octets
(including CR and LF) in brackets following the search term (this is
what I understand from reading the imap RFC).

Here's a patch that (mostly) makes this work. It's probably not ready
for application, but I'm fairly confident that the basic approach is
sound. I had to make changes to nnir-run-imap, because literals have to
be fed to the server on separate lines.

Things that bear examination:

1. Raw imap queries are still not touched. You couldn't put non-ascii
search terms in raw queries anyway, because, as mentioned above, they
have to be sent as separate lines to the server, and the current search
routine can't do that.

2. The 'coding' let-var that gets sent to the search command (the
argument to CHARSET) is a total guess on my part. It should be easy to
fix if my guess is wrong, though. Do we need to care about the coding
system used in the server's message storage?

3. I've only tested this with my local dovecot (2.2.13). Imap servers
are weird, and edge cases abound. It would be nice if someone could test
this directly with Gmail and Exchange, at least.

What are the future plans for nnir's mini-imap search language? Right
now it doesn't seem like a whole lot is gained by the
nnir-imap-search-arguments/default-search-key setup. It seems like it
would be simpler and more flexible to allow all valid imap search
criteria as lower-cased keys followed by a colon, which would make the
language look a little bit more like notmuch or what have you. Unknown
keys would be considered header values. Or something else like that --
mostly it just seems very limiting that we can specify at most one
criteria to search on.

Anyway, just curious what the plan is.

Eric

Lars Ingebrigtsen

2015-01-27 05:37:36 UTC

Permalink

Post by Eric Abrahamsen
2. The 'coding' let-var that gets sent to the search command (the
argument to CHARSET) is a total guess on my part. It should be easy to
fix if my guess is wrong, though. Do we need to care about the coding
system used in the server's message storage?
3. I've only tested this with my local dovecot (2.2.13). Imap servers
are weird, and edge cases abound. It would be nice if someone could test
this directly with Gmail and Exchange, at least.

I've applied your patch, so we'll get some testing. :-)

Post by Eric Abrahamsen
What are the future plans for nnir's mini-imap search language? Right
now it doesn't seem like a whole lot is gained by the
nnir-imap-search-arguments/default-search-key setup. It seems like it
would be simpler and more flexible to allow all valid imap search
criteria as lower-cased keys followed by a colon, which would make the
language look a little bit more like notmuch or what have you. Unknown
keys would be considered header values. Or something else like that --
mostly it just seems very limiting that we can specify at most one
criteria to search on.
Anyway, just curious what the plan is.

A simpler search language would be nice...

--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog http://lars.ingebrigtsen.no/

Eric Abrahamsen

2015-01-27 06:01:51 UTC

Permalink

Post by Lars Ingebrigtsen

I've applied your patch, so we'll get some testing. :-)

Would this be the source of the flood of nastygrams I'm getting from the
buildbot? I guess we *did* get some testing!

Post by Lars Ingebrigtsen

A simpler search language would be nice...

And a unified one, that the nnir could translate for its various
backends.

You'll forgive me if my patch-writing confidence is at a low ebb,
though. I've had a pretty poor batting average of late.

Lars Ingebrigtsen

2015-01-27 06:24:26 UTC

Permalink

Post by Eric Abrahamsen
Would this be the source of the flood of nastygrams I'm getting from the
buildbot? I guess we *did* get some testing!

Yup. :-) I think I've fixed them now, but have a look over the code.

Post by Eric Abrahamsen
You'll forgive me if my patch-writing confidence is at a low ebb,
though. I've had a pretty poor batting average of late.

Never mind the buildbot. :-)

--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog http://lars.ingebrigtsen.no/

Lars Ingebrigtsen

2015-01-27 06:32:53 UTC

Permalink

The remaining problem is that `default-process-coding-system' isn't
defined in XEmacs. Can that part of the patch be done in a different
way?

--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog http://lars.ingebrigtsen.no/

Eric Abrahamsen

2015-01-27 06:48:11 UTC

Permalink

Post by Lars Ingebrigtsen
The remaining problem is that `default-process-coding-system' isn't
defined in XEmacs. Can that part of the patch be done in a different
way?

I'll take a look tonight!