Talk:Base64

From Wikipedia, the free encyclopedia

The http://www.spammimic.com/encodepgp.shtml (PGP Mimic) page means one crazy use of base64-encoding.

Contents

[edit] Split

The page currently is about two completely different things:

  1. the numeral system in base 64
  2. the base64 encoding
  • Actually, I think that the numeral system should be "Base 64" with a space, but I am not sure. Anyway, I think that a split with a disabiguation page is in order. Paolo Liberatore (Talk) 15:05, 30 November 2005 (UTC)
I'm not sure i'd call them completely different after all a binary file is really just a very big number if you think about it. The page describes a numeral system and then goes onto describe its uses I see nothing wrong with this structure. Plugwash 15:47, 30 November 2005 (UTC)
I rather agree, particularly since the numeral system isn't particularly notable except for its use in the encoding. —Ilmari Karonen (talk) 15:52, 30 November 2005 (UTC)

Sorry folks, I made a mistake here: in spite of what I remembered, the numeral system used by the Babylonians was base 60, not 64 (we also divide time in 60th for this reason). Obviously, there is not much to say about the numeral system, except that it is the base of the base64 encoding. I will remove the split tag. Paolo Liberatore (Talk) 17:00, 30 November 2005 (UTC)

[edit] Base64 in freeware applications

I updated recently this article to include the vast usage possible for Base64, including in freeware applications like Mozilla and Thunderbird.

The simplest of examples:

 C:\Documents and Settings\<UNAME>\Application Data\Thunderbird\Profiles\<PRNUMBER>.default
 type signons.txt
 mailbox://henrique@venus
 \=username=\
 ~
 *\=password=\
 TW9ua2V5
 <UNAME> stands for your username in a Windows XP distribution, for example.

The password can be easily decoded, and is: Monkey.

This does not detract nor diminishes the great software provided by Mozilla (subjective opinion I know... that's why I posted this opinion in this discussion article). The majority of users will not notice these security flaws, nor even bother their personal data is subject of trojans in their desktops... that can be able to decode these passwords quite easily and deliver them worldwide.

Of course both Mozilla and Thunderbird offer an option for simetric cyphers (increasingly more difficult to decode) on all Managed Passwords.

I know this is not the right placeholder for software considerations: but I found outstandingly interesting to find even Mail User Agents (MUA) use the basic concepts of Mail-encoding (as is Base64 mainly used!) for obscuring plain-text passwords.

Unless the user is asked to enter a key then the ONLY purpose encrypting the key serves is to prevent someone accidently remembering a password they shouldn't when poking arround in a config file. If someone has access to the encrypted password then they almost certainly have access to the key as well! Plugwash 01:26, 19 December 2005 (UTC)
It's my expectation that Mozilla consier the password as stored in clear text, and use Base64 not for 'encryption' but rather in case the password contains special characters (such as the space). Plugwash is right, of course, unless there's a key the password could easily be decrypted by reverse engineering the executable that reads it. 59.167.212.218 23:25, 4 August 2007 (UTC) (aka Calrion)

[edit] mIRC trojans

mIRC trojans often use Base 64 as mIRC has functions for this inbuilt: $encode(text,m) and $decode(text,m). The trojans are spread over /amsg (message to all channels) or private messages and rely on the naiv trust of the users. They try to make users run commands encoded in Base 64 by claiming things like it will get them the latest Matrix movie, or operator (administrator) status in a certain channel. Some of them comes in the form '//write somename $decode(Base 64 encoded script,m) || .load -rs somename' and installs a script that keeps spreading this code, and sometimes comes with a backdoor. Other trojans hid the whole code by making use of brackets: //[ $decode(Base 64 encoded commands,m) ]' and can run any commands. Then there's the ones who make use of $findfile to execute commands and appearing to be a harmless /echo: '//echo $($decode(Base 64 encoded $findfile mostly executing /amsg $cb,m),2)' where $cb is the clipboard content which mostly is the command, and $(...,2) evaluates the $findfile decoded.

Perhaps someone could add a note on this in the article. I have never written in a WP article and feel abit lost.

[edit] UUU becomes VVVV

maybe mention $ echo -n UUU|base64-encode ;echo VVVV and say why, just for the fun of it.

[edit] UTF, really?

"This data encoding scheme is used to encode the UTF-16" Is it really doing this? I doubt. It's encoding unicode codepoints, just like utf-8, utf-16, ucs-2 do.

The rfc for UTF-7 seems to actually date back to the days before supplementry characters so its no help, using UTF-16 surrogates would be the only sane way to support those planes in UTF-7 without massive changes but i do not know if current implementations do so. Plugwash 18:24, 17 June 2006 (UTC)
UTF-7 is generally deprecated these days. Rootless 13:44, 18 June 2006 (UTC)

[edit] MIME Line breaks are <CR><LF>

From the article:

As newlines are inserted in the encoded data every 76 characters, the actual length of the encoded data is approximately 135.1% of the original.

To the best of my knowledge, MIME defines a line break as the character pair <CR><LF> (in that order). Therefore, every 57 bytes from the source is expanded to 76 Base64 characters + <CR> + <LF>, or 78 characters. This gives an expansion of approximately 136.8%.

Thiadmer Riemersma (thiadmer at compuphase dot com)

Googling and reading the article newline seem to verify this, so I modified the article accordingly. –Mysid(t) 18:21, 8 August 2006 (UTC)

[edit] modified Base64 for URL

The section URL Applications contains a little paragraph about "modified Base64 for URL". However acording to the referenced page http://tools.ietf.org/html/rfc3548#page-6, it is wrong.

rfc3548 seems to think that URL and file name encodings use '-' and '_' instead of '+' and '/'. Not '*' and '-'.

And unless I am missing something they should also be with the padding '=' but as far as I know '=' is reserved for URLs... which would indicate that the current wiki text is more correct.

In the Wiki text * is the character to use, but that is not right because * will get percent encoded by url encoders. Therefore I believe the correct base64 is as the RFC states —Preceding unsigned comment added by 207.58.192.150 (talk) 20:19, 17 January 2008 (UTC)

they tought about that too (section 5 paragraph 3)
The pad character "=" is typically percent-encoded when used in an URI [9], but if the data length is known implicitly, this can be avoided by skipping the padding; see section 3.2.
85.149.120.16 (talk) 23:11, 18 May 2008 (UTC)

[edit] Example

I felt the example wasn't quite as intuitive as it could be, so I created the table version. (Sometime reader, new to editing.) aes

The "Man" --> "TWFu" table is a great addition. Thanks. — Omegatron 22:03, 27 December 2007 (UTC)

[edit] Material added by Ultimater

I have reverted the addition of the following material by user:Ultimater. I think there may be some merit in it, but I also think some more attention should be paid to style and formatting, before it is added to the article proper. E.g., use of whole-word capitalization, rethorical questions, sentences starting by "Notice...", "Remember..." and the like should be avoided or limited.--Niels Ø 11:31, 23 August 2006 (UTC)

[edit] Added before heading "An example"

Also notice that the length of each of the outputs are multiples of 4. Not only MUST every base64-encoded string consist of an even number of characters, the number of total characters MUST be evenly divisible by 4. The reason is because base64 is used to represent an exact binary sequence of data in groups 8 bits.

[edit] Added before heading "UTF-7"

Remember ;   The text doesn't need to be exactly 3 characters in length. Notice the usage of the padding character.

Text content M a  
ASCII 77 97  
Bit pattern 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 0            
Index 19 22 4  
Base64-Encoded T W E =


Notice that the equals character (the padding character) is appended to the generated base64-encoded string and ONLY when there is an empty slot in the Text content. The padding character will NEVER appear in the middle or beginning of a base64-encoded string. The padding character can be totally OMITTED from your base64-encoded string and it will not harm the string's contents. The reason is because the number of un-used bits can be recalcuated. However it's always a good idea to include the padding character in your strings.

It's possible to have two padding characters but NEVER three:

Text content M    
ASCII 77    
Bit pattern 0 1 0 0 1 1 0 1 0 0 0 0                        
Index 19 16    
Base64-Encoded T Q = =

Why won't you encounter 3 padding characters? Because the string is read 3 characters at a time and 3 padding characters would translate as 000000 000000 000000 000000 which is "AAAA" and can be totally ignored -- however feel free to add as many extra A's or padding characters to the end of your base64-encoded string as you wish.

Let's have a second look at our previous example base64-encoded string again:

----+---10----+---20----+---30----+---40----+---60----+---70----+---80----86

TWFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb24sIGJ1dCBieSB0aGlz
IHNpbmd1bGFyIHBhc3Npb24gZnJvbSBvdGhlciBhbmltYWxzLCB3aGljaCBpcyBhIGx1c3Qgb2Yg
dGhlIG1pbmQsIHRoYXQgYnkgYSBwZXJzZXZlcmFuY2Ugb2YgZGVsaWdodCBpbiB0aGUgY29udGlu
dWVkIGFuZCBpbmRlZmF0aWdhYmxlIGdlbmVyYXRpb24gb2Yga25vd2xlZGdlLCBleGNlZWRzIHRo
ZSBzaG9ydCB2ZWhlbWVuY2Ugb2YgYW55IGNhcm5hbCBwbGVhc3VyZS4=

Tell me, if you were to decode that string back into it's original ASCII form, how many characters would it consist of? How long would it take you to turn that into a sequence of 0's and 1's and to count the number of bits and divide it by 4 then calcuate the remainder so you know the number of unused bits? Who needs to count it!? Just count the number of padding characters at the end of the string (in this case one) and you will know the number of un-used bits (one padding character per every two un-used bits). Hence in this case, the length of the original string was 1 character short (the padding character is a blank slot) of being a multiple of three.

[edit] How are the last two rows in the diagram derived?

If you understand how the top three rows are derived it is fairly straightforward. If yo look at the binary digits above the index column, the first 6 are 010011 which is 16+2+1 giving the decimal 19. Likewise the next 6 are 010000, which is decimal 16. These are the values in the index column.

If you take the set of characters from the article:

"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"

Use the 19 to index into here (starting with A as zero, B as 1, etc. and you get T, the character in the last row. Similarly 16 indexes to Q. Does the article need more step by step details? -- Q Chris 07:04, 19 April 2007 (UTC)

[edit] I don't understand the need for padding

I don't understand the need for padding. You can always tell the number of bytes of decoded chars from the encoded.

  • 1 byte clear-> 2 bytes encoded
  • 2 bytes clear -> 3 bytes encoded
  • 3 bytes clear -> 4 bytes encoded

Why did they think padding was needed? -- Chris Q 13:25, 13 December 2006 (UTC)

Good question, afaict the specs are silent on the matter too, the only reasons i can think of would possibbly be either poor understanding on the parts of the creators or possiblly an intention to allow encoded data to be concatenated without decoding. Plugwash 13:48, 13 December 2006 (UTC)
It is a decoding optimisation. For decoding the input is always a multiple of 4, when you take padding into account. This means you can "read" the input as an int32_t in C. It also allows you to do a minor consitency check using the length of the input. -- James Antill 18:10, 14 December 2006 (UTC)
After having implemented a decoder I can confirm the optimization. It allows you to remove EOF/length checking from the decoding loop, giving a very small performance improvement. As far a minor consistency check goes I am a little sceptical, as it would only work 2/3 of the time and most MIME usages allow non-mime characters to be appended. -- Chris Q 12:12, 22 January 2007 (UTC)
I don't buy that it improves performance, since you could process the last chunk in a separate piece of code, outside the inner loop. However, I think it does allow the code to be simpler. I found a nice simple example of an implementation in VB at Base64Dec01. I spent a fair amount of time pondering this question myself before I found this discussion there. Perhaps it should be mentioned in the article? --LCarl 23:01, 9 February 2007 (UTC)
what if you want to read/interpret a base64 encode file from backwards. With out padding there is only one way to read the content, that is from the start of the file. (r.saiprasad at gmail dot com)
True but given the intentional use of base64 there is little need to read it randomly or backwards. Plugwash 13:18, 6 March 2007 (UTC)
I think it's so that you can distinguish between a final null byte in the data (which should be retained) and a null byte added to make up a triad to be encoded. A base64 sequence "AA==" should be distinct from "AAAA". --wintermute (talk) 21:29, 14 December 2007 (UTC)
"AA" and "AAA" are already distinct from "AAAA" —Preceding unsigned comment added by 66.235.5.33 (talk) 17:19, 12 January 2008 (UTC)

[edit] Quadrosexagesimal

What's up with this particular expression? It's the equivalent of hexadecimal, but it appears to be all wrong. Could somebody brake down this particular expression to the latin word roots it is made up from and fix the spelling please? --89.212.75.6 16:01, 24 February 2007 (UTC)

Quadrosexagesimal is not the same as hexadecimal, it is base 64 as opposed to base 16. I am no linguist but I guess that quadro refers to the 4 and sexagesimal to the 60, as in hexadecimal the hex refers to six and the decimal to 10. Note the term Quadrosexagesimal applies to the definition of the mathematical base 64, not the Base64 encoding. IMHO Really there should be two articles current one does not separate the two completely different terms very clearly, though separating them would require a lot of work -- Q Chris 15:00, 19 April 2007 (UTC)
It's not Latin. It's a neologism created in English by analogy of words like hexadecimal. My question is whether this is actually an English word (it's not in the OED), or an invention for Wikipedia. kwami 04:42, 27 September 2007 (UTC)
Actually, it should be quattuorsexagesimal, though that's still faux Latin. I'm deleting the word until s.o. can come up with a non-Wikipedia based source. kwami 04:49, 27 September 2007 (UTC)

[edit] Justification for URL variant

The justification given for the URL variant of Base64 is unsourced and it doesn't really sound right:

"Using a URL-encoder on standard Base64, however, is inconvenient as it will translate the '+' and '/' characters into special '%XX' hexadecimal sequences ('+' = '%2B' and '/' = '%2F'). When this is later used with database storage or across heterogeneous systems, they will themselves choke on the '%' character generated by URL-encoders (because the '%' character is also used in ANSI SQL as a wildcard)."

It is my understanding that the problem isn't with URL-escaping the data, it's with the interpretation of UNescaped base64 data in URLs. Specifically, the / is used as a path separator, and + is typically converted to a space. So you can't have a URL like http://example.com/base64-encoded-text-here because it will be mangled. Both / and + are perfectly legal in a URL and there's no reason that a developer or toolset should automatically think to escape it. In fact, it would be great if they did because their application would then unescape it and they'd end up with valid Base64 data again.

Further, the justification involving "database storage or across heterogeneous systems" choking on '%' sounds a bit odd to me. Applications should be unescaping their URLs before using the data to begin with (bug 1), and back-ends should never be blindly using user-provided data directly in their queries (bug 2). Any application that fits into this category has more serious problems to worry about, so it makes no sense to me why this should be used as justification for a URL-friendly version of Base64. —Fastolfe00 04:38, 25 October 2007 (UTC)

We don't seem to have a source for either version of the explanation. I'd put a {fact} tag on it, but it seems that the whole article needs sourcing, not just that one point. On the other hand, if you know a source that backs you up, please put your explanation in the article. In my opinion, it makes more sense. -- trlkly 07:19, 15 April 2008 (UTC)

[edit] Requested move

The article should be moved to Base 64. Other bases have a space such as Base 24. I know the encoding method is called base64, not base 64, but the former is based on the latter, not vice versa. - TAKASUGI Shinji (talk) 06:51, 21 November 2007 (UTC)

Oppose I would oppose this move on the grounds that I reckon that most people finding this article will be looking for the base64 encoding mechanism rather than the mathematical base 64. Also the article is almost exclusively about base64 encoding, not the mathematical base. Why not create an article about the mathematical base 64 at Base 64 rather than having the redirect. Each article would naturally reference the other. -- Q Chris (talk) 08:38, 21 November 2007 (UTC)
Then please separate Base 32 and Base32 too. - TAKASUGI Shinji (talk) 11:30, 21 November 2007 (UTC)
Yes, I would go along with that -- Q Chris (talk) 14:12, 21 November 2007 (UTC)
I've noticed this page has been renamed. Now, if we divide it into Base 64 and Base64, I think it's better to move this article again to Base64 and create the new article Base 64. What do you think? - TAKASUGI Shinji (talk) 01:01, 22 November 2007 (UTC)
I have moved it back! There is no justification for moving a page that is almost entirely about base64 to base 64. You are quite correct, it makes sense for this article to be called base64 and a new article about the mathematical base 64 should be written. —Preceding unsigned comment added by Q Chris (talkcontribs) 08:06, 22 November 2007 (UTC)
I have separated Base64 and Base 64. - TAKASUGI Shinji (talk) 10:30, 22 November 2007 (UTC)
Good stuff, thanks -- Q Chris (talk) 12:28, 22 November 2007 (UTC)

[edit] Is this true?

It is the largest power-of-two base that can be represented using single printable ASCII characters.

I'd have thought that base 128 would hold that honour, and that base 64 is used because it's the largest power-of-2 base that can be represented using characters that are pretty much certain to have the same value on different systems. Thoughts? --wintermute (talk) 21:20, 14 December 2007 (UTC)

There are 95 printable ASCII (not Extended ASCII) characters, and since BASE95 isn't a power of 2, I guess it's true. —EncMstr 21:37, 14 December 2007 (UTC)

[edit] JavaScript Code Sample Usage

May I use JavaScript base64 implementation given in the article in GNU GPL software legally? Does anyone have an idea, where this code comes from? Thanks. Dadudadu (talk) 12:58, 20 March 2008 (UTC)

Although I hate to be unhelpful, I can't answer either question. The first would be giving legal advice, which Wikipedia can't do, and the second I don't know the answer. What I can tell you is that there are implementations of base64 that are released under the GPL available, so you might want to consider using one of those. -- trlkly 07:10, 15 April 2008 (UTC)

[edit] Citations Missing Tag

This article needs citations. It's not necessarily inaccurate, but it has only one citation, and that is for an relatively insignificant point. So I have added the {{citations missing}} tag, as I feel it is the most accurate representation of what this article needs, i.e. both citations and footnotes. -- trlkly 07:31, 15 April 2008 (UTC)

[edit] Apparent buffer overflow bug in C code

I'm looking at the C code, and it looks like it can write one byte beyond the specified end of the output buffer:

      result[resultIndex++] = base64chars[n0];
      if(resultIndex > maxResultLength) return;
      result[resultIndex++] = base64chars[n1];
      if(resultIndex > maxResultLength) return;
      // one more instance omitted for brevity

The problem is that it writes to the output buffer before it checks for overflow. For example, suppose maxResultLength == 0. (Yes, this is a dumb value, but the same argument will apply regardless of the value.) It will write the value to result[0], increment resultIndex to 1, then notice it is too big and return. I am pretty sure it should be:

      if(resultIndex >= maxResultLength) return;
      result[resultIndex++] = base64chars[n0];
      if(resultIndex >= maxResultLength) return;
      result[resultIndex++] = base64chars[n1];
      // and the third instance of writing to result[] should be fixed likewise

If I am missing something and this is not really a bug, will someone please let me know? Otherwise I will fix the code, test it, and edit this article accordingly. CosineKitty (talk) 16:54, 13 May 2008 (UTC)

[edit] Proposal for merging

In reality, the only interesting thing one can say about this base is its use in Base64. In my opinion, the article Base 64 should be "merge" to this article in accordance with Wikipedia:Notability (numbers). QQ (talk) 11:20, 23 May 2008 (UTC)

Merge and redirect I don't care which spelling is used, but one should be a redirect to the other. —EncMstr (talk) 07:29, 28 May 2008 (UTC)
Merge and redirect as per EncMstr, also for Base 32 into Base32 (I just edited Base32, and needed Base64 in UTF-1). --217.184.142.41 (talk) 11:30, 3 June 2008 (UTC)