Searching \ for '[ee]: text compression' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: massmind.org/techref/index.htm?key=text+compression
Search entire site for: 'text compression'.

Exact match. Not showing close matches.
PICList Thread
'[ee]: text compression'
2001\10\10@120102 by Andy N1YEW

picon face
how would i compress small text bits like 30 characters?

thanks,
andy

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics


2001\10\10@123754 by Scott Dattalo

face
flavicon
face
On Wed, 10 Oct 2001, Andy N1YEW wrote:

> how would i compress small text bits like 30 characters?

This is going to be a long thread...

If you have a limited set of characters from which to compose your text
stream then it's possible to encoded them in a different number base. For
example, if you had only 16 symbols, then the encoding would only require
4-bits per symbol. Unfortunately, 16 symbols is limiting. A radix 40 (base
40) allows 3 symbols to be encoded into 16 bits since 40^3 = 64000 < 2^16.
The encoding is:

  16 bits => S1*40^2 + S2*40 + S3

Forty symbols will allow one case of the 26 letters, the 10 digits, and 4
extra characters like space, comma, period, and double quote.

If this is not enough, then you can incorporate escape sequences that
either define rarely used symbols or switch to different character sets.
For example, 64001 may mean encode the letter 'A' and then switch to the
lower case character set.

Another trick is to use 64000 - 65024 to encode two characters from a 32
character symbol table. So rarely used abbreviations consisting of strings
of capital letters could be easily encoded. Or you could encode one
character from a 40 character set and one character from a 38 character
set (or ignore the last two in a 40) and cover the codes from 64000 to
65520. This would leave 16 symbols for special escape sequences.

You can also take advantage of common punctuations. E.g. A capital letter
is almost always followed by a lower case letter; a period and comma are
almost always followed by a space.

BTW, Huffman coding and the like don't work too well for small strings
like this.

Scott

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics


2001\10\10@130736 by Bob Ammerman

picon face
Huffman will work fine if you have a large set of small strings that all
similar occurance frequencies of each character code.

Bob Ammerman
RAm Systems
(contract development of high performance, high function, low-level
software)

{Original Message removed}

2001\10\10@142653 by Scott Dattalo

face
flavicon
face
On Wed, 10 Oct 2001, Bob Ammerman wrote:

> Huffman will work fine if you have a large set of small strings that all
> similar occurance frequencies of each character code.


I should have been clearer.

*Adaptive* Huffman encoding doesn't work well for encoding small strings
when memory is limited (like it is on a PIC). The problem with adaptive
Huffman encoding is that you need on the order of N memory cells for the N
symbols. If N is 128 for example, you run out of memory.

A non-adaptive Huffman encoder that scans the strings before hand could
work. In fact, it would work quite well if you have a fixed number of
known strings.

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics


2001\10\10@144815 by Bob Ammerman

picon face
Scott,

You are absolutely correct.

In fact, if the goal is the pack as many LCD type messages as possible into
the minimum amount of memory (which is probably what the OP was after), then
there are many tricks that can be used that depend on having the processing
power of a desktop machine to do the compression, and leaving the PIC a
relatively simple job for the decompression.

Bob Ammerman
RAm Systems
(contract development of high performance, high function, low-level
software)

{Original Message removed}

2001\10\10@181600 by Jeff DeMaagd

flavicon
face
----- Original Message -----
From: Andy N1YEW <spam_OUTn1yewTakeThisOuTspamSOFTHOME.NET>

>how would i compress small text bits like 30 characters?<

You didn't really specify a whole lot.  Is it just one text string or do you
have a sizable number of text strings?  How complex to you want to get?

The simplest method I can think of is for words used more than once I make
it into a separate string and call it in for each instance used.  I have
been doing this, this way I can save some space at minimal extra coding
hassle, so I can spend my efforts elsewhere.

Now I have to look up Huffman, assuming I can dig out my Algorithms book...

Jeff

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics


2001\10\10@184759 by Andy N1YEW

picon face
> ----- Original Message -----
> From: Andy N1YEW <.....n1yewKILLspamspam@spam@SOFTHOME.NET>
>
> >how would i compress small text bits like 30 characters?<
>
> You didn't really specify a whole lot.  Is it just one text string or do
you
> have a sizable number of text strings?  How complex to you want to get?

I want a simple routine, and there are many strings but they cannot be
combined together(IE: made into one long compressed string).  I thought of
making a character set "a-z" "0-9" "@" and  "."  That may work I think..

Andy

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics


2001\10\10@223837 by steve

flavicon
face
> The simplest method I can think of is for words used more than once I
> make it into a separate string and call it in for each instance used.
> I have been doing this, this way I can save some space at minimal
> extra coding hassle, so I can spend my efforts elsewhere.

A method that I have used with good results and along the same
lines is to take advantage of the high bit for encoding. The simplest
implementation is where there are a lot of spaces. Count how
many spaces, add 128 to it and embed that in the string. In your
print routine, if you come across a character >128 you output a
space and decrement the character until it reaches 128.
You can extend that further by building a table of phrases (eg.
"the", "turn on", "turn off", "Option", "detonate", etc) and add the
table index to 192 (128 + 64). Assuming that you don't have strings
with more than 63 spaces.

For example, you would store the string

"* 1.*You must * A before you *." (31 bytes)

Where "*" are the values 195, 138, 193 and 196 respectively.

Which would expand to

"Option 1.          You must turn on A before you detonate."
(58 bytes)

It doesn't take long to do with a simple word count program and
search/replace in a text editor and it is microcontroller friendly to
unpack.

Steve.




======================================================
Steve Baldwin                Electronic Product Design
TLA Microsystems Ltd         Microcontroller Specialists
PO Box 15-680, New Lynn      http://www.tla.co.nz
Auckland, New Zealand        ph  +64 9 820-2221
email: stevebspamKILLspamtla.co.nz      fax +64 9 820-1929
======================================================

--
http://www.piclist.com hint: The list server can filter out subtopics
(like ads or off topics) for you. See http://www.piclist.com/#topics


2001\10\11@020859 by dr. Imre Bartfai

flavicon
face
Hi,

if you can use a PIC capable of reading its own memory (e. g. 16F87x or
18Cxxx or 17Cxxx) then u can probably store 2 characters in program
word. This compression rate is unbeatable I think regarding to resources.

Imre


+-----------------------------------------------------------------------+
| The information transmitted is intended only for the person or entity |
| to which it is addressed and may contain confidential and/or          |
| privileged material.  Any review, retransmission, dissemination or    |
| other use of, or taking of any action in reliance upon, this          |
| information by persons or entities other than the intended recipient  |
| is prohibited. If you received this in error, please contact the      |
| sender and delete the material from any computer.                     |
+-----------------------------------------------------------------------+

On Wed, 10 Oct 2001, Andy N1YEW wrote:

> > {Original Message removed}

2001\10\12@031801 by jeethur

flavicon
face
> Huffman will work fine if you have a large set of small strings that all
> similar occurance frequencies of each character code.
>

Bob,
This is real interesting. Have you done huffman on PICs ? I managed with
RLE.
And RLE is of no help on text strings. But for huffman I could'nt imagine of
a way to do.

Jeethu Rao

--
http://www.piclist.com#nomail Going offline? Don't AutoReply us!
email .....listservKILLspamspam.....mitvma.mit.edu with SET PICList DIGEST in the body


2001\10\12@070811 by Byron A Jeff
face picon face
On Fri, Oct 12, 2001 at 12:46:26PM +0530, Jeethu Rao wrote:
> > Huffman will work fine if you have a large set of small strings that all
> > similar occurance frequencies of each character code.
> >
>
> Bob,
> This is real interesting. Have you done huffman on PICs ? I managed with
> RLE.
> And RLE is of no help on text strings. But for huffman I could'nt imagine of
> a way to do.

Well first off you'd only do huffman decoding on the PIC. The encoding of the
strings would be done offline.

Huffman decoding is generally done with a binary tree. It shouldn't be too
difficult to collapse a binary tree representation into a table. All you'd have
to do is follow some simple rules like leaves of the tree have bit 7 turned off
and interior pointers would have bit 7 turned on. For each interior node have
a a consecutive pair of pointers for the left and right subtrees. Then it's
just a matter of reading the next compressed bit, adding it to current pointer
and accessing the table. If the top bit is set then proceed to the next bit.
If not then that's the character.

Here's a simple example. Three characters and their huffman encodings:

A - 0
B - 10
C - 11

Here's the table:

0:01000001   ; The character A
1:10000010   ; Subtree starts at table element 2
2:01000010   ; The character B
3:01000011   ; The character C

So the bit string 11010 would map out like this:

- init to table element p=0, the root of the tree.
- Access the first bit, a 1, and add it to p, Access table[p] = 10000010
- Since the top bit is 1, set p to the lower 7 bits (p=2) and continue.
- Access the next bit, a 1,  and add it to p, access table[p] = 01000011 ('C')
- Since the top bit is 0, We're done. Output a 'C'. Reset p=0

- Access the next bit, a 0, and add it to p, access table[p] = 01000001 ('A')
- Since the top bit is 0, We're done. Output a 'A'. Reset p=0

- Access the next bit, a 1, and add it to p, access table[p] = 10000010
- Since the top bit is 1, set p to the lower 7 bits (p=2) and continue.
- Access the next bit, a 0, and add it to p, access table[p] = 01000010 ('B')
- Since the top bit is 0, We're done. Output a 'B'. Reset p=0

The output would be "CAB".

Depending on the size of the string you can Huffman encode a null character at
the end of the string, or preceed the string with its length.

And of course a quick review of Huffman coding. Do a frequency analysis of
the characters in your strings and build a tree so that the most frequent
characters (statistically) have the shortest path in the tree. So in the
above example across all the strings, 'A' must have occured more than 'B' or
'C' because it has a shorter Huffman encoding.

Hope this helps,

BAJ

--
http://www.piclist.com#nomail Going offline? Don't AutoReply us!
email EraseMElistservspam_OUTspamTakeThisOuTmitvma.mit.edu with SET PICList DIGEST in the body


2001\10\14@005505 by jeethur

flavicon
face
Byron,
       Thanks for the long explanation. Currently I'm not working on
       anything like text de-compression on PICs. But its nevertheless
       a very interesting topic. I've saved your message. Maybe it'll come
       in hand some time in the fututre.

Thanks,

Jeethu Rao

> {Original Message removed}

More... (looser matching)
- Last day of these posts
- In 2001 , 2002 only
- Today
- New search...