WKdmCompress_16k.s source code [xnu/osfmk/arm64/WKdmCompress_16k.s]

1	/*
2	* Copyright (c) 2000-2014 Apple Inc. All rights reserved.
3	*
4	* @APPLE_OSREFERENCE_LICENSE_HEADER_START@
5	*
6	* This file contains Original Code and/or Modifications of Original Code
7	* as defined in and that are subject to the Apple Public Source License
8	* Version 2.0 (the 'License'). You may not use this file except in
9	* compliance with the License. The rights granted to you under the License
10	* may not be used to create, or enable the creation or redistribution of,
11	* unlawful or unlicensed copies of an Apple operating system, or to
12	* circumvent, violate, or enable the circumvention or violation of, any
13	* terms of an Apple operating system software license agreement.
14	*
15	* Please obtain a copy of the License at
16	* http://www.opensource.apple.com/apsl/ and read it before using this file.
17	*
18	* The Original Code and all software distributed under the License are
19	* distributed on an 'AS IS' basis, WITHOUT WARRANTY OF ANY KIND, EITHER
20	* EXPRESS OR IMPLIED, AND APPLE HEREBY DISCLAIMS ALL SUCH WARRANTIES,
21	* INCLUDING WITHOUT LIMITATION, ANY WARRANTIES OF MERCHANTABILITY,
22	* FITNESS FOR A PARTICULAR PURPOSE, QUIET ENJOYMENT OR NON-INFRINGEMENT.
23	* Please see the License for the specific language governing rights and
24	* limitations under the License.
25	*
26	* @APPLE_OSREFERENCE_LICENSE_HEADER_END@
27	*/
28
29	/*
30	This file contains arm64 hand optimized implementation of WKdm memory page compressor.
31
32	int WKdm_compress (WK_word src_buf, WK_word* dest_buf, WK_word* scratch, unsigned int bytes_budget);*
33
34	input :
35	src_buf : address of input page (length = 1024 words)
36	dest_buf : address of output buffer (may not be 16-byte aligned)
37	scratch : a 16-byte aligned 4k bytes scratch memory provided by the caller,
38	bytes_budget : a given byte target in compression
39
40	output :
41
42	if the input buffer can be compressed within the given byte budget, the dest_buf is written with compressed data and the function returns with number of bytes for the compressed data
43	o.w., the function returns -1 to signal that the input data can not be compressed with the given byte budget.
44	During the scan and tag process, each word that can not be compressed will be written to dest_buf, followed by a 12-bytes header + 256-bytes tag area.
45	When the functions returns -1, dest_buf is filled with all those words that can not be compressed and should be considered undefined.
46	The worst-case scenario is that all words can not be compressed. Hence, the minimum size requirement for dest_buf should be 12+256+4096 = 4364 bytes to prevent from memory fault.
47
48	The 4th argument bytes_budget is the target compress budget in bytes.
49	Should the input page can be compressed within the budget, the compressed data is written to dest_buf, and the function returns the number of compressed bytes.*
50	Otherwise, the function returns -1 (to signal to the caller that the page can not be compressed).
51
52	WKdm Compression algorithm is briefly stated as follows:
53
54	There is a dynamically updated dictionary consisting of 16 words. Each dictionary word is initialized to 1 at the point of entry to the function.
55	For a nonzero input word x, its 8-bits (10-bits scaled up) is used to determine a corresponding word from the dictionary, represented by dict_index (4-bits) and dict_word (32-bits).
56	a. k = (x>>10)&255; // 8-bit hash table index
57	b. dict_index = hashTable[k]; // 4-bit dictionary index, hashTable[] is fixed
58	c. dict_word = dictionary[dict_index]; // 32-bit dictionary word, dictionary[] is dynamically updated
59
60	Each input word x is classified/tagged into 4 classes :
61	0 : x = 0
62	1 : (x>>10) == (dict_word>>10), bits 10:31 of the input word match a dictionary word
63	2 : (x>>10) != (dict_word>>10), the above condition (22 higher bits matched) is not met, meaning a dictionary miss
64	3 : (x == dict_word), the exact input word is in the dictionary
65
66	For each class, different numbers of bits are needed for the decompressor to reproduce the original input word.
67	0 : 2-bits tag (32->2 compression)
68	1 : 2-bits tag + 4-bits dict_index + 10-bits lower bits (32->16 compression)
69	2 : 2-bits tag + 32-bits new word (32->34 expansion)
70	3 : 2-bits tag + 4-bits dict_index (32->6 compression)
71
72	It is obvious now that WKdm compress algorithm works well for pages where there are lots of zero words (32->2) and/or there are freqeunt repeats of some word patterns (32->6).
73
74	the output bit stream (dest_buf) consists of*
75	a. 12 bytes header
76	b. 256 bytes for 1024 packed tags
77	c. (varying number of) words for new words not matched to dictionary word.
78	d. (varying number of) 32-bit words for packed 4-bit dict_indices (for class 1 and 3)
79	e. (varying number of) 32-bit words for packed 10-bit low bits (for class 1)
80
81	the header is actually of 3 words that specify the ending offset (in 32-bit words) from the start of the bit stream of c,d,e, respectively.
82	Note that there might be padding bits in d (if the number of dict_indices does not divide by 8), and there are 2/12/22 padding bits for packing 3/2/1 low 10-bits in a 32-bit word.
83
84
85	The WKdm compress algorithm 1st runs a scan and classification pass, tagging and write unpacked data into temporary buffers. It follows by packing those data into the output buffer.
86
87	The temp buffers are
88
89	uint8_t tempTagsArray[1024]; // temporary saving for tags before final packing
90	uint8_t tempQPosArray[1024]; // temporary saving for dict_indices before final packing
91	uint16_t tempLowBitsArray[1024]; // temporary saving for partially matched lower 10 bits before final packing
92
93	Since the new words (that can not matched fully or partially to the dictionary) are stored right after the header and the tags section and need no packing, we directly write them to
94	the destination buffer.
95
96	uint32_t new_word = dest_buf+3+64; // 3 words for header, 64 words for tags, new words come right after the tags.*
97
98	Now since we are given a byte budget for this compressor, we can monitor the byte (or bit) usage on the fly in the scanning and tagging pass.
99
100	byte_count -= 12 + 256; // bit budget minus header and tags
101
102	whenever an input word is classified as class
103
104	2 : byte_count -= 4;
105
106	the compress function can early exit (return -1) should the page can not be compressed with the given byte budget (i.e., byte_count <= 0).
107
108	without showing the bit budget management, the pseudo code is given as follows:
109
110	uint8_t tags=tempTagsArray;*
111	uint8_t dict=tempQPosArray;*
112	uint8_t partial=tempLowBitsArray;*
113
114	for (i=0;i<1024;i++) {
115	x = src_buf++;*
116	if (x == 0) { // zero, 2-bits tag
117	*tags++ = 0;
118	} else {
119
120	// find dict_index and dict_word from x
121	k = (x>>10)&255;
122	dict_index = hashTable[k];
123	dict_word = dictionary[dict_index];
124
125	if (dict_word == x) { // exactly match
126	// 2-bits tag + 4-bits table index
127	*tags++ = 3;
128	*dict++ = dict_index;
129	} else if (((x^dict_word)>>10)==0) { // 22 higher bits matched
130	// 2-bits tag + 4-bits table index + 10-bits lower partial
131	*tags++ = 1;
132	*dict++ = dict_index;
133	*partial++ = x &0x3ff;
134	dictionary[dict_index] = x;
135	} else { // not matched
136	// 2-bits tag + 32-bits new word
137	*tags++ = 2;
138	*new_word++ = x;
139	dictionary[dict_index] = x;
140	}
141	}
142	}
143
144	after this classification/tagging pass is completed, the 3 temp buffers are packed into the output dest_buf:*
145
146	1. 1024 tags are packed into 256 bytes right after the 12-bytes header
147	2. dictionary indices (4-bits each) are packed into are right after the new words section
148	3. 3 low 10-bits are packed into a 32-bit word, this is after the dictionary indices section.
149
150	cclee, 11/9/12
151
152	Added zero page, single value page, sparse page, early abort optimizations
153	rsrini, 09/14/14
154	*/
155
156	#define PAGES_SIZE_IN_KBYTES 16
157
158	#ifndef PAGES_SIZE_IN_KBYTES
159	#define PAGES_SIZE_IN_KBYTES 4
160	#endif
161
162	#if !((PAGES_SIZE_IN_KBYTES==4) \|\| (PAGES_SIZE_IN_KBYTES==16))
163	#error "Only PAGES_SIZE_IN_KBYTES = 4 or 16 is supported"
164	#endif
165
166
167	.text
168	.align `4`
169
170	/*
171	int WKdm_compress (WK_word src_buf, WK_word* dest_buf, WK_word* scratch, unsigned int bytes_budget);*
172	*/
173
174	.globl _WKdm_compress_16k
175	_WKdm_compress_16k:
176
177	/*
178	------------------------- symbolizing register use -----------------------------------
179	*/
180	#define src_buf x0
181	#define next_input_word x0
182	#define dest_buf x1
183	#define scratch x2
184	#define byte_count x3
185	#define next_tag x4
186	#define tempTagsArray x2 // scratch
187	#define dictionary x5
188	#define remaining x6
189	#define next_full_patt x7
190	#define dict_location x8
191	#define wdict_location w8
192	#define next_qp x9
193	#define hashTable x10
194	#define tempQPosArray x11
195	#define next_low_bits x12
196
197	/*
198	this arm64 assembly code is ported from x86_64 assembly code,
199	therefore need such symbolization to quickly reuse the x86_64 assembly code
200	for these intermediate/temporary register use
201	*/
202	#define rax x13
203	#define eax w13
204	#define rcx x14
205	#define ecx w14
206	#define rdx x15
207	#define edx w15
208	#define rdi x0 /* after some point, x0/rdi becomes free other usage */
209
210
211	/*
212	------------------------- scratch memory --------------------------------------
213
214	need 164 (dictionary) + 2564 (tempTagsArray) + 2564 (tempQPosArray) + 10244 (tempLowBitsArray)
215	total 6208 bytes
216	[sp,#0] : dictionary
217	[scratch,#0] : tempTagsArray
218	[scratch,#1024] : tempQPosArray
219	[scratch,#2048] : tempLowBitsArray
220	*/
221
222	#define scale (PAGES_SIZE_IN_KBYTES/4)
223
224	#define SV_RETURN 0 // return value when SV, ZV page is found
225	#define MZV_MAGIC 17185 // magic value used to identify MZV page encoding
226	#define CHKPT_BYTES 416 // for early aborts: checkpoint after processing this many bytes. Must be in range [4..4096]
227	#define CHKPT_WORDS (CHKPT_BYTES/4) // checkpoint bytes in words
228	#define CHKPT_TAG_BYTES (CHKPT_BYTES/16) // size of the tags for CHKPT_BYTES of data
229	#define CHKPT_SHRUNK_BYTES 426 // for early aborts: max size of compressed stream to allow further processing ..
230	// .. to disable early aborts, set CHKPT_SHRUNK_BYTES to 4096
231	#if CHKPT_BYTES > 4096
232	#error CHKPT_BYTES must be <= 4096
233	#endif
234	#if CHKPT_BYTES < 4
235	#error CHKPT_BYTES must be >= 4
236	#endif
237
238	#if KERNEL
239	sub sp, sp, #`64`
240	st1`.4s` {v0,v1,v2,v3},[sp]
241	#endif
242
243	sub sp, sp, #`64` // allocate for dictionary
244	mov dictionary, sp // use x5 to point to sp, so we can use sub xd, xn, sp
245
246	sub sp, sp, #`64` // allocate space for saving callee-saved registers
247	mov x15, sp
248	stp x20, x21, [x15, #`0`] // save x20, x21
249	stp x22, x23, [x15, #`16`] // save x22, x23
250	stp x24, x25, [x15, #`32`] // save x24, x25
251	stp x26, x27, [x15, #`48`] // save x26, x27
252
253	/*
254	------- entwined statck space allocation, registers set up, and PRELOAD_DICTIONARY -------------------
255	*/
256
257	// NOTE: ALL THE DICTIONARY VALUES MUST BE INITIALIZED TO ZERO
258	// THIS IS NEEDED TO EFFICIENTLY DETECT SINGLE VALUE PAGES
259	mov next_tag, tempTagsArray // &tempTagsArray[0]
260	add next_qp, scratch, #(`1024`scale) // next_qp*
261	mov remaining, #(CHKPT_WORDSscale) // remaining input words .. initially set to checkpoint*
262	add next_full_patt, dest_buf, #(`12`+`256`scale) // dest_buf + [TAGS_AREA_OFFSET + (num_input_words / 16)]4
263	sub byte_count, byte_count, #(`12`+`256`scale) // bit_count - header - tags*
264	add next_low_bits, scratch, #(`2048`scale) // &tempLowBitsArray[0]*
265	stp xzr, xzr, [dictionary, #`0`] // initialize dictionary
266	adrp hashTable, _hashLookupTable@GOTPAGE
267	stp xzr, xzr, [dictionary, #`16`] // initialize dictionary
268	stp xzr, xzr, [dictionary, #`32`] // initialize dictionary
269	ldr hashTable, [hashTable, _hashLookupTable@GOTPAGEOFF]
270	stp xzr, xzr, [dictionary, #`48`] // initialize dictionary
271
272	#define EARLYCHECK 0
273	#define NORMAL 1
274
275	#define mode w20
276	#define start_next_full_patt x21
277	#define start_next_input_word x22
278	#define start_next_low_bits x23
279	#define r11 x24
280	#define r13 x25
281	#define byte_budget x26
282	#define start_next_qp tempQPosArray
283
284	add tempQPosArray, scratch, #(`1024`scale) // &tempQPosArray[0]*
285	mov mode, EARLYCHECK // indicate we are yet to evaluate the early aborts
286	mov start_next_full_patt, next_full_patt // remember the start of next_full_patt
287	mov start_next_input_word, next_input_word // remember the start of next_input_word
288	mov start_next_low_bits, next_low_bits // remember the start of next_low_bit
289	add byte_budget, byte_count, #(`12`+`256`scale) // remember the byte budget*
290
291	b L_loop
292
293	.align `4`, `0x90`
294
295	/ we've just detected a zero input word in edx /
296	L_RECORD_ZERO:
297	strb edx, [next_tag], #`1` // next_tag++ = ZERO; edx is used as input word, and if we are here edx = 0*
298	subs remaining, remaining, #`1` // remaing--;
299	b.le CHECKPOINT // if remaining = 0, break
300
301	/ -------------- scan/tag pass loop ------------------------- /
302	L_loop:
303
304	/ load new input word to edx /
305	ldr edx, [next_input_word], #`4`
306	cbz edx, L_RECORD_ZERO // if (input_word==0) RECORD_ZERO
307
308	/*
309	now the input word edx is nonzero, we next find the corresponding dictionary word (eax) and dict_location
310	*/
311	ubfm eax, edx, #`10`, #`17`
312	ldrb wdict_location, [hashTable, rax] // HASH_TO_DICT_BYTE_OFFSET(input_word)
313	ldr eax, [dictionary, dict_location] // dict_word = dict_location;*
314
315	/ detect whether we match input to its corresponding dictionary word /
316	eor eax, eax, edx // dict_word vs input_word
317	cbz eax, L_RECORD_EXACT // if identical, RECORD_EXACT
318	lsr eax, eax, #`10` // HIGH_BITS(dict_word^input_word)
319	cbz eax, L_RECORD_PARTIAL // if identical, RECORD_PARTIAL
320
321	L_RECORD_MISS:
322	/*
323	if we are here, the input word can not be derived from the dictionary,
324	we write the input word as a new word,
325	and update the dictionary with this new word
326	*/
327	subs byte_count, byte_count, #`4` // byte_count -= 4
328	b.le L_budgetExhausted // return -1 to signal this page is not compressable
329	str edx, [next_full_patt], #`4` // next_full_patt++ = input_word;*
330	mov eax, #`2` // tag for MISS
331	subs remaining, remaining, #`1` // remaing--;
332	str edx, [dictionary, dict_location] // dict_location = input_word*
333	strb eax, [next_tag], #`1` // next_tag++ = 2 for miss*
334	b.gt L_loop // // if remaining > 0, repeat
335	b CHECKPOINT
336
337	L_done_search:
338
339	// SET_QPOS_AREA_START(dest_buf,next_full_patt);
340	/ 1st word in dest_buf header = 4-byte offset (from start) of end of new word section /
341
342	sub rax, next_full_patt, dest_buf // next_full_patt - dest_buf
343	lsr eax, eax, #`2` // offset in 4-bytes
344	str eax, [dest_buf] // dest_buf[0] = next_full_patt - dest_buf
345
346	/ -------------------------- packing 1024 tags into 256 bytes ----------------------------------------/
347	// boundary_tmp = WK_pack_2bits(tempTagsArray, (WK_word ) next_tag, dest_buf + HEADER_SIZE_IN_WORDS);*
348
349	add rdi, dest_buf, #`12` // dest_buf
350	mov rcx, tempTagsArray // &tempTagsArray[0]
351
352	L_pack_2bits:
353	ld1`.2s` {v0,v1,v2,v3},[rcx],#`32`
354
355	shl`.2d` v1,v1,#`4`
356	shl`.2d` v3,v3,#`4`
357
358	orr`.8b` v0, v0, v1
359	orr`.8b` v2, v2, v3
360
361	ushr`.2d` v1, v0, #`30`
362	ushr`.2d` v3, v2, #`30`
363
364	orr`.8b` v0, v0, v1
365	orr`.8b` v2, v2, v3
366
367	zip1`.2s` v0, v0, v2
368	st1`.2s` {v0},[rdi],#`8`
369	cmp next_tag, rcx
370	b.hi L_pack_2bits
371
372	/ --------------------------------- packing 4-bits dict indices into dest_buf ---------------------------------- /
373
374	/ 1st, round up number of 4-bits dict_indices to a multiple of 8 and fill in 0 if needed /
375	sub rax, next_qp, tempQPosArray // eax = num_bytes_to_pack = next_qp - (char ) tempQPosArray;*
376	add eax, eax, #`7` // num_bytes_to_pack+7
377	lsr eax, eax, #`3` // num_packed_words = (num_bytes_to_pack + 7) >> 3
378	add rcx, tempQPosArray, rax, lsl #`3` // endQPosArray = tempQPosArray + 2num_source_words*
379	lsl rax, rax, #`2`
380	subs byte_count, byte_count, rax
381	b.lt L_budgetExhausted
382
383	cmp rcx, next_qp // endQPosArray vs next_qp
384	b.ls `2f` // if (next_qp >= endQPosArray) skip the following zero paddings
385	sub rax, rcx, next_qp
386	mov edx, #`0`
387	tst eax, #`4`
388	b.eq `1f`
389	str edx, [next_qp], #`4`
390	`1`: tst eax, #`2`
391	b.eq `1f`
392	strh edx, [next_qp], #`2`
393	`1`: tst eax, #`1`
394	b.eq `2f`
395	strb edx, [next_qp], #`1`
396	`2`:
397	mov rdi, next_full_patt // next_full_patt
398	cmp rcx, tempQPosArray // endQPosArray vs tempQPosArray
399	ldr eax, [dest_buf]
400	b.ls L20 // if (endQPosArray <= tempQPosArray) skip the following
401	mov rdx, tempQPosArray // tempQPosArray
402
403	/ packing 4-bits dict indices into dest_buf /
404	L_pack_4bits:
405	ldr rax, [rdx], #`8` // src_next[1]:src_next[0]
406	orr rax, rax, rax, lsr #`28` // eax = src_next[0] \| (src_next[1] << 4)
407	cmp rcx, rdx // source_end vs src_next
408	str eax, [rdi], #`4` // dest_next++ = temp;*
409	b.hi L_pack_4bits // while (src_next < source_end) repeat the loop
410
411	// SET_LOW_BITS_AREA_START(dest_buf,boundary_tmp);
412	sub rax, rdi, dest_buf // boundary_tmp - dest_buf
413	lsr eax, eax, #`2` // boundary_tmp - dest_buf in words
414	L20:
415	str eax, [dest_buf,#`4`] // dest_buf[1] = boundary_tmp - dest_buf
416
417
418
419	/ --------------------------- packing 3 10-bits low bits into a 32-bit word in dest_buf[] ----------------------------------------- /
420
421	add rcx, scratch, #(`2048`scale) // tempLowBitsArray*
422	sub rdx, next_low_bits, rcx // next_low_bits - tempLowBitsArray (in bytes)
423	lsr rdx, rdx, #`1` // num_tenbits_to_pack (in half-words)
424	subs edx, edx, #`3` // pre-decrement num_tenbits_to_pack by 3
425	b.lt `1f` // if num_tenbits_to_pack < 3, skip the following loop
426	`0`:
427	subs byte_count, byte_count, #`4` // byte_count -= 4
428	b.le L_budgetExhausted // return -1 to signal this page is not compressable
429	subs edx, edx, #`3` // num_tenbits_to_pack-=3
430	ldr rax, [rcx], #`6`
431	bfm rax, rax, #`58`, #`9` // pack 1st toward 2nd
432	bfm rax, rax, #`58`, #`25` // pack 1st/2nd toward 3rd
433	lsr rax, rax, #`12`
434	str eax, [rdi], #`4` // pack w0,w1,w2 into 1 dest_buf word
435	b.ge `0b` // if no less than 3 elements, back to loop head
436
437	`1`: adds edx, edx, #`3` // post-increment num_tenbits_to_pack by 3
438	b.eq `3f` // if num_tenbits_to_pack is a multiple of 3, skip the following
439	subs byte_count, byte_count, #`4` // byte_count -= 4
440	b.le L_budgetExhausted // return -1 to signal this page is not compressable
441	ldrh eax,[rcx] // w0
442	subs edx, edx, #`1` // num_tenbits_to_pack--
443	b.eq `2f` //
444	ldrh edx, [rcx, #`2`] // w1
445	orr eax, eax, edx, lsl #`10` // w0 \| (w1<<10)
446
447	`2`: str eax, [rdi], #`4` // write the final dest_buf word
448
449	`3`: sub rax, rdi, dest_buf // boundary_tmp - dest_buf
450	lsr eax, eax, #`2` // boundary_tmp - dest_buf in terms of words
451	str eax, [dest_buf, #`8`] // SET_LOW_BITS_AREA_END(dest_buf,boundary_tmp)
452	lsl w0, eax, #`2` // boundary_tmp - dest_buf in terms of bytes
453
454	L_done:
455
456	// restore registers and return
457	mov x15, sp
458	ldp x20, x21, [x15, #`0`] // restore x20, x21
459	ldp x22, x23, [x15, #`16`] // restore x22, x23
460	ldp x24, x25, [x15, #`32`] // restore x24, x25
461	ldp x26, x27, [x15, #`48`] // restore x26, x27
462	add sp, sp, #`128` // deallocate for dictionary + saved register space
463
464	#if KERNEL
465	ld1`.4s` {v0,v1,v2,v3},[sp],#`64`
466	#endif
467	ret lr
468
469	.align `4`
470	L_budgetExhausted:
471	mov x0, #-`1`
472	b L_done
473
474
475	.align `4`,`0x90`
476	L_RECORD_EXACT:
477	/*
478	we have an exact match of the input word to its corresponding dictionary word
479	write tag/dict_index to the temorary buffers
480	*/
481	mov eax, #`3`
482	lsr w14, wdict_location, #`2` // divide by 4 for word offset
483	subs remaining, remaining, #`1` // remaing--;
484	strb eax, [next_tag], #`1` // next_tag++ = 3 for exact*
485	strb w14, [next_qp], #`1` // next_qp = word offset (4-bit)*
486	b.gt L_loop
487	b CHECKPOINT // if remaining = 0, break
488
489	.align `4`,`0x90`
490	L_RECORD_PARTIAL:
491	/*
492	we have a partial (high 22-bits) match of the input word to its corresponding dictionary word
493	write tag/dict_index/low 10 bits to the temorary buffers
494	*/
495	mov ecx, #`1`
496	strb ecx, [next_tag], #`1` // next_tag++ = 1 for partial matched*
497	str edx, [dictionary, dict_location] // dict_location = input_word;*
498	subs remaining, remaining, #`1` // remaing--;
499	lsr eax, wdict_location, #`2` // offset in 32-bit word
500	and edx, edx, #`1023` // lower 10 bits
501	strb eax, [next_qp], #`1` // update next_qp++*
502	strh edx, [next_low_bits], #`2` // save next_low_bits++
503	b.gt L_loop
504
505	CHECKPOINT:
506
507	cbz mode, L_check_compression_ratio // if this this an early abort check..
508
509	L_check_zero_page:
510
511	cmp start_next_full_patt, next_full_patt // check if any dictionary misses in page
512	b.ne L_check_single_value_page
513
514	cmp start_next_qp, next_qp // check if any partial or exact dictionary matches
515	b.ne L_check_single_value_page
516
517	mov x0, #SV_RETURN // Magic return value
518	b L_done
519
520	L_check_single_value_page:
521
522	sub rax, next_full_patt, start_next_full_patt // get # dictionary misses
523	lsr rax, rax, #`2`
524
525	sub r11, next_qp, start_next_qp // get # dictionary hits (exact + partial)
526
527	sub r13, next_low_bits, start_next_low_bits // get # dictionary partial hits
528	lsr r13, r13, #`1`
529
530	// Single value page if one of the follwoing is true:
531	// partial == 0 AND hits == 1023(for 4K page) AND miss == 1 AND tag[0] == 2 (i.e. miss)
532	// partial == 1 AND hits == 1024(for 4K page) AND tag[0] == 1 (i.e. partial)
533	//
534	cbnz r13, `1f` // were there 0 partial hits?
535
536	cmp r11, #(`256`PAGES_SIZE_IN_KBYTES - `1`) // were there 1023 dictionary hits*
537	b.ne `1f`
538
539	cmp rax, #`1` // was there exacly 1 dictionary miss?
540	b.ne `1f`
541
542	ldrb edx, [tempTagsArray] // read the very 1st tag
543	cmp edx, #`2` // was the very 1st tag a miss?
544	b.eq L_is_single_value_page
545
546	`1`:
547	cmp r13, #`1` // was there 1 partial hit?
548	b.ne L_check_mostly_zero
549
550	cmp r11, #(`256`PAGES_SIZE_IN_KBYTES) // were there 1024 dictionary hits*
551	b.ne L_check_mostly_zero
552
553	ldrb edx, [tempTagsArray] // read the very 1st tag
554	cmp edx, #`1` // was the very 1st tag a partial?
555	b.ne L_check_mostly_zero
556
557	L_is_single_value_page:
558
559	mov x0, #SV_RETURN // Magic return value
560	b L_done
561
562	L_check_mostly_zero:
563	// how much space will the sparse packer take?
564	add rax, rax, r11 // rax += (next_qp - start_next_qp)
565	mov rdx, #`6`
566	mov rcx, #`4`
567	madd r11, rax, rdx, rcx // r11 = rax 6 (i.e. 4 byte word + 2 byte offset) + 4 byte for header*
568
569	sub rax, next_low_bits, start_next_low_bits // get bytes consumed by lower-10 bits
570	mov rdx, #`1365`
571	mul rax, rax, rdx
572
573	sub rdx, next_full_patt, start_next_full_patt // get bytes consumed by dictionary misses
574	add rax, rdx, rax, lsr #`11` // rax = 2/3(next_low_bits - start_next_low_bits) + (next_full_patt - start_next_full_patt)*
575
576	sub rdx, next_qp, start_next_qp
577	add rax, rax, rdx, lsr #`1` // rax += (next_qp - start_next_qp)/2
578	add rax, rax, #(`12`+`256`scale) // rax += bytes taken by the header + tags*
579
580	cmp rax, r11 // is the default packer the better option?
581	b.lt L_done_search
582
583	cmp r11, byte_budget // can the sparse packer fit into the given budget?
584	b.gt L_budgetExhausted
585
586	L_sparse_packer:
587	mov edx, #MZV_MAGIC
588	str edx, [dest_buf], #`4` // header to indicate a sparse packer
589
590	mov rdx, #`0` // rdx = byte offset in src of non-0 word
591	`1`:
592	ldr rax, [start_next_input_word, rdx] // rax = read dword
593	cbnz rax, `5f` // is dword != 0
594	`3`:
595	add rdx, rdx, #`8` // 8 more bytes have been processed
596	`4`:
597	cmp rdx, #(`4096`scale) // has the entire page been processed*
598	b.ne `1b`
599	mov x0, r11 // store the size of the compressed stream
600	b L_done
601
602	`5`:
603	cbz eax, `6f` // is lower word == 0
604	str eax, [dest_buf], #`4` // store the non-0 word in the dest buffer
605	strh edx, [dest_buf], #`2` // store the byte index
606	`6`:
607	lsr rax, rax, `32` // get the upper word into position
608	cbz eax, `3b` // is dword == 0
609	add rdx, rdx, #`4`
610	str eax, [dest_buf], #`4` // store the non-0 word in the dest buffer
611	strh edx, [dest_buf], #`2` // store the byte index
612	add rdx, rdx, #`4`
613	b `4b`
614
615	L_check_compression_ratio:
616
617	mov mode, NORMAL
618	mov remaining, #((`1024` - CHKPT_WORDS)scale) // remaining input words to process*
619	cbz remaining, CHECKPOINT // if there are no remaining words to process
620
621	sub rax, next_low_bits, start_next_low_bits // get bytes consumed by lower-10 bits
622	mov rdx, #`1365`
623	mul rax, rax, rdx
624
625	sub rdx, next_full_patt, start_next_full_patt // get bytes consumed by dictionary misses
626	add rax, rdx, rax, lsr #`11` // rax = 2/3(next_low_bits - start_next_low_bits) + (next_full_patt - start_next_full_patt)*
627
628	sub rdx, next_qp, start_next_qp
629	add rax, rax, rdx, lsr #`1` // rax += (next_qp - start_next_qp)/2
630	subs rax, rax, #((CHKPT_SHRUNK_BYTES - CHKPT_TAG_BYTES)*scale)
631	// rax += CHKPT_TAG_BYTES; rax -= CHKPT_SHRUNK_BYTES
632
633	b.gt L_budgetExhausted // if rax is > 0, we need to early abort
634	b L_loop // we are done
635

Browse the source code of xnu/osfmk/arm64/WKdmCompress_16k.s