xref: /sqlite-3.40.0/src/wal.c (revision f2fcd075)
1 /*
2 ** 2010 February 1
3 **
4 ** The author disclaims copyright to this source code.  In place of
5 ** a legal notice, here is a blessing:
6 **
7 **    May you do good and not evil.
8 **    May you find forgiveness for yourself and forgive others.
9 **    May you share freely, never taking more than you give.
10 **
11 *************************************************************************
12 **
13 ** This file contains the implementation of a write-ahead log (WAL) used in
14 ** "journal_mode=WAL" mode.
15 **
16 ** WRITE-AHEAD LOG (WAL) FILE FORMAT
17 **
18 ** A WAL file consists of a header followed by zero or more "frames".
19 ** Each frame records the revised content of a single page from the
20 ** database file.  All changes to the database are recorded by writing
21 ** frames into the WAL.  Transactions commit when a frame is written that
22 ** contains a commit marker.  A single WAL can and usually does record
23 ** multiple transactions.  Periodically, the content of the WAL is
24 ** transferred back into the database file in an operation called a
25 ** "checkpoint".
26 **
27 ** A single WAL file can be used multiple times.  In other words, the
28 ** WAL can fill up with frames and then be checkpointed and then new
29 ** frames can overwrite the old ones.  A WAL always grows from beginning
30 ** toward the end.  Checksums and counters attached to each frame are
31 ** used to determine which frames within the WAL are valid and which
32 ** are leftovers from prior checkpoints.
33 **
34 ** The WAL header is 32 bytes in size and consists of the following eight
35 ** big-endian 32-bit unsigned integer values:
36 **
37 **     0: Magic number.  0x377f0682 or 0x377f0683
38 **     4: File format version.  Currently 3007000
39 **     8: Database page size.  Example: 1024
40 **    12: Checkpoint sequence number
41 **    16: Salt-1, random integer incremented with each checkpoint
42 **    20: Salt-2, a different random integer changing with each ckpt
43 **    24: Checksum-1 (first part of checksum for first 24 bytes of header).
44 **    28: Checksum-2 (second part of checksum for first 24 bytes of header).
45 **
46 ** Immediately following the wal-header are zero or more frames. Each
47 ** frame consists of a 24-byte frame-header followed by a <page-size> bytes
48 ** of page data. The frame-header is six big-endian 32-bit unsigned
49 ** integer values, as follows:
50 **
51 **     0: Page number.
52 **     4: For commit records, the size of the database image in pages
53 **        after the commit. For all other records, zero.
54 **     8: Salt-1 (copied from the header)
55 **    12: Salt-2 (copied from the header)
56 **    16: Checksum-1.
57 **    20: Checksum-2.
58 **
59 ** A frame is considered valid if and only if the following conditions are
60 ** true:
61 **
62 **    (1) The salt-1 and salt-2 values in the frame-header match
63 **        salt values in the wal-header
64 **
65 **    (2) The checksum values in the final 8 bytes of the frame-header
66 **        exactly match the checksum computed consecutively on the
67 **        WAL header and the first 8 bytes and the content of all frames
68 **        up to and including the current frame.
69 **
70 ** The checksum is computed using 32-bit big-endian integers if the
71 ** magic number in the first 4 bytes of the WAL is 0x377f0683 and it
72 ** is computed using little-endian if the magic number is 0x377f0682.
73 ** The checksum values are always stored in the frame header in a
74 ** big-endian format regardless of which byte order is used to compute
75 ** the checksum.  The checksum is computed by interpreting the input as
76 ** an even number of unsigned 32-bit integers: x[0] through x[N].  The
77 ** algorithm used for the checksum is as follows:
78 **
79 **   for i from 0 to n-1 step 2:
80 **     s0 += x[i] + s1;
81 **     s1 += x[i+1] + s0;
82 **   endfor
83 **
84 ** Note that s0 and s1 are both weighted checksums using fibonacci weights
85 ** in reverse order (the largest fibonacci weight occurs on the first element
86 ** of the sequence being summed.)  The s1 value spans all 32-bit
87 ** terms of the sequence whereas s0 omits the final term.
88 **
89 ** On a checkpoint, the WAL is first VFS.xSync-ed, then valid content of the
90 ** WAL is transferred into the database, then the database is VFS.xSync-ed.
91 ** The VFS.xSync operations serve as write barriers - all writes launched
92 ** before the xSync must complete before any write that launches after the
93 ** xSync begins.
94 **
95 ** After each checkpoint, the salt-1 value is incremented and the salt-2
96 ** value is randomized.  This prevents old and new frames in the WAL from
97 ** being considered valid at the same time and being checkpointing together
98 ** following a crash.
99 **
100 ** READER ALGORITHM
101 **
102 ** To read a page from the database (call it page number P), a reader
103 ** first checks the WAL to see if it contains page P.  If so, then the
104 ** last valid instance of page P that is a followed by a commit frame
105 ** or is a commit frame itself becomes the value read.  If the WAL
106 ** contains no copies of page P that are valid and which are a commit
107 ** frame or are followed by a commit frame, then page P is read from
108 ** the database file.
109 **
110 ** To start a read transaction, the reader records the index of the last
111 ** valid frame in the WAL.  The reader uses this recorded "mxFrame" value
112 ** for all subsequent read operations.  New transactions can be appended
113 ** to the WAL, but as long as the reader uses its original mxFrame value
114 ** and ignores the newly appended content, it will see a consistent snapshot
115 ** of the database from a single point in time.  This technique allows
116 ** multiple concurrent readers to view different versions of the database
117 ** content simultaneously.
118 **
119 ** The reader algorithm in the previous paragraphs works correctly, but
120 ** because frames for page P can appear anywhere within the WAL, the
121 ** reader has to scan the entire WAL looking for page P frames.  If the
122 ** WAL is large (multiple megabytes is typical) that scan can be slow,
123 ** and read performance suffers.  To overcome this problem, a separate
124 ** data structure called the wal-index is maintained to expedite the
125 ** search for frames of a particular page.
126 **
127 ** WAL-INDEX FORMAT
128 **
129 ** Conceptually, the wal-index is shared memory, though VFS implementations
130 ** might choose to implement the wal-index using a mmapped file.  Because
131 ** the wal-index is shared memory, SQLite does not support journal_mode=WAL
132 ** on a network filesystem.  All users of the database must be able to
133 ** share memory.
134 **
135 ** The wal-index is transient.  After a crash, the wal-index can (and should
136 ** be) reconstructed from the original WAL file.  In fact, the VFS is required
137 ** to either truncate or zero the header of the wal-index when the last
138 ** connection to it closes.  Because the wal-index is transient, it can
139 ** use an architecture-specific format; it does not have to be cross-platform.
140 ** Hence, unlike the database and WAL file formats which store all values
141 ** as big endian, the wal-index can store multi-byte values in the native
142 ** byte order of the host computer.
143 **
144 ** The purpose of the wal-index is to answer this question quickly:  Given
145 ** a page number P, return the index of the last frame for page P in the WAL,
146 ** or return NULL if there are no frames for page P in the WAL.
147 **
148 ** The wal-index consists of a header region, followed by an one or
149 ** more index blocks.
150 **
151 ** The wal-index header contains the total number of frames within the WAL
152 ** in the the mxFrame field.
153 **
154 ** Each index block except for the first contains information on
155 ** HASHTABLE_NPAGE frames. The first index block contains information on
156 ** HASHTABLE_NPAGE_ONE frames. The values of HASHTABLE_NPAGE_ONE and
157 ** HASHTABLE_NPAGE are selected so that together the wal-index header and
158 ** first index block are the same size as all other index blocks in the
159 ** wal-index.
160 **
161 ** Each index block contains two sections, a page-mapping that contains the
162 ** database page number associated with each wal frame, and a hash-table
163 ** that allows readers to query an index block for a specific page number.
164 ** The page-mapping is an array of HASHTABLE_NPAGE (or HASHTABLE_NPAGE_ONE
165 ** for the first index block) 32-bit page numbers. The first entry in the
166 ** first index-block contains the database page number corresponding to the
167 ** first frame in the WAL file. The first entry in the second index block
168 ** in the WAL file corresponds to the (HASHTABLE_NPAGE_ONE+1)th frame in
169 ** the log, and so on.
170 **
171 ** The last index block in a wal-index usually contains less than the full
172 ** complement of HASHTABLE_NPAGE (or HASHTABLE_NPAGE_ONE) page-numbers,
173 ** depending on the contents of the WAL file. This does not change the
174 ** allocated size of the page-mapping array - the page-mapping array merely
175 ** contains unused entries.
176 **
177 ** Even without using the hash table, the last frame for page P
178 ** can be found by scanning the page-mapping sections of each index block
179 ** starting with the last index block and moving toward the first, and
180 ** within each index block, starting at the end and moving toward the
181 ** beginning.  The first entry that equals P corresponds to the frame
182 ** holding the content for that page.
183 **
184 ** The hash table consists of HASHTABLE_NSLOT 16-bit unsigned integers.
185 ** HASHTABLE_NSLOT = 2*HASHTABLE_NPAGE, and there is one entry in the
186 ** hash table for each page number in the mapping section, so the hash
187 ** table is never more than half full.  The expected number of collisions
188 ** prior to finding a match is 1.  Each entry of the hash table is an
189 ** 1-based index of an entry in the mapping section of the same
190 ** index block.   Let K be the 1-based index of the largest entry in
191 ** the mapping section.  (For index blocks other than the last, K will
192 ** always be exactly HASHTABLE_NPAGE (4096) and for the last index block
193 ** K will be (mxFrame%HASHTABLE_NPAGE).)  Unused slots of the hash table
194 ** contain a value of 0.
195 **
196 ** To look for page P in the hash table, first compute a hash iKey on
197 ** P as follows:
198 **
199 **      iKey = (P * 383) % HASHTABLE_NSLOT
200 **
201 ** Then start scanning entries of the hash table, starting with iKey
202 ** (wrapping around to the beginning when the end of the hash table is
203 ** reached) until an unused hash slot is found. Let the first unused slot
204 ** be at index iUnused.  (iUnused might be less than iKey if there was
205 ** wrap-around.) Because the hash table is never more than half full,
206 ** the search is guaranteed to eventually hit an unused entry.  Let
207 ** iMax be the value between iKey and iUnused, closest to iUnused,
208 ** where aHash[iMax]==P.  If there is no iMax entry (if there exists
209 ** no hash slot such that aHash[i]==p) then page P is not in the
210 ** current index block.  Otherwise the iMax-th mapping entry of the
211 ** current index block corresponds to the last entry that references
212 ** page P.
213 **
214 ** A hash search begins with the last index block and moves toward the
215 ** first index block, looking for entries corresponding to page P.  On
216 ** average, only two or three slots in each index block need to be
217 ** examined in order to either find the last entry for page P, or to
218 ** establish that no such entry exists in the block.  Each index block
219 ** holds over 4000 entries.  So two or three index blocks are sufficient
220 ** to cover a typical 10 megabyte WAL file, assuming 1K pages.  8 or 10
221 ** comparisons (on average) suffice to either locate a frame in the
222 ** WAL or to establish that the frame does not exist in the WAL.  This
223 ** is much faster than scanning the entire 10MB WAL.
224 **
225 ** Note that entries are added in order of increasing K.  Hence, one
226 ** reader might be using some value K0 and a second reader that started
227 ** at a later time (after additional transactions were added to the WAL
228 ** and to the wal-index) might be using a different value K1, where K1>K0.
229 ** Both readers can use the same hash table and mapping section to get
230 ** the correct result.  There may be entries in the hash table with
231 ** K>K0 but to the first reader, those entries will appear to be unused
232 ** slots in the hash table and so the first reader will get an answer as
233 ** if no values greater than K0 had ever been inserted into the hash table
234 ** in the first place - which is what reader one wants.  Meanwhile, the
235 ** second reader using K1 will see additional values that were inserted
236 ** later, which is exactly what reader two wants.
237 **
238 ** When a rollback occurs, the value of K is decreased. Hash table entries
239 ** that correspond to frames greater than the new K value are removed
240 ** from the hash table at this point.
241 */
242 #ifndef SQLITE_OMIT_WAL
243 
244 #include "wal.h"
245 
246 /*
247 ** Trace output macros
248 */
249 #if defined(SQLITE_TEST) && defined(SQLITE_DEBUG)
250 int sqlite3WalTrace = 0;
251 # define WALTRACE(X)  if(sqlite3WalTrace) sqlite3DebugPrintf X
252 #else
253 # define WALTRACE(X)
254 #endif
255 
256 /*
257 ** The maximum (and only) versions of the wal and wal-index formats
258 ** that may be interpreted by this version of SQLite.
259 **
260 ** If a client begins recovering a WAL file and finds that (a) the checksum
261 ** values in the wal-header are correct and (b) the version field is not
262 ** WAL_MAX_VERSION, recovery fails and SQLite returns SQLITE_CANTOPEN.
263 **
264 ** Similarly, if a client successfully reads a wal-index header (i.e. the
265 ** checksum test is successful) and finds that the version field is not
266 ** WALINDEX_MAX_VERSION, then no read-transaction is opened and SQLite
267 ** returns SQLITE_CANTOPEN.
268 */
269 #define WAL_MAX_VERSION      3007000
270 #define WALINDEX_MAX_VERSION 3007000
271 
272 /*
273 ** Indices of various locking bytes.   WAL_NREADER is the number
274 ** of available reader locks and should be at least 3.
275 */
276 #define WAL_WRITE_LOCK         0
277 #define WAL_ALL_BUT_WRITE      1
278 #define WAL_CKPT_LOCK          1
279 #define WAL_RECOVER_LOCK       2
280 #define WAL_READ_LOCK(I)       (3+(I))
281 #define WAL_NREADER            (SQLITE_SHM_NLOCK-3)
282 
283 
284 /* Object declarations */
285 typedef struct WalIndexHdr WalIndexHdr;
286 typedef struct WalIterator WalIterator;
287 typedef struct WalCkptInfo WalCkptInfo;
288 
289 
290 /*
291 ** The following object holds a copy of the wal-index header content.
292 **
293 ** The actual header in the wal-index consists of two copies of this
294 ** object.
295 **
296 ** The szPage value can be any power of 2 between 512 and 32768, inclusive.
297 ** Or it can be 1 to represent a 65536-byte page.  The latter case was
298 ** added in 3.7.1 when support for 64K pages was added.
299 */
300 struct WalIndexHdr {
301   u32 iVersion;                   /* Wal-index version */
302   u32 unused;                     /* Unused (padding) field */
303   u32 iChange;                    /* Counter incremented each transaction */
304   u8 isInit;                      /* 1 when initialized */
305   u8 bigEndCksum;                 /* True if checksums in WAL are big-endian */
306   u16 szPage;                     /* Database page size in bytes. 1==64K */
307   u32 mxFrame;                    /* Index of last valid frame in the WAL */
308   u32 nPage;                      /* Size of database in pages */
309   u32 aFrameCksum[2];             /* Checksum of last frame in log */
310   u32 aSalt[2];                   /* Two salt values copied from WAL header */
311   u32 aCksum[2];                  /* Checksum over all prior fields */
312 };
313 
314 /*
315 ** A copy of the following object occurs in the wal-index immediately
316 ** following the second copy of the WalIndexHdr.  This object stores
317 ** information used by checkpoint.
318 **
319 ** nBackfill is the number of frames in the WAL that have been written
320 ** back into the database. (We call the act of moving content from WAL to
321 ** database "backfilling".)  The nBackfill number is never greater than
322 ** WalIndexHdr.mxFrame.  nBackfill can only be increased by threads
323 ** holding the WAL_CKPT_LOCK lock (which includes a recovery thread).
324 ** However, a WAL_WRITE_LOCK thread can move the value of nBackfill from
325 ** mxFrame back to zero when the WAL is reset.
326 **
327 ** There is one entry in aReadMark[] for each reader lock.  If a reader
328 ** holds read-lock K, then the value in aReadMark[K] is no greater than
329 ** the mxFrame for that reader.  The value READMARK_NOT_USED (0xffffffff)
330 ** for any aReadMark[] means that entry is unused.  aReadMark[0] is
331 ** a special case; its value is never used and it exists as a place-holder
332 ** to avoid having to offset aReadMark[] indexs by one.  Readers holding
333 ** WAL_READ_LOCK(0) always ignore the entire WAL and read all content
334 ** directly from the database.
335 **
336 ** The value of aReadMark[K] may only be changed by a thread that
337 ** is holding an exclusive lock on WAL_READ_LOCK(K).  Thus, the value of
338 ** aReadMark[K] cannot changed while there is a reader is using that mark
339 ** since the reader will be holding a shared lock on WAL_READ_LOCK(K).
340 **
341 ** The checkpointer may only transfer frames from WAL to database where
342 ** the frame numbers are less than or equal to every aReadMark[] that is
343 ** in use (that is, every aReadMark[j] for which there is a corresponding
344 ** WAL_READ_LOCK(j)).  New readers (usually) pick the aReadMark[] with the
345 ** largest value and will increase an unused aReadMark[] to mxFrame if there
346 ** is not already an aReadMark[] equal to mxFrame.  The exception to the
347 ** previous sentence is when nBackfill equals mxFrame (meaning that everything
348 ** in the WAL has been backfilled into the database) then new readers
349 ** will choose aReadMark[0] which has value 0 and hence such reader will
350 ** get all their all content directly from the database file and ignore
351 ** the WAL.
352 **
353 ** Writers normally append new frames to the end of the WAL.  However,
354 ** if nBackfill equals mxFrame (meaning that all WAL content has been
355 ** written back into the database) and if no readers are using the WAL
356 ** (in other words, if there are no WAL_READ_LOCK(i) where i>0) then
357 ** the writer will first "reset" the WAL back to the beginning and start
358 ** writing new content beginning at frame 1.
359 **
360 ** We assume that 32-bit loads are atomic and so no locks are needed in
361 ** order to read from any aReadMark[] entries.
362 */
363 struct WalCkptInfo {
364   u32 nBackfill;                  /* Number of WAL frames backfilled into DB */
365   u32 aReadMark[WAL_NREADER];     /* Reader marks */
366 };
367 #define READMARK_NOT_USED  0xffffffff
368 
369 
370 /* A block of WALINDEX_LOCK_RESERVED bytes beginning at
371 ** WALINDEX_LOCK_OFFSET is reserved for locks. Since some systems
372 ** only support mandatory file-locks, we do not read or write data
373 ** from the region of the file on which locks are applied.
374 */
375 #define WALINDEX_LOCK_OFFSET   (sizeof(WalIndexHdr)*2 + sizeof(WalCkptInfo))
376 #define WALINDEX_LOCK_RESERVED 16
377 #define WALINDEX_HDR_SIZE      (WALINDEX_LOCK_OFFSET+WALINDEX_LOCK_RESERVED)
378 
379 /* Size of header before each frame in wal */
380 #define WAL_FRAME_HDRSIZE 24
381 
382 /* Size of write ahead log header, including checksum. */
383 /* #define WAL_HDRSIZE 24 */
384 #define WAL_HDRSIZE 32
385 
386 /* WAL magic value. Either this value, or the same value with the least
387 ** significant bit also set (WAL_MAGIC | 0x00000001) is stored in 32-bit
388 ** big-endian format in the first 4 bytes of a WAL file.
389 **
390 ** If the LSB is set, then the checksums for each frame within the WAL
391 ** file are calculated by treating all data as an array of 32-bit
392 ** big-endian words. Otherwise, they are calculated by interpreting
393 ** all data as 32-bit little-endian words.
394 */
395 #define WAL_MAGIC 0x377f0682
396 
397 /*
398 ** Return the offset of frame iFrame in the write-ahead log file,
399 ** assuming a database page size of szPage bytes. The offset returned
400 ** is to the start of the write-ahead log frame-header.
401 */
402 #define walFrameOffset(iFrame, szPage) (                               \
403   WAL_HDRSIZE + ((iFrame)-1)*(i64)((szPage)+WAL_FRAME_HDRSIZE)         \
404 )
405 
406 /*
407 ** An open write-ahead log file is represented by an instance of the
408 ** following object.
409 */
410 struct Wal {
411   sqlite3_vfs *pVfs;         /* The VFS used to create pDbFd */
412   sqlite3_file *pDbFd;       /* File handle for the database file */
413   sqlite3_file *pWalFd;      /* File handle for WAL file */
414   u32 iCallback;             /* Value to pass to log callback (or 0) */
415   int nWiData;               /* Size of array apWiData */
416   volatile u32 **apWiData;   /* Pointer to wal-index content in memory */
417   u32 szPage;                /* Database page size */
418   i16 readLock;              /* Which read lock is being held.  -1 for none */
419   u8 exclusiveMode;          /* Non-zero if connection is in exclusive mode */
420   u8 writeLock;              /* True if in a write transaction */
421   u8 ckptLock;               /* True if holding a checkpoint lock */
422   u8 readOnly;               /* True if the WAL file is open read-only */
423   WalIndexHdr hdr;           /* Wal-index header for current transaction */
424   const char *zWalName;      /* Name of WAL file */
425   u32 nCkpt;                 /* Checkpoint sequence counter in the wal-header */
426 #ifdef SQLITE_DEBUG
427   u8 lockError;              /* True if a locking error has occurred */
428 #endif
429 };
430 
431 /*
432 ** Each page of the wal-index mapping contains a hash-table made up of
433 ** an array of HASHTABLE_NSLOT elements of the following type.
434 */
435 typedef u16 ht_slot;
436 
437 /*
438 ** This structure is used to implement an iterator that loops through
439 ** all frames in the WAL in database page order. Where two or more frames
440 ** correspond to the same database page, the iterator visits only the
441 ** frame most recently written to the WAL (in other words, the frame with
442 ** the largest index).
443 **
444 ** The internals of this structure are only accessed by:
445 **
446 **   walIteratorInit() - Create a new iterator,
447 **   walIteratorNext() - Step an iterator,
448 **   walIteratorFree() - Free an iterator.
449 **
450 ** This functionality is used by the checkpoint code (see walCheckpoint()).
451 */
452 struct WalIterator {
453   int iPrior;                     /* Last result returned from the iterator */
454   int nSegment;                   /* Size of the aSegment[] array */
455   struct WalSegment {
456     int iNext;                    /* Next slot in aIndex[] not yet returned */
457     ht_slot *aIndex;              /* i0, i1, i2... such that aPgno[iN] ascend */
458     u32 *aPgno;                   /* Array of page numbers. */
459     int nEntry;                   /* Max size of aPgno[] and aIndex[] arrays */
460     int iZero;                    /* Frame number associated with aPgno[0] */
461   } aSegment[1];                  /* One for every 32KB page in the WAL */
462 };
463 
464 /*
465 ** Define the parameters of the hash tables in the wal-index file. There
466 ** is a hash-table following every HASHTABLE_NPAGE page numbers in the
467 ** wal-index.
468 **
469 ** Changing any of these constants will alter the wal-index format and
470 ** create incompatibilities.
471 */
472 #define HASHTABLE_NPAGE      4096                 /* Must be power of 2 */
473 #define HASHTABLE_HASH_1     383                  /* Should be prime */
474 #define HASHTABLE_NSLOT      (HASHTABLE_NPAGE*2)  /* Must be a power of 2 */
475 
476 /*
477 ** The block of page numbers associated with the first hash-table in a
478 ** wal-index is smaller than usual. This is so that there is a complete
479 ** hash-table on each aligned 32KB page of the wal-index.
480 */
481 #define HASHTABLE_NPAGE_ONE  (HASHTABLE_NPAGE - (WALINDEX_HDR_SIZE/sizeof(u32)))
482 
483 /* The wal-index is divided into pages of WALINDEX_PGSZ bytes each. */
484 #define WALINDEX_PGSZ   (                                         \
485     sizeof(ht_slot)*HASHTABLE_NSLOT + HASHTABLE_NPAGE*sizeof(u32) \
486 )
487 
488 /*
489 ** Obtain a pointer to the iPage'th page of the wal-index. The wal-index
490 ** is broken into pages of WALINDEX_PGSZ bytes. Wal-index pages are
491 ** numbered from zero.
492 **
493 ** If this call is successful, *ppPage is set to point to the wal-index
494 ** page and SQLITE_OK is returned. If an error (an OOM or VFS error) occurs,
495 ** then an SQLite error code is returned and *ppPage is set to 0.
496 */
497 static int walIndexPage(Wal *pWal, int iPage, volatile u32 **ppPage){
498   int rc = SQLITE_OK;
499 
500   /* Enlarge the pWal->apWiData[] array if required */
501   if( pWal->nWiData<=iPage ){
502     int nByte = sizeof(u32*)*(iPage+1);
503     volatile u32 **apNew;
504     apNew = (volatile u32 **)sqlite3_realloc((void *)pWal->apWiData, nByte);
505     if( !apNew ){
506       *ppPage = 0;
507       return SQLITE_NOMEM;
508     }
509     memset((void*)&apNew[pWal->nWiData], 0,
510            sizeof(u32*)*(iPage+1-pWal->nWiData));
511     pWal->apWiData = apNew;
512     pWal->nWiData = iPage+1;
513   }
514 
515   /* Request a pointer to the required page from the VFS */
516   if( pWal->apWiData[iPage]==0 ){
517     rc = sqlite3OsShmMap(pWal->pDbFd, iPage, WALINDEX_PGSZ,
518         pWal->writeLock, (void volatile **)&pWal->apWiData[iPage]
519     );
520   }
521 
522   *ppPage = pWal->apWiData[iPage];
523   assert( iPage==0 || *ppPage || rc!=SQLITE_OK );
524   return rc;
525 }
526 
527 /*
528 ** Return a pointer to the WalCkptInfo structure in the wal-index.
529 */
530 static volatile WalCkptInfo *walCkptInfo(Wal *pWal){
531   assert( pWal->nWiData>0 && pWal->apWiData[0] );
532   return (volatile WalCkptInfo*)&(pWal->apWiData[0][sizeof(WalIndexHdr)/2]);
533 }
534 
535 /*
536 ** Return a pointer to the WalIndexHdr structure in the wal-index.
537 */
538 static volatile WalIndexHdr *walIndexHdr(Wal *pWal){
539   assert( pWal->nWiData>0 && pWal->apWiData[0] );
540   return (volatile WalIndexHdr*)pWal->apWiData[0];
541 }
542 
543 /*
544 ** The argument to this macro must be of type u32. On a little-endian
545 ** architecture, it returns the u32 value that results from interpreting
546 ** the 4 bytes as a big-endian value. On a big-endian architecture, it
547 ** returns the value that would be produced by intepreting the 4 bytes
548 ** of the input value as a little-endian integer.
549 */
550 #define BYTESWAP32(x) ( \
551     (((x)&0x000000FF)<<24) + (((x)&0x0000FF00)<<8)  \
552   + (((x)&0x00FF0000)>>8)  + (((x)&0xFF000000)>>24) \
553 )
554 
555 /*
556 ** Generate or extend an 8 byte checksum based on the data in
557 ** array aByte[] and the initial values of aIn[0] and aIn[1] (or
558 ** initial values of 0 and 0 if aIn==NULL).
559 **
560 ** The checksum is written back into aOut[] before returning.
561 **
562 ** nByte must be a positive multiple of 8.
563 */
564 static void walChecksumBytes(
565   int nativeCksum, /* True for native byte-order, false for non-native */
566   u8 *a,           /* Content to be checksummed */
567   int nByte,       /* Bytes of content in a[].  Must be a multiple of 8. */
568   const u32 *aIn,  /* Initial checksum value input */
569   u32 *aOut        /* OUT: Final checksum value output */
570 ){
571   u32 s1, s2;
572   u32 *aData = (u32 *)a;
573   u32 *aEnd = (u32 *)&a[nByte];
574 
575   if( aIn ){
576     s1 = aIn[0];
577     s2 = aIn[1];
578   }else{
579     s1 = s2 = 0;
580   }
581 
582   assert( nByte>=8 );
583   assert( (nByte&0x00000007)==0 );
584 
585   if( nativeCksum ){
586     do {
587       s1 += *aData++ + s2;
588       s2 += *aData++ + s1;
589     }while( aData<aEnd );
590   }else{
591     do {
592       s1 += BYTESWAP32(aData[0]) + s2;
593       s2 += BYTESWAP32(aData[1]) + s1;
594       aData += 2;
595     }while( aData<aEnd );
596   }
597 
598   aOut[0] = s1;
599   aOut[1] = s2;
600 }
601 
602 /*
603 ** Write the header information in pWal->hdr into the wal-index.
604 **
605 ** The checksum on pWal->hdr is updated before it is written.
606 */
607 static void walIndexWriteHdr(Wal *pWal){
608   volatile WalIndexHdr *aHdr = walIndexHdr(pWal);
609   const int nCksum = offsetof(WalIndexHdr, aCksum);
610 
611   assert( pWal->writeLock );
612   pWal->hdr.isInit = 1;
613   pWal->hdr.iVersion = WALINDEX_MAX_VERSION;
614   walChecksumBytes(1, (u8*)&pWal->hdr, nCksum, 0, pWal->hdr.aCksum);
615   memcpy((void *)&aHdr[1], (void *)&pWal->hdr, sizeof(WalIndexHdr));
616   sqlite3OsShmBarrier(pWal->pDbFd);
617   memcpy((void *)&aHdr[0], (void *)&pWal->hdr, sizeof(WalIndexHdr));
618 }
619 
620 /*
621 ** This function encodes a single frame header and writes it to a buffer
622 ** supplied by the caller. A frame-header is made up of a series of
623 ** 4-byte big-endian integers, as follows:
624 **
625 **     0: Page number.
626 **     4: For commit records, the size of the database image in pages
627 **        after the commit. For all other records, zero.
628 **     8: Salt-1 (copied from the wal-header)
629 **    12: Salt-2 (copied from the wal-header)
630 **    16: Checksum-1.
631 **    20: Checksum-2.
632 */
633 static void walEncodeFrame(
634   Wal *pWal,                      /* The write-ahead log */
635   u32 iPage,                      /* Database page number for frame */
636   u32 nTruncate,                  /* New db size (or 0 for non-commit frames) */
637   u8 *aData,                      /* Pointer to page data */
638   u8 *aFrame                      /* OUT: Write encoded frame here */
639 ){
640   int nativeCksum;                /* True for native byte-order checksums */
641   u32 *aCksum = pWal->hdr.aFrameCksum;
642   assert( WAL_FRAME_HDRSIZE==24 );
643   sqlite3Put4byte(&aFrame[0], iPage);
644   sqlite3Put4byte(&aFrame[4], nTruncate);
645   memcpy(&aFrame[8], pWal->hdr.aSalt, 8);
646 
647   nativeCksum = (pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN);
648   walChecksumBytes(nativeCksum, aFrame, 8, aCksum, aCksum);
649   walChecksumBytes(nativeCksum, aData, pWal->szPage, aCksum, aCksum);
650 
651   sqlite3Put4byte(&aFrame[16], aCksum[0]);
652   sqlite3Put4byte(&aFrame[20], aCksum[1]);
653 }
654 
655 /*
656 ** Check to see if the frame with header in aFrame[] and content
657 ** in aData[] is valid.  If it is a valid frame, fill *piPage and
658 ** *pnTruncate and return true.  Return if the frame is not valid.
659 */
660 static int walDecodeFrame(
661   Wal *pWal,                      /* The write-ahead log */
662   u32 *piPage,                    /* OUT: Database page number for frame */
663   u32 *pnTruncate,                /* OUT: New db size (or 0 if not commit) */
664   u8 *aData,                      /* Pointer to page data (for checksum) */
665   u8 *aFrame                      /* Frame data */
666 ){
667   int nativeCksum;                /* True for native byte-order checksums */
668   u32 *aCksum = pWal->hdr.aFrameCksum;
669   u32 pgno;                       /* Page number of the frame */
670   assert( WAL_FRAME_HDRSIZE==24 );
671 
672   /* A frame is only valid if the salt values in the frame-header
673   ** match the salt values in the wal-header.
674   */
675   if( memcmp(&pWal->hdr.aSalt, &aFrame[8], 8)!=0 ){
676     return 0;
677   }
678 
679   /* A frame is only valid if the page number is creater than zero.
680   */
681   pgno = sqlite3Get4byte(&aFrame[0]);
682   if( pgno==0 ){
683     return 0;
684   }
685 
686   /* A frame is only valid if a checksum of the WAL header,
687   ** all prior frams, the first 16 bytes of this frame-header,
688   ** and the frame-data matches the checksum in the last 8
689   ** bytes of this frame-header.
690   */
691   nativeCksum = (pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN);
692   walChecksumBytes(nativeCksum, aFrame, 8, aCksum, aCksum);
693   walChecksumBytes(nativeCksum, aData, pWal->szPage, aCksum, aCksum);
694   if( aCksum[0]!=sqlite3Get4byte(&aFrame[16])
695    || aCksum[1]!=sqlite3Get4byte(&aFrame[20])
696   ){
697     /* Checksum failed. */
698     return 0;
699   }
700 
701   /* If we reach this point, the frame is valid.  Return the page number
702   ** and the new database size.
703   */
704   *piPage = pgno;
705   *pnTruncate = sqlite3Get4byte(&aFrame[4]);
706   return 1;
707 }
708 
709 
710 #if defined(SQLITE_TEST) && defined(SQLITE_DEBUG)
711 /*
712 ** Names of locks.  This routine is used to provide debugging output and is not
713 ** a part of an ordinary build.
714 */
715 static const char *walLockName(int lockIdx){
716   if( lockIdx==WAL_WRITE_LOCK ){
717     return "WRITE-LOCK";
718   }else if( lockIdx==WAL_CKPT_LOCK ){
719     return "CKPT-LOCK";
720   }else if( lockIdx==WAL_RECOVER_LOCK ){
721     return "RECOVER-LOCK";
722   }else{
723     static char zName[15];
724     sqlite3_snprintf(sizeof(zName), zName, "READ-LOCK[%d]",
725                      lockIdx-WAL_READ_LOCK(0));
726     return zName;
727   }
728 }
729 #endif /*defined(SQLITE_TEST) || defined(SQLITE_DEBUG) */
730 
731 
732 /*
733 ** Set or release locks on the WAL.  Locks are either shared or exclusive.
734 ** A lock cannot be moved directly between shared and exclusive - it must go
735 ** through the unlocked state first.
736 **
737 ** In locking_mode=EXCLUSIVE, all of these routines become no-ops.
738 */
739 static int walLockShared(Wal *pWal, int lockIdx){
740   int rc;
741   if( pWal->exclusiveMode ) return SQLITE_OK;
742   rc = sqlite3OsShmLock(pWal->pDbFd, lockIdx, 1,
743                         SQLITE_SHM_LOCK | SQLITE_SHM_SHARED);
744   WALTRACE(("WAL%p: acquire SHARED-%s %s\n", pWal,
745             walLockName(lockIdx), rc ? "failed" : "ok"));
746   VVA_ONLY( pWal->lockError = (u8)(rc!=SQLITE_OK && rc!=SQLITE_BUSY); )
747   return rc;
748 }
749 static void walUnlockShared(Wal *pWal, int lockIdx){
750   if( pWal->exclusiveMode ) return;
751   (void)sqlite3OsShmLock(pWal->pDbFd, lockIdx, 1,
752                          SQLITE_SHM_UNLOCK | SQLITE_SHM_SHARED);
753   WALTRACE(("WAL%p: release SHARED-%s\n", pWal, walLockName(lockIdx)));
754 }
755 static int walLockExclusive(Wal *pWal, int lockIdx, int n){
756   int rc;
757   if( pWal->exclusiveMode ) return SQLITE_OK;
758   rc = sqlite3OsShmLock(pWal->pDbFd, lockIdx, n,
759                         SQLITE_SHM_LOCK | SQLITE_SHM_EXCLUSIVE);
760   WALTRACE(("WAL%p: acquire EXCLUSIVE-%s cnt=%d %s\n", pWal,
761             walLockName(lockIdx), n, rc ? "failed" : "ok"));
762   VVA_ONLY( pWal->lockError = (u8)(rc!=SQLITE_OK && rc!=SQLITE_BUSY); )
763   return rc;
764 }
765 static void walUnlockExclusive(Wal *pWal, int lockIdx, int n){
766   if( pWal->exclusiveMode ) return;
767   (void)sqlite3OsShmLock(pWal->pDbFd, lockIdx, n,
768                          SQLITE_SHM_UNLOCK | SQLITE_SHM_EXCLUSIVE);
769   WALTRACE(("WAL%p: release EXCLUSIVE-%s cnt=%d\n", pWal,
770              walLockName(lockIdx), n));
771 }
772 
773 /*
774 ** Compute a hash on a page number.  The resulting hash value must land
775 ** between 0 and (HASHTABLE_NSLOT-1).  The walHashNext() function advances
776 ** the hash to the next value in the event of a collision.
777 */
778 static int walHash(u32 iPage){
779   assert( iPage>0 );
780   assert( (HASHTABLE_NSLOT & (HASHTABLE_NSLOT-1))==0 );
781   return (iPage*HASHTABLE_HASH_1) & (HASHTABLE_NSLOT-1);
782 }
783 static int walNextHash(int iPriorHash){
784   return (iPriorHash+1)&(HASHTABLE_NSLOT-1);
785 }
786 
787 /*
788 ** Return pointers to the hash table and page number array stored on
789 ** page iHash of the wal-index. The wal-index is broken into 32KB pages
790 ** numbered starting from 0.
791 **
792 ** Set output variable *paHash to point to the start of the hash table
793 ** in the wal-index file. Set *piZero to one less than the frame
794 ** number of the first frame indexed by this hash table. If a
795 ** slot in the hash table is set to N, it refers to frame number
796 ** (*piZero+N) in the log.
797 **
798 ** Finally, set *paPgno so that *paPgno[1] is the page number of the
799 ** first frame indexed by the hash table, frame (*piZero+1).
800 */
801 static int walHashGet(
802   Wal *pWal,                      /* WAL handle */
803   int iHash,                      /* Find the iHash'th table */
804   volatile ht_slot **paHash,      /* OUT: Pointer to hash index */
805   volatile u32 **paPgno,          /* OUT: Pointer to page number array */
806   u32 *piZero                     /* OUT: Frame associated with *paPgno[0] */
807 ){
808   int rc;                         /* Return code */
809   volatile u32 *aPgno;
810 
811   rc = walIndexPage(pWal, iHash, &aPgno);
812   assert( rc==SQLITE_OK || iHash>0 );
813 
814   if( rc==SQLITE_OK ){
815     u32 iZero;
816     volatile ht_slot *aHash;
817 
818     aHash = (volatile ht_slot *)&aPgno[HASHTABLE_NPAGE];
819     if( iHash==0 ){
820       aPgno = &aPgno[WALINDEX_HDR_SIZE/sizeof(u32)];
821       iZero = 0;
822     }else{
823       iZero = HASHTABLE_NPAGE_ONE + (iHash-1)*HASHTABLE_NPAGE;
824     }
825 
826     *paPgno = &aPgno[-1];
827     *paHash = aHash;
828     *piZero = iZero;
829   }
830   return rc;
831 }
832 
833 /*
834 ** Return the number of the wal-index page that contains the hash-table
835 ** and page-number array that contain entries corresponding to WAL frame
836 ** iFrame. The wal-index is broken up into 32KB pages. Wal-index pages
837 ** are numbered starting from 0.
838 */
839 static int walFramePage(u32 iFrame){
840   int iHash = (iFrame+HASHTABLE_NPAGE-HASHTABLE_NPAGE_ONE-1) / HASHTABLE_NPAGE;
841   assert( (iHash==0 || iFrame>HASHTABLE_NPAGE_ONE)
842        && (iHash>=1 || iFrame<=HASHTABLE_NPAGE_ONE)
843        && (iHash<=1 || iFrame>(HASHTABLE_NPAGE_ONE+HASHTABLE_NPAGE))
844        && (iHash>=2 || iFrame<=HASHTABLE_NPAGE_ONE+HASHTABLE_NPAGE)
845        && (iHash<=2 || iFrame>(HASHTABLE_NPAGE_ONE+2*HASHTABLE_NPAGE))
846   );
847   return iHash;
848 }
849 
850 /*
851 ** Return the page number associated with frame iFrame in this WAL.
852 */
853 static u32 walFramePgno(Wal *pWal, u32 iFrame){
854   int iHash = walFramePage(iFrame);
855   if( iHash==0 ){
856     return pWal->apWiData[0][WALINDEX_HDR_SIZE/sizeof(u32) + iFrame - 1];
857   }
858   return pWal->apWiData[iHash][(iFrame-1-HASHTABLE_NPAGE_ONE)%HASHTABLE_NPAGE];
859 }
860 
861 /*
862 ** Remove entries from the hash table that point to WAL slots greater
863 ** than pWal->hdr.mxFrame.
864 **
865 ** This function is called whenever pWal->hdr.mxFrame is decreased due
866 ** to a rollback or savepoint.
867 **
868 ** At most only the hash table containing pWal->hdr.mxFrame needs to be
869 ** updated.  Any later hash tables will be automatically cleared when
870 ** pWal->hdr.mxFrame advances to the point where those hash tables are
871 ** actually needed.
872 */
873 static void walCleanupHash(Wal *pWal){
874   volatile ht_slot *aHash = 0;    /* Pointer to hash table to clear */
875   volatile u32 *aPgno = 0;        /* Page number array for hash table */
876   u32 iZero = 0;                  /* frame == (aHash[x]+iZero) */
877   int iLimit = 0;                 /* Zero values greater than this */
878   int nByte;                      /* Number of bytes to zero in aPgno[] */
879   int i;                          /* Used to iterate through aHash[] */
880 
881   assert( pWal->writeLock );
882   testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE-1 );
883   testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE );
884   testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE+1 );
885 
886   if( pWal->hdr.mxFrame==0 ) return;
887 
888   /* Obtain pointers to the hash-table and page-number array containing
889   ** the entry that corresponds to frame pWal->hdr.mxFrame. It is guaranteed
890   ** that the page said hash-table and array reside on is already mapped.
891   */
892   assert( pWal->nWiData>walFramePage(pWal->hdr.mxFrame) );
893   assert( pWal->apWiData[walFramePage(pWal->hdr.mxFrame)] );
894   walHashGet(pWal, walFramePage(pWal->hdr.mxFrame), &aHash, &aPgno, &iZero);
895 
896   /* Zero all hash-table entries that correspond to frame numbers greater
897   ** than pWal->hdr.mxFrame.
898   */
899   iLimit = pWal->hdr.mxFrame - iZero;
900   assert( iLimit>0 );
901   for(i=0; i<HASHTABLE_NSLOT; i++){
902     if( aHash[i]>iLimit ){
903       aHash[i] = 0;
904     }
905   }
906 
907   /* Zero the entries in the aPgno array that correspond to frames with
908   ** frame numbers greater than pWal->hdr.mxFrame.
909   */
910   nByte = (int)((char *)aHash - (char *)&aPgno[iLimit+1]);
911   memset((void *)&aPgno[iLimit+1], 0, nByte);
912 
913 #ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT
914   /* Verify that the every entry in the mapping region is still reachable
915   ** via the hash table even after the cleanup.
916   */
917   if( iLimit ){
918     int i;           /* Loop counter */
919     int iKey;        /* Hash key */
920     for(i=1; i<=iLimit; i++){
921       for(iKey=walHash(aPgno[i]); aHash[iKey]; iKey=walNextHash(iKey)){
922         if( aHash[iKey]==i ) break;
923       }
924       assert( aHash[iKey]==i );
925     }
926   }
927 #endif /* SQLITE_ENABLE_EXPENSIVE_ASSERT */
928 }
929 
930 
931 /*
932 ** Set an entry in the wal-index that will map database page number
933 ** pPage into WAL frame iFrame.
934 */
935 static int walIndexAppend(Wal *pWal, u32 iFrame, u32 iPage){
936   int rc;                         /* Return code */
937   u32 iZero = 0;                  /* One less than frame number of aPgno[1] */
938   volatile u32 *aPgno = 0;        /* Page number array */
939   volatile ht_slot *aHash = 0;    /* Hash table */
940 
941   rc = walHashGet(pWal, walFramePage(iFrame), &aHash, &aPgno, &iZero);
942 
943   /* Assuming the wal-index file was successfully mapped, populate the
944   ** page number array and hash table entry.
945   */
946   if( rc==SQLITE_OK ){
947     int iKey;                     /* Hash table key */
948     int idx;                      /* Value to write to hash-table slot */
949     int nCollide;                 /* Number of hash collisions */
950 
951     idx = iFrame - iZero;
952     assert( idx <= HASHTABLE_NSLOT/2 + 1 );
953 
954     /* If this is the first entry to be added to this hash-table, zero the
955     ** entire hash table and aPgno[] array before proceding.
956     */
957     if( idx==1 ){
958       int nByte = (int)((u8 *)&aHash[HASHTABLE_NSLOT] - (u8 *)&aPgno[1]);
959       memset((void*)&aPgno[1], 0, nByte);
960     }
961 
962     /* If the entry in aPgno[] is already set, then the previous writer
963     ** must have exited unexpectedly in the middle of a transaction (after
964     ** writing one or more dirty pages to the WAL to free up memory).
965     ** Remove the remnants of that writers uncommitted transaction from
966     ** the hash-table before writing any new entries.
967     */
968     if( aPgno[idx] ){
969       walCleanupHash(pWal);
970       assert( !aPgno[idx] );
971     }
972 
973     /* Write the aPgno[] array entry and the hash-table slot. */
974     nCollide = idx;
975     for(iKey=walHash(iPage); aHash[iKey]; iKey=walNextHash(iKey)){
976       if( (nCollide--)==0 ) return SQLITE_CORRUPT_BKPT;
977     }
978     aPgno[idx] = iPage;
979     aHash[iKey] = (ht_slot)idx;
980 
981 #ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT
982     /* Verify that the number of entries in the hash table exactly equals
983     ** the number of entries in the mapping region.
984     */
985     {
986       int i;           /* Loop counter */
987       int nEntry = 0;  /* Number of entries in the hash table */
988       for(i=0; i<HASHTABLE_NSLOT; i++){ if( aHash[i] ) nEntry++; }
989       assert( nEntry==idx );
990     }
991 
992     /* Verify that the every entry in the mapping region is reachable
993     ** via the hash table.  This turns out to be a really, really expensive
994     ** thing to check, so only do this occasionally - not on every
995     ** iteration.
996     */
997     if( (idx&0x3ff)==0 ){
998       int i;           /* Loop counter */
999       for(i=1; i<=idx; i++){
1000         for(iKey=walHash(aPgno[i]); aHash[iKey]; iKey=walNextHash(iKey)){
1001           if( aHash[iKey]==i ) break;
1002         }
1003         assert( aHash[iKey]==i );
1004       }
1005     }
1006 #endif /* SQLITE_ENABLE_EXPENSIVE_ASSERT */
1007   }
1008 
1009 
1010   return rc;
1011 }
1012 
1013 
1014 /*
1015 ** Recover the wal-index by reading the write-ahead log file.
1016 **
1017 ** This routine first tries to establish an exclusive lock on the
1018 ** wal-index to prevent other threads/processes from doing anything
1019 ** with the WAL or wal-index while recovery is running.  The
1020 ** WAL_RECOVER_LOCK is also held so that other threads will know
1021 ** that this thread is running recovery.  If unable to establish
1022 ** the necessary locks, this routine returns SQLITE_BUSY.
1023 */
1024 static int walIndexRecover(Wal *pWal){
1025   int rc;                         /* Return Code */
1026   i64 nSize;                      /* Size of log file */
1027   u32 aFrameCksum[2] = {0, 0};
1028   int iLock;                      /* Lock offset to lock for checkpoint */
1029   int nLock;                      /* Number of locks to hold */
1030 
1031   /* Obtain an exclusive lock on all byte in the locking range not already
1032   ** locked by the caller. The caller is guaranteed to have locked the
1033   ** WAL_WRITE_LOCK byte, and may have also locked the WAL_CKPT_LOCK byte.
1034   ** If successful, the same bytes that are locked here are unlocked before
1035   ** this function returns.
1036   */
1037   assert( pWal->ckptLock==1 || pWal->ckptLock==0 );
1038   assert( WAL_ALL_BUT_WRITE==WAL_WRITE_LOCK+1 );
1039   assert( WAL_CKPT_LOCK==WAL_ALL_BUT_WRITE );
1040   assert( pWal->writeLock );
1041   iLock = WAL_ALL_BUT_WRITE + pWal->ckptLock;
1042   nLock = SQLITE_SHM_NLOCK - iLock;
1043   rc = walLockExclusive(pWal, iLock, nLock);
1044   if( rc ){
1045     return rc;
1046   }
1047   WALTRACE(("WAL%p: recovery begin...\n", pWal));
1048 
1049   memset(&pWal->hdr, 0, sizeof(WalIndexHdr));
1050 
1051   rc = sqlite3OsFileSize(pWal->pWalFd, &nSize);
1052   if( rc!=SQLITE_OK ){
1053     goto recovery_error;
1054   }
1055 
1056   if( nSize>WAL_HDRSIZE ){
1057     u8 aBuf[WAL_HDRSIZE];         /* Buffer to load WAL header into */
1058     u8 *aFrame = 0;               /* Malloc'd buffer to load entire frame */
1059     int szFrame;                  /* Number of bytes in buffer aFrame[] */
1060     u8 *aData;                    /* Pointer to data part of aFrame buffer */
1061     int iFrame;                   /* Index of last frame read */
1062     i64 iOffset;                  /* Next offset to read from log file */
1063     int szPage;                   /* Page size according to the log */
1064     u32 magic;                    /* Magic value read from WAL header */
1065     u32 version;                  /* Magic value read from WAL header */
1066 
1067     /* Read in the WAL header. */
1068     rc = sqlite3OsRead(pWal->pWalFd, aBuf, WAL_HDRSIZE, 0);
1069     if( rc!=SQLITE_OK ){
1070       goto recovery_error;
1071     }
1072 
1073     /* If the database page size is not a power of two, or is greater than
1074     ** SQLITE_MAX_PAGE_SIZE, conclude that the WAL file contains no valid
1075     ** data. Similarly, if the 'magic' value is invalid, ignore the whole
1076     ** WAL file.
1077     */
1078     magic = sqlite3Get4byte(&aBuf[0]);
1079     szPage = sqlite3Get4byte(&aBuf[8]);
1080     if( (magic&0xFFFFFFFE)!=WAL_MAGIC
1081      || szPage&(szPage-1)
1082      || szPage>SQLITE_MAX_PAGE_SIZE
1083      || szPage<512
1084     ){
1085       goto finished;
1086     }
1087     pWal->hdr.bigEndCksum = (u8)(magic&0x00000001);
1088     pWal->szPage = szPage;
1089     pWal->nCkpt = sqlite3Get4byte(&aBuf[12]);
1090     memcpy(&pWal->hdr.aSalt, &aBuf[16], 8);
1091 
1092     /* Verify that the WAL header checksum is correct */
1093     walChecksumBytes(pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN,
1094         aBuf, WAL_HDRSIZE-2*4, 0, pWal->hdr.aFrameCksum
1095     );
1096     if( pWal->hdr.aFrameCksum[0]!=sqlite3Get4byte(&aBuf[24])
1097      || pWal->hdr.aFrameCksum[1]!=sqlite3Get4byte(&aBuf[28])
1098     ){
1099       goto finished;
1100     }
1101 
1102     /* Verify that the version number on the WAL format is one that
1103     ** are able to understand */
1104     version = sqlite3Get4byte(&aBuf[4]);
1105     if( version!=WAL_MAX_VERSION ){
1106       rc = SQLITE_CANTOPEN_BKPT;
1107       goto finished;
1108     }
1109 
1110     /* Malloc a buffer to read frames into. */
1111     szFrame = szPage + WAL_FRAME_HDRSIZE;
1112     aFrame = (u8 *)sqlite3_malloc(szFrame);
1113     if( !aFrame ){
1114       rc = SQLITE_NOMEM;
1115       goto recovery_error;
1116     }
1117     aData = &aFrame[WAL_FRAME_HDRSIZE];
1118 
1119     /* Read all frames from the log file. */
1120     iFrame = 0;
1121     for(iOffset=WAL_HDRSIZE; (iOffset+szFrame)<=nSize; iOffset+=szFrame){
1122       u32 pgno;                   /* Database page number for frame */
1123       u32 nTruncate;              /* dbsize field from frame header */
1124       int isValid;                /* True if this frame is valid */
1125 
1126       /* Read and decode the next log frame. */
1127       rc = sqlite3OsRead(pWal->pWalFd, aFrame, szFrame, iOffset);
1128       if( rc!=SQLITE_OK ) break;
1129       isValid = walDecodeFrame(pWal, &pgno, &nTruncate, aData, aFrame);
1130       if( !isValid ) break;
1131       rc = walIndexAppend(pWal, ++iFrame, pgno);
1132       if( rc!=SQLITE_OK ) break;
1133 
1134       /* If nTruncate is non-zero, this is a commit record. */
1135       if( nTruncate ){
1136         pWal->hdr.mxFrame = iFrame;
1137         pWal->hdr.nPage = nTruncate;
1138         pWal->hdr.szPage = (u16)((szPage&0xff00) | (szPage>>16));
1139         testcase( szPage<=32768 );
1140         testcase( szPage>=65536 );
1141         aFrameCksum[0] = pWal->hdr.aFrameCksum[0];
1142         aFrameCksum[1] = pWal->hdr.aFrameCksum[1];
1143       }
1144     }
1145 
1146     sqlite3_free(aFrame);
1147   }
1148 
1149 finished:
1150   if( rc==SQLITE_OK ){
1151     volatile WalCkptInfo *pInfo;
1152     int i;
1153     pWal->hdr.aFrameCksum[0] = aFrameCksum[0];
1154     pWal->hdr.aFrameCksum[1] = aFrameCksum[1];
1155     walIndexWriteHdr(pWal);
1156 
1157     /* Reset the checkpoint-header. This is safe because this thread is
1158     ** currently holding locks that exclude all other readers, writers and
1159     ** checkpointers.
1160     */
1161     pInfo = walCkptInfo(pWal);
1162     pInfo->nBackfill = 0;
1163     pInfo->aReadMark[0] = 0;
1164     for(i=1; i<WAL_NREADER; i++) pInfo->aReadMark[i] = READMARK_NOT_USED;
1165 
1166     /* If more than one frame was recovered from the log file, report an
1167     ** event via sqlite3_log(). This is to help with identifying performance
1168     ** problems caused by applications routinely shutting down without
1169     ** checkpointing the log file.
1170     */
1171     if( pWal->hdr.nPage ){
1172       sqlite3_log(SQLITE_OK, "Recovered %d frames from WAL file %s",
1173           pWal->hdr.nPage, pWal->zWalName
1174       );
1175     }
1176   }
1177 
1178 recovery_error:
1179   WALTRACE(("WAL%p: recovery %s\n", pWal, rc ? "failed" : "ok"));
1180   walUnlockExclusive(pWal, iLock, nLock);
1181   return rc;
1182 }
1183 
1184 /*
1185 ** Close an open wal-index.
1186 */
1187 static void walIndexClose(Wal *pWal, int isDelete){
1188   sqlite3OsShmUnmap(pWal->pDbFd, isDelete);
1189 }
1190 
1191 /*
1192 ** Open a connection to the WAL file zWalName. The database file must
1193 ** already be opened on connection pDbFd. The buffer that zWalName points
1194 ** to must remain valid for the lifetime of the returned Wal* handle.
1195 **
1196 ** A SHARED lock should be held on the database file when this function
1197 ** is called. The purpose of this SHARED lock is to prevent any other
1198 ** client from unlinking the WAL or wal-index file. If another process
1199 ** were to do this just after this client opened one of these files, the
1200 ** system would be badly broken.
1201 **
1202 ** If the log file is successfully opened, SQLITE_OK is returned and
1203 ** *ppWal is set to point to a new WAL handle. If an error occurs,
1204 ** an SQLite error code is returned and *ppWal is left unmodified.
1205 */
1206 int sqlite3WalOpen(
1207   sqlite3_vfs *pVfs,              /* vfs module to open wal and wal-index */
1208   sqlite3_file *pDbFd,            /* The open database file */
1209   const char *zWalName,           /* Name of the WAL file */
1210   Wal **ppWal                     /* OUT: Allocated Wal handle */
1211 ){
1212   int rc;                         /* Return Code */
1213   Wal *pRet;                      /* Object to allocate and return */
1214   int flags;                      /* Flags passed to OsOpen() */
1215 
1216   assert( zWalName && zWalName[0] );
1217   assert( pDbFd );
1218 
1219   /* In the amalgamation, the os_unix.c and os_win.c source files come before
1220   ** this source file.  Verify that the #defines of the locking byte offsets
1221   ** in os_unix.c and os_win.c agree with the WALINDEX_LOCK_OFFSET value.
1222   */
1223 #ifdef WIN_SHM_BASE
1224   assert( WIN_SHM_BASE==WALINDEX_LOCK_OFFSET );
1225 #endif
1226 #ifdef UNIX_SHM_BASE
1227   assert( UNIX_SHM_BASE==WALINDEX_LOCK_OFFSET );
1228 #endif
1229 
1230 
1231   /* Allocate an instance of struct Wal to return. */
1232   *ppWal = 0;
1233   pRet = (Wal*)sqlite3MallocZero(sizeof(Wal) + pVfs->szOsFile);
1234   if( !pRet ){
1235     return SQLITE_NOMEM;
1236   }
1237 
1238   pRet->pVfs = pVfs;
1239   pRet->pWalFd = (sqlite3_file *)&pRet[1];
1240   pRet->pDbFd = pDbFd;
1241   pRet->readLock = -1;
1242   pRet->zWalName = zWalName;
1243 
1244   /* Open file handle on the write-ahead log file. */
1245   flags = (SQLITE_OPEN_READWRITE|SQLITE_OPEN_CREATE|SQLITE_OPEN_WAL);
1246   rc = sqlite3OsOpen(pVfs, zWalName, pRet->pWalFd, flags, &flags);
1247   if( rc==SQLITE_OK && flags&SQLITE_OPEN_READONLY ){
1248     pRet->readOnly = 1;
1249   }
1250 
1251   if( rc!=SQLITE_OK ){
1252     walIndexClose(pRet, 0);
1253     sqlite3OsClose(pRet->pWalFd);
1254     sqlite3_free(pRet);
1255   }else{
1256     *ppWal = pRet;
1257     WALTRACE(("WAL%d: opened\n", pRet));
1258   }
1259   return rc;
1260 }
1261 
1262 /*
1263 ** Find the smallest page number out of all pages held in the WAL that
1264 ** has not been returned by any prior invocation of this method on the
1265 ** same WalIterator object.   Write into *piFrame the frame index where
1266 ** that page was last written into the WAL.  Write into *piPage the page
1267 ** number.
1268 **
1269 ** Return 0 on success.  If there are no pages in the WAL with a page
1270 ** number larger than *piPage, then return 1.
1271 */
1272 static int walIteratorNext(
1273   WalIterator *p,               /* Iterator */
1274   u32 *piPage,                  /* OUT: The page number of the next page */
1275   u32 *piFrame                  /* OUT: Wal frame index of next page */
1276 ){
1277   u32 iMin;                     /* Result pgno must be greater than iMin */
1278   u32 iRet = 0xFFFFFFFF;        /* 0xffffffff is never a valid page number */
1279   int i;                        /* For looping through segments */
1280 
1281   iMin = p->iPrior;
1282   assert( iMin<0xffffffff );
1283   for(i=p->nSegment-1; i>=0; i--){
1284     struct WalSegment *pSegment = &p->aSegment[i];
1285     while( pSegment->iNext<pSegment->nEntry ){
1286       u32 iPg = pSegment->aPgno[pSegment->aIndex[pSegment->iNext]];
1287       if( iPg>iMin ){
1288         if( iPg<iRet ){
1289           iRet = iPg;
1290           *piFrame = pSegment->iZero + pSegment->aIndex[pSegment->iNext];
1291         }
1292         break;
1293       }
1294       pSegment->iNext++;
1295     }
1296   }
1297 
1298   *piPage = p->iPrior = iRet;
1299   return (iRet==0xFFFFFFFF);
1300 }
1301 
1302 /*
1303 ** This function merges two sorted lists into a single sorted list.
1304 */
1305 static void walMerge(
1306   u32 *aContent,                  /* Pages in wal */
1307   ht_slot *aLeft,                 /* IN: Left hand input list */
1308   int nLeft,                      /* IN: Elements in array *paLeft */
1309   ht_slot **paRight,              /* IN/OUT: Right hand input list */
1310   int *pnRight,                   /* IN/OUT: Elements in *paRight */
1311   ht_slot *aTmp                   /* Temporary buffer */
1312 ){
1313   int iLeft = 0;                  /* Current index in aLeft */
1314   int iRight = 0;                 /* Current index in aRight */
1315   int iOut = 0;                   /* Current index in output buffer */
1316   int nRight = *pnRight;
1317   ht_slot *aRight = *paRight;
1318 
1319   assert( nLeft>0 && nRight>0 );
1320   while( iRight<nRight || iLeft<nLeft ){
1321     ht_slot logpage;
1322     Pgno dbpage;
1323 
1324     if( (iLeft<nLeft)
1325      && (iRight>=nRight || aContent[aLeft[iLeft]]<aContent[aRight[iRight]])
1326     ){
1327       logpage = aLeft[iLeft++];
1328     }else{
1329       logpage = aRight[iRight++];
1330     }
1331     dbpage = aContent[logpage];
1332 
1333     aTmp[iOut++] = logpage;
1334     if( iLeft<nLeft && aContent[aLeft[iLeft]]==dbpage ) iLeft++;
1335 
1336     assert( iLeft>=nLeft || aContent[aLeft[iLeft]]>dbpage );
1337     assert( iRight>=nRight || aContent[aRight[iRight]]>dbpage );
1338   }
1339 
1340   *paRight = aLeft;
1341   *pnRight = iOut;
1342   memcpy(aLeft, aTmp, sizeof(aTmp[0])*iOut);
1343 }
1344 
1345 /*
1346 ** Sort the elements in list aList, removing any duplicates.
1347 */
1348 static void walMergesort(
1349   u32 *aContent,                  /* Pages in wal */
1350   ht_slot *aBuffer,               /* Buffer of at least *pnList items to use */
1351   ht_slot *aList,                 /* IN/OUT: List to sort */
1352   int *pnList                     /* IN/OUT: Number of elements in aList[] */
1353 ){
1354   struct Sublist {
1355     int nList;                    /* Number of elements in aList */
1356     ht_slot *aList;               /* Pointer to sub-list content */
1357   };
1358 
1359   const int nList = *pnList;      /* Size of input list */
1360   int nMerge = 0;                 /* Number of elements in list aMerge */
1361   ht_slot *aMerge = 0;            /* List to be merged */
1362   int iList;                      /* Index into input list */
1363   int iSub = 0;                   /* Index into aSub array */
1364   struct Sublist aSub[13];        /* Array of sub-lists */
1365 
1366   memset(aSub, 0, sizeof(aSub));
1367   assert( nList<=HASHTABLE_NPAGE && nList>0 );
1368   assert( HASHTABLE_NPAGE==(1<<(ArraySize(aSub)-1)) );
1369 
1370   for(iList=0; iList<nList; iList++){
1371     nMerge = 1;
1372     aMerge = &aList[iList];
1373     for(iSub=0; iList & (1<<iSub); iSub++){
1374       struct Sublist *p = &aSub[iSub];
1375       assert( p->aList && p->nList<=(1<<iSub) );
1376       assert( p->aList==&aList[iList&~((2<<iSub)-1)] );
1377       walMerge(aContent, p->aList, p->nList, &aMerge, &nMerge, aBuffer);
1378     }
1379     aSub[iSub].aList = aMerge;
1380     aSub[iSub].nList = nMerge;
1381   }
1382 
1383   for(iSub++; iSub<ArraySize(aSub); iSub++){
1384     if( nList & (1<<iSub) ){
1385       struct Sublist *p = &aSub[iSub];
1386       assert( p->nList<=(1<<iSub) );
1387       assert( p->aList==&aList[nList&~((2<<iSub)-1)] );
1388       walMerge(aContent, p->aList, p->nList, &aMerge, &nMerge, aBuffer);
1389     }
1390   }
1391   assert( aMerge==aList );
1392   *pnList = nMerge;
1393 
1394 #ifdef SQLITE_DEBUG
1395   {
1396     int i;
1397     for(i=1; i<*pnList; i++){
1398       assert( aContent[aList[i]] > aContent[aList[i-1]] );
1399     }
1400   }
1401 #endif
1402 }
1403 
1404 /*
1405 ** Free an iterator allocated by walIteratorInit().
1406 */
1407 static void walIteratorFree(WalIterator *p){
1408   sqlite3ScratchFree(p);
1409 }
1410 
1411 /*
1412 ** Construct a WalInterator object that can be used to loop over all
1413 ** pages in the WAL in ascending order. The caller must hold the checkpoint
1414 **
1415 ** On success, make *pp point to the newly allocated WalInterator object
1416 ** return SQLITE_OK. Otherwise, return an error code. If this routine
1417 ** returns an error, the value of *pp is undefined.
1418 **
1419 ** The calling routine should invoke walIteratorFree() to destroy the
1420 ** WalIterator object when it has finished with it.
1421 */
1422 static int walIteratorInit(Wal *pWal, WalIterator **pp){
1423   WalIterator *p;                 /* Return value */
1424   int nSegment;                   /* Number of segments to merge */
1425   u32 iLast;                      /* Last frame in log */
1426   int nByte;                      /* Number of bytes to allocate */
1427   int i;                          /* Iterator variable */
1428   ht_slot *aTmp;                  /* Temp space used by merge-sort */
1429   int rc = SQLITE_OK;             /* Return Code */
1430 
1431   /* This routine only runs while holding the checkpoint lock. And
1432   ** it only runs if there is actually content in the log (mxFrame>0).
1433   */
1434   assert( pWal->ckptLock && pWal->hdr.mxFrame>0 );
1435   iLast = pWal->hdr.mxFrame;
1436 
1437   /* Allocate space for the WalIterator object. */
1438   nSegment = walFramePage(iLast) + 1;
1439   nByte = sizeof(WalIterator)
1440         + (nSegment-1)*sizeof(struct WalSegment)
1441         + iLast*sizeof(ht_slot);
1442   p = (WalIterator *)sqlite3ScratchMalloc(nByte);
1443   if( !p ){
1444     return SQLITE_NOMEM;
1445   }
1446   memset(p, 0, nByte);
1447   p->nSegment = nSegment;
1448 
1449   /* Allocate temporary space used by the merge-sort routine. This block
1450   ** of memory will be freed before this function returns.
1451   */
1452   aTmp = (ht_slot *)sqlite3ScratchMalloc(
1453       sizeof(ht_slot) * (iLast>HASHTABLE_NPAGE?HASHTABLE_NPAGE:iLast)
1454   );
1455   if( !aTmp ){
1456     rc = SQLITE_NOMEM;
1457   }
1458 
1459   for(i=0; rc==SQLITE_OK && i<nSegment; i++){
1460     volatile ht_slot *aHash;
1461     u32 iZero;
1462     volatile u32 *aPgno;
1463 
1464     rc = walHashGet(pWal, i, &aHash, &aPgno, &iZero);
1465     if( rc==SQLITE_OK ){
1466       int j;                      /* Counter variable */
1467       int nEntry;                 /* Number of entries in this segment */
1468       ht_slot *aIndex;            /* Sorted index for this segment */
1469 
1470       aPgno++;
1471       if( (i+1)==nSegment ){
1472         nEntry = (int)(iLast - iZero);
1473       }else{
1474         nEntry = (int)((u32*)aHash - (u32*)aPgno);
1475       }
1476       aIndex = &((ht_slot *)&p->aSegment[p->nSegment])[iZero];
1477       iZero++;
1478 
1479       for(j=0; j<nEntry; j++){
1480         aIndex[j] = (ht_slot)j;
1481       }
1482       walMergesort((u32 *)aPgno, aTmp, aIndex, &nEntry);
1483       p->aSegment[i].iZero = iZero;
1484       p->aSegment[i].nEntry = nEntry;
1485       p->aSegment[i].aIndex = aIndex;
1486       p->aSegment[i].aPgno = (u32 *)aPgno;
1487     }
1488   }
1489   sqlite3ScratchFree(aTmp);
1490 
1491   if( rc!=SQLITE_OK ){
1492     walIteratorFree(p);
1493   }
1494   *pp = p;
1495   return rc;
1496 }
1497 
1498 /*
1499 ** Copy as much content as we can from the WAL back into the database file
1500 ** in response to an sqlite3_wal_checkpoint() request or the equivalent.
1501 **
1502 ** The amount of information copies from WAL to database might be limited
1503 ** by active readers.  This routine will never overwrite a database page
1504 ** that a concurrent reader might be using.
1505 **
1506 ** All I/O barrier operations (a.k.a fsyncs) occur in this routine when
1507 ** SQLite is in WAL-mode in synchronous=NORMAL.  That means that if
1508 ** checkpoints are always run by a background thread or background
1509 ** process, foreground threads will never block on a lengthy fsync call.
1510 **
1511 ** Fsync is called on the WAL before writing content out of the WAL and
1512 ** into the database.  This ensures that if the new content is persistent
1513 ** in the WAL and can be recovered following a power-loss or hard reset.
1514 **
1515 ** Fsync is also called on the database file if (and only if) the entire
1516 ** WAL content is copied into the database file.  This second fsync makes
1517 ** it safe to delete the WAL since the new content will persist in the
1518 ** database file.
1519 **
1520 ** This routine uses and updates the nBackfill field of the wal-index header.
1521 ** This is the only routine tha will increase the value of nBackfill.
1522 ** (A WAL reset or recovery will revert nBackfill to zero, but not increase
1523 ** its value.)
1524 **
1525 ** The caller must be holding sufficient locks to ensure that no other
1526 ** checkpoint is running (in any other thread or process) at the same
1527 ** time.
1528 */
1529 static int walCheckpoint(
1530   Wal *pWal,                      /* Wal connection */
1531   int sync_flags,                 /* Flags for OsSync() (or 0) */
1532   int nBuf,                       /* Size of zBuf in bytes */
1533   u8 *zBuf                        /* Temporary buffer to use */
1534 ){
1535   int rc;                         /* Return code */
1536   int szPage;                     /* Database page-size */
1537   WalIterator *pIter = 0;         /* Wal iterator context */
1538   u32 iDbpage = 0;                /* Next database page to write */
1539   u32 iFrame = 0;                 /* Wal frame containing data for iDbpage */
1540   u32 mxSafeFrame;                /* Max frame that can be backfilled */
1541   u32 mxPage;                     /* Max database page to write */
1542   int i;                          /* Loop counter */
1543   volatile WalCkptInfo *pInfo;    /* The checkpoint status information */
1544 
1545   szPage = (pWal->hdr.szPage&0xfe00) + ((pWal->hdr.szPage&0x0001)<<16);
1546   testcase( szPage<=32768 );
1547   testcase( szPage>=65536 );
1548   if( pWal->hdr.mxFrame==0 ) return SQLITE_OK;
1549 
1550   /* Allocate the iterator */
1551   rc = walIteratorInit(pWal, &pIter);
1552   if( rc!=SQLITE_OK ){
1553     return rc;
1554   }
1555   assert( pIter );
1556 
1557   /*** TODO:  Move this test out to the caller.  Make it an assert() here ***/
1558   if( szPage!=nBuf ){
1559     rc = SQLITE_CORRUPT_BKPT;
1560     goto walcheckpoint_out;
1561   }
1562 
1563   /* Compute in mxSafeFrame the index of the last frame of the WAL that is
1564   ** safe to write into the database.  Frames beyond mxSafeFrame might
1565   ** overwrite database pages that are in use by active readers and thus
1566   ** cannot be backfilled from the WAL.
1567   */
1568   mxSafeFrame = pWal->hdr.mxFrame;
1569   mxPage = pWal->hdr.nPage;
1570   pInfo = walCkptInfo(pWal);
1571   for(i=1; i<WAL_NREADER; i++){
1572     u32 y = pInfo->aReadMark[i];
1573     if( mxSafeFrame>=y ){
1574       assert( y<=pWal->hdr.mxFrame );
1575       rc = walLockExclusive(pWal, WAL_READ_LOCK(i), 1);
1576       if( rc==SQLITE_OK ){
1577         pInfo->aReadMark[i] = READMARK_NOT_USED;
1578         walUnlockExclusive(pWal, WAL_READ_LOCK(i), 1);
1579       }else if( rc==SQLITE_BUSY ){
1580         mxSafeFrame = y;
1581       }else{
1582         goto walcheckpoint_out;
1583       }
1584     }
1585   }
1586 
1587   if( pInfo->nBackfill<mxSafeFrame
1588    && (rc = walLockExclusive(pWal, WAL_READ_LOCK(0), 1))==SQLITE_OK
1589   ){
1590     i64 nSize;                    /* Current size of database file */
1591     u32 nBackfill = pInfo->nBackfill;
1592 
1593     /* Sync the WAL to disk */
1594     if( sync_flags ){
1595       rc = sqlite3OsSync(pWal->pWalFd, sync_flags);
1596     }
1597 
1598     /* If the database file may grow as a result of this checkpoint, hint
1599     ** about the eventual size of the db file to the VFS layer.
1600     */
1601     if( rc==SQLITE_OK ){
1602       i64 nReq = ((i64)mxPage * szPage);
1603       rc = sqlite3OsFileSize(pWal->pDbFd, &nSize);
1604       if( rc==SQLITE_OK && nSize<nReq ){
1605         sqlite3OsFileControl(pWal->pDbFd, SQLITE_FCNTL_SIZE_HINT, &nReq);
1606       }
1607     }
1608 
1609     /* Iterate through the contents of the WAL, copying data to the db file. */
1610     while( rc==SQLITE_OK && 0==walIteratorNext(pIter, &iDbpage, &iFrame) ){
1611       i64 iOffset;
1612       assert( walFramePgno(pWal, iFrame)==iDbpage );
1613       if( iFrame<=nBackfill || iFrame>mxSafeFrame || iDbpage>mxPage ) continue;
1614       iOffset = walFrameOffset(iFrame, szPage) + WAL_FRAME_HDRSIZE;
1615       /* testcase( IS_BIG_INT(iOffset) ); // requires a 4GiB WAL file */
1616       rc = sqlite3OsRead(pWal->pWalFd, zBuf, szPage, iOffset);
1617       if( rc!=SQLITE_OK ) break;
1618       iOffset = (iDbpage-1)*(i64)szPage;
1619       testcase( IS_BIG_INT(iOffset) );
1620       rc = sqlite3OsWrite(pWal->pDbFd, zBuf, szPage, iOffset);
1621       if( rc!=SQLITE_OK ) break;
1622     }
1623 
1624     /* If work was actually accomplished... */
1625     if( rc==SQLITE_OK ){
1626       if( mxSafeFrame==walIndexHdr(pWal)->mxFrame ){
1627         i64 szDb = pWal->hdr.nPage*(i64)szPage;
1628         testcase( IS_BIG_INT(szDb) );
1629         rc = sqlite3OsTruncate(pWal->pDbFd, szDb);
1630         if( rc==SQLITE_OK && sync_flags ){
1631           rc = sqlite3OsSync(pWal->pDbFd, sync_flags);
1632         }
1633       }
1634       if( rc==SQLITE_OK ){
1635         pInfo->nBackfill = mxSafeFrame;
1636       }
1637     }
1638 
1639     /* Release the reader lock held while backfilling */
1640     walUnlockExclusive(pWal, WAL_READ_LOCK(0), 1);
1641   }else if( rc==SQLITE_BUSY ){
1642     /* Reset the return code so as not to report a checkpoint failure
1643     ** just because active readers prevent any backfill.
1644     */
1645     rc = SQLITE_OK;
1646   }
1647 
1648  walcheckpoint_out:
1649   walIteratorFree(pIter);
1650   return rc;
1651 }
1652 
1653 /*
1654 ** Close a connection to a log file.
1655 */
1656 int sqlite3WalClose(
1657   Wal *pWal,                      /* Wal to close */
1658   int sync_flags,                 /* Flags to pass to OsSync() (or 0) */
1659   int nBuf,
1660   u8 *zBuf                        /* Buffer of at least nBuf bytes */
1661 ){
1662   int rc = SQLITE_OK;
1663   if( pWal ){
1664     int isDelete = 0;             /* True to unlink wal and wal-index files */
1665 
1666     /* If an EXCLUSIVE lock can be obtained on the database file (using the
1667     ** ordinary, rollback-mode locking methods, this guarantees that the
1668     ** connection associated with this log file is the only connection to
1669     ** the database. In this case checkpoint the database and unlink both
1670     ** the wal and wal-index files.
1671     **
1672     ** The EXCLUSIVE lock is not released before returning.
1673     */
1674     rc = sqlite3OsLock(pWal->pDbFd, SQLITE_LOCK_EXCLUSIVE);
1675     if( rc==SQLITE_OK ){
1676       pWal->exclusiveMode = 1;
1677       rc = sqlite3WalCheckpoint(pWal, sync_flags, nBuf, zBuf);
1678       if( rc==SQLITE_OK ){
1679         isDelete = 1;
1680       }
1681     }
1682 
1683     walIndexClose(pWal, isDelete);
1684     sqlite3OsClose(pWal->pWalFd);
1685     if( isDelete ){
1686       sqlite3OsDelete(pWal->pVfs, pWal->zWalName, 0);
1687     }
1688     WALTRACE(("WAL%p: closed\n", pWal));
1689     sqlite3_free((void *)pWal->apWiData);
1690     sqlite3_free(pWal);
1691   }
1692   return rc;
1693 }
1694 
1695 /*
1696 ** Try to read the wal-index header.  Return 0 on success and 1 if
1697 ** there is a problem.
1698 **
1699 ** The wal-index is in shared memory.  Another thread or process might
1700 ** be writing the header at the same time this procedure is trying to
1701 ** read it, which might result in inconsistency.  A dirty read is detected
1702 ** by verifying that both copies of the header are the same and also by
1703 ** a checksum on the header.
1704 **
1705 ** If and only if the read is consistent and the header is different from
1706 ** pWal->hdr, then pWal->hdr is updated to the content of the new header
1707 ** and *pChanged is set to 1.
1708 **
1709 ** If the checksum cannot be verified return non-zero. If the header
1710 ** is read successfully and the checksum verified, return zero.
1711 */
1712 static int walIndexTryHdr(Wal *pWal, int *pChanged){
1713   u32 aCksum[2];                  /* Checksum on the header content */
1714   WalIndexHdr h1, h2;             /* Two copies of the header content */
1715   WalIndexHdr volatile *aHdr;     /* Header in shared memory */
1716 
1717   /* The first page of the wal-index must be mapped at this point. */
1718   assert( pWal->nWiData>0 && pWal->apWiData[0] );
1719 
1720   /* Read the header. This might happen concurrently with a write to the
1721   ** same area of shared memory on a different CPU in a SMP,
1722   ** meaning it is possible that an inconsistent snapshot is read
1723   ** from the file. If this happens, return non-zero.
1724   **
1725   ** There are two copies of the header at the beginning of the wal-index.
1726   ** When reading, read [0] first then [1].  Writes are in the reverse order.
1727   ** Memory barriers are used to prevent the compiler or the hardware from
1728   ** reordering the reads and writes.
1729   */
1730   aHdr = walIndexHdr(pWal);
1731   memcpy(&h1, (void *)&aHdr[0], sizeof(h1));
1732   sqlite3OsShmBarrier(pWal->pDbFd);
1733   memcpy(&h2, (void *)&aHdr[1], sizeof(h2));
1734 
1735   if( memcmp(&h1, &h2, sizeof(h1))!=0 ){
1736     return 1;   /* Dirty read */
1737   }
1738   if( h1.isInit==0 ){
1739     return 1;   /* Malformed header - probably all zeros */
1740   }
1741   walChecksumBytes(1, (u8*)&h1, sizeof(h1)-sizeof(h1.aCksum), 0, aCksum);
1742   if( aCksum[0]!=h1.aCksum[0] || aCksum[1]!=h1.aCksum[1] ){
1743     return 1;   /* Checksum does not match */
1744   }
1745 
1746   if( memcmp(&pWal->hdr, &h1, sizeof(WalIndexHdr)) ){
1747     *pChanged = 1;
1748     memcpy(&pWal->hdr, &h1, sizeof(WalIndexHdr));
1749     pWal->szPage = (pWal->hdr.szPage&0xfe00) + ((pWal->hdr.szPage&0x0001)<<16);
1750     testcase( pWal->szPage<=32768 );
1751     testcase( pWal->szPage>=65536 );
1752   }
1753 
1754   /* The header was successfully read. Return zero. */
1755   return 0;
1756 }
1757 
1758 /*
1759 ** Read the wal-index header from the wal-index and into pWal->hdr.
1760 ** If the wal-header appears to be corrupt, try to reconstruct the
1761 ** wal-index from the WAL before returning.
1762 **
1763 ** Set *pChanged to 1 if the wal-index header value in pWal->hdr is
1764 ** changed by this opertion.  If pWal->hdr is unchanged, set *pChanged
1765 ** to 0.
1766 **
1767 ** If the wal-index header is successfully read, return SQLITE_OK.
1768 ** Otherwise an SQLite error code.
1769 */
1770 static int walIndexReadHdr(Wal *pWal, int *pChanged){
1771   int rc;                         /* Return code */
1772   int badHdr;                     /* True if a header read failed */
1773   volatile u32 *page0;            /* Chunk of wal-index containing header */
1774 
1775   /* Ensure that page 0 of the wal-index (the page that contains the
1776   ** wal-index header) is mapped. Return early if an error occurs here.
1777   */
1778   assert( pChanged );
1779   rc = walIndexPage(pWal, 0, &page0);
1780   if( rc!=SQLITE_OK ){
1781     return rc;
1782   };
1783   assert( page0 || pWal->writeLock==0 );
1784 
1785   /* If the first page of the wal-index has been mapped, try to read the
1786   ** wal-index header immediately, without holding any lock. This usually
1787   ** works, but may fail if the wal-index header is corrupt or currently
1788   ** being modified by another thread or process.
1789   */
1790   badHdr = (page0 ? walIndexTryHdr(pWal, pChanged) : 1);
1791 
1792   /* If the first attempt failed, it might have been due to a race
1793   ** with a writer.  So get a WRITE lock and try again.
1794   */
1795   assert( badHdr==0 || pWal->writeLock==0 );
1796   if( badHdr && SQLITE_OK==(rc = walLockExclusive(pWal, WAL_WRITE_LOCK, 1)) ){
1797     pWal->writeLock = 1;
1798     if( SQLITE_OK==(rc = walIndexPage(pWal, 0, &page0)) ){
1799       badHdr = walIndexTryHdr(pWal, pChanged);
1800       if( badHdr ){
1801         /* If the wal-index header is still malformed even while holding
1802         ** a WRITE lock, it can only mean that the header is corrupted and
1803         ** needs to be reconstructed.  So run recovery to do exactly that.
1804         */
1805         rc = walIndexRecover(pWal);
1806         *pChanged = 1;
1807       }
1808     }
1809     pWal->writeLock = 0;
1810     walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1);
1811   }
1812 
1813   /* If the header is read successfully, check the version number to make
1814   ** sure the wal-index was not constructed with some future format that
1815   ** this version of SQLite cannot understand.
1816   */
1817   if( badHdr==0 && pWal->hdr.iVersion!=WALINDEX_MAX_VERSION ){
1818     rc = SQLITE_CANTOPEN_BKPT;
1819   }
1820 
1821   return rc;
1822 }
1823 
1824 /*
1825 ** This is the value that walTryBeginRead returns when it needs to
1826 ** be retried.
1827 */
1828 #define WAL_RETRY  (-1)
1829 
1830 /*
1831 ** Attempt to start a read transaction.  This might fail due to a race or
1832 ** other transient condition.  When that happens, it returns WAL_RETRY to
1833 ** indicate to the caller that it is safe to retry immediately.
1834 **
1835 ** On success return SQLITE_OK.  On a permanent failure (such an
1836 ** I/O error or an SQLITE_BUSY because another process is running
1837 ** recovery) return a positive error code.
1838 **
1839 ** The useWal parameter is true to force the use of the WAL and disable
1840 ** the case where the WAL is bypassed because it has been completely
1841 ** checkpointed.  If useWal==0 then this routine calls walIndexReadHdr()
1842 ** to make a copy of the wal-index header into pWal->hdr.  If the
1843 ** wal-index header has changed, *pChanged is set to 1 (as an indication
1844 ** to the caller that the local paget cache is obsolete and needs to be
1845 ** flushed.)  When useWal==1, the wal-index header is assumed to already
1846 ** be loaded and the pChanged parameter is unused.
1847 **
1848 ** The caller must set the cnt parameter to the number of prior calls to
1849 ** this routine during the current read attempt that returned WAL_RETRY.
1850 ** This routine will start taking more aggressive measures to clear the
1851 ** race conditions after multiple WAL_RETRY returns, and after an excessive
1852 ** number of errors will ultimately return SQLITE_PROTOCOL.  The
1853 ** SQLITE_PROTOCOL return indicates that some other process has gone rogue
1854 ** and is not honoring the locking protocol.  There is a vanishingly small
1855 ** chance that SQLITE_PROTOCOL could be returned because of a run of really
1856 ** bad luck when there is lots of contention for the wal-index, but that
1857 ** possibility is so small that it can be safely neglected, we believe.
1858 **
1859 ** On success, this routine obtains a read lock on
1860 ** WAL_READ_LOCK(pWal->readLock).  The pWal->readLock integer is
1861 ** in the range 0 <= pWal->readLock < WAL_NREADER.  If pWal->readLock==(-1)
1862 ** that means the Wal does not hold any read lock.  The reader must not
1863 ** access any database page that is modified by a WAL frame up to and
1864 ** including frame number aReadMark[pWal->readLock].  The reader will
1865 ** use WAL frames up to and including pWal->hdr.mxFrame if pWal->readLock>0
1866 ** Or if pWal->readLock==0, then the reader will ignore the WAL
1867 ** completely and get all content directly from the database file.
1868 ** If the useWal parameter is 1 then the WAL will never be ignored and
1869 ** this routine will always set pWal->readLock>0 on success.
1870 ** When the read transaction is completed, the caller must release the
1871 ** lock on WAL_READ_LOCK(pWal->readLock) and set pWal->readLock to -1.
1872 **
1873 ** This routine uses the nBackfill and aReadMark[] fields of the header
1874 ** to select a particular WAL_READ_LOCK() that strives to let the
1875 ** checkpoint process do as much work as possible.  This routine might
1876 ** update values of the aReadMark[] array in the header, but if it does
1877 ** so it takes care to hold an exclusive lock on the corresponding
1878 ** WAL_READ_LOCK() while changing values.
1879 */
1880 static int walTryBeginRead(Wal *pWal, int *pChanged, int useWal, int cnt){
1881   volatile WalCkptInfo *pInfo;    /* Checkpoint information in wal-index */
1882   u32 mxReadMark;                 /* Largest aReadMark[] value */
1883   int mxI;                        /* Index of largest aReadMark[] value */
1884   int i;                          /* Loop counter */
1885   int rc = SQLITE_OK;             /* Return code  */
1886 
1887   assert( pWal->readLock<0 );     /* Not currently locked */
1888 
1889   /* Take steps to avoid spinning forever if there is a protocol error. */
1890   if( cnt>5 ){
1891     if( cnt>100 ) return SQLITE_PROTOCOL;
1892     sqlite3OsSleep(pWal->pVfs, 1);
1893   }
1894 
1895   if( !useWal ){
1896     rc = walIndexReadHdr(pWal, pChanged);
1897     if( rc==SQLITE_BUSY ){
1898       /* If there is not a recovery running in another thread or process
1899       ** then convert BUSY errors to WAL_RETRY.  If recovery is known to
1900       ** be running, convert BUSY to BUSY_RECOVERY.  There is a race here
1901       ** which might cause WAL_RETRY to be returned even if BUSY_RECOVERY
1902       ** would be technically correct.  But the race is benign since with
1903       ** WAL_RETRY this routine will be called again and will probably be
1904       ** right on the second iteration.
1905       */
1906       if( pWal->apWiData[0]==0 ){
1907         /* This branch is taken when the xShmMap() method returns SQLITE_BUSY.
1908         ** We assume this is a transient condition, so return WAL_RETRY. The
1909         ** xShmMap() implementation used by the default unix and win32 VFS
1910         ** modules may return SQLITE_BUSY due to a race condition in the
1911         ** code that determines whether or not the shared-memory region
1912         ** must be zeroed before the requested page is returned.
1913         */
1914         rc = WAL_RETRY;
1915       }else if( SQLITE_OK==(rc = walLockShared(pWal, WAL_RECOVER_LOCK)) ){
1916         walUnlockShared(pWal, WAL_RECOVER_LOCK);
1917         rc = WAL_RETRY;
1918       }else if( rc==SQLITE_BUSY ){
1919         rc = SQLITE_BUSY_RECOVERY;
1920       }
1921     }
1922     if( rc!=SQLITE_OK ){
1923       return rc;
1924     }
1925   }
1926 
1927   pInfo = walCkptInfo(pWal);
1928   if( !useWal && pInfo->nBackfill==pWal->hdr.mxFrame ){
1929     /* The WAL has been completely backfilled (or it is empty).
1930     ** and can be safely ignored.
1931     */
1932     rc = walLockShared(pWal, WAL_READ_LOCK(0));
1933     sqlite3OsShmBarrier(pWal->pDbFd);
1934     if( rc==SQLITE_OK ){
1935       if( memcmp((void *)walIndexHdr(pWal), &pWal->hdr, sizeof(WalIndexHdr)) ){
1936         /* It is not safe to allow the reader to continue here if frames
1937         ** may have been appended to the log before READ_LOCK(0) was obtained.
1938         ** When holding READ_LOCK(0), the reader ignores the entire log file,
1939         ** which implies that the database file contains a trustworthy
1940         ** snapshoT. Since holding READ_LOCK(0) prevents a checkpoint from
1941         ** happening, this is usually correct.
1942         **
1943         ** However, if frames have been appended to the log (or if the log
1944         ** is wrapped and written for that matter) before the READ_LOCK(0)
1945         ** is obtained, that is not necessarily true. A checkpointer may
1946         ** have started to backfill the appended frames but crashed before
1947         ** it finished. Leaving a corrupt image in the database file.
1948         */
1949         walUnlockShared(pWal, WAL_READ_LOCK(0));
1950         return WAL_RETRY;
1951       }
1952       pWal->readLock = 0;
1953       return SQLITE_OK;
1954     }else if( rc!=SQLITE_BUSY ){
1955       return rc;
1956     }
1957   }
1958 
1959   /* If we get this far, it means that the reader will want to use
1960   ** the WAL to get at content from recent commits.  The job now is
1961   ** to select one of the aReadMark[] entries that is closest to
1962   ** but not exceeding pWal->hdr.mxFrame and lock that entry.
1963   */
1964   mxReadMark = 0;
1965   mxI = 0;
1966   for(i=1; i<WAL_NREADER; i++){
1967     u32 thisMark = pInfo->aReadMark[i];
1968     if( mxReadMark<=thisMark && thisMark<=pWal->hdr.mxFrame ){
1969       assert( thisMark!=READMARK_NOT_USED );
1970       mxReadMark = thisMark;
1971       mxI = i;
1972     }
1973   }
1974   if( mxI==0 ){
1975     /* If we get here, it means that all of the aReadMark[] entries between
1976     ** 1 and WAL_NREADER-1 are zero.  Try to initialize aReadMark[1] to
1977     ** be mxFrame, then retry.
1978     */
1979     rc = walLockExclusive(pWal, WAL_READ_LOCK(1), 1);
1980     if( rc==SQLITE_OK ){
1981       pInfo->aReadMark[1] = pWal->hdr.mxFrame;
1982       walUnlockExclusive(pWal, WAL_READ_LOCK(1), 1);
1983       rc = WAL_RETRY;
1984     }else if( rc==SQLITE_BUSY ){
1985       rc = WAL_RETRY;
1986     }
1987     return rc;
1988   }else{
1989     if( mxReadMark < pWal->hdr.mxFrame ){
1990       for(i=1; i<WAL_NREADER; i++){
1991         rc = walLockExclusive(pWal, WAL_READ_LOCK(i), 1);
1992         if( rc==SQLITE_OK ){
1993           mxReadMark = pInfo->aReadMark[i] = pWal->hdr.mxFrame;
1994           mxI = i;
1995           walUnlockExclusive(pWal, WAL_READ_LOCK(i), 1);
1996           break;
1997         }else if( rc!=SQLITE_BUSY ){
1998           return rc;
1999         }
2000       }
2001     }
2002 
2003     rc = walLockShared(pWal, WAL_READ_LOCK(mxI));
2004     if( rc ){
2005       return rc==SQLITE_BUSY ? WAL_RETRY : rc;
2006     }
2007     /* Now that the read-lock has been obtained, check that neither the
2008     ** value in the aReadMark[] array or the contents of the wal-index
2009     ** header have changed.
2010     **
2011     ** It is necessary to check that the wal-index header did not change
2012     ** between the time it was read and when the shared-lock was obtained
2013     ** on WAL_READ_LOCK(mxI) was obtained to account for the possibility
2014     ** that the log file may have been wrapped by a writer, or that frames
2015     ** that occur later in the log than pWal->hdr.mxFrame may have been
2016     ** copied into the database by a checkpointer. If either of these things
2017     ** happened, then reading the database with the current value of
2018     ** pWal->hdr.mxFrame risks reading a corrupted snapshot. So, retry
2019     ** instead.
2020     **
2021     ** This does not guarantee that the copy of the wal-index header is up to
2022     ** date before proceeding. That would not be possible without somehow
2023     ** blocking writers. It only guarantees that a dangerous checkpoint or
2024     ** log-wrap (either of which would require an exclusive lock on
2025     ** WAL_READ_LOCK(mxI)) has not occurred since the snapshot was valid.
2026     */
2027     sqlite3OsShmBarrier(pWal->pDbFd);
2028     if( pInfo->aReadMark[mxI]!=mxReadMark
2029      || memcmp((void *)walIndexHdr(pWal), &pWal->hdr, sizeof(WalIndexHdr))
2030     ){
2031       walUnlockShared(pWal, WAL_READ_LOCK(mxI));
2032       return WAL_RETRY;
2033     }else{
2034       assert( mxReadMark<=pWal->hdr.mxFrame );
2035       pWal->readLock = (i16)mxI;
2036     }
2037   }
2038   return rc;
2039 }
2040 
2041 /*
2042 ** Begin a read transaction on the database.
2043 **
2044 ** This routine used to be called sqlite3OpenSnapshot() and with good reason:
2045 ** it takes a snapshot of the state of the WAL and wal-index for the current
2046 ** instant in time.  The current thread will continue to use this snapshot.
2047 ** Other threads might append new content to the WAL and wal-index but
2048 ** that extra content is ignored by the current thread.
2049 **
2050 ** If the database contents have changes since the previous read
2051 ** transaction, then *pChanged is set to 1 before returning.  The
2052 ** Pager layer will use this to know that is cache is stale and
2053 ** needs to be flushed.
2054 */
2055 int sqlite3WalBeginReadTransaction(Wal *pWal, int *pChanged){
2056   int rc;                         /* Return code */
2057   int cnt = 0;                    /* Number of TryBeginRead attempts */
2058 
2059   do{
2060     rc = walTryBeginRead(pWal, pChanged, 0, ++cnt);
2061   }while( rc==WAL_RETRY );
2062   return rc;
2063 }
2064 
2065 /*
2066 ** Finish with a read transaction.  All this does is release the
2067 ** read-lock.
2068 */
2069 void sqlite3WalEndReadTransaction(Wal *pWal){
2070   sqlite3WalEndWriteTransaction(pWal);
2071   if( pWal->readLock>=0 ){
2072     walUnlockShared(pWal, WAL_READ_LOCK(pWal->readLock));
2073     pWal->readLock = -1;
2074   }
2075 }
2076 
2077 /*
2078 ** Read a page from the WAL, if it is present in the WAL and if the
2079 ** current read transaction is configured to use the WAL.
2080 **
2081 ** The *pInWal is set to 1 if the requested page is in the WAL and
2082 ** has been loaded.  Or *pInWal is set to 0 if the page was not in
2083 ** the WAL and needs to be read out of the database.
2084 */
2085 int sqlite3WalRead(
2086   Wal *pWal,                      /* WAL handle */
2087   Pgno pgno,                      /* Database page number to read data for */
2088   int *pInWal,                    /* OUT: True if data is read from WAL */
2089   int nOut,                       /* Size of buffer pOut in bytes */
2090   u8 *pOut                        /* Buffer to write page data to */
2091 ){
2092   u32 iRead = 0;                  /* If !=0, WAL frame to return data from */
2093   u32 iLast = pWal->hdr.mxFrame;  /* Last page in WAL for this reader */
2094   int iHash;                      /* Used to loop through N hash tables */
2095 
2096   /* This routine is only be called from within a read transaction. */
2097   assert( pWal->readLock>=0 || pWal->lockError );
2098 
2099   /* If the "last page" field of the wal-index header snapshot is 0, then
2100   ** no data will be read from the wal under any circumstances. Return early
2101   ** in this case as an optimization.  Likewise, if pWal->readLock==0,
2102   ** then the WAL is ignored by the reader so return early, as if the
2103   ** WAL were empty.
2104   */
2105   if( iLast==0 || pWal->readLock==0 ){
2106     *pInWal = 0;
2107     return SQLITE_OK;
2108   }
2109 
2110   /* Search the hash table or tables for an entry matching page number
2111   ** pgno. Each iteration of the following for() loop searches one
2112   ** hash table (each hash table indexes up to HASHTABLE_NPAGE frames).
2113   **
2114   ** This code might run concurrently to the code in walIndexAppend()
2115   ** that adds entries to the wal-index (and possibly to this hash
2116   ** table). This means the value just read from the hash
2117   ** slot (aHash[iKey]) may have been added before or after the
2118   ** current read transaction was opened. Values added after the
2119   ** read transaction was opened may have been written incorrectly -
2120   ** i.e. these slots may contain garbage data. However, we assume
2121   ** that any slots written before the current read transaction was
2122   ** opened remain unmodified.
2123   **
2124   ** For the reasons above, the if(...) condition featured in the inner
2125   ** loop of the following block is more stringent that would be required
2126   ** if we had exclusive access to the hash-table:
2127   **
2128   **   (aPgno[iFrame]==pgno):
2129   **     This condition filters out normal hash-table collisions.
2130   **
2131   **   (iFrame<=iLast):
2132   **     This condition filters out entries that were added to the hash
2133   **     table after the current read-transaction had started.
2134   */
2135   for(iHash=walFramePage(iLast); iHash>=0 && iRead==0; iHash--){
2136     volatile ht_slot *aHash;      /* Pointer to hash table */
2137     volatile u32 *aPgno;          /* Pointer to array of page numbers */
2138     u32 iZero;                    /* Frame number corresponding to aPgno[0] */
2139     int iKey;                     /* Hash slot index */
2140     int nCollide;                 /* Number of hash collisions remaining */
2141     int rc;                       /* Error code */
2142 
2143     rc = walHashGet(pWal, iHash, &aHash, &aPgno, &iZero);
2144     if( rc!=SQLITE_OK ){
2145       return rc;
2146     }
2147     nCollide = HASHTABLE_NSLOT;
2148     for(iKey=walHash(pgno); aHash[iKey]; iKey=walNextHash(iKey)){
2149       u32 iFrame = aHash[iKey] + iZero;
2150       if( iFrame<=iLast && aPgno[aHash[iKey]]==pgno ){
2151         assert( iFrame>iRead );
2152         iRead = iFrame;
2153       }
2154       if( (nCollide--)==0 ){
2155         return SQLITE_CORRUPT_BKPT;
2156       }
2157     }
2158   }
2159 
2160 #ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT
2161   /* If expensive assert() statements are available, do a linear search
2162   ** of the wal-index file content. Make sure the results agree with the
2163   ** result obtained using the hash indexes above.  */
2164   {
2165     u32 iRead2 = 0;
2166     u32 iTest;
2167     for(iTest=iLast; iTest>0; iTest--){
2168       if( walFramePgno(pWal, iTest)==pgno ){
2169         iRead2 = iTest;
2170         break;
2171       }
2172     }
2173     assert( iRead==iRead2 );
2174   }
2175 #endif
2176 
2177   /* If iRead is non-zero, then it is the log frame number that contains the
2178   ** required page. Read and return data from the log file.
2179   */
2180   if( iRead ){
2181     int sz;
2182     i64 iOffset;
2183     sz = pWal->hdr.szPage;
2184     sz = (pWal->hdr.szPage&0xfe00) + ((pWal->hdr.szPage&0x0001)<<16);
2185     testcase( sz<=32768 );
2186     testcase( sz>=65536 );
2187     iOffset = walFrameOffset(iRead, sz) + WAL_FRAME_HDRSIZE;
2188     *pInWal = 1;
2189     /* testcase( IS_BIG_INT(iOffset) ); // requires a 4GiB WAL */
2190     return sqlite3OsRead(pWal->pWalFd, pOut, nOut, iOffset);
2191   }
2192 
2193   *pInWal = 0;
2194   return SQLITE_OK;
2195 }
2196 
2197 
2198 /*
2199 ** Return the size of the database in pages (or zero, if unknown).
2200 */
2201 Pgno sqlite3WalDbsize(Wal *pWal){
2202   if( pWal && ALWAYS(pWal->readLock>=0) ){
2203     return pWal->hdr.nPage;
2204   }
2205   return 0;
2206 }
2207 
2208 
2209 /*
2210 ** This function starts a write transaction on the WAL.
2211 **
2212 ** A read transaction must have already been started by a prior call
2213 ** to sqlite3WalBeginReadTransaction().
2214 **
2215 ** If another thread or process has written into the database since
2216 ** the read transaction was started, then it is not possible for this
2217 ** thread to write as doing so would cause a fork.  So this routine
2218 ** returns SQLITE_BUSY in that case and no write transaction is started.
2219 **
2220 ** There can only be a single writer active at a time.
2221 */
2222 int sqlite3WalBeginWriteTransaction(Wal *pWal){
2223   int rc;
2224 
2225   /* Cannot start a write transaction without first holding a read
2226   ** transaction. */
2227   assert( pWal->readLock>=0 );
2228 
2229   if( pWal->readOnly ){
2230     return SQLITE_READONLY;
2231   }
2232 
2233   /* Only one writer allowed at a time.  Get the write lock.  Return
2234   ** SQLITE_BUSY if unable.
2235   */
2236   rc = walLockExclusive(pWal, WAL_WRITE_LOCK, 1);
2237   if( rc ){
2238     return rc;
2239   }
2240   pWal->writeLock = 1;
2241 
2242   /* If another connection has written to the database file since the
2243   ** time the read transaction on this connection was started, then
2244   ** the write is disallowed.
2245   */
2246   if( memcmp(&pWal->hdr, (void *)walIndexHdr(pWal), sizeof(WalIndexHdr))!=0 ){
2247     walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1);
2248     pWal->writeLock = 0;
2249     rc = SQLITE_BUSY;
2250   }
2251 
2252   return rc;
2253 }
2254 
2255 /*
2256 ** End a write transaction.  The commit has already been done.  This
2257 ** routine merely releases the lock.
2258 */
2259 int sqlite3WalEndWriteTransaction(Wal *pWal){
2260   if( pWal->writeLock ){
2261     walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1);
2262     pWal->writeLock = 0;
2263   }
2264   return SQLITE_OK;
2265 }
2266 
2267 /*
2268 ** If any data has been written (but not committed) to the log file, this
2269 ** function moves the write-pointer back to the start of the transaction.
2270 **
2271 ** Additionally, the callback function is invoked for each frame written
2272 ** to the WAL since the start of the transaction. If the callback returns
2273 ** other than SQLITE_OK, it is not invoked again and the error code is
2274 ** returned to the caller.
2275 **
2276 ** Otherwise, if the callback function does not return an error, this
2277 ** function returns SQLITE_OK.
2278 */
2279 int sqlite3WalUndo(Wal *pWal, int (*xUndo)(void *, Pgno), void *pUndoCtx){
2280   int rc = SQLITE_OK;
2281   if( ALWAYS(pWal->writeLock) ){
2282     Pgno iMax = pWal->hdr.mxFrame;
2283     Pgno iFrame;
2284 
2285     /* Restore the clients cache of the wal-index header to the state it
2286     ** was in before the client began writing to the database.
2287     */
2288     memcpy(&pWal->hdr, (void *)walIndexHdr(pWal), sizeof(WalIndexHdr));
2289 
2290     for(iFrame=pWal->hdr.mxFrame+1;
2291         ALWAYS(rc==SQLITE_OK) && iFrame<=iMax;
2292         iFrame++
2293     ){
2294       /* This call cannot fail. Unless the page for which the page number
2295       ** is passed as the second argument is (a) in the cache and
2296       ** (b) has an outstanding reference, then xUndo is either a no-op
2297       ** (if (a) is false) or simply expels the page from the cache (if (b)
2298       ** is false).
2299       **
2300       ** If the upper layer is doing a rollback, it is guaranteed that there
2301       ** are no outstanding references to any page other than page 1. And
2302       ** page 1 is never written to the log until the transaction is
2303       ** committed. As a result, the call to xUndo may not fail.
2304       */
2305       assert( walFramePgno(pWal, iFrame)!=1 );
2306       rc = xUndo(pUndoCtx, walFramePgno(pWal, iFrame));
2307     }
2308     walCleanupHash(pWal);
2309   }
2310   assert( rc==SQLITE_OK );
2311   return rc;
2312 }
2313 
2314 /*
2315 ** Argument aWalData must point to an array of WAL_SAVEPOINT_NDATA u32
2316 ** values. This function populates the array with values required to
2317 ** "rollback" the write position of the WAL handle back to the current
2318 ** point in the event of a savepoint rollback (via WalSavepointUndo()).
2319 */
2320 void sqlite3WalSavepoint(Wal *pWal, u32 *aWalData){
2321   assert( pWal->writeLock );
2322   aWalData[0] = pWal->hdr.mxFrame;
2323   aWalData[1] = pWal->hdr.aFrameCksum[0];
2324   aWalData[2] = pWal->hdr.aFrameCksum[1];
2325   aWalData[3] = pWal->nCkpt;
2326 }
2327 
2328 /*
2329 ** Move the write position of the WAL back to the point identified by
2330 ** the values in the aWalData[] array. aWalData must point to an array
2331 ** of WAL_SAVEPOINT_NDATA u32 values that has been previously populated
2332 ** by a call to WalSavepoint().
2333 */
2334 int sqlite3WalSavepointUndo(Wal *pWal, u32 *aWalData){
2335   int rc = SQLITE_OK;
2336 
2337   assert( pWal->writeLock );
2338   assert( aWalData[3]!=pWal->nCkpt || aWalData[0]<=pWal->hdr.mxFrame );
2339 
2340   if( aWalData[3]!=pWal->nCkpt ){
2341     /* This savepoint was opened immediately after the write-transaction
2342     ** was started. Right after that, the writer decided to wrap around
2343     ** to the start of the log. Update the savepoint values to match.
2344     */
2345     aWalData[0] = 0;
2346     aWalData[3] = pWal->nCkpt;
2347   }
2348 
2349   if( aWalData[0]<pWal->hdr.mxFrame ){
2350     pWal->hdr.mxFrame = aWalData[0];
2351     pWal->hdr.aFrameCksum[0] = aWalData[1];
2352     pWal->hdr.aFrameCksum[1] = aWalData[2];
2353     walCleanupHash(pWal);
2354   }
2355 
2356   return rc;
2357 }
2358 
2359 /*
2360 ** This function is called just before writing a set of frames to the log
2361 ** file (see sqlite3WalFrames()). It checks to see if, instead of appending
2362 ** to the current log file, it is possible to overwrite the start of the
2363 ** existing log file with the new frames (i.e. "reset" the log). If so,
2364 ** it sets pWal->hdr.mxFrame to 0. Otherwise, pWal->hdr.mxFrame is left
2365 ** unchanged.
2366 **
2367 ** SQLITE_OK is returned if no error is encountered (regardless of whether
2368 ** or not pWal->hdr.mxFrame is modified). An SQLite error code is returned
2369 ** if some error
2370 */
2371 static int walRestartLog(Wal *pWal){
2372   int rc = SQLITE_OK;
2373   int cnt;
2374 
2375   if( pWal->readLock==0 ){
2376     volatile WalCkptInfo *pInfo = walCkptInfo(pWal);
2377     assert( pInfo->nBackfill==pWal->hdr.mxFrame );
2378     if( pInfo->nBackfill>0 ){
2379       rc = walLockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1);
2380       if( rc==SQLITE_OK ){
2381         /* If all readers are using WAL_READ_LOCK(0) (in other words if no
2382         ** readers are currently using the WAL), then the transactions
2383         ** frames will overwrite the start of the existing log. Update the
2384         ** wal-index header to reflect this.
2385         **
2386         ** In theory it would be Ok to update the cache of the header only
2387         ** at this point. But updating the actual wal-index header is also
2388         ** safe and means there is no special case for sqlite3WalUndo()
2389         ** to handle if this transaction is rolled back.
2390         */
2391         int i;                    /* Loop counter */
2392         u32 *aSalt = pWal->hdr.aSalt;       /* Big-endian salt values */
2393         pWal->nCkpt++;
2394         pWal->hdr.mxFrame = 0;
2395         sqlite3Put4byte((u8*)&aSalt[0], 1 + sqlite3Get4byte((u8*)&aSalt[0]));
2396         sqlite3_randomness(4, &aSalt[1]);
2397         walIndexWriteHdr(pWal);
2398         pInfo->nBackfill = 0;
2399         for(i=1; i<WAL_NREADER; i++) pInfo->aReadMark[i] = READMARK_NOT_USED;
2400         assert( pInfo->aReadMark[0]==0 );
2401         walUnlockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1);
2402       }
2403     }
2404     walUnlockShared(pWal, WAL_READ_LOCK(0));
2405     pWal->readLock = -1;
2406     cnt = 0;
2407     do{
2408       int notUsed;
2409       rc = walTryBeginRead(pWal, &notUsed, 1, ++cnt);
2410     }while( rc==WAL_RETRY );
2411   }
2412   return rc;
2413 }
2414 
2415 /*
2416 ** Write a set of frames to the log. The caller must hold the write-lock
2417 ** on the log file (obtained using sqlite3WalBeginWriteTransaction()).
2418 */
2419 int sqlite3WalFrames(
2420   Wal *pWal,                      /* Wal handle to write to */
2421   int szPage,                     /* Database page-size in bytes */
2422   PgHdr *pList,                   /* List of dirty pages to write */
2423   Pgno nTruncate,                 /* Database size after this commit */
2424   int isCommit,                   /* True if this is a commit */
2425   int sync_flags                  /* Flags to pass to OsSync() (or 0) */
2426 ){
2427   int rc;                         /* Used to catch return codes */
2428   u32 iFrame;                     /* Next frame address */
2429   u8 aFrame[WAL_FRAME_HDRSIZE];   /* Buffer to assemble frame-header in */
2430   PgHdr *p;                       /* Iterator to run through pList with. */
2431   PgHdr *pLast = 0;               /* Last frame in list */
2432   int nLast = 0;                  /* Number of extra copies of last page */
2433 
2434   assert( pList );
2435   assert( pWal->writeLock );
2436 
2437 #if defined(SQLITE_TEST) && defined(SQLITE_DEBUG)
2438   { int cnt; for(cnt=0, p=pList; p; p=p->pDirty, cnt++){}
2439     WALTRACE(("WAL%p: frame write begin. %d frames. mxFrame=%d. %s\n",
2440               pWal, cnt, pWal->hdr.mxFrame, isCommit ? "Commit" : "Spill"));
2441   }
2442 #endif
2443 
2444   /* See if it is possible to write these frames into the start of the
2445   ** log file, instead of appending to it at pWal->hdr.mxFrame.
2446   */
2447   if( SQLITE_OK!=(rc = walRestartLog(pWal)) ){
2448     return rc;
2449   }
2450 
2451   /* If this is the first frame written into the log, write the WAL
2452   ** header to the start of the WAL file. See comments at the top of
2453   ** this source file for a description of the WAL header format.
2454   */
2455   iFrame = pWal->hdr.mxFrame;
2456   if( iFrame==0 ){
2457     u8 aWalHdr[WAL_HDRSIZE];      /* Buffer to assemble wal-header in */
2458     u32 aCksum[2];                /* Checksum for wal-header */
2459 
2460     sqlite3Put4byte(&aWalHdr[0], (WAL_MAGIC | SQLITE_BIGENDIAN));
2461     sqlite3Put4byte(&aWalHdr[4], WAL_MAX_VERSION);
2462     sqlite3Put4byte(&aWalHdr[8], szPage);
2463     sqlite3Put4byte(&aWalHdr[12], pWal->nCkpt);
2464     sqlite3_randomness(8, pWal->hdr.aSalt);
2465     memcpy(&aWalHdr[16], pWal->hdr.aSalt, 8);
2466     walChecksumBytes(1, aWalHdr, WAL_HDRSIZE-2*4, 0, aCksum);
2467     sqlite3Put4byte(&aWalHdr[24], aCksum[0]);
2468     sqlite3Put4byte(&aWalHdr[28], aCksum[1]);
2469 
2470     pWal->szPage = szPage;
2471     pWal->hdr.bigEndCksum = SQLITE_BIGENDIAN;
2472     pWal->hdr.aFrameCksum[0] = aCksum[0];
2473     pWal->hdr.aFrameCksum[1] = aCksum[1];
2474 
2475     rc = sqlite3OsWrite(pWal->pWalFd, aWalHdr, sizeof(aWalHdr), 0);
2476     WALTRACE(("WAL%p: wal-header write %s\n", pWal, rc ? "failed" : "ok"));
2477     if( rc!=SQLITE_OK ){
2478       return rc;
2479     }
2480   }
2481   assert( (int)pWal->szPage==szPage );
2482 
2483   /* Write the log file. */
2484   for(p=pList; p; p=p->pDirty){
2485     u32 nDbsize;                  /* Db-size field for frame header */
2486     i64 iOffset;                  /* Write offset in log file */
2487     void *pData;
2488 
2489     iOffset = walFrameOffset(++iFrame, szPage);
2490     /* testcase( IS_BIG_INT(iOffset) ); // requires a 4GiB WAL */
2491 
2492     /* Populate and write the frame header */
2493     nDbsize = (isCommit && p->pDirty==0) ? nTruncate : 0;
2494 #if defined(SQLITE_HAS_CODEC)
2495     if( (pData = sqlite3PagerCodec(p))==0 ) return SQLITE_NOMEM;
2496 #else
2497     pData = p->pData;
2498 #endif
2499     walEncodeFrame(pWal, p->pgno, nDbsize, pData, aFrame);
2500     rc = sqlite3OsWrite(pWal->pWalFd, aFrame, sizeof(aFrame), iOffset);
2501     if( rc!=SQLITE_OK ){
2502       return rc;
2503     }
2504 
2505     /* Write the page data */
2506     rc = sqlite3OsWrite(pWal->pWalFd, pData, szPage, iOffset+sizeof(aFrame));
2507     if( rc!=SQLITE_OK ){
2508       return rc;
2509     }
2510     pLast = p;
2511   }
2512 
2513   /* Sync the log file if the 'isSync' flag was specified. */
2514   if( sync_flags ){
2515     i64 iSegment = sqlite3OsSectorSize(pWal->pWalFd);
2516     i64 iOffset = walFrameOffset(iFrame+1, szPage);
2517 
2518     assert( isCommit );
2519     assert( iSegment>0 );
2520 
2521     iSegment = (((iOffset+iSegment-1)/iSegment) * iSegment);
2522     while( iOffset<iSegment ){
2523       void *pData;
2524 #if defined(SQLITE_HAS_CODEC)
2525       if( (pData = sqlite3PagerCodec(pLast))==0 ) return SQLITE_NOMEM;
2526 #else
2527       pData = pLast->pData;
2528 #endif
2529       walEncodeFrame(pWal, pLast->pgno, nTruncate, pData, aFrame);
2530       /* testcase( IS_BIG_INT(iOffset) ); // requires a 4GiB WAL */
2531       rc = sqlite3OsWrite(pWal->pWalFd, aFrame, sizeof(aFrame), iOffset);
2532       if( rc!=SQLITE_OK ){
2533         return rc;
2534       }
2535       iOffset += WAL_FRAME_HDRSIZE;
2536       rc = sqlite3OsWrite(pWal->pWalFd, pData, szPage, iOffset);
2537       if( rc!=SQLITE_OK ){
2538         return rc;
2539       }
2540       nLast++;
2541       iOffset += szPage;
2542     }
2543 
2544     rc = sqlite3OsSync(pWal->pWalFd, sync_flags);
2545   }
2546 
2547   /* Append data to the wal-index. It is not necessary to lock the
2548   ** wal-index to do this as the SQLITE_SHM_WRITE lock held on the wal-index
2549   ** guarantees that there are no other writers, and no data that may
2550   ** be in use by existing readers is being overwritten.
2551   */
2552   iFrame = pWal->hdr.mxFrame;
2553   for(p=pList; p && rc==SQLITE_OK; p=p->pDirty){
2554     iFrame++;
2555     rc = walIndexAppend(pWal, iFrame, p->pgno);
2556   }
2557   while( nLast>0 && rc==SQLITE_OK ){
2558     iFrame++;
2559     nLast--;
2560     rc = walIndexAppend(pWal, iFrame, pLast->pgno);
2561   }
2562 
2563   if( rc==SQLITE_OK ){
2564     /* Update the private copy of the header. */
2565     pWal->hdr.szPage = (u16)((szPage&0xff00) | (szPage>>16));
2566     testcase( szPage<=32768 );
2567     testcase( szPage>=65536 );
2568     pWal->hdr.mxFrame = iFrame;
2569     if( isCommit ){
2570       pWal->hdr.iChange++;
2571       pWal->hdr.nPage = nTruncate;
2572     }
2573     /* If this is a commit, update the wal-index header too. */
2574     if( isCommit ){
2575       walIndexWriteHdr(pWal);
2576       pWal->iCallback = iFrame;
2577     }
2578   }
2579 
2580   WALTRACE(("WAL%p: frame write %s\n", pWal, rc ? "failed" : "ok"));
2581   return rc;
2582 }
2583 
2584 /*
2585 ** This routine is called to implement sqlite3_wal_checkpoint() and
2586 ** related interfaces.
2587 **
2588 ** Obtain a CHECKPOINT lock and then backfill as much information as
2589 ** we can from WAL into the database.
2590 */
2591 int sqlite3WalCheckpoint(
2592   Wal *pWal,                      /* Wal connection */
2593   int sync_flags,                 /* Flags to sync db file with (or 0) */
2594   int nBuf,                       /* Size of temporary buffer */
2595   u8 *zBuf                        /* Temporary buffer to use */
2596 ){
2597   int rc;                         /* Return code */
2598   int isChanged = 0;              /* True if a new wal-index header is loaded */
2599 
2600   assert( pWal->ckptLock==0 );
2601 
2602   WALTRACE(("WAL%p: checkpoint begins\n", pWal));
2603   rc = walLockExclusive(pWal, WAL_CKPT_LOCK, 1);
2604   if( rc ){
2605     /* Usually this is SQLITE_BUSY meaning that another thread or process
2606     ** is already running a checkpoint, or maybe a recovery.  But it might
2607     ** also be SQLITE_IOERR. */
2608     return rc;
2609   }
2610   pWal->ckptLock = 1;
2611 
2612   /* Copy data from the log to the database file. */
2613   rc = walIndexReadHdr(pWal, &isChanged);
2614   if( rc==SQLITE_OK ){
2615     rc = walCheckpoint(pWal, sync_flags, nBuf, zBuf);
2616   }
2617   if( isChanged ){
2618     /* If a new wal-index header was loaded before the checkpoint was
2619     ** performed, then the pager-cache associated with pWal is now
2620     ** out of date. So zero the cached wal-index header to ensure that
2621     ** next time the pager opens a snapshot on this database it knows that
2622     ** the cache needs to be reset.
2623     */
2624     memset(&pWal->hdr, 0, sizeof(WalIndexHdr));
2625   }
2626 
2627   /* Release the locks. */
2628   walUnlockExclusive(pWal, WAL_CKPT_LOCK, 1);
2629   pWal->ckptLock = 0;
2630   WALTRACE(("WAL%p: checkpoint %s\n", pWal, rc ? "failed" : "ok"));
2631   return rc;
2632 }
2633 
2634 /* Return the value to pass to a sqlite3_wal_hook callback, the
2635 ** number of frames in the WAL at the point of the last commit since
2636 ** sqlite3WalCallback() was called.  If no commits have occurred since
2637 ** the last call, then return 0.
2638 */
2639 int sqlite3WalCallback(Wal *pWal){
2640   u32 ret = 0;
2641   if( pWal ){
2642     ret = pWal->iCallback;
2643     pWal->iCallback = 0;
2644   }
2645   return (int)ret;
2646 }
2647 
2648 /*
2649 ** This function is called to change the WAL subsystem into or out
2650 ** of locking_mode=EXCLUSIVE.
2651 **
2652 ** If op is zero, then attempt to change from locking_mode=EXCLUSIVE
2653 ** into locking_mode=NORMAL.  This means that we must acquire a lock
2654 ** on the pWal->readLock byte.  If the WAL is already in locking_mode=NORMAL
2655 ** or if the acquisition of the lock fails, then return 0.  If the
2656 ** transition out of exclusive-mode is successful, return 1.  This
2657 ** operation must occur while the pager is still holding the exclusive
2658 ** lock on the main database file.
2659 **
2660 ** If op is one, then change from locking_mode=NORMAL into
2661 ** locking_mode=EXCLUSIVE.  This means that the pWal->readLock must
2662 ** be released.  Return 1 if the transition is made and 0 if the
2663 ** WAL is already in exclusive-locking mode - meaning that this
2664 ** routine is a no-op.  The pager must already hold the exclusive lock
2665 ** on the main database file before invoking this operation.
2666 **
2667 ** If op is negative, then do a dry-run of the op==1 case but do
2668 ** not actually change anything.  The pager uses this to see if it
2669 ** should acquire the database exclusive lock prior to invoking
2670 ** the op==1 case.
2671 */
2672 int sqlite3WalExclusiveMode(Wal *pWal, int op){
2673   int rc;
2674   assert( pWal->writeLock==0 );
2675 
2676   /* pWal->readLock is usually set, but might be -1 if there was a
2677   ** prior error while attempting to acquire are read-lock. This cannot
2678   ** happen if the connection is actually in exclusive mode (as no xShmLock
2679   ** locks are taken in this case). Nor should the pager attempt to
2680   ** upgrade to exclusive-mode following such an error.
2681   */
2682   assert( pWal->readLock>=0 || pWal->lockError );
2683   assert( pWal->readLock>=0 || (op<=0 && pWal->exclusiveMode==0) );
2684 
2685   if( op==0 ){
2686     if( pWal->exclusiveMode ){
2687       pWal->exclusiveMode = 0;
2688       if( walLockShared(pWal, WAL_READ_LOCK(pWal->readLock))!=SQLITE_OK ){
2689         pWal->exclusiveMode = 1;
2690       }
2691       rc = pWal->exclusiveMode==0;
2692     }else{
2693       /* Already in locking_mode=NORMAL */
2694       rc = 0;
2695     }
2696   }else if( op>0 ){
2697     assert( pWal->exclusiveMode==0 );
2698     assert( pWal->readLock>=0 );
2699     walUnlockShared(pWal, WAL_READ_LOCK(pWal->readLock));
2700     pWal->exclusiveMode = 1;
2701     rc = 1;
2702   }else{
2703     rc = pWal->exclusiveMode==0;
2704   }
2705   return rc;
2706 }
2707 
2708 #endif /* #ifndef SQLITE_OMIT_WAL */
2709