1c438efd6Sdrh /* 27ed91f23Sdrh ** 2010 February 1 37ed91f23Sdrh ** 47ed91f23Sdrh ** The author disclaims copyright to this source code. In place of 57ed91f23Sdrh ** a legal notice, here is a blessing: 67ed91f23Sdrh ** 77ed91f23Sdrh ** May you do good and not evil. 87ed91f23Sdrh ** May you find forgiveness for yourself and forgive others. 97ed91f23Sdrh ** May you share freely, never taking more than you give. 107ed91f23Sdrh ** 117ed91f23Sdrh ************************************************************************* 127ed91f23Sdrh ** 13027a128aSdrh ** This file contains the implementation of a write-ahead log (WAL) used in 14027a128aSdrh ** "journal_mode=WAL" mode. 1529d4dbefSdrh ** 167ed91f23Sdrh ** WRITE-AHEAD LOG (WAL) FILE FORMAT 17c438efd6Sdrh ** 187e263728Sdrh ** A WAL file consists of a header followed by zero or more "frames". 19027a128aSdrh ** Each frame records the revised content of a single page from the 2029d4dbefSdrh ** database file. All changes to the database are recorded by writing 2129d4dbefSdrh ** frames into the WAL. Transactions commit when a frame is written that 2229d4dbefSdrh ** contains a commit marker. A single WAL can and usually does record 2329d4dbefSdrh ** multiple transactions. Periodically, the content of the WAL is 2429d4dbefSdrh ** transferred back into the database file in an operation called a 2529d4dbefSdrh ** "checkpoint". 2629d4dbefSdrh ** 2729d4dbefSdrh ** A single WAL file can be used multiple times. In other words, the 28027a128aSdrh ** WAL can fill up with frames and then be checkpointed and then new 2929d4dbefSdrh ** frames can overwrite the old ones. A WAL always grows from beginning 3029d4dbefSdrh ** toward the end. Checksums and counters attached to each frame are 3129d4dbefSdrh ** used to determine which frames within the WAL are valid and which 3229d4dbefSdrh ** are leftovers from prior checkpoints. 3329d4dbefSdrh ** 34cd28508eSdrh ** The WAL header is 32 bytes in size and consists of the following eight 35c438efd6Sdrh ** big-endian 32-bit unsigned integer values: 36c438efd6Sdrh ** 371b78eaf0Sdrh ** 0: Magic number. 0x377f0682 or 0x377f0683 3823ea97b6Sdrh ** 4: File format version. Currently 3007000 3923ea97b6Sdrh ** 8: Database page size. Example: 1024 4023ea97b6Sdrh ** 12: Checkpoint sequence number 417e263728Sdrh ** 16: Salt-1, random integer incremented with each checkpoint 427e263728Sdrh ** 20: Salt-2, a different random integer changing with each ckpt 4310f5a50eSdan ** 24: Checksum-1 (first part of checksum for first 24 bytes of header). 4410f5a50eSdan ** 28: Checksum-2 (second part of checksum for first 24 bytes of header). 45c438efd6Sdrh ** 4623ea97b6Sdrh ** Immediately following the wal-header are zero or more frames. Each 4723ea97b6Sdrh ** frame consists of a 24-byte frame-header followed by a <page-size> bytes 48cd28508eSdrh ** of page data. The frame-header is six big-endian 32-bit unsigned 49c438efd6Sdrh ** integer values, as follows: 50c438efd6Sdrh ** 51c438efd6Sdrh ** 0: Page number. 52c438efd6Sdrh ** 4: For commit records, the size of the database image in pages 53c438efd6Sdrh ** after the commit. For all other records, zero. 547e263728Sdrh ** 8: Salt-1 (copied from the header) 557e263728Sdrh ** 12: Salt-2 (copied from the header) 5623ea97b6Sdrh ** 16: Checksum-1. 5723ea97b6Sdrh ** 20: Checksum-2. 5829d4dbefSdrh ** 597e263728Sdrh ** A frame is considered valid if and only if the following conditions are 607e263728Sdrh ** true: 617e263728Sdrh ** 627e263728Sdrh ** (1) The salt-1 and salt-2 values in the frame-header match 637e263728Sdrh ** salt values in the wal-header 647e263728Sdrh ** 657e263728Sdrh ** (2) The checksum values in the final 8 bytes of the frame-header 661b78eaf0Sdrh ** exactly match the checksum computed consecutively on the 671b78eaf0Sdrh ** WAL header and the first 8 bytes and the content of all frames 681b78eaf0Sdrh ** up to and including the current frame. 691b78eaf0Sdrh ** 701b78eaf0Sdrh ** The checksum is computed using 32-bit big-endian integers if the 711b78eaf0Sdrh ** magic number in the first 4 bytes of the WAL is 0x377f0683 and it 721b78eaf0Sdrh ** is computed using little-endian if the magic number is 0x377f0682. 7351b21b16Sdrh ** The checksum values are always stored in the frame header in a 7451b21b16Sdrh ** big-endian format regardless of which byte order is used to compute 7551b21b16Sdrh ** the checksum. The checksum is computed by interpreting the input as 7651b21b16Sdrh ** an even number of unsigned 32-bit integers: x[0] through x[N]. The 77ffca4301Sdrh ** algorithm used for the checksum is as follows: 7851b21b16Sdrh ** 7951b21b16Sdrh ** for i from 0 to n-1 step 2: 8051b21b16Sdrh ** s0 += x[i] + s1; 8151b21b16Sdrh ** s1 += x[i+1] + s0; 8251b21b16Sdrh ** endfor 837e263728Sdrh ** 84cd28508eSdrh ** Note that s0 and s1 are both weighted checksums using fibonacci weights 85cd28508eSdrh ** in reverse order (the largest fibonacci weight occurs on the first element 86cd28508eSdrh ** of the sequence being summed.) The s1 value spans all 32-bit 87cd28508eSdrh ** terms of the sequence whereas s0 omits the final term. 88cd28508eSdrh ** 897e263728Sdrh ** On a checkpoint, the WAL is first VFS.xSync-ed, then valid content of the 907e263728Sdrh ** WAL is transferred into the database, then the database is VFS.xSync-ed. 91ffca4301Sdrh ** The VFS.xSync operations serve as write barriers - all writes launched 927e263728Sdrh ** before the xSync must complete before any write that launches after the 937e263728Sdrh ** xSync begins. 947e263728Sdrh ** 957e263728Sdrh ** After each checkpoint, the salt-1 value is incremented and the salt-2 967e263728Sdrh ** value is randomized. This prevents old and new frames in the WAL from 977e263728Sdrh ** being considered valid at the same time and being checkpointing together 987e263728Sdrh ** following a crash. 997e263728Sdrh ** 10029d4dbefSdrh ** READER ALGORITHM 10129d4dbefSdrh ** 10229d4dbefSdrh ** To read a page from the database (call it page number P), a reader 10329d4dbefSdrh ** first checks the WAL to see if it contains page P. If so, then the 10473b64e4dSdrh ** last valid instance of page P that is a followed by a commit frame 10573b64e4dSdrh ** or is a commit frame itself becomes the value read. If the WAL 10673b64e4dSdrh ** contains no copies of page P that are valid and which are a commit 10773b64e4dSdrh ** frame or are followed by a commit frame, then page P is read from 10873b64e4dSdrh ** the database file. 10929d4dbefSdrh ** 11073b64e4dSdrh ** To start a read transaction, the reader records the index of the last 11173b64e4dSdrh ** valid frame in the WAL. The reader uses this recorded "mxFrame" value 11273b64e4dSdrh ** for all subsequent read operations. New transactions can be appended 11373b64e4dSdrh ** to the WAL, but as long as the reader uses its original mxFrame value 11473b64e4dSdrh ** and ignores the newly appended content, it will see a consistent snapshot 11573b64e4dSdrh ** of the database from a single point in time. This technique allows 11673b64e4dSdrh ** multiple concurrent readers to view different versions of the database 11773b64e4dSdrh ** content simultaneously. 11873b64e4dSdrh ** 11973b64e4dSdrh ** The reader algorithm in the previous paragraphs works correctly, but 12029d4dbefSdrh ** because frames for page P can appear anywhere within the WAL, the 121027a128aSdrh ** reader has to scan the entire WAL looking for page P frames. If the 12229d4dbefSdrh ** WAL is large (multiple megabytes is typical) that scan can be slow, 123027a128aSdrh ** and read performance suffers. To overcome this problem, a separate 12429d4dbefSdrh ** data structure called the wal-index is maintained to expedite the 12529d4dbefSdrh ** search for frames of a particular page. 12629d4dbefSdrh ** 12729d4dbefSdrh ** WAL-INDEX FORMAT 12829d4dbefSdrh ** 12929d4dbefSdrh ** Conceptually, the wal-index is shared memory, though VFS implementations 13029d4dbefSdrh ** might choose to implement the wal-index using a mmapped file. Because 13129d4dbefSdrh ** the wal-index is shared memory, SQLite does not support journal_mode=WAL 13229d4dbefSdrh ** on a network filesystem. All users of the database must be able to 13329d4dbefSdrh ** share memory. 13429d4dbefSdrh ** 13507dae088Sdrh ** In the default unix and windows implementation, the wal-index is a mmapped 13607dae088Sdrh ** file whose name is the database name with a "-shm" suffix added. For that 13707dae088Sdrh ** reason, the wal-index is sometimes called the "shm" file. 13807dae088Sdrh ** 13929d4dbefSdrh ** The wal-index is transient. After a crash, the wal-index can (and should 14029d4dbefSdrh ** be) reconstructed from the original WAL file. In fact, the VFS is required 14129d4dbefSdrh ** to either truncate or zero the header of the wal-index when the last 14229d4dbefSdrh ** connection to it closes. Because the wal-index is transient, it can 14329d4dbefSdrh ** use an architecture-specific format; it does not have to be cross-platform. 14429d4dbefSdrh ** Hence, unlike the database and WAL file formats which store all values 14529d4dbefSdrh ** as big endian, the wal-index can store multi-byte values in the native 14629d4dbefSdrh ** byte order of the host computer. 14729d4dbefSdrh ** 14829d4dbefSdrh ** The purpose of the wal-index is to answer this question quickly: Given 149610b8d85Sdrh ** a page number P and a maximum frame index M, return the index of the 150610b8d85Sdrh ** last frame in the wal before frame M for page P in the WAL, or return 151610b8d85Sdrh ** NULL if there are no frames for page P in the WAL prior to M. 15229d4dbefSdrh ** 15329d4dbefSdrh ** The wal-index consists of a header region, followed by an one or 15429d4dbefSdrh ** more index blocks. 15529d4dbefSdrh ** 156027a128aSdrh ** The wal-index header contains the total number of frames within the WAL 157d5578433Smistachkin ** in the mxFrame field. 158ad3cadd8Sdan ** 159ad3cadd8Sdan ** Each index block except for the first contains information on 160ad3cadd8Sdan ** HASHTABLE_NPAGE frames. The first index block contains information on 161ad3cadd8Sdan ** HASHTABLE_NPAGE_ONE frames. The values of HASHTABLE_NPAGE_ONE and 162ad3cadd8Sdan ** HASHTABLE_NPAGE are selected so that together the wal-index header and 163ad3cadd8Sdan ** first index block are the same size as all other index blocks in the 164ad3cadd8Sdan ** wal-index. 165ad3cadd8Sdan ** 166ad3cadd8Sdan ** Each index block contains two sections, a page-mapping that contains the 167ad3cadd8Sdan ** database page number associated with each wal frame, and a hash-table 168ffca4301Sdrh ** that allows readers to query an index block for a specific page number. 169ad3cadd8Sdan ** The page-mapping is an array of HASHTABLE_NPAGE (or HASHTABLE_NPAGE_ONE 170ad3cadd8Sdan ** for the first index block) 32-bit page numbers. The first entry in the 171ad3cadd8Sdan ** first index-block contains the database page number corresponding to the 172ad3cadd8Sdan ** first frame in the WAL file. The first entry in the second index block 173ad3cadd8Sdan ** in the WAL file corresponds to the (HASHTABLE_NPAGE_ONE+1)th frame in 174ad3cadd8Sdan ** the log, and so on. 175ad3cadd8Sdan ** 176ad3cadd8Sdan ** The last index block in a wal-index usually contains less than the full 177ad3cadd8Sdan ** complement of HASHTABLE_NPAGE (or HASHTABLE_NPAGE_ONE) page-numbers, 178ad3cadd8Sdan ** depending on the contents of the WAL file. This does not change the 179ad3cadd8Sdan ** allocated size of the page-mapping array - the page-mapping array merely 180ad3cadd8Sdan ** contains unused entries. 181027a128aSdrh ** 182027a128aSdrh ** Even without using the hash table, the last frame for page P 183ad3cadd8Sdan ** can be found by scanning the page-mapping sections of each index block 184027a128aSdrh ** starting with the last index block and moving toward the first, and 185027a128aSdrh ** within each index block, starting at the end and moving toward the 186027a128aSdrh ** beginning. The first entry that equals P corresponds to the frame 187027a128aSdrh ** holding the content for that page. 188027a128aSdrh ** 189027a128aSdrh ** The hash table consists of HASHTABLE_NSLOT 16-bit unsigned integers. 190027a128aSdrh ** HASHTABLE_NSLOT = 2*HASHTABLE_NPAGE, and there is one entry in the 191027a128aSdrh ** hash table for each page number in the mapping section, so the hash 192027a128aSdrh ** table is never more than half full. The expected number of collisions 193027a128aSdrh ** prior to finding a match is 1. Each entry of the hash table is an 194027a128aSdrh ** 1-based index of an entry in the mapping section of the same 195027a128aSdrh ** index block. Let K be the 1-based index of the largest entry in 196027a128aSdrh ** the mapping section. (For index blocks other than the last, K will 197027a128aSdrh ** always be exactly HASHTABLE_NPAGE (4096) and for the last index block 198027a128aSdrh ** K will be (mxFrame%HASHTABLE_NPAGE).) Unused slots of the hash table 19973b64e4dSdrh ** contain a value of 0. 200027a128aSdrh ** 201027a128aSdrh ** To look for page P in the hash table, first compute a hash iKey on 202027a128aSdrh ** P as follows: 203027a128aSdrh ** 204027a128aSdrh ** iKey = (P * 383) % HASHTABLE_NSLOT 205027a128aSdrh ** 206027a128aSdrh ** Then start scanning entries of the hash table, starting with iKey 207027a128aSdrh ** (wrapping around to the beginning when the end of the hash table is 208027a128aSdrh ** reached) until an unused hash slot is found. Let the first unused slot 209027a128aSdrh ** be at index iUnused. (iUnused might be less than iKey if there was 210027a128aSdrh ** wrap-around.) Because the hash table is never more than half full, 211027a128aSdrh ** the search is guaranteed to eventually hit an unused entry. Let 212027a128aSdrh ** iMax be the value between iKey and iUnused, closest to iUnused, 213027a128aSdrh ** where aHash[iMax]==P. If there is no iMax entry (if there exists 214027a128aSdrh ** no hash slot such that aHash[i]==p) then page P is not in the 215027a128aSdrh ** current index block. Otherwise the iMax-th mapping entry of the 216027a128aSdrh ** current index block corresponds to the last entry that references 217027a128aSdrh ** page P. 218027a128aSdrh ** 219027a128aSdrh ** A hash search begins with the last index block and moves toward the 220027a128aSdrh ** first index block, looking for entries corresponding to page P. On 221027a128aSdrh ** average, only two or three slots in each index block need to be 222027a128aSdrh ** examined in order to either find the last entry for page P, or to 223027a128aSdrh ** establish that no such entry exists in the block. Each index block 224027a128aSdrh ** holds over 4000 entries. So two or three index blocks are sufficient 225027a128aSdrh ** to cover a typical 10 megabyte WAL file, assuming 1K pages. 8 or 10 226027a128aSdrh ** comparisons (on average) suffice to either locate a frame in the 227027a128aSdrh ** WAL or to establish that the frame does not exist in the WAL. This 228027a128aSdrh ** is much faster than scanning the entire 10MB WAL. 229027a128aSdrh ** 230027a128aSdrh ** Note that entries are added in order of increasing K. Hence, one 231027a128aSdrh ** reader might be using some value K0 and a second reader that started 232027a128aSdrh ** at a later time (after additional transactions were added to the WAL 233027a128aSdrh ** and to the wal-index) might be using a different value K1, where K1>K0. 234027a128aSdrh ** Both readers can use the same hash table and mapping section to get 235027a128aSdrh ** the correct result. There may be entries in the hash table with 236027a128aSdrh ** K>K0 but to the first reader, those entries will appear to be unused 237027a128aSdrh ** slots in the hash table and so the first reader will get an answer as 238027a128aSdrh ** if no values greater than K0 had ever been inserted into the hash table 239027a128aSdrh ** in the first place - which is what reader one wants. Meanwhile, the 240027a128aSdrh ** second reader using K1 will see additional values that were inserted 241027a128aSdrh ** later, which is exactly what reader two wants. 242027a128aSdrh ** 2436f150148Sdan ** When a rollback occurs, the value of K is decreased. Hash table entries 2446f150148Sdan ** that correspond to frames greater than the new K value are removed 2456f150148Sdan ** from the hash table at this point. 246c438efd6Sdrh */ 24729d4dbefSdrh #ifndef SQLITE_OMIT_WAL 248c438efd6Sdrh 24929d4dbefSdrh #include "wal.h" 25029d4dbefSdrh 25173b64e4dSdrh /* 252c74c3334Sdrh ** Trace output macros 253c74c3334Sdrh */ 254c74c3334Sdrh #if defined(SQLITE_TEST) && defined(SQLITE_DEBUG) 25515d68092Sdrh int sqlite3WalTrace = 0; 256c74c3334Sdrh # define WALTRACE(X) if(sqlite3WalTrace) sqlite3DebugPrintf X 257c74c3334Sdrh #else 258c74c3334Sdrh # define WALTRACE(X) 259c74c3334Sdrh #endif 260c74c3334Sdrh 26110f5a50eSdan /* 26210f5a50eSdan ** The maximum (and only) versions of the wal and wal-index formats 26310f5a50eSdan ** that may be interpreted by this version of SQLite. 26410f5a50eSdan ** 26510f5a50eSdan ** If a client begins recovering a WAL file and finds that (a) the checksum 26610f5a50eSdan ** values in the wal-header are correct and (b) the version field is not 26710f5a50eSdan ** WAL_MAX_VERSION, recovery fails and SQLite returns SQLITE_CANTOPEN. 26810f5a50eSdan ** 26910f5a50eSdan ** Similarly, if a client successfully reads a wal-index header (i.e. the 27010f5a50eSdan ** checksum test is successful) and finds that the version field is not 27110f5a50eSdan ** WALINDEX_MAX_VERSION, then no read-transaction is opened and SQLite 27210f5a50eSdan ** returns SQLITE_CANTOPEN. 27310f5a50eSdan */ 27410f5a50eSdan #define WAL_MAX_VERSION 3007000 27510f5a50eSdan #define WALINDEX_MAX_VERSION 3007000 276c74c3334Sdrh 277c74c3334Sdrh /* 27807dae088Sdrh ** Index numbers for various locking bytes. WAL_NREADER is the number 279998147ecSdrh ** of available reader locks and should be at least 3. The default 280998147ecSdrh ** is SQLITE_SHM_NLOCK==8 and WAL_NREADER==5. 28107dae088Sdrh ** 28207dae088Sdrh ** Technically, the various VFSes are free to implement these locks however 28307dae088Sdrh ** they see fit. However, compatibility is encouraged so that VFSes can 28407dae088Sdrh ** interoperate. The standard implemention used on both unix and windows 28507dae088Sdrh ** is for the index number to indicate a byte offset into the 28607dae088Sdrh ** WalCkptInfo.aLock[] array in the wal-index header. In other words, all 28707dae088Sdrh ** locks are on the shm file. The WALINDEX_LOCK_OFFSET constant (which 28807dae088Sdrh ** should be 120) is the location in the shm file for the first locking 28907dae088Sdrh ** byte. 29073b64e4dSdrh */ 29173b64e4dSdrh #define WAL_WRITE_LOCK 0 29273b64e4dSdrh #define WAL_ALL_BUT_WRITE 1 29373b64e4dSdrh #define WAL_CKPT_LOCK 1 29473b64e4dSdrh #define WAL_RECOVER_LOCK 2 29573b64e4dSdrh #define WAL_READ_LOCK(I) (3+(I)) 29673b64e4dSdrh #define WAL_NREADER (SQLITE_SHM_NLOCK-3) 29773b64e4dSdrh 298c438efd6Sdrh 2997ed91f23Sdrh /* Object declarations */ 3007ed91f23Sdrh typedef struct WalIndexHdr WalIndexHdr; 3017ed91f23Sdrh typedef struct WalIterator WalIterator; 30273b64e4dSdrh typedef struct WalCkptInfo WalCkptInfo; 303c438efd6Sdrh 304c438efd6Sdrh 305c438efd6Sdrh /* 306286a2884Sdrh ** The following object holds a copy of the wal-index header content. 307286a2884Sdrh ** 308286a2884Sdrh ** The actual header in the wal-index consists of two copies of this 309998147ecSdrh ** object followed by one instance of the WalCkptInfo object. 310998147ecSdrh ** For all versions of SQLite through 3.10.0 and probably beyond, 311998147ecSdrh ** the locking bytes (WalCkptInfo.aLock) start at offset 120 and 312998147ecSdrh ** the total header size is 136 bytes. 3139b78f791Sdrh ** 3149b78f791Sdrh ** The szPage value can be any power of 2 between 512 and 32768, inclusive. 3159b78f791Sdrh ** Or it can be 1 to represent a 65536-byte page. The latter case was 3169b78f791Sdrh ** added in 3.7.1 when support for 64K pages was added. 317c438efd6Sdrh */ 3187ed91f23Sdrh struct WalIndexHdr { 31910f5a50eSdan u32 iVersion; /* Wal-index version */ 32010f5a50eSdan u32 unused; /* Unused (padding) field */ 321c438efd6Sdrh u32 iChange; /* Counter incremented each transaction */ 3224b82c387Sdrh u8 isInit; /* 1 when initialized */ 3234b82c387Sdrh u8 bigEndCksum; /* True if checksums in WAL are big-endian */ 3249b78f791Sdrh u16 szPage; /* Database page size in bytes. 1==64K */ 325027a128aSdrh u32 mxFrame; /* Index of last valid frame in the WAL */ 326c438efd6Sdrh u32 nPage; /* Size of database in pages */ 32771d89919Sdan u32 aFrameCksum[2]; /* Checksum of last frame in log */ 32871d89919Sdan u32 aSalt[2]; /* Two salt values copied from WAL header */ 3297e263728Sdrh u32 aCksum[2]; /* Checksum over all prior fields */ 330c438efd6Sdrh }; 331c438efd6Sdrh 33273b64e4dSdrh /* 33373b64e4dSdrh ** A copy of the following object occurs in the wal-index immediately 33473b64e4dSdrh ** following the second copy of the WalIndexHdr. This object stores 33573b64e4dSdrh ** information used by checkpoint. 33673b64e4dSdrh ** 33773b64e4dSdrh ** nBackfill is the number of frames in the WAL that have been written 33873b64e4dSdrh ** back into the database. (We call the act of moving content from WAL to 33973b64e4dSdrh ** database "backfilling".) The nBackfill number is never greater than 34073b64e4dSdrh ** WalIndexHdr.mxFrame. nBackfill can only be increased by threads 34173b64e4dSdrh ** holding the WAL_CKPT_LOCK lock (which includes a recovery thread). 34273b64e4dSdrh ** However, a WAL_WRITE_LOCK thread can move the value of nBackfill from 34373b64e4dSdrh ** mxFrame back to zero when the WAL is reset. 34473b64e4dSdrh ** 345998147ecSdrh ** nBackfillAttempted is the largest value of nBackfill that a checkpoint 346998147ecSdrh ** has attempted to achieve. Normally nBackfill==nBackfillAtempted, however 347998147ecSdrh ** the nBackfillAttempted is set before any backfilling is done and the 348998147ecSdrh ** nBackfill is only set after all backfilling completes. So if a checkpoint 349998147ecSdrh ** crashes, nBackfillAttempted might be larger than nBackfill. The 350998147ecSdrh ** WalIndexHdr.mxFrame must never be less than nBackfillAttempted. 351998147ecSdrh ** 352998147ecSdrh ** The aLock[] field is a set of bytes used for locking. These bytes should 353998147ecSdrh ** never be read or written. 354998147ecSdrh ** 35573b64e4dSdrh ** There is one entry in aReadMark[] for each reader lock. If a reader 35673b64e4dSdrh ** holds read-lock K, then the value in aReadMark[K] is no greater than 357db7f647eSdrh ** the mxFrame for that reader. The value READMARK_NOT_USED (0xffffffff) 358db7f647eSdrh ** for any aReadMark[] means that entry is unused. aReadMark[0] is 359db7f647eSdrh ** a special case; its value is never used and it exists as a place-holder 360db7f647eSdrh ** to avoid having to offset aReadMark[] indexs by one. Readers holding 361db7f647eSdrh ** WAL_READ_LOCK(0) always ignore the entire WAL and read all content 362db7f647eSdrh ** directly from the database. 36373b64e4dSdrh ** 36473b64e4dSdrh ** The value of aReadMark[K] may only be changed by a thread that 36573b64e4dSdrh ** is holding an exclusive lock on WAL_READ_LOCK(K). Thus, the value of 36673b64e4dSdrh ** aReadMark[K] cannot changed while there is a reader is using that mark 36773b64e4dSdrh ** since the reader will be holding a shared lock on WAL_READ_LOCK(K). 36873b64e4dSdrh ** 36973b64e4dSdrh ** The checkpointer may only transfer frames from WAL to database where 37073b64e4dSdrh ** the frame numbers are less than or equal to every aReadMark[] that is 37173b64e4dSdrh ** in use (that is, every aReadMark[j] for which there is a corresponding 37273b64e4dSdrh ** WAL_READ_LOCK(j)). New readers (usually) pick the aReadMark[] with the 37373b64e4dSdrh ** largest value and will increase an unused aReadMark[] to mxFrame if there 37473b64e4dSdrh ** is not already an aReadMark[] equal to mxFrame. The exception to the 37573b64e4dSdrh ** previous sentence is when nBackfill equals mxFrame (meaning that everything 37673b64e4dSdrh ** in the WAL has been backfilled into the database) then new readers 37773b64e4dSdrh ** will choose aReadMark[0] which has value 0 and hence such reader will 37873b64e4dSdrh ** get all their all content directly from the database file and ignore 37973b64e4dSdrh ** the WAL. 38073b64e4dSdrh ** 38173b64e4dSdrh ** Writers normally append new frames to the end of the WAL. However, 38273b64e4dSdrh ** if nBackfill equals mxFrame (meaning that all WAL content has been 38373b64e4dSdrh ** written back into the database) and if no readers are using the WAL 38473b64e4dSdrh ** (in other words, if there are no WAL_READ_LOCK(i) where i>0) then 38573b64e4dSdrh ** the writer will first "reset" the WAL back to the beginning and start 38673b64e4dSdrh ** writing new content beginning at frame 1. 38773b64e4dSdrh ** 38873b64e4dSdrh ** We assume that 32-bit loads are atomic and so no locks are needed in 38973b64e4dSdrh ** order to read from any aReadMark[] entries. 39073b64e4dSdrh */ 39173b64e4dSdrh struct WalCkptInfo { 39273b64e4dSdrh u32 nBackfill; /* Number of WAL frames backfilled into DB */ 39373b64e4dSdrh u32 aReadMark[WAL_NREADER]; /* Reader marks */ 394998147ecSdrh u8 aLock[SQLITE_SHM_NLOCK]; /* Reserved space for locks */ 395998147ecSdrh u32 nBackfillAttempted; /* WAL frames perhaps written, or maybe not */ 396998147ecSdrh u32 notUsed0; /* Available for future enhancements */ 39773b64e4dSdrh }; 398db7f647eSdrh #define READMARK_NOT_USED 0xffffffff 39973b64e4dSdrh 40073b64e4dSdrh 4017e263728Sdrh /* A block of WALINDEX_LOCK_RESERVED bytes beginning at 4027e263728Sdrh ** WALINDEX_LOCK_OFFSET is reserved for locks. Since some systems 4037e263728Sdrh ** only support mandatory file-locks, we do not read or write data 4047e263728Sdrh ** from the region of the file on which locks are applied. 405c438efd6Sdrh */ 406998147ecSdrh #define WALINDEX_LOCK_OFFSET (sizeof(WalIndexHdr)*2+offsetof(WalCkptInfo,aLock)) 407998147ecSdrh #define WALINDEX_HDR_SIZE (sizeof(WalIndexHdr)*2+sizeof(WalCkptInfo)) 408c438efd6Sdrh 4097ed91f23Sdrh /* Size of header before each frame in wal */ 41023ea97b6Sdrh #define WAL_FRAME_HDRSIZE 24 411c438efd6Sdrh 41210f5a50eSdan /* Size of write ahead log header, including checksum. */ 41310f5a50eSdan #define WAL_HDRSIZE 32 414c438efd6Sdrh 415b8fd6c2fSdan /* WAL magic value. Either this value, or the same value with the least 416b8fd6c2fSdan ** significant bit also set (WAL_MAGIC | 0x00000001) is stored in 32-bit 417b8fd6c2fSdan ** big-endian format in the first 4 bytes of a WAL file. 418b8fd6c2fSdan ** 419b8fd6c2fSdan ** If the LSB is set, then the checksums for each frame within the WAL 420b8fd6c2fSdan ** file are calculated by treating all data as an array of 32-bit 421b8fd6c2fSdan ** big-endian words. Otherwise, they are calculated by interpreting 422b8fd6c2fSdan ** all data as 32-bit little-endian words. 423b8fd6c2fSdan */ 424b8fd6c2fSdan #define WAL_MAGIC 0x377f0682 425b8fd6c2fSdan 426c438efd6Sdrh /* 4277ed91f23Sdrh ** Return the offset of frame iFrame in the write-ahead log file, 4286e81096fSdrh ** assuming a database page size of szPage bytes. The offset returned 4297ed91f23Sdrh ** is to the start of the write-ahead log frame-header. 430c438efd6Sdrh */ 4316e81096fSdrh #define walFrameOffset(iFrame, szPage) ( \ 432bd0e9070Sdan WAL_HDRSIZE + ((iFrame)-1)*(i64)((szPage)+WAL_FRAME_HDRSIZE) \ 433c438efd6Sdrh ) 434c438efd6Sdrh 435c438efd6Sdrh /* 4367ed91f23Sdrh ** An open write-ahead log file is represented by an instance of the 4377ed91f23Sdrh ** following object. 438c438efd6Sdrh */ 4397ed91f23Sdrh struct Wal { 44073b64e4dSdrh sqlite3_vfs *pVfs; /* The VFS used to create pDbFd */ 441d9e5c4f6Sdrh sqlite3_file *pDbFd; /* File handle for the database file */ 442d9e5c4f6Sdrh sqlite3_file *pWalFd; /* File handle for WAL file */ 443c438efd6Sdrh u32 iCallback; /* Value to pass to log callback (or 0) */ 44485a83755Sdrh i64 mxWalSize; /* Truncate WAL to this size upon reset */ 44513a3cb82Sdan int nWiData; /* Size of array apWiData */ 44688f975a7Sdrh int szFirstBlock; /* Size of first block written to WAL file */ 44713a3cb82Sdan volatile u32 **apWiData; /* Pointer to wal-index content in memory */ 448b2eced5dSdrh u32 szPage; /* Database page size */ 44973b64e4dSdrh i16 readLock; /* Which read lock is being held. -1 for none */ 4504eb02a45Sdrh u8 syncFlags; /* Flags to use to sync header writes */ 4515543759bSdan u8 exclusiveMode; /* Non-zero if connection is in exclusive mode */ 45273b64e4dSdrh u8 writeLock; /* True if in a write transaction */ 45373b64e4dSdrh u8 ckptLock; /* True if holding a checkpoint lock */ 45466dfec8bSdrh u8 readOnly; /* WAL_RDWR, WAL_RDONLY, or WAL_SHM_RDONLY */ 455f60b7f36Sdan u8 truncateOnCommit; /* True to truncate WAL file on commit */ 456d992b150Sdrh u8 syncHeader; /* Fsync the WAL header if true */ 457374f4a04Sdrh u8 padToSectorBoundary; /* Pad transactions out to the next sector */ 45885bc6df2Sdrh u8 bShmUnreliable; /* SHM content is read-only and unreliable */ 45973b64e4dSdrh WalIndexHdr hdr; /* Wal-index header for current transaction */ 460b8c7cfb8Sdan u32 minFrame; /* Ignore wal frames before this one */ 461c9a9022bSdan u32 iReCksum; /* On commit, recalculate checksums from here */ 4623e875ef3Sdan const char *zWalName; /* Name of WAL file */ 4637e263728Sdrh u32 nCkpt; /* Checkpoint sequence counter in the wal-header */ 464aab4c02eSdrh #ifdef SQLITE_DEBUG 465aab4c02eSdrh u8 lockError; /* True if a locking error has occurred */ 466aab4c02eSdrh #endif 467fc1acf33Sdan #ifdef SQLITE_ENABLE_SNAPSHOT 468998147ecSdrh WalIndexHdr *pSnapshot; /* Start transaction here if not NULL */ 469fc1acf33Sdan #endif 470c438efd6Sdrh }; 471c438efd6Sdrh 47273b64e4dSdrh /* 4738c408004Sdan ** Candidate values for Wal.exclusiveMode. 4748c408004Sdan */ 4758c408004Sdan #define WAL_NORMAL_MODE 0 4768c408004Sdan #define WAL_EXCLUSIVE_MODE 1 4778c408004Sdan #define WAL_HEAPMEMORY_MODE 2 4788c408004Sdan 4798c408004Sdan /* 48066dfec8bSdrh ** Possible values for WAL.readOnly 48166dfec8bSdrh */ 48266dfec8bSdrh #define WAL_RDWR 0 /* Normal read/write connection */ 48366dfec8bSdrh #define WAL_RDONLY 1 /* The WAL file is readonly */ 48466dfec8bSdrh #define WAL_SHM_RDONLY 2 /* The SHM file is readonly */ 48566dfec8bSdrh 48666dfec8bSdrh /* 487067f3165Sdan ** Each page of the wal-index mapping contains a hash-table made up of 488067f3165Sdan ** an array of HASHTABLE_NSLOT elements of the following type. 489067f3165Sdan */ 490067f3165Sdan typedef u16 ht_slot; 491067f3165Sdan 492067f3165Sdan /* 493ad3cadd8Sdan ** This structure is used to implement an iterator that loops through 494ad3cadd8Sdan ** all frames in the WAL in database page order. Where two or more frames 495ad3cadd8Sdan ** correspond to the same database page, the iterator visits only the 496ad3cadd8Sdan ** frame most recently written to the WAL (in other words, the frame with 497ad3cadd8Sdan ** the largest index). 498ad3cadd8Sdan ** 499ad3cadd8Sdan ** The internals of this structure are only accessed by: 500ad3cadd8Sdan ** 501ad3cadd8Sdan ** walIteratorInit() - Create a new iterator, 502ad3cadd8Sdan ** walIteratorNext() - Step an iterator, 503ad3cadd8Sdan ** walIteratorFree() - Free an iterator. 504ad3cadd8Sdan ** 505ad3cadd8Sdan ** This functionality is used by the checkpoint code (see walCheckpoint()). 506ad3cadd8Sdan */ 507ad3cadd8Sdan struct WalIterator { 508ad3cadd8Sdan int iPrior; /* Last result returned from the iterator */ 509d9c9b78eSdrh int nSegment; /* Number of entries in aSegment[] */ 510ad3cadd8Sdan struct WalSegment { 511ad3cadd8Sdan int iNext; /* Next slot in aIndex[] not yet returned */ 512ad3cadd8Sdan ht_slot *aIndex; /* i0, i1, i2... such that aPgno[iN] ascend */ 513ad3cadd8Sdan u32 *aPgno; /* Array of page numbers. */ 514d9c9b78eSdrh int nEntry; /* Nr. of entries in aPgno[] and aIndex[] */ 515ad3cadd8Sdan int iZero; /* Frame number associated with aPgno[0] */ 516d9c9b78eSdrh } aSegment[1]; /* One for every 32KB page in the wal-index */ 517ad3cadd8Sdan }; 518ad3cadd8Sdan 519ad3cadd8Sdan /* 52013a3cb82Sdan ** Define the parameters of the hash tables in the wal-index file. There 52113a3cb82Sdan ** is a hash-table following every HASHTABLE_NPAGE page numbers in the 52213a3cb82Sdan ** wal-index. 52313a3cb82Sdan ** 52413a3cb82Sdan ** Changing any of these constants will alter the wal-index format and 52513a3cb82Sdan ** create incompatibilities. 52613a3cb82Sdan */ 527067f3165Sdan #define HASHTABLE_NPAGE 4096 /* Must be power of 2 */ 52813a3cb82Sdan #define HASHTABLE_HASH_1 383 /* Should be prime */ 52913a3cb82Sdan #define HASHTABLE_NSLOT (HASHTABLE_NPAGE*2) /* Must be a power of 2 */ 53013a3cb82Sdan 531ad3cadd8Sdan /* 532ad3cadd8Sdan ** The block of page numbers associated with the first hash-table in a 53313a3cb82Sdan ** wal-index is smaller than usual. This is so that there is a complete 53413a3cb82Sdan ** hash-table on each aligned 32KB page of the wal-index. 53513a3cb82Sdan */ 536067f3165Sdan #define HASHTABLE_NPAGE_ONE (HASHTABLE_NPAGE - (WALINDEX_HDR_SIZE/sizeof(u32))) 53713a3cb82Sdan 538067f3165Sdan /* The wal-index is divided into pages of WALINDEX_PGSZ bytes each. */ 539067f3165Sdan #define WALINDEX_PGSZ ( \ 540067f3165Sdan sizeof(ht_slot)*HASHTABLE_NSLOT + HASHTABLE_NPAGE*sizeof(u32) \ 541067f3165Sdan ) 54213a3cb82Sdan 54313a3cb82Sdan /* 54413a3cb82Sdan ** Obtain a pointer to the iPage'th page of the wal-index. The wal-index 545067f3165Sdan ** is broken into pages of WALINDEX_PGSZ bytes. Wal-index pages are 54613a3cb82Sdan ** numbered from zero. 54713a3cb82Sdan ** 548c05a063cSdrh ** If the wal-index is currently smaller the iPage pages then the size 549c05a063cSdrh ** of the wal-index might be increased, but only if it is safe to do 550c05a063cSdrh ** so. It is safe to enlarge the wal-index if pWal->writeLock is true 551c05a063cSdrh ** or pWal->exclusiveMode==WAL_HEAPMEMORY_MODE. 552c05a063cSdrh ** 55313a3cb82Sdan ** If this call is successful, *ppPage is set to point to the wal-index 55413a3cb82Sdan ** page and SQLITE_OK is returned. If an error (an OOM or VFS error) occurs, 55513a3cb82Sdan ** then an SQLite error code is returned and *ppPage is set to 0. 55613a3cb82Sdan */ 5572e178d73Sdrh static SQLITE_NOINLINE int walIndexPageRealloc( 5582e178d73Sdrh Wal *pWal, /* The WAL context */ 5592e178d73Sdrh int iPage, /* The page we seek */ 5602e178d73Sdrh volatile u32 **ppPage /* Write the page pointer here */ 5612e178d73Sdrh ){ 56213a3cb82Sdan int rc = SQLITE_OK; 56313a3cb82Sdan 56413a3cb82Sdan /* Enlarge the pWal->apWiData[] array if required */ 56513a3cb82Sdan if( pWal->nWiData<=iPage ){ 56613a3cb82Sdan int nByte = sizeof(u32*)*(iPage+1); 56713a3cb82Sdan volatile u32 **apNew; 568f3cdcdccSdrh apNew = (volatile u32 **)sqlite3_realloc64((void *)pWal->apWiData, nByte); 56913a3cb82Sdan if( !apNew ){ 57013a3cb82Sdan *ppPage = 0; 571fad3039cSmistachkin return SQLITE_NOMEM_BKPT; 57213a3cb82Sdan } 573519426aaSdrh memset((void*)&apNew[pWal->nWiData], 0, 574519426aaSdrh sizeof(u32*)*(iPage+1-pWal->nWiData)); 57513a3cb82Sdan pWal->apWiData = apNew; 57613a3cb82Sdan pWal->nWiData = iPage+1; 57713a3cb82Sdan } 57813a3cb82Sdan 57913a3cb82Sdan /* Request a pointer to the required page from the VFS */ 580c0ec2f77Sdrh assert( pWal->apWiData[iPage]==0 ); 5818c408004Sdan if( pWal->exclusiveMode==WAL_HEAPMEMORY_MODE ){ 5828c408004Sdan pWal->apWiData[iPage] = (u32 volatile *)sqlite3MallocZero(WALINDEX_PGSZ); 583fad3039cSmistachkin if( !pWal->apWiData[iPage] ) rc = SQLITE_NOMEM_BKPT; 5848c408004Sdan }else{ 58518801915Sdan rc = sqlite3OsShmMap(pWal->pDbFd, iPage, WALINDEX_PGSZ, 58613a3cb82Sdan pWal->writeLock, (void volatile **)&pWal->apWiData[iPage] 58713a3cb82Sdan ); 588c05a063cSdrh assert( pWal->apWiData[iPage]!=0 || rc!=SQLITE_OK || pWal->writeLock==0 ); 589c05a063cSdrh testcase( pWal->apWiData[iPage]==0 && rc==SQLITE_OK ); 59092c02da3Sdan if( (rc&0xff)==SQLITE_READONLY ){ 59166dfec8bSdrh pWal->readOnly |= WAL_SHM_RDONLY; 59292c02da3Sdan if( rc==SQLITE_READONLY ){ 59366dfec8bSdrh rc = SQLITE_OK; 5944edc6bf3Sdan } 59513a3cb82Sdan } 5968c408004Sdan } 597b6d2f9c5Sdan 59866dfec8bSdrh *ppPage = pWal->apWiData[iPage]; 59913a3cb82Sdan assert( iPage==0 || *ppPage || rc!=SQLITE_OK ); 60013a3cb82Sdan return rc; 60113a3cb82Sdan } 6022e178d73Sdrh static int walIndexPage( 6032e178d73Sdrh Wal *pWal, /* The WAL context */ 6042e178d73Sdrh int iPage, /* The page we seek */ 6052e178d73Sdrh volatile u32 **ppPage /* Write the page pointer here */ 6062e178d73Sdrh ){ 6072e178d73Sdrh if( pWal->nWiData<=iPage || (*ppPage = pWal->apWiData[iPage])==0 ){ 6082e178d73Sdrh return walIndexPageRealloc(pWal, iPage, ppPage); 6092e178d73Sdrh } 6102e178d73Sdrh return SQLITE_OK; 6112e178d73Sdrh } 61213a3cb82Sdan 61313a3cb82Sdan /* 61473b64e4dSdrh ** Return a pointer to the WalCkptInfo structure in the wal-index. 61573b64e4dSdrh */ 61673b64e4dSdrh static volatile WalCkptInfo *walCkptInfo(Wal *pWal){ 6174280eb30Sdan assert( pWal->nWiData>0 && pWal->apWiData[0] ); 6184280eb30Sdan return (volatile WalCkptInfo*)&(pWal->apWiData[0][sizeof(WalIndexHdr)/2]); 6194280eb30Sdan } 6204280eb30Sdan 6214280eb30Sdan /* 6224280eb30Sdan ** Return a pointer to the WalIndexHdr structure in the wal-index. 6234280eb30Sdan */ 6244280eb30Sdan static volatile WalIndexHdr *walIndexHdr(Wal *pWal){ 6254280eb30Sdan assert( pWal->nWiData>0 && pWal->apWiData[0] ); 6264280eb30Sdan return (volatile WalIndexHdr*)pWal->apWiData[0]; 62773b64e4dSdrh } 62873b64e4dSdrh 629c438efd6Sdrh /* 630b8fd6c2fSdan ** The argument to this macro must be of type u32. On a little-endian 631b8fd6c2fSdan ** architecture, it returns the u32 value that results from interpreting 632b8fd6c2fSdan ** the 4 bytes as a big-endian value. On a big-endian architecture, it 63360ec914cSpeter.d.reid ** returns the value that would be produced by interpreting the 4 bytes 634b8fd6c2fSdan ** of the input value as a little-endian integer. 635b8fd6c2fSdan */ 636b8fd6c2fSdan #define BYTESWAP32(x) ( \ 637b8fd6c2fSdan (((x)&0x000000FF)<<24) + (((x)&0x0000FF00)<<8) \ 638b8fd6c2fSdan + (((x)&0x00FF0000)>>8) + (((x)&0xFF000000)>>24) \ 639b8fd6c2fSdan ) 640c438efd6Sdrh 641c438efd6Sdrh /* 6427e263728Sdrh ** Generate or extend an 8 byte checksum based on the data in 6437e263728Sdrh ** array aByte[] and the initial values of aIn[0] and aIn[1] (or 6447e263728Sdrh ** initial values of 0 and 0 if aIn==NULL). 6457e263728Sdrh ** 6467e263728Sdrh ** The checksum is written back into aOut[] before returning. 6477e263728Sdrh ** 6487e263728Sdrh ** nByte must be a positive multiple of 8. 649c438efd6Sdrh */ 6507e263728Sdrh static void walChecksumBytes( 651b8fd6c2fSdan int nativeCksum, /* True for native byte-order, false for non-native */ 6527e263728Sdrh u8 *a, /* Content to be checksummed */ 6537e263728Sdrh int nByte, /* Bytes of content in a[]. Must be a multiple of 8. */ 6547e263728Sdrh const u32 *aIn, /* Initial checksum value input */ 6557e263728Sdrh u32 *aOut /* OUT: Final checksum value output */ 6567e263728Sdrh ){ 6577e263728Sdrh u32 s1, s2; 658b8fd6c2fSdan u32 *aData = (u32 *)a; 659b8fd6c2fSdan u32 *aEnd = (u32 *)&a[nByte]; 660b8fd6c2fSdan 6617e263728Sdrh if( aIn ){ 6627e263728Sdrh s1 = aIn[0]; 6637e263728Sdrh s2 = aIn[1]; 6647e263728Sdrh }else{ 6657e263728Sdrh s1 = s2 = 0; 6667e263728Sdrh } 667c438efd6Sdrh 668584c754dSdrh assert( nByte>=8 ); 669b8fd6c2fSdan assert( (nByte&0x00000007)==0 ); 670c438efd6Sdrh 671b8fd6c2fSdan if( nativeCksum ){ 672c438efd6Sdrh do { 673b8fd6c2fSdan s1 += *aData++ + s2; 674b8fd6c2fSdan s2 += *aData++ + s1; 675b8fd6c2fSdan }while( aData<aEnd ); 676b8fd6c2fSdan }else{ 677b8fd6c2fSdan do { 678b8fd6c2fSdan s1 += BYTESWAP32(aData[0]) + s2; 679b8fd6c2fSdan s2 += BYTESWAP32(aData[1]) + s1; 680b8fd6c2fSdan aData += 2; 681b8fd6c2fSdan }while( aData<aEnd ); 682b8fd6c2fSdan } 683b8fd6c2fSdan 6847e263728Sdrh aOut[0] = s1; 6857e263728Sdrh aOut[1] = s2; 686c438efd6Sdrh } 687c438efd6Sdrh 6888c408004Sdan static void walShmBarrier(Wal *pWal){ 6898c408004Sdan if( pWal->exclusiveMode!=WAL_HEAPMEMORY_MODE ){ 6908c408004Sdan sqlite3OsShmBarrier(pWal->pDbFd); 6918c408004Sdan } 6928c408004Sdan } 6938c408004Sdan 694c438efd6Sdrh /* 6957e263728Sdrh ** Write the header information in pWal->hdr into the wal-index. 6967e263728Sdrh ** 6977e263728Sdrh ** The checksum on pWal->hdr is updated before it is written. 6987ed91f23Sdrh */ 6997e263728Sdrh static void walIndexWriteHdr(Wal *pWal){ 7004280eb30Sdan volatile WalIndexHdr *aHdr = walIndexHdr(pWal); 7014280eb30Sdan const int nCksum = offsetof(WalIndexHdr, aCksum); 70273b64e4dSdrh 70373b64e4dSdrh assert( pWal->writeLock ); 7044b82c387Sdrh pWal->hdr.isInit = 1; 70510f5a50eSdan pWal->hdr.iVersion = WALINDEX_MAX_VERSION; 7064280eb30Sdan walChecksumBytes(1, (u8*)&pWal->hdr, nCksum, 0, pWal->hdr.aCksum); 707f6bff3f5Sdrh memcpy((void*)&aHdr[1], (const void*)&pWal->hdr, sizeof(WalIndexHdr)); 7088c408004Sdan walShmBarrier(pWal); 709f6bff3f5Sdrh memcpy((void*)&aHdr[0], (const void*)&pWal->hdr, sizeof(WalIndexHdr)); 710c438efd6Sdrh } 711c438efd6Sdrh 712c438efd6Sdrh /* 713c438efd6Sdrh ** This function encodes a single frame header and writes it to a buffer 7147ed91f23Sdrh ** supplied by the caller. A frame-header is made up of a series of 715c438efd6Sdrh ** 4-byte big-endian integers, as follows: 716c438efd6Sdrh ** 71723ea97b6Sdrh ** 0: Page number. 71823ea97b6Sdrh ** 4: For commit records, the size of the database image in pages 71923ea97b6Sdrh ** after the commit. For all other records, zero. 7207e263728Sdrh ** 8: Salt-1 (copied from the wal-header) 7217e263728Sdrh ** 12: Salt-2 (copied from the wal-header) 72223ea97b6Sdrh ** 16: Checksum-1. 72323ea97b6Sdrh ** 20: Checksum-2. 724c438efd6Sdrh */ 7257ed91f23Sdrh static void walEncodeFrame( 72623ea97b6Sdrh Wal *pWal, /* The write-ahead log */ 727c438efd6Sdrh u32 iPage, /* Database page number for frame */ 728c438efd6Sdrh u32 nTruncate, /* New db size (or 0 for non-commit frames) */ 7297e263728Sdrh u8 *aData, /* Pointer to page data */ 730c438efd6Sdrh u8 *aFrame /* OUT: Write encoded frame here */ 731c438efd6Sdrh ){ 732b8fd6c2fSdan int nativeCksum; /* True for native byte-order checksums */ 73371d89919Sdan u32 *aCksum = pWal->hdr.aFrameCksum; 73423ea97b6Sdrh assert( WAL_FRAME_HDRSIZE==24 ); 735c438efd6Sdrh sqlite3Put4byte(&aFrame[0], iPage); 736c438efd6Sdrh sqlite3Put4byte(&aFrame[4], nTruncate); 737c9a9022bSdan if( pWal->iReCksum==0 ){ 7387e263728Sdrh memcpy(&aFrame[8], pWal->hdr.aSalt, 8); 739c438efd6Sdrh 740b8fd6c2fSdan nativeCksum = (pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN); 74171d89919Sdan walChecksumBytes(nativeCksum, aFrame, 8, aCksum, aCksum); 742b8fd6c2fSdan walChecksumBytes(nativeCksum, aData, pWal->szPage, aCksum, aCksum); 743c438efd6Sdrh 74423ea97b6Sdrh sqlite3Put4byte(&aFrame[16], aCksum[0]); 74523ea97b6Sdrh sqlite3Put4byte(&aFrame[20], aCksum[1]); 746869aaf09Sdrh }else{ 747869aaf09Sdrh memset(&aFrame[8], 0, 16); 748c438efd6Sdrh } 749c9a9022bSdan } 750c438efd6Sdrh 751c438efd6Sdrh /* 7527e263728Sdrh ** Check to see if the frame with header in aFrame[] and content 7537e263728Sdrh ** in aData[] is valid. If it is a valid frame, fill *piPage and 7547e263728Sdrh ** *pnTruncate and return true. Return if the frame is not valid. 755c438efd6Sdrh */ 7567ed91f23Sdrh static int walDecodeFrame( 75723ea97b6Sdrh Wal *pWal, /* The write-ahead log */ 758c438efd6Sdrh u32 *piPage, /* OUT: Database page number for frame */ 759c438efd6Sdrh u32 *pnTruncate, /* OUT: New db size (or 0 if not commit) */ 760c438efd6Sdrh u8 *aData, /* Pointer to page data (for checksum) */ 761c438efd6Sdrh u8 *aFrame /* Frame data */ 762c438efd6Sdrh ){ 763b8fd6c2fSdan int nativeCksum; /* True for native byte-order checksums */ 76471d89919Sdan u32 *aCksum = pWal->hdr.aFrameCksum; 765c8179157Sdrh u32 pgno; /* Page number of the frame */ 76623ea97b6Sdrh assert( WAL_FRAME_HDRSIZE==24 ); 76723ea97b6Sdrh 7687e263728Sdrh /* A frame is only valid if the salt values in the frame-header 7697e263728Sdrh ** match the salt values in the wal-header. 7707e263728Sdrh */ 7717e263728Sdrh if( memcmp(&pWal->hdr.aSalt, &aFrame[8], 8)!=0 ){ 77223ea97b6Sdrh return 0; 77323ea97b6Sdrh } 774c438efd6Sdrh 775c8179157Sdrh /* A frame is only valid if the page number is creater than zero. 776c8179157Sdrh */ 777c8179157Sdrh pgno = sqlite3Get4byte(&aFrame[0]); 778c8179157Sdrh if( pgno==0 ){ 779c8179157Sdrh return 0; 780c8179157Sdrh } 781c8179157Sdrh 782519426aaSdrh /* A frame is only valid if a checksum of the WAL header, 783519426aaSdrh ** all prior frams, the first 16 bytes of this frame-header, 784519426aaSdrh ** and the frame-data matches the checksum in the last 8 785519426aaSdrh ** bytes of this frame-header. 7867e263728Sdrh */ 787b8fd6c2fSdan nativeCksum = (pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN); 78871d89919Sdan walChecksumBytes(nativeCksum, aFrame, 8, aCksum, aCksum); 789b8fd6c2fSdan walChecksumBytes(nativeCksum, aData, pWal->szPage, aCksum, aCksum); 79023ea97b6Sdrh if( aCksum[0]!=sqlite3Get4byte(&aFrame[16]) 79123ea97b6Sdrh || aCksum[1]!=sqlite3Get4byte(&aFrame[20]) 792c438efd6Sdrh ){ 793c438efd6Sdrh /* Checksum failed. */ 794c438efd6Sdrh return 0; 795c438efd6Sdrh } 796c438efd6Sdrh 7977e263728Sdrh /* If we reach this point, the frame is valid. Return the page number 7987e263728Sdrh ** and the new database size. 7997e263728Sdrh */ 800c8179157Sdrh *piPage = pgno; 801c438efd6Sdrh *pnTruncate = sqlite3Get4byte(&aFrame[4]); 802c438efd6Sdrh return 1; 803c438efd6Sdrh } 804c438efd6Sdrh 805c438efd6Sdrh 806c74c3334Sdrh #if defined(SQLITE_TEST) && defined(SQLITE_DEBUG) 807c74c3334Sdrh /* 808181e091fSdrh ** Names of locks. This routine is used to provide debugging output and is not 809181e091fSdrh ** a part of an ordinary build. 810c74c3334Sdrh */ 811c74c3334Sdrh static const char *walLockName(int lockIdx){ 812c74c3334Sdrh if( lockIdx==WAL_WRITE_LOCK ){ 813c74c3334Sdrh return "WRITE-LOCK"; 814c74c3334Sdrh }else if( lockIdx==WAL_CKPT_LOCK ){ 815c74c3334Sdrh return "CKPT-LOCK"; 816c74c3334Sdrh }else if( lockIdx==WAL_RECOVER_LOCK ){ 817c74c3334Sdrh return "RECOVER-LOCK"; 818c74c3334Sdrh }else{ 819c74c3334Sdrh static char zName[15]; 820c74c3334Sdrh sqlite3_snprintf(sizeof(zName), zName, "READ-LOCK[%d]", 821c74c3334Sdrh lockIdx-WAL_READ_LOCK(0)); 822c74c3334Sdrh return zName; 823c74c3334Sdrh } 824c74c3334Sdrh } 825c74c3334Sdrh #endif /*defined(SQLITE_TEST) || defined(SQLITE_DEBUG) */ 826c74c3334Sdrh 827c74c3334Sdrh 828c438efd6Sdrh /* 829181e091fSdrh ** Set or release locks on the WAL. Locks are either shared or exclusive. 830181e091fSdrh ** A lock cannot be moved directly between shared and exclusive - it must go 831181e091fSdrh ** through the unlocked state first. 83273b64e4dSdrh ** 83373b64e4dSdrh ** In locking_mode=EXCLUSIVE, all of these routines become no-ops. 83473b64e4dSdrh */ 83573b64e4dSdrh static int walLockShared(Wal *pWal, int lockIdx){ 836c74c3334Sdrh int rc; 83773b64e4dSdrh if( pWal->exclusiveMode ) return SQLITE_OK; 838c74c3334Sdrh rc = sqlite3OsShmLock(pWal->pDbFd, lockIdx, 1, 83973b64e4dSdrh SQLITE_SHM_LOCK | SQLITE_SHM_SHARED); 840c74c3334Sdrh WALTRACE(("WAL%p: acquire SHARED-%s %s\n", pWal, 841c74c3334Sdrh walLockName(lockIdx), rc ? "failed" : "ok")); 8425eba1f60Sshaneh VVA_ONLY( pWal->lockError = (u8)(rc!=SQLITE_OK && rc!=SQLITE_BUSY); ) 843c74c3334Sdrh return rc; 84473b64e4dSdrh } 84573b64e4dSdrh static void walUnlockShared(Wal *pWal, int lockIdx){ 84673b64e4dSdrh if( pWal->exclusiveMode ) return; 84773b64e4dSdrh (void)sqlite3OsShmLock(pWal->pDbFd, lockIdx, 1, 84873b64e4dSdrh SQLITE_SHM_UNLOCK | SQLITE_SHM_SHARED); 849c74c3334Sdrh WALTRACE(("WAL%p: release SHARED-%s\n", pWal, walLockName(lockIdx))); 85073b64e4dSdrh } 851ab372773Sdrh static int walLockExclusive(Wal *pWal, int lockIdx, int n){ 852c74c3334Sdrh int rc; 85373b64e4dSdrh if( pWal->exclusiveMode ) return SQLITE_OK; 854c74c3334Sdrh rc = sqlite3OsShmLock(pWal->pDbFd, lockIdx, n, 85573b64e4dSdrh SQLITE_SHM_LOCK | SQLITE_SHM_EXCLUSIVE); 856c74c3334Sdrh WALTRACE(("WAL%p: acquire EXCLUSIVE-%s cnt=%d %s\n", pWal, 857c74c3334Sdrh walLockName(lockIdx), n, rc ? "failed" : "ok")); 8585eba1f60Sshaneh VVA_ONLY( pWal->lockError = (u8)(rc!=SQLITE_OK && rc!=SQLITE_BUSY); ) 859c74c3334Sdrh return rc; 86073b64e4dSdrh } 86173b64e4dSdrh static void walUnlockExclusive(Wal *pWal, int lockIdx, int n){ 86273b64e4dSdrh if( pWal->exclusiveMode ) return; 86373b64e4dSdrh (void)sqlite3OsShmLock(pWal->pDbFd, lockIdx, n, 86473b64e4dSdrh SQLITE_SHM_UNLOCK | SQLITE_SHM_EXCLUSIVE); 865c74c3334Sdrh WALTRACE(("WAL%p: release EXCLUSIVE-%s cnt=%d\n", pWal, 866c74c3334Sdrh walLockName(lockIdx), n)); 86773b64e4dSdrh } 86873b64e4dSdrh 86973b64e4dSdrh /* 87029d4dbefSdrh ** Compute a hash on a page number. The resulting hash value must land 871181e091fSdrh ** between 0 and (HASHTABLE_NSLOT-1). The walHashNext() function advances 872181e091fSdrh ** the hash to the next value in the event of a collision. 87329d4dbefSdrh */ 87429d4dbefSdrh static int walHash(u32 iPage){ 87529d4dbefSdrh assert( iPage>0 ); 87629d4dbefSdrh assert( (HASHTABLE_NSLOT & (HASHTABLE_NSLOT-1))==0 ); 87729d4dbefSdrh return (iPage*HASHTABLE_HASH_1) & (HASHTABLE_NSLOT-1); 87829d4dbefSdrh } 87929d4dbefSdrh static int walNextHash(int iPriorHash){ 88029d4dbefSdrh return (iPriorHash+1)&(HASHTABLE_NSLOT-1); 881bb23aff3Sdan } 882bb23aff3Sdan 8834280eb30Sdan /* 8844ece2f26Sdrh ** An instance of the WalHashLoc object is used to describe the location 8854ece2f26Sdrh ** of a page hash table in the wal-index. This becomes the return value 8864ece2f26Sdrh ** from walHashGet(). 8874ece2f26Sdrh */ 8884ece2f26Sdrh typedef struct WalHashLoc WalHashLoc; 8894ece2f26Sdrh struct WalHashLoc { 8904ece2f26Sdrh volatile ht_slot *aHash; /* Start of the wal-index hash table */ 8914ece2f26Sdrh volatile u32 *aPgno; /* aPgno[1] is the page of first frame indexed */ 8924ece2f26Sdrh u32 iZero; /* One less than the frame number of first indexed*/ 8934ece2f26Sdrh }; 8944ece2f26Sdrh 8954ece2f26Sdrh /* 8964280eb30Sdan ** Return pointers to the hash table and page number array stored on 8974280eb30Sdan ** page iHash of the wal-index. The wal-index is broken into 32KB pages 8984280eb30Sdan ** numbered starting from 0. 8994280eb30Sdan ** 9004ece2f26Sdrh ** Set output variable pLoc->aHash to point to the start of the hash table 9014ece2f26Sdrh ** in the wal-index file. Set pLoc->iZero to one less than the frame 9024280eb30Sdan ** number of the first frame indexed by this hash table. If a 9034280eb30Sdan ** slot in the hash table is set to N, it refers to frame number 9044ece2f26Sdrh ** (pLoc->iZero+N) in the log. 9054280eb30Sdan ** 9064ece2f26Sdrh ** Finally, set pLoc->aPgno so that pLoc->aPgno[1] is the page number of the 9074ece2f26Sdrh ** first frame indexed by the hash table, frame (pLoc->iZero+1). 9084280eb30Sdan */ 9094280eb30Sdan static int walHashGet( 91013a3cb82Sdan Wal *pWal, /* WAL handle */ 91113a3cb82Sdan int iHash, /* Find the iHash'th table */ 9124ece2f26Sdrh WalHashLoc *pLoc /* OUT: Hash table location */ 91313a3cb82Sdan ){ 9144280eb30Sdan int rc; /* Return code */ 9154280eb30Sdan 9164ece2f26Sdrh rc = walIndexPage(pWal, iHash, &pLoc->aPgno); 9174280eb30Sdan assert( rc==SQLITE_OK || iHash>0 ); 9184280eb30Sdan 9194280eb30Sdan if( rc==SQLITE_OK ){ 9204ece2f26Sdrh pLoc->aHash = (volatile ht_slot *)&pLoc->aPgno[HASHTABLE_NPAGE]; 92113a3cb82Sdan if( iHash==0 ){ 9224ece2f26Sdrh pLoc->aPgno = &pLoc->aPgno[WALINDEX_HDR_SIZE/sizeof(u32)]; 9234ece2f26Sdrh pLoc->iZero = 0; 92413a3cb82Sdan }else{ 9254ece2f26Sdrh pLoc->iZero = HASHTABLE_NPAGE_ONE + (iHash-1)*HASHTABLE_NPAGE; 92613a3cb82Sdan } 9274ece2f26Sdrh pLoc->aPgno = &pLoc->aPgno[-1]; 92813a3cb82Sdan } 9294280eb30Sdan return rc; 9304280eb30Sdan } 93113a3cb82Sdan 9324280eb30Sdan /* 9334280eb30Sdan ** Return the number of the wal-index page that contains the hash-table 9344280eb30Sdan ** and page-number array that contain entries corresponding to WAL frame 9354280eb30Sdan ** iFrame. The wal-index is broken up into 32KB pages. Wal-index pages 9364280eb30Sdan ** are numbered starting from 0. 9374280eb30Sdan */ 93813a3cb82Sdan static int walFramePage(u32 iFrame){ 93913a3cb82Sdan int iHash = (iFrame+HASHTABLE_NPAGE-HASHTABLE_NPAGE_ONE-1) / HASHTABLE_NPAGE; 94013a3cb82Sdan assert( (iHash==0 || iFrame>HASHTABLE_NPAGE_ONE) 94113a3cb82Sdan && (iHash>=1 || iFrame<=HASHTABLE_NPAGE_ONE) 94213a3cb82Sdan && (iHash<=1 || iFrame>(HASHTABLE_NPAGE_ONE+HASHTABLE_NPAGE)) 94313a3cb82Sdan && (iHash>=2 || iFrame<=HASHTABLE_NPAGE_ONE+HASHTABLE_NPAGE) 94413a3cb82Sdan && (iHash<=2 || iFrame>(HASHTABLE_NPAGE_ONE+2*HASHTABLE_NPAGE)) 94513a3cb82Sdan ); 94613a3cb82Sdan return iHash; 94713a3cb82Sdan } 94813a3cb82Sdan 94913a3cb82Sdan /* 95013a3cb82Sdan ** Return the page number associated with frame iFrame in this WAL. 95113a3cb82Sdan */ 95213a3cb82Sdan static u32 walFramePgno(Wal *pWal, u32 iFrame){ 95313a3cb82Sdan int iHash = walFramePage(iFrame); 95413a3cb82Sdan if( iHash==0 ){ 95513a3cb82Sdan return pWal->apWiData[0][WALINDEX_HDR_SIZE/sizeof(u32) + iFrame - 1]; 95613a3cb82Sdan } 95713a3cb82Sdan return pWal->apWiData[iHash][(iFrame-1-HASHTABLE_NPAGE_ONE)%HASHTABLE_NPAGE]; 95813a3cb82Sdan } 959bb23aff3Sdan 960bb23aff3Sdan /* 961ca6b5babSdan ** Remove entries from the hash table that point to WAL slots greater 962ca6b5babSdan ** than pWal->hdr.mxFrame. 963ca6b5babSdan ** 964ca6b5babSdan ** This function is called whenever pWal->hdr.mxFrame is decreased due 965ca6b5babSdan ** to a rollback or savepoint. 966ca6b5babSdan ** 967181e091fSdrh ** At most only the hash table containing pWal->hdr.mxFrame needs to be 968181e091fSdrh ** updated. Any later hash tables will be automatically cleared when 969181e091fSdrh ** pWal->hdr.mxFrame advances to the point where those hash tables are 970181e091fSdrh ** actually needed. 971ca6b5babSdan */ 972ca6b5babSdan static void walCleanupHash(Wal *pWal){ 9734ece2f26Sdrh WalHashLoc sLoc; /* Hash table location */ 974f77bbd9fSdrh int iLimit = 0; /* Zero values greater than this */ 97513a3cb82Sdan int nByte; /* Number of bytes to zero in aPgno[] */ 97613a3cb82Sdan int i; /* Used to iterate through aHash[] */ 977ca6b5babSdan 97873b64e4dSdrh assert( pWal->writeLock ); 979ffca4301Sdrh testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE-1 ); 980ffca4301Sdrh testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE ); 981ffca4301Sdrh testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE+1 ); 9829c156477Sdrh 9834280eb30Sdan if( pWal->hdr.mxFrame==0 ) return; 9844280eb30Sdan 9854280eb30Sdan /* Obtain pointers to the hash-table and page-number array containing 9864280eb30Sdan ** the entry that corresponds to frame pWal->hdr.mxFrame. It is guaranteed 9874280eb30Sdan ** that the page said hash-table and array reside on is already mapped. 9884280eb30Sdan */ 9894280eb30Sdan assert( pWal->nWiData>walFramePage(pWal->hdr.mxFrame) ); 9904280eb30Sdan assert( pWal->apWiData[walFramePage(pWal->hdr.mxFrame)] ); 9914ece2f26Sdrh walHashGet(pWal, walFramePage(pWal->hdr.mxFrame), &sLoc); 9924280eb30Sdan 9934280eb30Sdan /* Zero all hash-table entries that correspond to frame numbers greater 9944280eb30Sdan ** than pWal->hdr.mxFrame. 9954280eb30Sdan */ 9964ece2f26Sdrh iLimit = pWal->hdr.mxFrame - sLoc.iZero; 9979c156477Sdrh assert( iLimit>0 ); 998ca6b5babSdan for(i=0; i<HASHTABLE_NSLOT; i++){ 9994ece2f26Sdrh if( sLoc.aHash[i]>iLimit ){ 10004ece2f26Sdrh sLoc.aHash[i] = 0; 1001ca6b5babSdan } 1002ca6b5babSdan } 1003ca6b5babSdan 1004ca6b5babSdan /* Zero the entries in the aPgno array that correspond to frames with 1005ca6b5babSdan ** frame numbers greater than pWal->hdr.mxFrame. 1006ca6b5babSdan */ 10074ece2f26Sdrh nByte = (int)((char *)sLoc.aHash - (char *)&sLoc.aPgno[iLimit+1]); 10084ece2f26Sdrh memset((void *)&sLoc.aPgno[iLimit+1], 0, nByte); 1009ca6b5babSdan 1010ca6b5babSdan #ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT 1011ca6b5babSdan /* Verify that the every entry in the mapping region is still reachable 1012ca6b5babSdan ** via the hash table even after the cleanup. 1013ca6b5babSdan */ 1014f77bbd9fSdrh if( iLimit ){ 10156b67a8aeSmistachkin int j; /* Loop counter */ 1016ca6b5babSdan int iKey; /* Hash key */ 10176b67a8aeSmistachkin for(j=1; j<=iLimit; j++){ 10184ece2f26Sdrh for(iKey=walHash(sLoc.aPgno[j]);sLoc.aHash[iKey];iKey=walNextHash(iKey)){ 10194ece2f26Sdrh if( sLoc.aHash[iKey]==j ) break; 1020ca6b5babSdan } 10214ece2f26Sdrh assert( sLoc.aHash[iKey]==j ); 1022ca6b5babSdan } 1023ca6b5babSdan } 1024ca6b5babSdan #endif /* SQLITE_ENABLE_EXPENSIVE_ASSERT */ 1025ca6b5babSdan } 1026ca6b5babSdan 1027bb23aff3Sdan 10287ed91f23Sdrh /* 102929d4dbefSdrh ** Set an entry in the wal-index that will map database page number 103029d4dbefSdrh ** pPage into WAL frame iFrame. 1031c438efd6Sdrh */ 10327ed91f23Sdrh static int walIndexAppend(Wal *pWal, u32 iFrame, u32 iPage){ 10334280eb30Sdan int rc; /* Return code */ 10344ece2f26Sdrh WalHashLoc sLoc; /* Wal-index hash table location */ 10354280eb30Sdan 10364ece2f26Sdrh rc = walHashGet(pWal, walFramePage(iFrame), &sLoc); 10374280eb30Sdan 10384280eb30Sdan /* Assuming the wal-index file was successfully mapped, populate the 10394280eb30Sdan ** page number array and hash table entry. 10404280eb30Sdan */ 10414280eb30Sdan if( rc==SQLITE_OK ){ 10424280eb30Sdan int iKey; /* Hash table key */ 1043bb23aff3Sdan int idx; /* Value to write to hash-table slot */ 1044519426aaSdrh int nCollide; /* Number of hash collisions */ 1045c438efd6Sdrh 10464ece2f26Sdrh idx = iFrame - sLoc.iZero; 10474280eb30Sdan assert( idx <= HASHTABLE_NSLOT/2 + 1 ); 10484280eb30Sdan 10494280eb30Sdan /* If this is the first entry to be added to this hash-table, zero the 105060ec914cSpeter.d.reid ** entire hash table and aPgno[] array before proceeding. 10514280eb30Sdan */ 1052ca6b5babSdan if( idx==1 ){ 10534ece2f26Sdrh int nByte = (int)((u8 *)&sLoc.aHash[HASHTABLE_NSLOT] 10544ece2f26Sdrh - (u8 *)&sLoc.aPgno[1]); 10554ece2f26Sdrh memset((void*)&sLoc.aPgno[1], 0, nByte); 1056ca6b5babSdan } 1057ca6b5babSdan 1058ca6b5babSdan /* If the entry in aPgno[] is already set, then the previous writer 1059ca6b5babSdan ** must have exited unexpectedly in the middle of a transaction (after 1060ca6b5babSdan ** writing one or more dirty pages to the WAL to free up memory). 1061ca6b5babSdan ** Remove the remnants of that writers uncommitted transaction from 1062ca6b5babSdan ** the hash-table before writing any new entries. 1063ca6b5babSdan */ 10644ece2f26Sdrh if( sLoc.aPgno[idx] ){ 1065ca6b5babSdan walCleanupHash(pWal); 10664ece2f26Sdrh assert( !sLoc.aPgno[idx] ); 1067ca6b5babSdan } 10684280eb30Sdan 10694280eb30Sdan /* Write the aPgno[] array entry and the hash-table slot. */ 1070519426aaSdrh nCollide = idx; 10714ece2f26Sdrh for(iKey=walHash(iPage); sLoc.aHash[iKey]; iKey=walNextHash(iKey)){ 1072519426aaSdrh if( (nCollide--)==0 ) return SQLITE_CORRUPT_BKPT; 107329d4dbefSdrh } 10744ece2f26Sdrh sLoc.aPgno[idx] = iPage; 10754ece2f26Sdrh sLoc.aHash[iKey] = (ht_slot)idx; 10764fa95bfcSdrh 10774fa95bfcSdrh #ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT 10784fa95bfcSdrh /* Verify that the number of entries in the hash table exactly equals 10794fa95bfcSdrh ** the number of entries in the mapping region. 10804fa95bfcSdrh */ 10814fa95bfcSdrh { 10824fa95bfcSdrh int i; /* Loop counter */ 10834fa95bfcSdrh int nEntry = 0; /* Number of entries in the hash table */ 10844ece2f26Sdrh for(i=0; i<HASHTABLE_NSLOT; i++){ if( sLoc.aHash[i] ) nEntry++; } 10854fa95bfcSdrh assert( nEntry==idx ); 1086c438efd6Sdrh } 108731f98fc8Sdan 10884fa95bfcSdrh /* Verify that the every entry in the mapping region is reachable 10894fa95bfcSdrh ** via the hash table. This turns out to be a really, really expensive 10904fa95bfcSdrh ** thing to check, so only do this occasionally - not on every 10914fa95bfcSdrh ** iteration. 10924fa95bfcSdrh */ 10934fa95bfcSdrh if( (idx&0x3ff)==0 ){ 10944fa95bfcSdrh int i; /* Loop counter */ 10954fa95bfcSdrh for(i=1; i<=idx; i++){ 10964ece2f26Sdrh for(iKey=walHash(sLoc.aPgno[i]); 10974ece2f26Sdrh sLoc.aHash[iKey]; 10984ece2f26Sdrh iKey=walNextHash(iKey)){ 10994ece2f26Sdrh if( sLoc.aHash[iKey]==i ) break; 11004fa95bfcSdrh } 11014ece2f26Sdrh assert( sLoc.aHash[iKey]==i ); 11024fa95bfcSdrh } 11034fa95bfcSdrh } 11044fa95bfcSdrh #endif /* SQLITE_ENABLE_EXPENSIVE_ASSERT */ 11054fa95bfcSdrh } 11064fa95bfcSdrh 11074fa95bfcSdrh 1108bb23aff3Sdan return rc; 1109c438efd6Sdrh } 1110c438efd6Sdrh 1111c438efd6Sdrh 1112c438efd6Sdrh /* 11137ed91f23Sdrh ** Recover the wal-index by reading the write-ahead log file. 111473b64e4dSdrh ** 111573b64e4dSdrh ** This routine first tries to establish an exclusive lock on the 111673b64e4dSdrh ** wal-index to prevent other threads/processes from doing anything 111773b64e4dSdrh ** with the WAL or wal-index while recovery is running. The 111873b64e4dSdrh ** WAL_RECOVER_LOCK is also held so that other threads will know 111973b64e4dSdrh ** that this thread is running recovery. If unable to establish 112073b64e4dSdrh ** the necessary locks, this routine returns SQLITE_BUSY. 1121c438efd6Sdrh */ 11227ed91f23Sdrh static int walIndexRecover(Wal *pWal){ 1123c438efd6Sdrh int rc; /* Return Code */ 1124c438efd6Sdrh i64 nSize; /* Size of log file */ 112571d89919Sdan u32 aFrameCksum[2] = {0, 0}; 1126d0aa3427Sdan int iLock; /* Lock offset to lock for checkpoint */ 1127c438efd6Sdrh 1128d0aa3427Sdan /* Obtain an exclusive lock on all byte in the locking range not already 1129d0aa3427Sdan ** locked by the caller. The caller is guaranteed to have locked the 1130d0aa3427Sdan ** WAL_WRITE_LOCK byte, and may have also locked the WAL_CKPT_LOCK byte. 1131d0aa3427Sdan ** If successful, the same bytes that are locked here are unlocked before 1132d0aa3427Sdan ** this function returns. 1133d0aa3427Sdan */ 1134d0aa3427Sdan assert( pWal->ckptLock==1 || pWal->ckptLock==0 ); 1135d0aa3427Sdan assert( WAL_ALL_BUT_WRITE==WAL_WRITE_LOCK+1 ); 1136d0aa3427Sdan assert( WAL_CKPT_LOCK==WAL_ALL_BUT_WRITE ); 1137d0aa3427Sdan assert( pWal->writeLock ); 1138d0aa3427Sdan iLock = WAL_ALL_BUT_WRITE + pWal->ckptLock; 1139dea5ce36Sdan rc = walLockExclusive(pWal, iLock, WAL_READ_LOCK(0)-iLock); 1140dea5ce36Sdan if( rc==SQLITE_OK ){ 1141dea5ce36Sdan rc = walLockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1); 1142dea5ce36Sdan if( rc!=SQLITE_OK ){ 1143dea5ce36Sdan walUnlockExclusive(pWal, iLock, WAL_READ_LOCK(0)-iLock); 1144dea5ce36Sdan } 1145dea5ce36Sdan } 114673b64e4dSdrh if( rc ){ 114773b64e4dSdrh return rc; 114873b64e4dSdrh } 1149dea5ce36Sdan 1150c74c3334Sdrh WALTRACE(("WAL%p: recovery begin...\n", pWal)); 115173b64e4dSdrh 115271d89919Sdan memset(&pWal->hdr, 0, sizeof(WalIndexHdr)); 1153c438efd6Sdrh 1154d9e5c4f6Sdrh rc = sqlite3OsFileSize(pWal->pWalFd, &nSize); 1155c438efd6Sdrh if( rc!=SQLITE_OK ){ 115673b64e4dSdrh goto recovery_error; 1157c438efd6Sdrh } 1158c438efd6Sdrh 1159b8fd6c2fSdan if( nSize>WAL_HDRSIZE ){ 1160b8fd6c2fSdan u8 aBuf[WAL_HDRSIZE]; /* Buffer to load WAL header into */ 1161c438efd6Sdrh u8 *aFrame = 0; /* Malloc'd buffer to load entire frame */ 1162584c754dSdrh int szFrame; /* Number of bytes in buffer aFrame[] */ 1163c438efd6Sdrh u8 *aData; /* Pointer to data part of aFrame buffer */ 1164c438efd6Sdrh int iFrame; /* Index of last frame read */ 1165c438efd6Sdrh i64 iOffset; /* Next offset to read from log file */ 11666e81096fSdrh int szPage; /* Page size according to the log */ 1167b8fd6c2fSdan u32 magic; /* Magic value read from WAL header */ 116810f5a50eSdan u32 version; /* Magic value read from WAL header */ 1169fe6163d7Sdrh int isValid; /* True if this frame is valid */ 1170c438efd6Sdrh 1171b8fd6c2fSdan /* Read in the WAL header. */ 1172d9e5c4f6Sdrh rc = sqlite3OsRead(pWal->pWalFd, aBuf, WAL_HDRSIZE, 0); 1173c438efd6Sdrh if( rc!=SQLITE_OK ){ 117473b64e4dSdrh goto recovery_error; 1175c438efd6Sdrh } 1176c438efd6Sdrh 1177c438efd6Sdrh /* If the database page size is not a power of two, or is greater than 1178b8fd6c2fSdan ** SQLITE_MAX_PAGE_SIZE, conclude that the WAL file contains no valid 1179b8fd6c2fSdan ** data. Similarly, if the 'magic' value is invalid, ignore the whole 1180b8fd6c2fSdan ** WAL file. 1181c438efd6Sdrh */ 1182b8fd6c2fSdan magic = sqlite3Get4byte(&aBuf[0]); 118323ea97b6Sdrh szPage = sqlite3Get4byte(&aBuf[8]); 1184b8fd6c2fSdan if( (magic&0xFFFFFFFE)!=WAL_MAGIC 1185b8fd6c2fSdan || szPage&(szPage-1) 1186b8fd6c2fSdan || szPage>SQLITE_MAX_PAGE_SIZE 1187b8fd6c2fSdan || szPage<512 1188b8fd6c2fSdan ){ 1189c438efd6Sdrh goto finished; 1190c438efd6Sdrh } 11915eba1f60Sshaneh pWal->hdr.bigEndCksum = (u8)(magic&0x00000001); 1192b2eced5dSdrh pWal->szPage = szPage; 119323ea97b6Sdrh pWal->nCkpt = sqlite3Get4byte(&aBuf[12]); 11947e263728Sdrh memcpy(&pWal->hdr.aSalt, &aBuf[16], 8); 1195cd28508eSdrh 1196cd28508eSdrh /* Verify that the WAL header checksum is correct */ 119771d89919Sdan walChecksumBytes(pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN, 119810f5a50eSdan aBuf, WAL_HDRSIZE-2*4, 0, pWal->hdr.aFrameCksum 119971d89919Sdan ); 120010f5a50eSdan if( pWal->hdr.aFrameCksum[0]!=sqlite3Get4byte(&aBuf[24]) 120110f5a50eSdan || pWal->hdr.aFrameCksum[1]!=sqlite3Get4byte(&aBuf[28]) 120210f5a50eSdan ){ 120310f5a50eSdan goto finished; 120410f5a50eSdan } 120510f5a50eSdan 1206cd28508eSdrh /* Verify that the version number on the WAL format is one that 1207cd28508eSdrh ** are able to understand */ 120810f5a50eSdan version = sqlite3Get4byte(&aBuf[4]); 120910f5a50eSdan if( version!=WAL_MAX_VERSION ){ 121010f5a50eSdan rc = SQLITE_CANTOPEN_BKPT; 121110f5a50eSdan goto finished; 121210f5a50eSdan } 121310f5a50eSdan 1214c438efd6Sdrh /* Malloc a buffer to read frames into. */ 1215584c754dSdrh szFrame = szPage + WAL_FRAME_HDRSIZE; 1216f3cdcdccSdrh aFrame = (u8 *)sqlite3_malloc64(szFrame); 1217c438efd6Sdrh if( !aFrame ){ 1218fad3039cSmistachkin rc = SQLITE_NOMEM_BKPT; 121973b64e4dSdrh goto recovery_error; 1220c438efd6Sdrh } 12217ed91f23Sdrh aData = &aFrame[WAL_FRAME_HDRSIZE]; 1222c438efd6Sdrh 1223c438efd6Sdrh /* Read all frames from the log file. */ 1224c438efd6Sdrh iFrame = 0; 1225584c754dSdrh for(iOffset=WAL_HDRSIZE; (iOffset+szFrame)<=nSize; iOffset+=szFrame){ 1226c438efd6Sdrh u32 pgno; /* Database page number for frame */ 1227c438efd6Sdrh u32 nTruncate; /* dbsize field from frame header */ 1228c438efd6Sdrh 1229c438efd6Sdrh /* Read and decode the next log frame. */ 1230fe6163d7Sdrh iFrame++; 1231584c754dSdrh rc = sqlite3OsRead(pWal->pWalFd, aFrame, szFrame, iOffset); 1232c438efd6Sdrh if( rc!=SQLITE_OK ) break; 12337e263728Sdrh isValid = walDecodeFrame(pWal, &pgno, &nTruncate, aData, aFrame); 1234f694aa64Sdrh if( !isValid ) break; 1235fe6163d7Sdrh rc = walIndexAppend(pWal, iFrame, pgno); 1236c7991bdfSdan if( rc!=SQLITE_OK ) break; 1237c438efd6Sdrh 1238c438efd6Sdrh /* If nTruncate is non-zero, this is a commit record. */ 1239c438efd6Sdrh if( nTruncate ){ 124071d89919Sdan pWal->hdr.mxFrame = iFrame; 124171d89919Sdan pWal->hdr.nPage = nTruncate; 12421df2db7fSshaneh pWal->hdr.szPage = (u16)((szPage&0xff00) | (szPage>>16)); 12439b78f791Sdrh testcase( szPage<=32768 ); 12449b78f791Sdrh testcase( szPage>=65536 ); 124571d89919Sdan aFrameCksum[0] = pWal->hdr.aFrameCksum[0]; 124671d89919Sdan aFrameCksum[1] = pWal->hdr.aFrameCksum[1]; 1247c438efd6Sdrh } 1248c438efd6Sdrh } 1249c438efd6Sdrh 1250c438efd6Sdrh sqlite3_free(aFrame); 1251c438efd6Sdrh } 1252c438efd6Sdrh 1253c438efd6Sdrh finished: 1254576bc329Sdan if( rc==SQLITE_OK ){ 1255db7f647eSdrh volatile WalCkptInfo *pInfo; 1256db7f647eSdrh int i; 125771d89919Sdan pWal->hdr.aFrameCksum[0] = aFrameCksum[0]; 125871d89919Sdan pWal->hdr.aFrameCksum[1] = aFrameCksum[1]; 12597e263728Sdrh walIndexWriteHdr(pWal); 12603dee6da9Sdan 1261db7f647eSdrh /* Reset the checkpoint-header. This is safe because this thread is 12623dee6da9Sdan ** currently holding locks that exclude all other readers, writers and 12633dee6da9Sdan ** checkpointers. 12643dee6da9Sdan */ 1265db7f647eSdrh pInfo = walCkptInfo(pWal); 1266db7f647eSdrh pInfo->nBackfill = 0; 12673bf83ccdSdan pInfo->nBackfillAttempted = pWal->hdr.mxFrame; 1268db7f647eSdrh pInfo->aReadMark[0] = 0; 1269db7f647eSdrh for(i=1; i<WAL_NREADER; i++) pInfo->aReadMark[i] = READMARK_NOT_USED; 12705373b76bSdan if( pWal->hdr.mxFrame ) pInfo->aReadMark[1] = pWal->hdr.mxFrame; 1271eb8763d7Sdan 1272eb8763d7Sdan /* If more than one frame was recovered from the log file, report an 1273eb8763d7Sdan ** event via sqlite3_log(). This is to help with identifying performance 1274eb8763d7Sdan ** problems caused by applications routinely shutting down without 1275eb8763d7Sdan ** checkpointing the log file. 1276eb8763d7Sdan */ 1277eb8763d7Sdan if( pWal->hdr.nPage ){ 1278d040e764Sdrh sqlite3_log(SQLITE_NOTICE_RECOVER_WAL, 1279d040e764Sdrh "recovered %d frames from WAL file %s", 12800943f0bdSdan pWal->hdr.mxFrame, pWal->zWalName 1281eb8763d7Sdan ); 1282eb8763d7Sdan } 1283576bc329Sdan } 128473b64e4dSdrh 128573b64e4dSdrh recovery_error: 1286c74c3334Sdrh WALTRACE(("WAL%p: recovery %s\n", pWal, rc ? "failed" : "ok")); 1287dea5ce36Sdan walUnlockExclusive(pWal, iLock, WAL_READ_LOCK(0)-iLock); 1288dea5ce36Sdan walUnlockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1); 1289c438efd6Sdrh return rc; 1290c438efd6Sdrh } 1291c438efd6Sdrh 1292c438efd6Sdrh /* 12931018e90bSdan ** Close an open wal-index. 1294a8e654ebSdrh */ 12951018e90bSdan static void walIndexClose(Wal *pWal, int isDelete){ 129685bc6df2Sdrh if( pWal->exclusiveMode==WAL_HEAPMEMORY_MODE || pWal->bShmUnreliable ){ 12978c408004Sdan int i; 12988c408004Sdan for(i=0; i<pWal->nWiData; i++){ 12998c408004Sdan sqlite3_free((void *)pWal->apWiData[i]); 13008c408004Sdan pWal->apWiData[i] = 0; 13018c408004Sdan } 130211caf4f4Sdan } 130311caf4f4Sdan if( pWal->exclusiveMode!=WAL_HEAPMEMORY_MODE ){ 1304e11fedc5Sdrh sqlite3OsShmUnmap(pWal->pDbFd, isDelete); 1305a8e654ebSdrh } 13068c408004Sdan } 1307a8e654ebSdrh 1308a8e654ebSdrh /* 13093e875ef3Sdan ** Open a connection to the WAL file zWalName. The database file must 13103e875ef3Sdan ** already be opened on connection pDbFd. The buffer that zWalName points 13113e875ef3Sdan ** to must remain valid for the lifetime of the returned Wal* handle. 1312c438efd6Sdrh ** 1313c438efd6Sdrh ** A SHARED lock should be held on the database file when this function 1314c438efd6Sdrh ** is called. The purpose of this SHARED lock is to prevent any other 1315181e091fSdrh ** client from unlinking the WAL or wal-index file. If another process 1316c438efd6Sdrh ** were to do this just after this client opened one of these files, the 1317c438efd6Sdrh ** system would be badly broken. 1318ef378025Sdan ** 1319ef378025Sdan ** If the log file is successfully opened, SQLITE_OK is returned and 1320ef378025Sdan ** *ppWal is set to point to a new WAL handle. If an error occurs, 1321ef378025Sdan ** an SQLite error code is returned and *ppWal is left unmodified. 1322c438efd6Sdrh */ 1323c438efd6Sdrh int sqlite3WalOpen( 13247ed91f23Sdrh sqlite3_vfs *pVfs, /* vfs module to open wal and wal-index */ 1325d9e5c4f6Sdrh sqlite3_file *pDbFd, /* The open database file */ 13263e875ef3Sdan const char *zWalName, /* Name of the WAL file */ 13278c408004Sdan int bNoShm, /* True to run in heap-memory mode */ 132885a83755Sdrh i64 mxWalSize, /* Truncate WAL to this size on reset */ 13297ed91f23Sdrh Wal **ppWal /* OUT: Allocated Wal handle */ 1330c438efd6Sdrh ){ 1331ef378025Sdan int rc; /* Return Code */ 13327ed91f23Sdrh Wal *pRet; /* Object to allocate and return */ 1333c438efd6Sdrh int flags; /* Flags passed to OsOpen() */ 1334c438efd6Sdrh 13353e875ef3Sdan assert( zWalName && zWalName[0] ); 1336d9e5c4f6Sdrh assert( pDbFd ); 1337c438efd6Sdrh 13381b78eaf0Sdrh /* In the amalgamation, the os_unix.c and os_win.c source files come before 13391b78eaf0Sdrh ** this source file. Verify that the #defines of the locking byte offsets 13401b78eaf0Sdrh ** in os_unix.c and os_win.c agree with the WALINDEX_LOCK_OFFSET value. 1341998147ecSdrh ** For that matter, if the lock offset ever changes from its initial design 1342998147ecSdrh ** value of 120, we need to know that so there is an assert() to check it. 13431b78eaf0Sdrh */ 1344998147ecSdrh assert( 120==WALINDEX_LOCK_OFFSET ); 1345998147ecSdrh assert( 136==WALINDEX_HDR_SIZE ); 13461b78eaf0Sdrh #ifdef WIN_SHM_BASE 13471b78eaf0Sdrh assert( WIN_SHM_BASE==WALINDEX_LOCK_OFFSET ); 13481b78eaf0Sdrh #endif 13491b78eaf0Sdrh #ifdef UNIX_SHM_BASE 13501b78eaf0Sdrh assert( UNIX_SHM_BASE==WALINDEX_LOCK_OFFSET ); 13511b78eaf0Sdrh #endif 13521b78eaf0Sdrh 13531b78eaf0Sdrh 13547ed91f23Sdrh /* Allocate an instance of struct Wal to return. */ 13557ed91f23Sdrh *ppWal = 0; 13563e875ef3Sdan pRet = (Wal*)sqlite3MallocZero(sizeof(Wal) + pVfs->szOsFile); 135776ed3bc0Sdan if( !pRet ){ 1358fad3039cSmistachkin return SQLITE_NOMEM_BKPT; 135976ed3bc0Sdan } 136076ed3bc0Sdan 1361c438efd6Sdrh pRet->pVfs = pVfs; 1362d9e5c4f6Sdrh pRet->pWalFd = (sqlite3_file *)&pRet[1]; 1363d9e5c4f6Sdrh pRet->pDbFd = pDbFd; 136473b64e4dSdrh pRet->readLock = -1; 136585a83755Sdrh pRet->mxWalSize = mxWalSize; 13663e875ef3Sdan pRet->zWalName = zWalName; 1367d992b150Sdrh pRet->syncHeader = 1; 1368374f4a04Sdrh pRet->padToSectorBoundary = 1; 13698c408004Sdan pRet->exclusiveMode = (bNoShm ? WAL_HEAPMEMORY_MODE: WAL_NORMAL_MODE); 1370c438efd6Sdrh 13717ed91f23Sdrh /* Open file handle on the write-ahead log file. */ 1372ddb0ac4bSdan flags = (SQLITE_OPEN_READWRITE|SQLITE_OPEN_CREATE|SQLITE_OPEN_WAL); 13733e875ef3Sdan rc = sqlite3OsOpen(pVfs, zWalName, pRet->pWalFd, flags, &flags); 137450833e32Sdan if( rc==SQLITE_OK && flags&SQLITE_OPEN_READONLY ){ 137566dfec8bSdrh pRet->readOnly = WAL_RDONLY; 137650833e32Sdan } 1377c438efd6Sdrh 1378c438efd6Sdrh if( rc!=SQLITE_OK ){ 13791018e90bSdan walIndexClose(pRet, 0); 1380d9e5c4f6Sdrh sqlite3OsClose(pRet->pWalFd); 1381c438efd6Sdrh sqlite3_free(pRet); 1382ef378025Sdan }else{ 1383dd973548Sdan int iDC = sqlite3OsDeviceCharacteristics(pDbFd); 1384d992b150Sdrh if( iDC & SQLITE_IOCAP_SEQUENTIAL ){ pRet->syncHeader = 0; } 1385cb15f35fSdrh if( iDC & SQLITE_IOCAP_POWERSAFE_OVERWRITE ){ 1386cb15f35fSdrh pRet->padToSectorBoundary = 0; 1387cb15f35fSdrh } 13887ed91f23Sdrh *ppWal = pRet; 1389c74c3334Sdrh WALTRACE(("WAL%d: opened\n", pRet)); 1390ef378025Sdan } 1391c438efd6Sdrh return rc; 1392c438efd6Sdrh } 1393c438efd6Sdrh 1394a2a42013Sdrh /* 139585a83755Sdrh ** Change the size to which the WAL file is trucated on each reset. 139685a83755Sdrh */ 139785a83755Sdrh void sqlite3WalLimit(Wal *pWal, i64 iLimit){ 139885a83755Sdrh if( pWal ) pWal->mxWalSize = iLimit; 139985a83755Sdrh } 140085a83755Sdrh 140185a83755Sdrh /* 1402a2a42013Sdrh ** Find the smallest page number out of all pages held in the WAL that 1403a2a42013Sdrh ** has not been returned by any prior invocation of this method on the 1404a2a42013Sdrh ** same WalIterator object. Write into *piFrame the frame index where 1405a2a42013Sdrh ** that page was last written into the WAL. Write into *piPage the page 1406a2a42013Sdrh ** number. 1407a2a42013Sdrh ** 1408a2a42013Sdrh ** Return 0 on success. If there are no pages in the WAL with a page 1409a2a42013Sdrh ** number larger than *piPage, then return 1. 1410a2a42013Sdrh */ 14117ed91f23Sdrh static int walIteratorNext( 14127ed91f23Sdrh WalIterator *p, /* Iterator */ 1413a2a42013Sdrh u32 *piPage, /* OUT: The page number of the next page */ 1414a2a42013Sdrh u32 *piFrame /* OUT: Wal frame index of next page */ 1415c438efd6Sdrh ){ 1416a2a42013Sdrh u32 iMin; /* Result pgno must be greater than iMin */ 1417a2a42013Sdrh u32 iRet = 0xFFFFFFFF; /* 0xffffffff is never a valid page number */ 1418a2a42013Sdrh int i; /* For looping through segments */ 1419c438efd6Sdrh 1420a2a42013Sdrh iMin = p->iPrior; 1421a2a42013Sdrh assert( iMin<0xffffffff ); 1422c438efd6Sdrh for(i=p->nSegment-1; i>=0; i--){ 14237ed91f23Sdrh struct WalSegment *pSegment = &p->aSegment[i]; 142413a3cb82Sdan while( pSegment->iNext<pSegment->nEntry ){ 1425a2a42013Sdrh u32 iPg = pSegment->aPgno[pSegment->aIndex[pSegment->iNext]]; 1426c438efd6Sdrh if( iPg>iMin ){ 1427c438efd6Sdrh if( iPg<iRet ){ 1428c438efd6Sdrh iRet = iPg; 142913a3cb82Sdan *piFrame = pSegment->iZero + pSegment->aIndex[pSegment->iNext]; 1430c438efd6Sdrh } 1431c438efd6Sdrh break; 1432c438efd6Sdrh } 1433c438efd6Sdrh pSegment->iNext++; 1434c438efd6Sdrh } 1435c438efd6Sdrh } 1436c438efd6Sdrh 1437a2a42013Sdrh *piPage = p->iPrior = iRet; 1438c438efd6Sdrh return (iRet==0xFFFFFFFF); 1439c438efd6Sdrh } 1440c438efd6Sdrh 1441f544b4c4Sdan /* 1442f544b4c4Sdan ** This function merges two sorted lists into a single sorted list. 1443d9c9b78eSdrh ** 1444d9c9b78eSdrh ** aLeft[] and aRight[] are arrays of indices. The sort key is 1445d9c9b78eSdrh ** aContent[aLeft[]] and aContent[aRight[]]. Upon entry, the following 1446d9c9b78eSdrh ** is guaranteed for all J<K: 1447d9c9b78eSdrh ** 1448d9c9b78eSdrh ** aContent[aLeft[J]] < aContent[aLeft[K]] 1449d9c9b78eSdrh ** aContent[aRight[J]] < aContent[aRight[K]] 1450d9c9b78eSdrh ** 1451d9c9b78eSdrh ** This routine overwrites aRight[] with a new (probably longer) sequence 1452d9c9b78eSdrh ** of indices such that the aRight[] contains every index that appears in 1453d9c9b78eSdrh ** either aLeft[] or the old aRight[] and such that the second condition 1454d9c9b78eSdrh ** above is still met. 1455d9c9b78eSdrh ** 1456d9c9b78eSdrh ** The aContent[aLeft[X]] values will be unique for all X. And the 1457d9c9b78eSdrh ** aContent[aRight[X]] values will be unique too. But there might be 1458d9c9b78eSdrh ** one or more combinations of X and Y such that 1459d9c9b78eSdrh ** 1460d9c9b78eSdrh ** aLeft[X]!=aRight[Y] && aContent[aLeft[X]] == aContent[aRight[Y]] 1461d9c9b78eSdrh ** 1462d9c9b78eSdrh ** When that happens, omit the aLeft[X] and use the aRight[Y] index. 1463f544b4c4Sdan */ 1464f544b4c4Sdan static void walMerge( 1465d9c9b78eSdrh const u32 *aContent, /* Pages in wal - keys for the sort */ 1466f544b4c4Sdan ht_slot *aLeft, /* IN: Left hand input list */ 1467f544b4c4Sdan int nLeft, /* IN: Elements in array *paLeft */ 1468f544b4c4Sdan ht_slot **paRight, /* IN/OUT: Right hand input list */ 1469f544b4c4Sdan int *pnRight, /* IN/OUT: Elements in *paRight */ 1470f544b4c4Sdan ht_slot *aTmp /* Temporary buffer */ 1471a2a42013Sdrh ){ 1472a2a42013Sdrh int iLeft = 0; /* Current index in aLeft */ 1473f544b4c4Sdan int iRight = 0; /* Current index in aRight */ 1474a2a42013Sdrh int iOut = 0; /* Current index in output buffer */ 1475f544b4c4Sdan int nRight = *pnRight; 1476f544b4c4Sdan ht_slot *aRight = *paRight; 1477a2a42013Sdrh 1478f544b4c4Sdan assert( nLeft>0 && nRight>0 ); 1479a2a42013Sdrh while( iRight<nRight || iLeft<nLeft ){ 1480067f3165Sdan ht_slot logpage; 1481a2a42013Sdrh Pgno dbpage; 1482a2a42013Sdrh 1483a2a42013Sdrh if( (iLeft<nLeft) 1484a2a42013Sdrh && (iRight>=nRight || aContent[aLeft[iLeft]]<aContent[aRight[iRight]]) 1485a2a42013Sdrh ){ 1486a2a42013Sdrh logpage = aLeft[iLeft++]; 1487a2a42013Sdrh }else{ 1488a2a42013Sdrh logpage = aRight[iRight++]; 1489a2a42013Sdrh } 1490a2a42013Sdrh dbpage = aContent[logpage]; 1491a2a42013Sdrh 1492f544b4c4Sdan aTmp[iOut++] = logpage; 1493a2a42013Sdrh if( iLeft<nLeft && aContent[aLeft[iLeft]]==dbpage ) iLeft++; 1494a2a42013Sdrh 1495a2a42013Sdrh assert( iLeft>=nLeft || aContent[aLeft[iLeft]]>dbpage ); 1496a2a42013Sdrh assert( iRight>=nRight || aContent[aRight[iRight]]>dbpage ); 1497a2a42013Sdrh } 1498f544b4c4Sdan 1499f544b4c4Sdan *paRight = aLeft; 1500f544b4c4Sdan *pnRight = iOut; 1501f544b4c4Sdan memcpy(aLeft, aTmp, sizeof(aTmp[0])*iOut); 1502a2a42013Sdrh } 1503a2a42013Sdrh 1504f544b4c4Sdan /* 1505d9c9b78eSdrh ** Sort the elements in list aList using aContent[] as the sort key. 1506d9c9b78eSdrh ** Remove elements with duplicate keys, preferring to keep the 1507d9c9b78eSdrh ** larger aList[] values. 1508d9c9b78eSdrh ** 1509d9c9b78eSdrh ** The aList[] entries are indices into aContent[]. The values in 1510d9c9b78eSdrh ** aList[] are to be sorted so that for all J<K: 1511d9c9b78eSdrh ** 1512d9c9b78eSdrh ** aContent[aList[J]] < aContent[aList[K]] 1513d9c9b78eSdrh ** 1514d9c9b78eSdrh ** For any X and Y such that 1515d9c9b78eSdrh ** 1516d9c9b78eSdrh ** aContent[aList[X]] == aContent[aList[Y]] 1517d9c9b78eSdrh ** 1518d9c9b78eSdrh ** Keep the larger of the two values aList[X] and aList[Y] and discard 1519d9c9b78eSdrh ** the smaller. 1520f544b4c4Sdan */ 1521f544b4c4Sdan static void walMergesort( 1522d9c9b78eSdrh const u32 *aContent, /* Pages in wal */ 1523f544b4c4Sdan ht_slot *aBuffer, /* Buffer of at least *pnList items to use */ 1524f544b4c4Sdan ht_slot *aList, /* IN/OUT: List to sort */ 1525f544b4c4Sdan int *pnList /* IN/OUT: Number of elements in aList[] */ 1526f544b4c4Sdan ){ 1527f544b4c4Sdan struct Sublist { 1528f544b4c4Sdan int nList; /* Number of elements in aList */ 1529f544b4c4Sdan ht_slot *aList; /* Pointer to sub-list content */ 1530f544b4c4Sdan }; 1531f544b4c4Sdan 1532f544b4c4Sdan const int nList = *pnList; /* Size of input list */ 1533ff82894fSdrh int nMerge = 0; /* Number of elements in list aMerge */ 1534ff82894fSdrh ht_slot *aMerge = 0; /* List to be merged */ 1535f544b4c4Sdan int iList; /* Index into input list */ 1536f4fa0b80Sdrh u32 iSub = 0; /* Index into aSub array */ 1537f544b4c4Sdan struct Sublist aSub[13]; /* Array of sub-lists */ 1538f544b4c4Sdan 1539f544b4c4Sdan memset(aSub, 0, sizeof(aSub)); 1540f544b4c4Sdan assert( nList<=HASHTABLE_NPAGE && nList>0 ); 1541f544b4c4Sdan assert( HASHTABLE_NPAGE==(1<<(ArraySize(aSub)-1)) ); 1542f544b4c4Sdan 1543f544b4c4Sdan for(iList=0; iList<nList; iList++){ 1544f544b4c4Sdan nMerge = 1; 1545f544b4c4Sdan aMerge = &aList[iList]; 1546f544b4c4Sdan for(iSub=0; iList & (1<<iSub); iSub++){ 1547f4fa0b80Sdrh struct Sublist *p; 1548f4fa0b80Sdrh assert( iSub<ArraySize(aSub) ); 1549f4fa0b80Sdrh p = &aSub[iSub]; 1550f544b4c4Sdan assert( p->aList && p->nList<=(1<<iSub) ); 1551bdf1e243Sdan assert( p->aList==&aList[iList&~((2<<iSub)-1)] ); 1552f544b4c4Sdan walMerge(aContent, p->aList, p->nList, &aMerge, &nMerge, aBuffer); 1553f544b4c4Sdan } 1554f544b4c4Sdan aSub[iSub].aList = aMerge; 1555f544b4c4Sdan aSub[iSub].nList = nMerge; 1556f544b4c4Sdan } 1557f544b4c4Sdan 1558f544b4c4Sdan for(iSub++; iSub<ArraySize(aSub); iSub++){ 1559f544b4c4Sdan if( nList & (1<<iSub) ){ 1560f4fa0b80Sdrh struct Sublist *p; 1561f4fa0b80Sdrh assert( iSub<ArraySize(aSub) ); 1562f4fa0b80Sdrh p = &aSub[iSub]; 1563bdf1e243Sdan assert( p->nList<=(1<<iSub) ); 1564bdf1e243Sdan assert( p->aList==&aList[nList&~((2<<iSub)-1)] ); 1565f544b4c4Sdan walMerge(aContent, p->aList, p->nList, &aMerge, &nMerge, aBuffer); 1566f544b4c4Sdan } 1567f544b4c4Sdan } 1568f544b4c4Sdan assert( aMerge==aList ); 1569f544b4c4Sdan *pnList = nMerge; 1570f544b4c4Sdan 1571a2a42013Sdrh #ifdef SQLITE_DEBUG 1572a2a42013Sdrh { 1573a2a42013Sdrh int i; 1574a2a42013Sdrh for(i=1; i<*pnList; i++){ 1575a2a42013Sdrh assert( aContent[aList[i]] > aContent[aList[i-1]] ); 1576a2a42013Sdrh } 1577a2a42013Sdrh } 1578a2a42013Sdrh #endif 1579a2a42013Sdrh } 1580a2a42013Sdrh 1581a2a42013Sdrh /* 15825d656852Sdan ** Free an iterator allocated by walIteratorInit(). 15835d656852Sdan */ 15845d656852Sdan static void walIteratorFree(WalIterator *p){ 1585cbd55b03Sdrh sqlite3_free(p); 15865d656852Sdan } 15875d656852Sdan 15885d656852Sdan /* 1589bdf1e243Sdan ** Construct a WalInterator object that can be used to loop over all 1590302ce475Sdan ** pages in the WAL following frame nBackfill in ascending order. Frames 1591302ce475Sdan ** nBackfill or earlier may be included - excluding them is an optimization 1592302ce475Sdan ** only. The caller must hold the checkpoint lock. 1593a2a42013Sdrh ** 1594a2a42013Sdrh ** On success, make *pp point to the newly allocated WalInterator object 1595bdf1e243Sdan ** return SQLITE_OK. Otherwise, return an error code. If this routine 1596bdf1e243Sdan ** returns an error, the value of *pp is undefined. 1597a2a42013Sdrh ** 1598a2a42013Sdrh ** The calling routine should invoke walIteratorFree() to destroy the 1599bdf1e243Sdan ** WalIterator object when it has finished with it. 1600a2a42013Sdrh */ 1601302ce475Sdan static int walIteratorInit(Wal *pWal, u32 nBackfill, WalIterator **pp){ 16027ed91f23Sdrh WalIterator *p; /* Return value */ 1603c438efd6Sdrh int nSegment; /* Number of segments to merge */ 1604c438efd6Sdrh u32 iLast; /* Last frame in log */ 1605c438efd6Sdrh int nByte; /* Number of bytes to allocate */ 1606c438efd6Sdrh int i; /* Iterator variable */ 1607067f3165Sdan ht_slot *aTmp; /* Temp space used by merge-sort */ 1608bdf1e243Sdan int rc = SQLITE_OK; /* Return Code */ 1609a2a42013Sdrh 1610bdf1e243Sdan /* This routine only runs while holding the checkpoint lock. And 1611bdf1e243Sdan ** it only runs if there is actually content in the log (mxFrame>0). 1612a2a42013Sdrh */ 1613bdf1e243Sdan assert( pWal->ckptLock && pWal->hdr.mxFrame>0 ); 161413a3cb82Sdan iLast = pWal->hdr.mxFrame; 1615a2a42013Sdrh 1616bdf1e243Sdan /* Allocate space for the WalIterator object. */ 161713a3cb82Sdan nSegment = walFramePage(iLast) + 1; 161813a3cb82Sdan nByte = sizeof(WalIterator) 161952d6fc0eSdan + (nSegment-1)*sizeof(struct WalSegment) 162052d6fc0eSdan + iLast*sizeof(ht_slot); 1621f3cdcdccSdrh p = (WalIterator *)sqlite3_malloc64(nByte); 16228f6097c2Sdan if( !p ){ 1623fad3039cSmistachkin return SQLITE_NOMEM_BKPT; 1624a2a42013Sdrh } 1625c438efd6Sdrh memset(p, 0, nByte); 1626a2a42013Sdrh p->nSegment = nSegment; 1627bdf1e243Sdan 1628bdf1e243Sdan /* Allocate temporary space used by the merge-sort routine. This block 1629bdf1e243Sdan ** of memory will be freed before this function returns. 1630bdf1e243Sdan */ 1631f3cdcdccSdrh aTmp = (ht_slot *)sqlite3_malloc64( 163252d6fc0eSdan sizeof(ht_slot) * (iLast>HASHTABLE_NPAGE?HASHTABLE_NPAGE:iLast) 163352d6fc0eSdan ); 1634bdf1e243Sdan if( !aTmp ){ 1635fad3039cSmistachkin rc = SQLITE_NOMEM_BKPT; 1636bdf1e243Sdan } 1637bdf1e243Sdan 1638302ce475Sdan for(i=walFramePage(nBackfill+1); rc==SQLITE_OK && i<nSegment; i++){ 16394ece2f26Sdrh WalHashLoc sLoc; 164013a3cb82Sdan 16414ece2f26Sdrh rc = walHashGet(pWal, i, &sLoc); 1642bdf1e243Sdan if( rc==SQLITE_OK ){ 164352d6fc0eSdan int j; /* Counter variable */ 164452d6fc0eSdan int nEntry; /* Number of entries in this segment */ 164552d6fc0eSdan ht_slot *aIndex; /* Sorted index for this segment */ 164652d6fc0eSdan 16474ece2f26Sdrh sLoc.aPgno++; 1648519426aaSdrh if( (i+1)==nSegment ){ 16494ece2f26Sdrh nEntry = (int)(iLast - sLoc.iZero); 1650519426aaSdrh }else{ 16514ece2f26Sdrh nEntry = (int)((u32*)sLoc.aHash - (u32*)sLoc.aPgno); 1652519426aaSdrh } 16534ece2f26Sdrh aIndex = &((ht_slot *)&p->aSegment[p->nSegment])[sLoc.iZero]; 16544ece2f26Sdrh sLoc.iZero++; 165513a3cb82Sdan 165613a3cb82Sdan for(j=0; j<nEntry; j++){ 16575eba1f60Sshaneh aIndex[j] = (ht_slot)j; 1658c438efd6Sdrh } 16594ece2f26Sdrh walMergesort((u32 *)sLoc.aPgno, aTmp, aIndex, &nEntry); 16604ece2f26Sdrh p->aSegment[i].iZero = sLoc.iZero; 166113a3cb82Sdan p->aSegment[i].nEntry = nEntry; 1662bdf1e243Sdan p->aSegment[i].aIndex = aIndex; 16634ece2f26Sdrh p->aSegment[i].aPgno = (u32 *)sLoc.aPgno; 1664c438efd6Sdrh } 1665bdf1e243Sdan } 1666cbd55b03Sdrh sqlite3_free(aTmp); 1667c438efd6Sdrh 1668bdf1e243Sdan if( rc!=SQLITE_OK ){ 1669bdf1e243Sdan walIteratorFree(p); 167049cc2f3bSdrh p = 0; 1671bdf1e243Sdan } 16728f6097c2Sdan *pp = p; 1673bdf1e243Sdan return rc; 1674c438efd6Sdrh } 1675c438efd6Sdrh 1676c438efd6Sdrh /* 1677a58f26f9Sdan ** Attempt to obtain the exclusive WAL lock defined by parameters lockIdx and 1678a58f26f9Sdan ** n. If the attempt fails and parameter xBusy is not NULL, then it is a 1679a58f26f9Sdan ** busy-handler function. Invoke it and retry the lock until either the 1680a58f26f9Sdan ** lock is successfully obtained or the busy-handler returns 0. 1681a58f26f9Sdan */ 1682a58f26f9Sdan static int walBusyLock( 1683a58f26f9Sdan Wal *pWal, /* WAL connection */ 1684a58f26f9Sdan int (*xBusy)(void*), /* Function to call when busy */ 1685a58f26f9Sdan void *pBusyArg, /* Context argument for xBusyHandler */ 1686a58f26f9Sdan int lockIdx, /* Offset of first byte to lock */ 1687a58f26f9Sdan int n /* Number of bytes to lock */ 1688a58f26f9Sdan ){ 1689a58f26f9Sdan int rc; 1690a58f26f9Sdan do { 1691ab372773Sdrh rc = walLockExclusive(pWal, lockIdx, n); 1692a58f26f9Sdan }while( xBusy && rc==SQLITE_BUSY && xBusy(pBusyArg) ); 1693a58f26f9Sdan return rc; 1694a58f26f9Sdan } 1695a58f26f9Sdan 1696a58f26f9Sdan /* 1697f2b8dd58Sdan ** The cache of the wal-index header must be valid to call this function. 1698f2b8dd58Sdan ** Return the page-size in bytes used by the database. 1699f2b8dd58Sdan */ 1700f2b8dd58Sdan static int walPagesize(Wal *pWal){ 1701f2b8dd58Sdan return (pWal->hdr.szPage&0xfe00) + ((pWal->hdr.szPage&0x0001)<<16); 1702f2b8dd58Sdan } 1703f2b8dd58Sdan 1704f2b8dd58Sdan /* 1705f26a1549Sdan ** The following is guaranteed when this function is called: 1706f26a1549Sdan ** 1707f26a1549Sdan ** a) the WRITER lock is held, 1708f26a1549Sdan ** b) the entire log file has been checkpointed, and 1709f26a1549Sdan ** c) any existing readers are reading exclusively from the database 1710f26a1549Sdan ** file - there are no readers that may attempt to read a frame from 1711f26a1549Sdan ** the log file. 1712f26a1549Sdan ** 1713f26a1549Sdan ** This function updates the shared-memory structures so that the next 1714f26a1549Sdan ** client to write to the database (which may be this one) does so by 1715f26a1549Sdan ** writing frames into the start of the log file. 17160fe8c1b9Sdan ** 17170fe8c1b9Sdan ** The value of parameter salt1 is used as the aSalt[1] value in the 17180fe8c1b9Sdan ** new wal-index header. It should be passed a pseudo-random value (i.e. 17190fe8c1b9Sdan ** one obtained from sqlite3_randomness()). 1720f26a1549Sdan */ 17210fe8c1b9Sdan static void walRestartHdr(Wal *pWal, u32 salt1){ 1722f26a1549Sdan volatile WalCkptInfo *pInfo = walCkptInfo(pWal); 1723f26a1549Sdan int i; /* Loop counter */ 1724f26a1549Sdan u32 *aSalt = pWal->hdr.aSalt; /* Big-endian salt values */ 1725f26a1549Sdan pWal->nCkpt++; 1726f26a1549Sdan pWal->hdr.mxFrame = 0; 1727f26a1549Sdan sqlite3Put4byte((u8*)&aSalt[0], 1 + sqlite3Get4byte((u8*)&aSalt[0])); 17280fe8c1b9Sdan memcpy(&pWal->hdr.aSalt[1], &salt1, 4); 1729f26a1549Sdan walIndexWriteHdr(pWal); 1730f26a1549Sdan pInfo->nBackfill = 0; 1731998147ecSdrh pInfo->nBackfillAttempted = 0; 1732f26a1549Sdan pInfo->aReadMark[1] = 0; 1733f26a1549Sdan for(i=2; i<WAL_NREADER; i++) pInfo->aReadMark[i] = READMARK_NOT_USED; 1734f26a1549Sdan assert( pInfo->aReadMark[0]==0 ); 1735f26a1549Sdan } 1736f26a1549Sdan 1737f26a1549Sdan /* 173873b64e4dSdrh ** Copy as much content as we can from the WAL back into the database file 173973b64e4dSdrh ** in response to an sqlite3_wal_checkpoint() request or the equivalent. 174073b64e4dSdrh ** 174173b64e4dSdrh ** The amount of information copies from WAL to database might be limited 174273b64e4dSdrh ** by active readers. This routine will never overwrite a database page 174373b64e4dSdrh ** that a concurrent reader might be using. 174473b64e4dSdrh ** 174573b64e4dSdrh ** All I/O barrier operations (a.k.a fsyncs) occur in this routine when 174673b64e4dSdrh ** SQLite is in WAL-mode in synchronous=NORMAL. That means that if 174773b64e4dSdrh ** checkpoints are always run by a background thread or background 174873b64e4dSdrh ** process, foreground threads will never block on a lengthy fsync call. 174973b64e4dSdrh ** 175073b64e4dSdrh ** Fsync is called on the WAL before writing content out of the WAL and 175173b64e4dSdrh ** into the database. This ensures that if the new content is persistent 175273b64e4dSdrh ** in the WAL and can be recovered following a power-loss or hard reset. 175373b64e4dSdrh ** 175473b64e4dSdrh ** Fsync is also called on the database file if (and only if) the entire 175573b64e4dSdrh ** WAL content is copied into the database file. This second fsync makes 175673b64e4dSdrh ** it safe to delete the WAL since the new content will persist in the 175773b64e4dSdrh ** database file. 175873b64e4dSdrh ** 175973b64e4dSdrh ** This routine uses and updates the nBackfill field of the wal-index header. 176060ec914cSpeter.d.reid ** This is the only routine that will increase the value of nBackfill. 176173b64e4dSdrh ** (A WAL reset or recovery will revert nBackfill to zero, but not increase 176273b64e4dSdrh ** its value.) 176373b64e4dSdrh ** 176473b64e4dSdrh ** The caller must be holding sufficient locks to ensure that no other 176573b64e4dSdrh ** checkpoint is running (in any other thread or process) at the same 176673b64e4dSdrh ** time. 1767c438efd6Sdrh */ 17687ed91f23Sdrh static int walCheckpoint( 17697ed91f23Sdrh Wal *pWal, /* Wal connection */ 17707fb89906Sdan sqlite3 *db, /* Check for interrupts on this handle */ 1771cdc1f049Sdan int eMode, /* One of PASSIVE, FULL or RESTART */ 1772dd90d7eeSdrh int (*xBusy)(void*), /* Function to call when busy */ 1773a58f26f9Sdan void *pBusyArg, /* Context argument for xBusyHandler */ 1774c438efd6Sdrh int sync_flags, /* Flags for OsSync() (or 0) */ 17759c5e3680Sdan u8 *zBuf /* Temporary buffer to use */ 1776c438efd6Sdrh ){ 1777976b0033Sdan int rc = SQLITE_OK; /* Return code */ 1778b2eced5dSdrh int szPage; /* Database page-size */ 17797ed91f23Sdrh WalIterator *pIter = 0; /* Wal iterator context */ 1780c438efd6Sdrh u32 iDbpage = 0; /* Next database page to write */ 17817ed91f23Sdrh u32 iFrame = 0; /* Wal frame containing data for iDbpage */ 178273b64e4dSdrh u32 mxSafeFrame; /* Max frame that can be backfilled */ 1783502019c8Sdan u32 mxPage; /* Max database page to write */ 178473b64e4dSdrh int i; /* Loop counter */ 178573b64e4dSdrh volatile WalCkptInfo *pInfo; /* The checkpoint status information */ 1786c438efd6Sdrh 1787f2b8dd58Sdan szPage = walPagesize(pWal); 17889b78f791Sdrh testcase( szPage<=32768 ); 17899b78f791Sdrh testcase( szPage>=65536 ); 17907d208445Sdrh pInfo = walCkptInfo(pWal); 1791976b0033Sdan if( pInfo->nBackfill<pWal->hdr.mxFrame ){ 1792f544b4c4Sdan 1793dd90d7eeSdrh /* EVIDENCE-OF: R-62920-47450 The busy-handler callback is never invoked 1794dd90d7eeSdrh ** in the SQLITE_CHECKPOINT_PASSIVE mode. */ 1795dd90d7eeSdrh assert( eMode!=SQLITE_CHECKPOINT_PASSIVE || xBusy==0 ); 1796b6e099a9Sdan 179773b64e4dSdrh /* Compute in mxSafeFrame the index of the last frame of the WAL that is 179873b64e4dSdrh ** safe to write into the database. Frames beyond mxSafeFrame might 179973b64e4dSdrh ** overwrite database pages that are in use by active readers and thus 180073b64e4dSdrh ** cannot be backfilled from the WAL. 180173b64e4dSdrh */ 1802d54ff60bSdan mxSafeFrame = pWal->hdr.mxFrame; 1803502019c8Sdan mxPage = pWal->hdr.nPage; 180473b64e4dSdrh for(i=1; i<WAL_NREADER; i++){ 18051fe0af20Sdan /* Thread-sanitizer reports that the following is an unsafe read, 18061fe0af20Sdan ** as some other thread may be in the process of updating the value 18071fe0af20Sdan ** of the aReadMark[] slot. The assumption here is that if that is 18081fe0af20Sdan ** happening, the other client may only be increasing the value, 18091fe0af20Sdan ** not decreasing it. So assuming either that either the "old" or 18101fe0af20Sdan ** "new" version of the value is read, and not some arbitrary value 18111fe0af20Sdan ** that would never be written by a real client, things are still 18121fe0af20Sdan ** safe. */ 181373b64e4dSdrh u32 y = pInfo->aReadMark[i]; 1814f2b8dd58Sdan if( mxSafeFrame>y ){ 181583f42d1bSdan assert( y<=pWal->hdr.mxFrame ); 1816f2b8dd58Sdan rc = walBusyLock(pWal, xBusy, pBusyArg, WAL_READ_LOCK(i), 1); 181783f42d1bSdan if( rc==SQLITE_OK ){ 18185373b76bSdan pInfo->aReadMark[i] = (i==1 ? mxSafeFrame : READMARK_NOT_USED); 181973b64e4dSdrh walUnlockExclusive(pWal, WAL_READ_LOCK(i), 1); 18202d37e1cfSdrh }else if( rc==SQLITE_BUSY ){ 1821db7f647eSdrh mxSafeFrame = y; 1822f2b8dd58Sdan xBusy = 0; 18232d37e1cfSdrh }else{ 182483f42d1bSdan goto walcheckpoint_out; 182573b64e4dSdrh } 182673b64e4dSdrh } 182773b64e4dSdrh } 182873b64e4dSdrh 1829f0cb61d6Sdan /* Allocate the iterator */ 1830f0cb61d6Sdan if( pInfo->nBackfill<mxSafeFrame ){ 1831f0cb61d6Sdan rc = walIteratorInit(pWal, pInfo->nBackfill, &pIter); 1832f0cb61d6Sdan assert( rc==SQLITE_OK || pIter==0 ); 1833f0cb61d6Sdan } 1834f0cb61d6Sdan 1835f0cb61d6Sdan if( pIter 1836a58f26f9Sdan && (rc = walBusyLock(pWal, xBusy, pBusyArg, WAL_READ_LOCK(0),1))==SQLITE_OK 183773b64e4dSdrh ){ 1838502019c8Sdan i64 nSize; /* Current size of database file */ 183973b64e4dSdrh u32 nBackfill = pInfo->nBackfill; 184073b64e4dSdrh 18413bf83ccdSdan pInfo->nBackfillAttempted = mxSafeFrame; 18423bf83ccdSdan 184373b64e4dSdrh /* Sync the WAL to disk */ 1844daaae7b9Sdrh rc = sqlite3OsSync(pWal->pWalFd, CKPT_SYNC_FLAGS(sync_flags)); 1845c438efd6Sdrh 1846f23da966Sdan /* If the database may grow as a result of this checkpoint, hint 1847f23da966Sdan ** about the eventual size of the db file to the VFS layer. 1848f23da966Sdan */ 1849007820d6Sdan if( rc==SQLITE_OK ){ 1850007820d6Sdan i64 nReq = ((i64)mxPage * szPage); 1851f23da966Sdan rc = sqlite3OsFileSize(pWal->pDbFd, &nSize); 1852f23da966Sdan if( rc==SQLITE_OK && nSize<nReq ){ 1853f23da966Sdan sqlite3OsFileControlHint(pWal->pDbFd, SQLITE_FCNTL_SIZE_HINT, &nReq); 1854007820d6Sdan } 1855f23da966Sdan } 1856f23da966Sdan 1857502019c8Sdan 1858976b0033Sdan /* Iterate through the contents of the WAL, copying data to the db file */ 185973b64e4dSdrh while( rc==SQLITE_OK && 0==walIteratorNext(pIter, &iDbpage, &iFrame) ){ 18603e8e7ecbSdrh i64 iOffset; 186113a3cb82Sdan assert( walFramePgno(pWal, iFrame)==iDbpage ); 18627fb89906Sdan if( db->u1.isInterrupted ){ 18637fb89906Sdan rc = db->mallocFailed ? SQLITE_NOMEM_BKPT : SQLITE_INTERRUPT; 18647fb89906Sdan break; 18657fb89906Sdan } 1866976b0033Sdan if( iFrame<=nBackfill || iFrame>mxSafeFrame || iDbpage>mxPage ){ 1867976b0033Sdan continue; 1868976b0033Sdan } 18693e8e7ecbSdrh iOffset = walFrameOffset(iFrame, szPage) + WAL_FRAME_HDRSIZE; 187009b5dbc5Sdrh /* testcase( IS_BIG_INT(iOffset) ); // requires a 4GiB WAL file */ 18713e8e7ecbSdrh rc = sqlite3OsRead(pWal->pWalFd, zBuf, szPage, iOffset); 18723e8e7ecbSdrh if( rc!=SQLITE_OK ) break; 18733e8e7ecbSdrh iOffset = (iDbpage-1)*(i64)szPage; 18743e8e7ecbSdrh testcase( IS_BIG_INT(iOffset) ); 1875f23da966Sdan rc = sqlite3OsWrite(pWal->pDbFd, zBuf, szPage, iOffset); 18763e8e7ecbSdrh if( rc!=SQLITE_OK ) break; 1877c438efd6Sdrh } 1878c438efd6Sdrh 187973b64e4dSdrh /* If work was actually accomplished... */ 1880d764c7deSdan if( rc==SQLITE_OK ){ 18814280eb30Sdan if( mxSafeFrame==walIndexHdr(pWal)->mxFrame ){ 18823e8e7ecbSdrh i64 szDb = pWal->hdr.nPage*(i64)szPage; 18833e8e7ecbSdrh testcase( IS_BIG_INT(szDb) ); 18843e8e7ecbSdrh rc = sqlite3OsTruncate(pWal->pDbFd, szDb); 1885daaae7b9Sdrh if( rc==SQLITE_OK ){ 1886daaae7b9Sdrh rc = sqlite3OsSync(pWal->pDbFd, CKPT_SYNC_FLAGS(sync_flags)); 1887c438efd6Sdrh } 188873b64e4dSdrh } 1889d764c7deSdan if( rc==SQLITE_OK ){ 1890d764c7deSdan pInfo->nBackfill = mxSafeFrame; 1891d764c7deSdan } 189273b64e4dSdrh } 1893c438efd6Sdrh 189473b64e4dSdrh /* Release the reader lock held while backfilling */ 189573b64e4dSdrh walUnlockExclusive(pWal, WAL_READ_LOCK(0), 1); 1896a58f26f9Sdan } 1897a58f26f9Sdan 1898a58f26f9Sdan if( rc==SQLITE_BUSY ){ 189934116eafSdrh /* Reset the return code so as not to report a checkpoint failure 1900a58f26f9Sdan ** just because there are active readers. */ 190134116eafSdrh rc = SQLITE_OK; 190273b64e4dSdrh } 1903976b0033Sdan } 190473b64e4dSdrh 1905f26a1549Sdan /* If this is an SQLITE_CHECKPOINT_RESTART or TRUNCATE operation, and the 1906f26a1549Sdan ** entire wal file has been copied into the database file, then block 1907f26a1549Sdan ** until all readers have finished using the wal file. This ensures that 1908f26a1549Sdan ** the next process to write to the database restarts the wal file. 1909f2b8dd58Sdan */ 1910f2b8dd58Sdan if( rc==SQLITE_OK && eMode!=SQLITE_CHECKPOINT_PASSIVE ){ 1911cdc1f049Sdan assert( pWal->writeLock ); 1912f2b8dd58Sdan if( pInfo->nBackfill<pWal->hdr.mxFrame ){ 1913f2b8dd58Sdan rc = SQLITE_BUSY; 1914f26a1549Sdan }else if( eMode>=SQLITE_CHECKPOINT_RESTART ){ 19150fe8c1b9Sdan u32 salt1; 19160fe8c1b9Sdan sqlite3_randomness(4, &salt1); 1917976b0033Sdan assert( pInfo->nBackfill==pWal->hdr.mxFrame ); 1918cdc1f049Sdan rc = walBusyLock(pWal, xBusy, pBusyArg, WAL_READ_LOCK(1), WAL_NREADER-1); 1919cdc1f049Sdan if( rc==SQLITE_OK ){ 1920f26a1549Sdan if( eMode==SQLITE_CHECKPOINT_TRUNCATE ){ 1921a25165faSdrh /* IMPLEMENTATION-OF: R-44699-57140 This mode works the same way as 1922a25165faSdrh ** SQLITE_CHECKPOINT_RESTART with the addition that it also 1923a25165faSdrh ** truncates the log file to zero bytes just prior to a 1924a25165faSdrh ** successful return. 1925f26a1549Sdan ** 1926f26a1549Sdan ** In theory, it might be safe to do this without updating the 1927f26a1549Sdan ** wal-index header in shared memory, as all subsequent reader or 1928f26a1549Sdan ** writer clients should see that the entire log file has been 1929f26a1549Sdan ** checkpointed and behave accordingly. This seems unsafe though, 1930f26a1549Sdan ** as it would leave the system in a state where the contents of 1931f26a1549Sdan ** the wal-index header do not match the contents of the 1932f26a1549Sdan ** file-system. To avoid this, update the wal-index header to 1933f26a1549Sdan ** indicate that the log file contains zero valid frames. */ 19340fe8c1b9Sdan walRestartHdr(pWal, salt1); 1935f26a1549Sdan rc = sqlite3OsTruncate(pWal->pWalFd, 0); 1936f26a1549Sdan } 1937cdc1f049Sdan walUnlockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1); 1938cdc1f049Sdan } 1939cdc1f049Sdan } 1940f2b8dd58Sdan } 1941cdc1f049Sdan 194283f42d1bSdan walcheckpoint_out: 19437ed91f23Sdrh walIteratorFree(pIter); 1944c438efd6Sdrh return rc; 1945c438efd6Sdrh } 1946c438efd6Sdrh 1947c438efd6Sdrh /* 1948f60b7f36Sdan ** If the WAL file is currently larger than nMax bytes in size, truncate 1949f60b7f36Sdan ** it to exactly nMax bytes. If an error occurs while doing so, ignore it. 19508dd4afadSdrh */ 1951f60b7f36Sdan static void walLimitSize(Wal *pWal, i64 nMax){ 19528dd4afadSdrh i64 sz; 19538dd4afadSdrh int rx; 19548dd4afadSdrh sqlite3BeginBenignMalloc(); 19558dd4afadSdrh rx = sqlite3OsFileSize(pWal->pWalFd, &sz); 1956f60b7f36Sdan if( rx==SQLITE_OK && (sz > nMax ) ){ 1957f60b7f36Sdan rx = sqlite3OsTruncate(pWal->pWalFd, nMax); 19588dd4afadSdrh } 19598dd4afadSdrh sqlite3EndBenignMalloc(); 19608dd4afadSdrh if( rx ){ 19618dd4afadSdrh sqlite3_log(rx, "cannot limit WAL size: %s", pWal->zWalName); 19628dd4afadSdrh } 19638dd4afadSdrh } 19648dd4afadSdrh 19658dd4afadSdrh /* 1966c438efd6Sdrh ** Close a connection to a log file. 1967c438efd6Sdrh */ 1968c438efd6Sdrh int sqlite3WalClose( 19697ed91f23Sdrh Wal *pWal, /* Wal to close */ 19707fb89906Sdan sqlite3 *db, /* For interrupt flag */ 1971c438efd6Sdrh int sync_flags, /* Flags to pass to OsSync() (or 0) */ 1972b6e099a9Sdan int nBuf, 1973b6e099a9Sdan u8 *zBuf /* Buffer of at least nBuf bytes */ 1974c438efd6Sdrh ){ 1975c438efd6Sdrh int rc = SQLITE_OK; 19767ed91f23Sdrh if( pWal ){ 197730c8629eSdan int isDelete = 0; /* True to unlink wal and wal-index files */ 197830c8629eSdan 197930c8629eSdan /* If an EXCLUSIVE lock can be obtained on the database file (using the 198030c8629eSdan ** ordinary, rollback-mode locking methods, this guarantees that the 198130c8629eSdan ** connection associated with this log file is the only connection to 198230c8629eSdan ** the database. In this case checkpoint the database and unlink both 198330c8629eSdan ** the wal and wal-index files. 198430c8629eSdan ** 198530c8629eSdan ** The EXCLUSIVE lock is not released before returning. 198630c8629eSdan */ 19874a5bad57Sdan if( zBuf!=0 1988298af023Sdan && SQLITE_OK==(rc = sqlite3OsLock(pWal->pDbFd, SQLITE_LOCK_EXCLUSIVE)) 1989298af023Sdan ){ 19908c408004Sdan if( pWal->exclusiveMode==WAL_NORMAL_MODE ){ 19918c408004Sdan pWal->exclusiveMode = WAL_EXCLUSIVE_MODE; 19928c408004Sdan } 19937fb89906Sdan rc = sqlite3WalCheckpoint(pWal, db, 19947fb89906Sdan SQLITE_CHECKPOINT_PASSIVE, 0, 0, sync_flags, nBuf, zBuf, 0, 0 1995cdc1f049Sdan ); 1996eed42505Sdrh if( rc==SQLITE_OK ){ 1997eed42505Sdrh int bPersist = -1; 1998c02372ceSdrh sqlite3OsFileControlHint( 19996f2f19a1Sdan pWal->pDbFd, SQLITE_FCNTL_PERSIST_WAL, &bPersist 20006f2f19a1Sdan ); 2001eed42505Sdrh if( bPersist!=1 ){ 2002eed42505Sdrh /* Try to delete the WAL file if the checkpoint completed and 2003eed42505Sdrh ** fsyned (rc==SQLITE_OK) and if we are not in persistent-wal 2004eed42505Sdrh ** mode (!bPersist) */ 200530c8629eSdan isDelete = 1; 2006f60b7f36Sdan }else if( pWal->mxWalSize>=0 ){ 2007eed42505Sdrh /* Try to truncate the WAL file to zero bytes if the checkpoint 2008eed42505Sdrh ** completed and fsynced (rc==SQLITE_OK) and we are in persistent 2009eed42505Sdrh ** WAL mode (bPersist) and if the PRAGMA journal_size_limit is a 2010eed42505Sdrh ** non-negative value (pWal->mxWalSize>=0). Note that we truncate 2011eed42505Sdrh ** to zero bytes as truncating to the journal_size_limit might 2012eed42505Sdrh ** leave a corrupt WAL file on disk. */ 2013eed42505Sdrh walLimitSize(pWal, 0); 2014eed42505Sdrh } 201530c8629eSdan } 201630c8629eSdan } 201730c8629eSdan 20181018e90bSdan walIndexClose(pWal, isDelete); 2019d9e5c4f6Sdrh sqlite3OsClose(pWal->pWalFd); 202030c8629eSdan if( isDelete ){ 202192c45cf0Sdrh sqlite3BeginBenignMalloc(); 2022d9e5c4f6Sdrh sqlite3OsDelete(pWal->pVfs, pWal->zWalName, 0); 202392c45cf0Sdrh sqlite3EndBenignMalloc(); 202430c8629eSdan } 2025c74c3334Sdrh WALTRACE(("WAL%p: closed\n", pWal)); 20268a300f80Sshaneh sqlite3_free((void *)pWal->apWiData); 20277ed91f23Sdrh sqlite3_free(pWal); 2028c438efd6Sdrh } 2029c438efd6Sdrh return rc; 2030c438efd6Sdrh } 2031c438efd6Sdrh 2032c438efd6Sdrh /* 2033a2a42013Sdrh ** Try to read the wal-index header. Return 0 on success and 1 if 2034a2a42013Sdrh ** there is a problem. 2035a2a42013Sdrh ** 2036a2a42013Sdrh ** The wal-index is in shared memory. Another thread or process might 2037a2a42013Sdrh ** be writing the header at the same time this procedure is trying to 2038a2a42013Sdrh ** read it, which might result in inconsistency. A dirty read is detected 203973b64e4dSdrh ** by verifying that both copies of the header are the same and also by 204073b64e4dSdrh ** a checksum on the header. 2041a2a42013Sdrh ** 2042a2a42013Sdrh ** If and only if the read is consistent and the header is different from 2043a2a42013Sdrh ** pWal->hdr, then pWal->hdr is updated to the content of the new header 2044a2a42013Sdrh ** and *pChanged is set to 1. 2045c438efd6Sdrh ** 204684670502Sdan ** If the checksum cannot be verified return non-zero. If the header 204784670502Sdan ** is read successfully and the checksum verified, return zero. 2048c438efd6Sdrh */ 20497750ab48Sdrh static int walIndexTryHdr(Wal *pWal, int *pChanged){ 2050286a2884Sdrh u32 aCksum[2]; /* Checksum on the header content */ 2051f0b20f88Sdrh WalIndexHdr h1, h2; /* Two copies of the header content */ 20524280eb30Sdan WalIndexHdr volatile *aHdr; /* Header in shared memory */ 2053c438efd6Sdrh 20544280eb30Sdan /* The first page of the wal-index must be mapped at this point. */ 20554280eb30Sdan assert( pWal->nWiData>0 && pWal->apWiData[0] ); 205679e6c78cSdrh 20576cef0cf7Sdrh /* Read the header. This might happen concurrently with a write to the 205873b64e4dSdrh ** same area of shared memory on a different CPU in a SMP, 205973b64e4dSdrh ** meaning it is possible that an inconsistent snapshot is read 206084670502Sdan ** from the file. If this happens, return non-zero. 2061f0b20f88Sdrh ** 2062f0b20f88Sdrh ** There are two copies of the header at the beginning of the wal-index. 2063f0b20f88Sdrh ** When reading, read [0] first then [1]. Writes are in the reverse order. 2064f0b20f88Sdrh ** Memory barriers are used to prevent the compiler or the hardware from 2065f0b20f88Sdrh ** reordering the reads and writes. 2066c438efd6Sdrh */ 20674280eb30Sdan aHdr = walIndexHdr(pWal); 20684280eb30Sdan memcpy(&h1, (void *)&aHdr[0], sizeof(h1)); 20698c408004Sdan walShmBarrier(pWal); 20704280eb30Sdan memcpy(&h2, (void *)&aHdr[1], sizeof(h2)); 2071286a2884Sdrh 2072f0b20f88Sdrh if( memcmp(&h1, &h2, sizeof(h1))!=0 ){ 2073f0b20f88Sdrh return 1; /* Dirty read */ 2074286a2884Sdrh } 20754b82c387Sdrh if( h1.isInit==0 ){ 2076f0b20f88Sdrh return 1; /* Malformed header - probably all zeros */ 2077f0b20f88Sdrh } 2078b8fd6c2fSdan walChecksumBytes(1, (u8*)&h1, sizeof(h1)-sizeof(h1.aCksum), 0, aCksum); 2079f0b20f88Sdrh if( aCksum[0]!=h1.aCksum[0] || aCksum[1]!=h1.aCksum[1] ){ 2080f0b20f88Sdrh return 1; /* Checksum does not match */ 2081c438efd6Sdrh } 2082c438efd6Sdrh 2083f0b20f88Sdrh if( memcmp(&pWal->hdr, &h1, sizeof(WalIndexHdr)) ){ 2084c438efd6Sdrh *pChanged = 1; 2085f0b20f88Sdrh memcpy(&pWal->hdr, &h1, sizeof(WalIndexHdr)); 20869b78f791Sdrh pWal->szPage = (pWal->hdr.szPage&0xfe00) + ((pWal->hdr.szPage&0x0001)<<16); 20879b78f791Sdrh testcase( pWal->szPage<=32768 ); 20889b78f791Sdrh testcase( pWal->szPage>=65536 ); 2089c438efd6Sdrh } 209084670502Sdan 209184670502Sdan /* The header was successfully read. Return zero. */ 209284670502Sdan return 0; 2093c438efd6Sdrh } 2094c438efd6Sdrh 2095c438efd6Sdrh /* 209608ecefc5Sdan ** This is the value that walTryBeginRead returns when it needs to 209708ecefc5Sdan ** be retried. 209808ecefc5Sdan */ 209908ecefc5Sdan #define WAL_RETRY (-1) 210008ecefc5Sdan 210108ecefc5Sdan /* 2102a2a42013Sdrh ** Read the wal-index header from the wal-index and into pWal->hdr. 2103a927e94eSdrh ** If the wal-header appears to be corrupt, try to reconstruct the 2104a927e94eSdrh ** wal-index from the WAL before returning. 2105a2a42013Sdrh ** 2106a2a42013Sdrh ** Set *pChanged to 1 if the wal-index header value in pWal->hdr is 210760ec914cSpeter.d.reid ** changed by this operation. If pWal->hdr is unchanged, set *pChanged 2108a2a42013Sdrh ** to 0. 2109a2a42013Sdrh ** 21107ed91f23Sdrh ** If the wal-index header is successfully read, return SQLITE_OK. 2111c438efd6Sdrh ** Otherwise an SQLite error code. 2112c438efd6Sdrh */ 21137ed91f23Sdrh static int walIndexReadHdr(Wal *pWal, int *pChanged){ 211484670502Sdan int rc; /* Return code */ 211573b64e4dSdrh int badHdr; /* True if a header read failed */ 2116a927e94eSdrh volatile u32 *page0; /* Chunk of wal-index containing header */ 2117c438efd6Sdrh 21184280eb30Sdan /* Ensure that page 0 of the wal-index (the page that contains the 21194280eb30Sdan ** wal-index header) is mapped. Return early if an error occurs here. 21204280eb30Sdan */ 2121a861469aSdan assert( pChanged ); 21224280eb30Sdan rc = walIndexPage(pWal, 0, &page0); 212385bc6df2Sdrh if( rc!=SQLITE_OK ){ 212485bc6df2Sdrh assert( rc!=SQLITE_READONLY ); /* READONLY changed to OK in walIndexPage */ 21257e45e3a5Sdrh if( rc==SQLITE_READONLY_CANTINIT ){ 212685bc6df2Sdrh /* The SQLITE_READONLY_CANTINIT return means that the shared-memory 212785bc6df2Sdrh ** was openable but is not writable, and this thread is unable to 212885bc6df2Sdrh ** confirm that another write-capable connection has the shared-memory 212985bc6df2Sdrh ** open, and hence the content of the shared-memory is unreliable, 213085bc6df2Sdrh ** since the shared-memory might be inconsistent with the WAL file 213185bc6df2Sdrh ** and there is no writer on hand to fix it. */ 2132c05a063cSdrh assert( page0==0 ); 2133c05a063cSdrh assert( pWal->writeLock==0 ); 2134c05a063cSdrh assert( pWal->readOnly & WAL_SHM_RDONLY ); 213585bc6df2Sdrh pWal->bShmUnreliable = 1; 213611caf4f4Sdan pWal->exclusiveMode = WAL_HEAPMEMORY_MODE; 213711caf4f4Sdan *pChanged = 1; 213885bc6df2Sdrh }else{ 213985bc6df2Sdrh return rc; /* Any other non-OK return is just an error */ 214085bc6df2Sdrh } 2141c05a063cSdrh }else{ 2142c05a063cSdrh /* page0 can be NULL if the SHM is zero bytes in size and pWal->writeLock 2143c05a063cSdrh ** is zero, which prevents the SHM from growing */ 2144c05a063cSdrh testcase( page0!=0 ); 2145c05a063cSdrh } 2146c05a063cSdrh assert( page0!=0 || pWal->writeLock==0 ); 21477ed91f23Sdrh 21484280eb30Sdan /* If the first page of the wal-index has been mapped, try to read the 21494280eb30Sdan ** wal-index header immediately, without holding any lock. This usually 21504280eb30Sdan ** works, but may fail if the wal-index header is corrupt or currently 2151a927e94eSdrh ** being modified by another thread or process. 2152c438efd6Sdrh */ 21534280eb30Sdan badHdr = (page0 ? walIndexTryHdr(pWal, pChanged) : 1); 2154c438efd6Sdrh 215573b64e4dSdrh /* If the first attempt failed, it might have been due to a race 215666dfec8bSdrh ** with a writer. So get a WRITE lock and try again. 2157c438efd6Sdrh */ 2158d54ff60bSdan assert( badHdr==0 || pWal->writeLock==0 ); 21594edc6bf3Sdan if( badHdr ){ 216085bc6df2Sdrh if( pWal->bShmUnreliable==0 && (pWal->readOnly & WAL_SHM_RDONLY) ){ 21614edc6bf3Sdan if( SQLITE_OK==(rc = walLockShared(pWal, WAL_WRITE_LOCK)) ){ 21624edc6bf3Sdan walUnlockShared(pWal, WAL_WRITE_LOCK); 21634edc6bf3Sdan rc = SQLITE_READONLY_RECOVERY; 21644edc6bf3Sdan } 2165ab372773Sdrh }else if( SQLITE_OK==(rc = walLockExclusive(pWal, WAL_WRITE_LOCK, 1)) ){ 216673b64e4dSdrh pWal->writeLock = 1; 21674280eb30Sdan if( SQLITE_OK==(rc = walIndexPage(pWal, 0, &page0)) ){ 216873b64e4dSdrh badHdr = walIndexTryHdr(pWal, pChanged); 216973b64e4dSdrh if( badHdr ){ 217073b64e4dSdrh /* If the wal-index header is still malformed even while holding 217173b64e4dSdrh ** a WRITE lock, it can only mean that the header is corrupted and 217273b64e4dSdrh ** needs to be reconstructed. So run recovery to do exactly that. 217373b64e4dSdrh */ 21747ed91f23Sdrh rc = walIndexRecover(pWal); 21753dee6da9Sdan *pChanged = 1; 2176c438efd6Sdrh } 2177c438efd6Sdrh } 21784280eb30Sdan pWal->writeLock = 0; 21794280eb30Sdan walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1); 2180bab7b91eSdrh } 21814edc6bf3Sdan } 2182bab7b91eSdrh 2183a927e94eSdrh /* If the header is read successfully, check the version number to make 2184a927e94eSdrh ** sure the wal-index was not constructed with some future format that 2185a927e94eSdrh ** this version of SQLite cannot understand. 2186a927e94eSdrh */ 2187a927e94eSdrh if( badHdr==0 && pWal->hdr.iVersion!=WALINDEX_MAX_VERSION ){ 2188a927e94eSdrh rc = SQLITE_CANTOPEN_BKPT; 2189a927e94eSdrh } 219085bc6df2Sdrh if( pWal->bShmUnreliable ){ 219111caf4f4Sdan if( rc!=SQLITE_OK ){ 219211caf4f4Sdan walIndexClose(pWal, 0); 219385bc6df2Sdrh pWal->bShmUnreliable = 0; 219408ecefc5Sdan assert( pWal->nWiData>0 && pWal->apWiData[0]==0 ); 21958b17ac19Sdrh /* walIndexRecover() might have returned SHORT_READ if a concurrent 21968b17ac19Sdrh ** writer truncated the WAL out from under it. If that happens, it 21978b17ac19Sdrh ** indicates that a writer has fixed the SHM file for us, so retry */ 219808ecefc5Sdan if( rc==SQLITE_IOERR_SHORT_READ ) rc = WAL_RETRY; 219911caf4f4Sdan } 220011caf4f4Sdan pWal->exclusiveMode = WAL_NORMAL_MODE; 220111caf4f4Sdan } 2202a927e94eSdrh 2203c438efd6Sdrh return rc; 2204c438efd6Sdrh } 2205c438efd6Sdrh 2206c438efd6Sdrh /* 220785bc6df2Sdrh ** Open a transaction in a connection where the shared-memory is read-only 220885bc6df2Sdrh ** and where we cannot verify that there is a separate write-capable connection 220985bc6df2Sdrh ** on hand to keep the shared-memory up-to-date with the WAL file. 221085bc6df2Sdrh ** 221185bc6df2Sdrh ** This can happen, for example, when the shared-memory is implemented by 221285bc6df2Sdrh ** memory-mapping a *-shm file, where a prior writer has shut down and 221385bc6df2Sdrh ** left the *-shm file on disk, and now the present connection is trying 221485bc6df2Sdrh ** to use that database but lacks write permission on the *-shm file. 221585bc6df2Sdrh ** Other scenarios are also possible, depending on the VFS implementation. 221685bc6df2Sdrh ** 221785bc6df2Sdrh ** Precondition: 221885bc6df2Sdrh ** 221985bc6df2Sdrh ** The *-wal file has been read and an appropriate wal-index has been 222085bc6df2Sdrh ** constructed in pWal->apWiData[] using heap memory instead of shared 222185bc6df2Sdrh ** memory. 222211caf4f4Sdan ** 222311caf4f4Sdan ** If this function returns SQLITE_OK, then the read transaction has 222411caf4f4Sdan ** been successfully opened. In this case output variable (*pChanged) 222511caf4f4Sdan ** is set to true before returning if the caller should discard the 222611caf4f4Sdan ** contents of the page cache before proceeding. Or, if it returns 222711caf4f4Sdan ** WAL_RETRY, then the heap memory wal-index has been discarded and 222811caf4f4Sdan ** the caller should retry opening the read transaction from the 222911caf4f4Sdan ** beginning (including attempting to map the *-shm file). 223011caf4f4Sdan ** 223111caf4f4Sdan ** If an error occurs, an SQLite error code is returned. 223211caf4f4Sdan */ 223385bc6df2Sdrh static int walBeginShmUnreliable(Wal *pWal, int *pChanged){ 223411caf4f4Sdan i64 szWal; /* Size of wal file on disk in bytes */ 223511caf4f4Sdan i64 iOffset; /* Current offset when reading wal file */ 223611caf4f4Sdan u8 aBuf[WAL_HDRSIZE]; /* Buffer to load WAL header into */ 223711caf4f4Sdan u8 *aFrame = 0; /* Malloc'd buffer to load entire frame */ 223811caf4f4Sdan int szFrame; /* Number of bytes in buffer aFrame[] */ 223911caf4f4Sdan u8 *aData; /* Pointer to data part of aFrame buffer */ 224011caf4f4Sdan volatile void *pDummy; /* Dummy argument for xShmMap */ 224111caf4f4Sdan int rc; /* Return code */ 224211caf4f4Sdan u32 aSaveCksum[2]; /* Saved copy of pWal->hdr.aFrameCksum */ 224311caf4f4Sdan 224485bc6df2Sdrh assert( pWal->bShmUnreliable ); 224511caf4f4Sdan assert( pWal->readOnly & WAL_SHM_RDONLY ); 224611caf4f4Sdan assert( pWal->nWiData>0 && pWal->apWiData[0] ); 224711caf4f4Sdan 224811caf4f4Sdan /* Take WAL_READ_LOCK(0). This has the effect of preventing any 224985bc6df2Sdrh ** writers from running a checkpoint, but does not stop them 225011caf4f4Sdan ** from running recovery. */ 225111caf4f4Sdan rc = walLockShared(pWal, WAL_READ_LOCK(0)); 225211caf4f4Sdan if( rc!=SQLITE_OK ){ 2253ab548384Sdan if( rc==SQLITE_BUSY ) rc = WAL_RETRY; 225485bc6df2Sdrh goto begin_unreliable_shm_out; 225511caf4f4Sdan } 225611caf4f4Sdan pWal->readLock = 0; 225711caf4f4Sdan 225885bc6df2Sdrh /* Check to see if a separate writer has attached to the shared-memory area, 225985bc6df2Sdrh ** thus making the shared-memory "reliable" again. Do this by invoking 226085bc6df2Sdrh ** the xShmMap() routine of the VFS and looking to see if the return 226185bc6df2Sdrh ** is SQLITE_READONLY instead of SQLITE_READONLY_CANTINIT. 226211caf4f4Sdan ** 226385bc6df2Sdrh ** If the shared-memory is now "reliable" return WAL_RETRY, which will 226485bc6df2Sdrh ** cause the heap-memory WAL-index to be discarded and the actual 226585bc6df2Sdrh ** shared memory to be used in its place. 2266870655bbSdrh ** 2267870655bbSdrh ** This step is important because, even though this connection is holding 2268870655bbSdrh ** the WAL_READ_LOCK(0) which prevents a checkpoint, a writer might 2269870655bbSdrh ** have already checkpointed the WAL file and, while the current 2270870655bbSdrh ** is active, wrap the WAL and start overwriting frames that this 2271870655bbSdrh ** process wants to use. 2272870655bbSdrh ** 2273870655bbSdrh ** Once sqlite3OsShmMap() has been called for an sqlite3_file and has 2274870655bbSdrh ** returned any SQLITE_READONLY value, it must return only SQLITE_READONLY 2275870655bbSdrh ** or SQLITE_READONLY_CANTINIT or some error for all subsequent invocations, 2276870655bbSdrh ** even if some external agent does a "chmod" to make the shared-memory 2277870655bbSdrh ** writable by us, until sqlite3OsShmUnmap() has been called. 2278870655bbSdrh ** This is a requirement on the VFS implementation. 227985bc6df2Sdrh */ 228011caf4f4Sdan rc = sqlite3OsShmMap(pWal->pDbFd, 0, WALINDEX_PGSZ, 0, &pDummy); 22819214c1efSdrh assert( rc!=SQLITE_OK ); /* SQLITE_OK not possible for read-only connection */ 22827e45e3a5Sdrh if( rc!=SQLITE_READONLY_CANTINIT ){ 228311caf4f4Sdan rc = (rc==SQLITE_READONLY ? WAL_RETRY : rc); 228485bc6df2Sdrh goto begin_unreliable_shm_out; 228511caf4f4Sdan } 228611caf4f4Sdan 2287870655bbSdrh /* We reach this point only if the real shared-memory is still unreliable. 228885bc6df2Sdrh ** Assume the in-memory WAL-index substitute is correct and load it 228985bc6df2Sdrh ** into pWal->hdr. 229085bc6df2Sdrh */ 229111caf4f4Sdan memcpy(&pWal->hdr, (void*)walIndexHdr(pWal), sizeof(WalIndexHdr)); 229285bc6df2Sdrh 2293870655bbSdrh /* Make sure some writer hasn't come in and changed the WAL file out 2294870655bbSdrh ** from under us, then disconnected, while we were not looking. 229585bc6df2Sdrh */ 229611caf4f4Sdan rc = sqlite3OsFileSize(pWal->pWalFd, &szWal); 2297ab548384Sdan if( rc!=SQLITE_OK ){ 229885bc6df2Sdrh goto begin_unreliable_shm_out; 2299ab548384Sdan } 2300ab548384Sdan if( szWal<WAL_HDRSIZE ){ 230111caf4f4Sdan /* If the wal file is too small to contain a wal-header and the 230211caf4f4Sdan ** wal-index header has mxFrame==0, then it must be safe to proceed 230311caf4f4Sdan ** reading the database file only. However, the page cache cannot 230411caf4f4Sdan ** be trusted, as a read/write connection may have connected, written 230511caf4f4Sdan ** the db, run a checkpoint, truncated the wal file and disconnected 230611caf4f4Sdan ** since this client's last read transaction. */ 230711caf4f4Sdan *pChanged = 1; 2308ab548384Sdan rc = (pWal->hdr.mxFrame==0 ? SQLITE_OK : WAL_RETRY); 230985bc6df2Sdrh goto begin_unreliable_shm_out; 231011caf4f4Sdan } 231111caf4f4Sdan 231211caf4f4Sdan /* Check the salt keys at the start of the wal file still match. */ 231311caf4f4Sdan rc = sqlite3OsRead(pWal->pWalFd, aBuf, WAL_HDRSIZE, 0); 231411caf4f4Sdan if( rc!=SQLITE_OK ){ 231585bc6df2Sdrh goto begin_unreliable_shm_out; 231611caf4f4Sdan } 231711caf4f4Sdan if( memcmp(&pWal->hdr.aSalt, &aBuf[16], 8) ){ 2318870655bbSdrh /* Some writer has wrapped the WAL file while we were not looking. 2319870655bbSdrh ** Return WAL_RETRY which will cause the in-memory WAL-index to be 2320870655bbSdrh ** rebuilt. */ 232111caf4f4Sdan rc = WAL_RETRY; 232285bc6df2Sdrh goto begin_unreliable_shm_out; 232311caf4f4Sdan } 232411caf4f4Sdan 232511caf4f4Sdan /* Allocate a buffer to read frames into */ 232611caf4f4Sdan szFrame = pWal->hdr.szPage + WAL_FRAME_HDRSIZE; 232711caf4f4Sdan aFrame = (u8 *)sqlite3_malloc64(szFrame); 232811caf4f4Sdan if( aFrame==0 ){ 232911caf4f4Sdan rc = SQLITE_NOMEM_BKPT; 233085bc6df2Sdrh goto begin_unreliable_shm_out; 233111caf4f4Sdan } 233211caf4f4Sdan aData = &aFrame[WAL_FRAME_HDRSIZE]; 233311caf4f4Sdan 2334cbd33219Sdan /* Check to see if a complete transaction has been appended to the 2335cbd33219Sdan ** wal file since the heap-memory wal-index was created. If so, the 2336cbd33219Sdan ** heap-memory wal-index is discarded and WAL_RETRY returned to 2337cbd33219Sdan ** the caller. */ 233811caf4f4Sdan aSaveCksum[0] = pWal->hdr.aFrameCksum[0]; 233911caf4f4Sdan aSaveCksum[1] = pWal->hdr.aFrameCksum[1]; 234011caf4f4Sdan for(iOffset=walFrameOffset(pWal->hdr.mxFrame+1, pWal->hdr.szPage); 234111caf4f4Sdan iOffset+szFrame<=szWal; 234211caf4f4Sdan iOffset+=szFrame 234311caf4f4Sdan ){ 234411caf4f4Sdan u32 pgno; /* Database page number for frame */ 234511caf4f4Sdan u32 nTruncate; /* dbsize field from frame header */ 234611caf4f4Sdan 234711caf4f4Sdan /* Read and decode the next log frame. */ 234811caf4f4Sdan rc = sqlite3OsRead(pWal->pWalFd, aFrame, szFrame, iOffset); 2349ab548384Sdan if( rc!=SQLITE_OK ) break; 235011caf4f4Sdan if( !walDecodeFrame(pWal, &pgno, &nTruncate, aData, aFrame) ) break; 235111caf4f4Sdan 2352cbd33219Sdan /* If nTruncate is non-zero, then a complete transaction has been 2353cbd33219Sdan ** appended to this wal file. Set rc to WAL_RETRY and break out of 2354cbd33219Sdan ** the loop. */ 235511caf4f4Sdan if( nTruncate ){ 235611caf4f4Sdan rc = WAL_RETRY; 235711caf4f4Sdan break; 235811caf4f4Sdan } 235911caf4f4Sdan } 236011caf4f4Sdan pWal->hdr.aFrameCksum[0] = aSaveCksum[0]; 236111caf4f4Sdan pWal->hdr.aFrameCksum[1] = aSaveCksum[1]; 236211caf4f4Sdan 236385bc6df2Sdrh begin_unreliable_shm_out: 236411caf4f4Sdan sqlite3_free(aFrame); 236511caf4f4Sdan if( rc!=SQLITE_OK ){ 236611caf4f4Sdan int i; 236711caf4f4Sdan for(i=0; i<pWal->nWiData; i++){ 236811caf4f4Sdan sqlite3_free((void*)pWal->apWiData[i]); 236911caf4f4Sdan pWal->apWiData[i] = 0; 237011caf4f4Sdan } 237185bc6df2Sdrh pWal->bShmUnreliable = 0; 237211caf4f4Sdan sqlite3WalEndReadTransaction(pWal); 237311caf4f4Sdan *pChanged = 1; 237411caf4f4Sdan } 237511caf4f4Sdan return rc; 237611caf4f4Sdan } 237711caf4f4Sdan 237811caf4f4Sdan /* 237973b64e4dSdrh ** Attempt to start a read transaction. This might fail due to a race or 238073b64e4dSdrh ** other transient condition. When that happens, it returns WAL_RETRY to 238173b64e4dSdrh ** indicate to the caller that it is safe to retry immediately. 238273b64e4dSdrh ** 2383a927e94eSdrh ** On success return SQLITE_OK. On a permanent failure (such an 238473b64e4dSdrh ** I/O error or an SQLITE_BUSY because another process is running 238573b64e4dSdrh ** recovery) return a positive error code. 238673b64e4dSdrh ** 2387a927e94eSdrh ** The useWal parameter is true to force the use of the WAL and disable 2388a927e94eSdrh ** the case where the WAL is bypassed because it has been completely 2389a927e94eSdrh ** checkpointed. If useWal==0 then this routine calls walIndexReadHdr() 2390a927e94eSdrh ** to make a copy of the wal-index header into pWal->hdr. If the 2391a927e94eSdrh ** wal-index header has changed, *pChanged is set to 1 (as an indication 2392183f0aa6Sdrh ** to the caller that the local page cache is obsolete and needs to be 2393a927e94eSdrh ** flushed.) When useWal==1, the wal-index header is assumed to already 2394a927e94eSdrh ** be loaded and the pChanged parameter is unused. 2395a927e94eSdrh ** 2396a927e94eSdrh ** The caller must set the cnt parameter to the number of prior calls to 2397a927e94eSdrh ** this routine during the current read attempt that returned WAL_RETRY. 2398a927e94eSdrh ** This routine will start taking more aggressive measures to clear the 2399a927e94eSdrh ** race conditions after multiple WAL_RETRY returns, and after an excessive 2400a927e94eSdrh ** number of errors will ultimately return SQLITE_PROTOCOL. The 2401a927e94eSdrh ** SQLITE_PROTOCOL return indicates that some other process has gone rogue 2402a927e94eSdrh ** and is not honoring the locking protocol. There is a vanishingly small 2403a927e94eSdrh ** chance that SQLITE_PROTOCOL could be returned because of a run of really 2404a927e94eSdrh ** bad luck when there is lots of contention for the wal-index, but that 2405a927e94eSdrh ** possibility is so small that it can be safely neglected, we believe. 2406a927e94eSdrh ** 240773b64e4dSdrh ** On success, this routine obtains a read lock on 240873b64e4dSdrh ** WAL_READ_LOCK(pWal->readLock). The pWal->readLock integer is 240973b64e4dSdrh ** in the range 0 <= pWal->readLock < WAL_NREADER. If pWal->readLock==(-1) 241073b64e4dSdrh ** that means the Wal does not hold any read lock. The reader must not 241173b64e4dSdrh ** access any database page that is modified by a WAL frame up to and 241273b64e4dSdrh ** including frame number aReadMark[pWal->readLock]. The reader will 241373b64e4dSdrh ** use WAL frames up to and including pWal->hdr.mxFrame if pWal->readLock>0 241473b64e4dSdrh ** Or if pWal->readLock==0, then the reader will ignore the WAL 241573b64e4dSdrh ** completely and get all content directly from the database file. 2416a927e94eSdrh ** If the useWal parameter is 1 then the WAL will never be ignored and 2417a927e94eSdrh ** this routine will always set pWal->readLock>0 on success. 241873b64e4dSdrh ** When the read transaction is completed, the caller must release the 241973b64e4dSdrh ** lock on WAL_READ_LOCK(pWal->readLock) and set pWal->readLock to -1. 242073b64e4dSdrh ** 242173b64e4dSdrh ** This routine uses the nBackfill and aReadMark[] fields of the header 242273b64e4dSdrh ** to select a particular WAL_READ_LOCK() that strives to let the 242373b64e4dSdrh ** checkpoint process do as much work as possible. This routine might 242473b64e4dSdrh ** update values of the aReadMark[] array in the header, but if it does 242573b64e4dSdrh ** so it takes care to hold an exclusive lock on the corresponding 242673b64e4dSdrh ** WAL_READ_LOCK() while changing values. 242773b64e4dSdrh */ 2428aab4c02eSdrh static int walTryBeginRead(Wal *pWal, int *pChanged, int useWal, int cnt){ 242973b64e4dSdrh volatile WalCkptInfo *pInfo; /* Checkpoint information in wal-index */ 243073b64e4dSdrh u32 mxReadMark; /* Largest aReadMark[] value */ 243173b64e4dSdrh int mxI; /* Index of largest aReadMark[] value */ 243273b64e4dSdrh int i; /* Loop counter */ 243313a3cb82Sdan int rc = SQLITE_OK; /* Return code */ 2434c49e960dSdrh u32 mxFrame; /* Wal frame to lock to */ 2435c438efd6Sdrh 243661e4acecSdrh assert( pWal->readLock<0 ); /* Not currently locked */ 2437c438efd6Sdrh 24382e9b0923Sdrh /* useWal may only be set for read/write connections */ 24392e9b0923Sdrh assert( (pWal->readOnly & WAL_SHM_RDONLY)==0 || useWal==0 ); 24402e9b0923Sdrh 2441658d76c9Sdrh /* Take steps to avoid spinning forever if there is a protocol error. 2442658d76c9Sdrh ** 2443658d76c9Sdrh ** Circumstances that cause a RETRY should only last for the briefest 2444658d76c9Sdrh ** instances of time. No I/O or other system calls are done while the 2445658d76c9Sdrh ** locks are held, so the locks should not be held for very long. But 2446658d76c9Sdrh ** if we are unlucky, another process that is holding a lock might get 2447658d76c9Sdrh ** paged out or take a page-fault that is time-consuming to resolve, 2448658d76c9Sdrh ** during the few nanoseconds that it is holding the lock. In that case, 2449658d76c9Sdrh ** it might take longer than normal for the lock to free. 2450658d76c9Sdrh ** 2451658d76c9Sdrh ** After 5 RETRYs, we begin calling sqlite3OsSleep(). The first few 2452658d76c9Sdrh ** calls to sqlite3OsSleep() have a delay of 1 microsecond. Really this 2453658d76c9Sdrh ** is more of a scheduler yield than an actual delay. But on the 10th 2454658d76c9Sdrh ** an subsequent retries, the delays start becoming longer and longer, 24555b6e3b97Sdrh ** so that on the 100th (and last) RETRY we delay for 323 milliseconds. 24565b6e3b97Sdrh ** The total delay time before giving up is less than 10 seconds. 2457658d76c9Sdrh */ 2458aab4c02eSdrh if( cnt>5 ){ 2459658d76c9Sdrh int nDelay = 1; /* Pause time in microseconds */ 246003c6967fSdrh if( cnt>100 ){ 246103c6967fSdrh VVA_ONLY( pWal->lockError = 1; ) 246203c6967fSdrh return SQLITE_PROTOCOL; 246303c6967fSdrh } 24645b6e3b97Sdrh if( cnt>=10 ) nDelay = (cnt-9)*(cnt-9)*39; 2465658d76c9Sdrh sqlite3OsSleep(pWal->pVfs, nDelay); 2466aab4c02eSdrh } 2467aab4c02eSdrh 246873b64e4dSdrh if( !useWal ){ 246911caf4f4Sdan assert( rc==SQLITE_OK ); 247085bc6df2Sdrh if( pWal->bShmUnreliable==0 ){ 24717ed91f23Sdrh rc = walIndexReadHdr(pWal, pChanged); 247211caf4f4Sdan } 247373b64e4dSdrh if( rc==SQLITE_BUSY ){ 247473b64e4dSdrh /* If there is not a recovery running in another thread or process 247573b64e4dSdrh ** then convert BUSY errors to WAL_RETRY. If recovery is known to 247673b64e4dSdrh ** be running, convert BUSY to BUSY_RECOVERY. There is a race here 247773b64e4dSdrh ** which might cause WAL_RETRY to be returned even if BUSY_RECOVERY 247873b64e4dSdrh ** would be technically correct. But the race is benign since with 247973b64e4dSdrh ** WAL_RETRY this routine will be called again and will probably be 248073b64e4dSdrh ** right on the second iteration. 248173b64e4dSdrh */ 24827d4514a4Sdan if( pWal->apWiData[0]==0 ){ 24837d4514a4Sdan /* This branch is taken when the xShmMap() method returns SQLITE_BUSY. 24847d4514a4Sdan ** We assume this is a transient condition, so return WAL_RETRY. The 24857d4514a4Sdan ** xShmMap() implementation used by the default unix and win32 VFS 24867d4514a4Sdan ** modules may return SQLITE_BUSY due to a race condition in the 24877d4514a4Sdan ** code that determines whether or not the shared-memory region 24887d4514a4Sdan ** must be zeroed before the requested page is returned. 24897d4514a4Sdan */ 24907d4514a4Sdan rc = WAL_RETRY; 24917d4514a4Sdan }else if( SQLITE_OK==(rc = walLockShared(pWal, WAL_RECOVER_LOCK)) ){ 249273b64e4dSdrh walUnlockShared(pWal, WAL_RECOVER_LOCK); 249373b64e4dSdrh rc = WAL_RETRY; 249473b64e4dSdrh }else if( rc==SQLITE_BUSY ){ 249573b64e4dSdrh rc = SQLITE_BUSY_RECOVERY; 249673b64e4dSdrh } 249773b64e4dSdrh } 2498c438efd6Sdrh if( rc!=SQLITE_OK ){ 249973b64e4dSdrh return rc; 250073b64e4dSdrh } 250185bc6df2Sdrh else if( pWal->bShmUnreliable ){ 250285bc6df2Sdrh return walBeginShmUnreliable(pWal, pChanged); 250311caf4f4Sdan } 2504a927e94eSdrh } 250573b64e4dSdrh 250692c02da3Sdan assert( pWal->nWiData>0 ); 25072e9b0923Sdrh assert( pWal->apWiData[0]!=0 ); 25082e9b0923Sdrh pInfo = walCkptInfo(pWal); 25092e9b0923Sdrh if( !useWal && pInfo->nBackfill==pWal->hdr.mxFrame 2510fc1acf33Sdan #ifdef SQLITE_ENABLE_SNAPSHOT 251121f2bafdSdan && (pWal->pSnapshot==0 || pWal->hdr.mxFrame==0) 2512fc1acf33Sdan #endif 2513fc1acf33Sdan ){ 251473b64e4dSdrh /* The WAL has been completely backfilled (or it is empty). 251573b64e4dSdrh ** and can be safely ignored. 251673b64e4dSdrh */ 251773b64e4dSdrh rc = walLockShared(pWal, WAL_READ_LOCK(0)); 25188c408004Sdan walShmBarrier(pWal); 251973b64e4dSdrh if( rc==SQLITE_OK ){ 25202e9b0923Sdrh if( memcmp((void *)walIndexHdr(pWal), &pWal->hdr, sizeof(WalIndexHdr)) ){ 2521493cc590Sdan /* It is not safe to allow the reader to continue here if frames 2522493cc590Sdan ** may have been appended to the log before READ_LOCK(0) was obtained. 2523493cc590Sdan ** When holding READ_LOCK(0), the reader ignores the entire log file, 2524493cc590Sdan ** which implies that the database file contains a trustworthy 252560ec914cSpeter.d.reid ** snapshot. Since holding READ_LOCK(0) prevents a checkpoint from 2526493cc590Sdan ** happening, this is usually correct. 2527493cc590Sdan ** 2528493cc590Sdan ** However, if frames have been appended to the log (or if the log 2529493cc590Sdan ** is wrapped and written for that matter) before the READ_LOCK(0) 2530493cc590Sdan ** is obtained, that is not necessarily true. A checkpointer may 2531493cc590Sdan ** have started to backfill the appended frames but crashed before 2532493cc590Sdan ** it finished. Leaving a corrupt image in the database file. 2533493cc590Sdan */ 253473b64e4dSdrh walUnlockShared(pWal, WAL_READ_LOCK(0)); 253573b64e4dSdrh return WAL_RETRY; 253673b64e4dSdrh } 253773b64e4dSdrh pWal->readLock = 0; 253873b64e4dSdrh return SQLITE_OK; 253973b64e4dSdrh }else if( rc!=SQLITE_BUSY ){ 254073b64e4dSdrh return rc; 2541c438efd6Sdrh } 2542c438efd6Sdrh } 2543ba51590bSdan 254473b64e4dSdrh /* If we get this far, it means that the reader will want to use 254573b64e4dSdrh ** the WAL to get at content from recent commits. The job now is 254673b64e4dSdrh ** to select one of the aReadMark[] entries that is closest to 254773b64e4dSdrh ** but not exceeding pWal->hdr.mxFrame and lock that entry. 254873b64e4dSdrh */ 254973b64e4dSdrh mxReadMark = 0; 255073b64e4dSdrh mxI = 0; 2551fc1acf33Sdan mxFrame = pWal->hdr.mxFrame; 2552fc1acf33Sdan #ifdef SQLITE_ENABLE_SNAPSHOT 2553818b11aeSdan if( pWal->pSnapshot && pWal->pSnapshot->mxFrame<mxFrame ){ 2554818b11aeSdan mxFrame = pWal->pSnapshot->mxFrame; 2555818b11aeSdan } 2556fc1acf33Sdan #endif 255773b64e4dSdrh for(i=1; i<WAL_NREADER; i++){ 255873b64e4dSdrh u32 thisMark = pInfo->aReadMark[i]; 2559fc1acf33Sdan if( mxReadMark<=thisMark && thisMark<=mxFrame ){ 2560db7f647eSdrh assert( thisMark!=READMARK_NOT_USED ); 256173b64e4dSdrh mxReadMark = thisMark; 256273b64e4dSdrh mxI = i; 256373b64e4dSdrh } 256473b64e4dSdrh } 256566dfec8bSdrh if( (pWal->readOnly & WAL_SHM_RDONLY)==0 2566fc1acf33Sdan && (mxReadMark<mxFrame || mxI==0) 256766dfec8bSdrh ){ 2568d54ff60bSdan for(i=1; i<WAL_NREADER; i++){ 2569ab372773Sdrh rc = walLockExclusive(pWal, WAL_READ_LOCK(i), 1); 257073b64e4dSdrh if( rc==SQLITE_OK ){ 2571fc1acf33Sdan mxReadMark = pInfo->aReadMark[i] = mxFrame; 257273b64e4dSdrh mxI = i; 257373b64e4dSdrh walUnlockExclusive(pWal, WAL_READ_LOCK(i), 1); 257473b64e4dSdrh break; 257538933f2cSdrh }else if( rc!=SQLITE_BUSY ){ 257638933f2cSdrh return rc; 257773b64e4dSdrh } 257873b64e4dSdrh } 257973b64e4dSdrh } 2580658d76c9Sdrh if( mxI==0 ){ 25815bf39346Sdrh assert( rc==SQLITE_BUSY || (pWal->readOnly & WAL_SHM_RDONLY)!=0 ); 25827e45e3a5Sdrh return rc==SQLITE_BUSY ? WAL_RETRY : SQLITE_READONLY_CANTINIT; 2583658d76c9Sdrh } 258473b64e4dSdrh 258573b64e4dSdrh rc = walLockShared(pWal, WAL_READ_LOCK(mxI)); 258673b64e4dSdrh if( rc ){ 258773b64e4dSdrh return rc==SQLITE_BUSY ? WAL_RETRY : rc; 258873b64e4dSdrh } 2589eb8cb3a8Sdan /* Now that the read-lock has been obtained, check that neither the 2590eb8cb3a8Sdan ** value in the aReadMark[] array or the contents of the wal-index 2591eb8cb3a8Sdan ** header have changed. 2592eb8cb3a8Sdan ** 2593eb8cb3a8Sdan ** It is necessary to check that the wal-index header did not change 2594eb8cb3a8Sdan ** between the time it was read and when the shared-lock was obtained 2595eb8cb3a8Sdan ** on WAL_READ_LOCK(mxI) was obtained to account for the possibility 2596eb8cb3a8Sdan ** that the log file may have been wrapped by a writer, or that frames 2597eb8cb3a8Sdan ** that occur later in the log than pWal->hdr.mxFrame may have been 2598eb8cb3a8Sdan ** copied into the database by a checkpointer. If either of these things 2599eb8cb3a8Sdan ** happened, then reading the database with the current value of 2600eb8cb3a8Sdan ** pWal->hdr.mxFrame risks reading a corrupted snapshot. So, retry 2601eb8cb3a8Sdan ** instead. 2602eb8cb3a8Sdan ** 2603b8c7cfb8Sdan ** Before checking that the live wal-index header has not changed 2604b8c7cfb8Sdan ** since it was read, set Wal.minFrame to the first frame in the wal 2605b8c7cfb8Sdan ** file that has not yet been checkpointed. This client will not need 2606b8c7cfb8Sdan ** to read any frames earlier than minFrame from the wal file - they 2607b8c7cfb8Sdan ** can be safely read directly from the database file. 2608b8c7cfb8Sdan ** 2609b8c7cfb8Sdan ** Because a ShmBarrier() call is made between taking the copy of 2610b8c7cfb8Sdan ** nBackfill and checking that the wal-header in shared-memory still 2611b8c7cfb8Sdan ** matches the one cached in pWal->hdr, it is guaranteed that the 2612b8c7cfb8Sdan ** checkpointer that set nBackfill was not working with a wal-index 2613b8c7cfb8Sdan ** header newer than that cached in pWal->hdr. If it were, that could 2614b8c7cfb8Sdan ** cause a problem. The checkpointer could omit to checkpoint 2615b8c7cfb8Sdan ** a version of page X that lies before pWal->minFrame (call that version 2616b8c7cfb8Sdan ** A) on the basis that there is a newer version (version B) of the same 2617b8c7cfb8Sdan ** page later in the wal file. But if version B happens to like past 2618b8c7cfb8Sdan ** frame pWal->hdr.mxFrame - then the client would incorrectly assume 2619b8c7cfb8Sdan ** that it can read version A from the database file. However, since 2620b8c7cfb8Sdan ** we can guarantee that the checkpointer that set nBackfill could not 2621b8c7cfb8Sdan ** see any pages past pWal->hdr.mxFrame, this problem does not come up. 2622eb8cb3a8Sdan */ 2623b8c7cfb8Sdan pWal->minFrame = pInfo->nBackfill+1; 26248c408004Sdan walShmBarrier(pWal); 262573b64e4dSdrh if( pInfo->aReadMark[mxI]!=mxReadMark 26264280eb30Sdan || memcmp((void *)walIndexHdr(pWal), &pWal->hdr, sizeof(WalIndexHdr)) 262773b64e4dSdrh ){ 262873b64e4dSdrh walUnlockShared(pWal, WAL_READ_LOCK(mxI)); 262973b64e4dSdrh return WAL_RETRY; 263073b64e4dSdrh }else{ 2631db7f647eSdrh assert( mxReadMark<=pWal->hdr.mxFrame ); 26325eba1f60Sshaneh pWal->readLock = (i16)mxI; 263373b64e4dSdrh } 263473b64e4dSdrh return rc; 263573b64e4dSdrh } 263673b64e4dSdrh 2637bc88711dSdrh #ifdef SQLITE_ENABLE_SNAPSHOT 263873b64e4dSdrh /* 263993f51132Sdan ** Attempt to reduce the value of the WalCkptInfo.nBackfillAttempted 264093f51132Sdan ** variable so that older snapshots can be accessed. To do this, loop 264193f51132Sdan ** through all wal frames from nBackfillAttempted to (nBackfill+1), 264293f51132Sdan ** comparing their content to the corresponding page with the database 264393f51132Sdan ** file, if any. Set nBackfillAttempted to the frame number of the 264493f51132Sdan ** first frame for which the wal file content matches the db file. 264593f51132Sdan ** 264693f51132Sdan ** This is only really safe if the file-system is such that any page 264793f51132Sdan ** writes made by earlier checkpointers were atomic operations, which 264893f51132Sdan ** is not always true. It is also possible that nBackfillAttempted 264993f51132Sdan ** may be left set to a value larger than expected, if a wal frame 265093f51132Sdan ** contains content that duplicate of an earlier version of the same 265193f51132Sdan ** page. 265293f51132Sdan ** 265393f51132Sdan ** SQLITE_OK is returned if successful, or an SQLite error code if an 265493f51132Sdan ** error occurs. It is not an error if nBackfillAttempted cannot be 265593f51132Sdan ** decreased at all. 26561158498dSdan */ 26571158498dSdan int sqlite3WalSnapshotRecover(Wal *pWal){ 26581158498dSdan int rc; 26591158498dSdan 266093f51132Sdan assert( pWal->readLock>=0 ); 26611158498dSdan rc = walLockExclusive(pWal, WAL_CKPT_LOCK, 1); 26621158498dSdan if( rc==SQLITE_OK ){ 26631158498dSdan volatile WalCkptInfo *pInfo = walCkptInfo(pWal); 26641158498dSdan int szPage = (int)pWal->szPage; 26656a9e7f16Sdan i64 szDb; /* Size of db file in bytes */ 26666a9e7f16Sdan 26676a9e7f16Sdan rc = sqlite3OsFileSize(pWal->pDbFd, &szDb); 26686a9e7f16Sdan if( rc==SQLITE_OK ){ 26691158498dSdan void *pBuf1 = sqlite3_malloc(szPage); 26701158498dSdan void *pBuf2 = sqlite3_malloc(szPage); 26711158498dSdan if( pBuf1==0 || pBuf2==0 ){ 26721158498dSdan rc = SQLITE_NOMEM; 26731158498dSdan }else{ 26741158498dSdan u32 i = pInfo->nBackfillAttempted; 26751158498dSdan for(i=pInfo->nBackfillAttempted; i>pInfo->nBackfill; i--){ 26764ece2f26Sdrh WalHashLoc sLoc; /* Hash table location */ 26771158498dSdan u32 pgno; /* Page number in db file */ 26781158498dSdan i64 iDbOff; /* Offset of db file entry */ 26791158498dSdan i64 iWalOff; /* Offset of wal file entry */ 26801158498dSdan 26814ece2f26Sdrh rc = walHashGet(pWal, walFramePage(i), &sLoc); 26826a9e7f16Sdan if( rc!=SQLITE_OK ) break; 26834ece2f26Sdrh pgno = sLoc.aPgno[i-sLoc.iZero]; 26841158498dSdan iDbOff = (i64)(pgno-1) * szPage; 26856a9e7f16Sdan 26866a9e7f16Sdan if( iDbOff+szPage<=szDb ){ 26871158498dSdan iWalOff = walFrameOffset(i, szPage) + WAL_FRAME_HDRSIZE; 26881158498dSdan rc = sqlite3OsRead(pWal->pWalFd, pBuf1, szPage, iWalOff); 26891158498dSdan 26901158498dSdan if( rc==SQLITE_OK ){ 26911158498dSdan rc = sqlite3OsRead(pWal->pDbFd, pBuf2, szPage, iDbOff); 26921158498dSdan } 26931158498dSdan 26941158498dSdan if( rc!=SQLITE_OK || 0==memcmp(pBuf1, pBuf2, szPage) ){ 26951158498dSdan break; 26961158498dSdan } 26976a9e7f16Sdan } 26981158498dSdan 26991158498dSdan pInfo->nBackfillAttempted = i-1; 27001158498dSdan } 27011158498dSdan } 27021158498dSdan 27031158498dSdan sqlite3_free(pBuf1); 27041158498dSdan sqlite3_free(pBuf2); 27056a9e7f16Sdan } 27061158498dSdan walUnlockExclusive(pWal, WAL_CKPT_LOCK, 1); 27071158498dSdan } 27081158498dSdan 27091158498dSdan return rc; 27101158498dSdan } 2711bc88711dSdrh #endif /* SQLITE_ENABLE_SNAPSHOT */ 27121158498dSdan 27131158498dSdan /* 271473b64e4dSdrh ** Begin a read transaction on the database. 271573b64e4dSdrh ** 271673b64e4dSdrh ** This routine used to be called sqlite3OpenSnapshot() and with good reason: 271773b64e4dSdrh ** it takes a snapshot of the state of the WAL and wal-index for the current 271873b64e4dSdrh ** instant in time. The current thread will continue to use this snapshot. 271973b64e4dSdrh ** Other threads might append new content to the WAL and wal-index but 272073b64e4dSdrh ** that extra content is ignored by the current thread. 272173b64e4dSdrh ** 272273b64e4dSdrh ** If the database contents have changes since the previous read 272373b64e4dSdrh ** transaction, then *pChanged is set to 1 before returning. The 272473b64e4dSdrh ** Pager layer will use this to know that is cache is stale and 272573b64e4dSdrh ** needs to be flushed. 272673b64e4dSdrh */ 272766dfec8bSdrh int sqlite3WalBeginReadTransaction(Wal *pWal, int *pChanged){ 272873b64e4dSdrh int rc; /* Return code */ 2729aab4c02eSdrh int cnt = 0; /* Number of TryBeginRead attempts */ 273073b64e4dSdrh 2731fc1acf33Sdan #ifdef SQLITE_ENABLE_SNAPSHOT 2732fc1acf33Sdan int bChanged = 0; 2733fc1acf33Sdan WalIndexHdr *pSnapshot = pWal->pSnapshot; 2734998147ecSdrh if( pSnapshot && memcmp(pSnapshot, &pWal->hdr, sizeof(WalIndexHdr))!=0 ){ 2735fc1acf33Sdan bChanged = 1; 2736fc1acf33Sdan } 2737fc1acf33Sdan #endif 2738fc1acf33Sdan 273973b64e4dSdrh do{ 2740aab4c02eSdrh rc = walTryBeginRead(pWal, pChanged, 0, ++cnt); 274173b64e4dSdrh }while( rc==WAL_RETRY ); 2742ab1cc746Sdrh testcase( (rc&0xff)==SQLITE_BUSY ); 2743ab1cc746Sdrh testcase( (rc&0xff)==SQLITE_IOERR ); 2744ab1cc746Sdrh testcase( rc==SQLITE_PROTOCOL ); 2745ab1cc746Sdrh testcase( rc==SQLITE_OK ); 2746fc1acf33Sdan 2747fc1acf33Sdan #ifdef SQLITE_ENABLE_SNAPSHOT 2748fc1acf33Sdan if( rc==SQLITE_OK ){ 2749998147ecSdrh if( pSnapshot && memcmp(pSnapshot, &pWal->hdr, sizeof(WalIndexHdr))!=0 ){ 275065127cd5Sdan /* At this point the client has a lock on an aReadMark[] slot holding 27513bf83ccdSdan ** a value equal to or smaller than pSnapshot->mxFrame, but pWal->hdr 27523bf83ccdSdan ** is populated with the wal-index header corresponding to the head 27533bf83ccdSdan ** of the wal file. Verify that pSnapshot is still valid before 27543bf83ccdSdan ** continuing. Reasons why pSnapshot might no longer be valid: 275565127cd5Sdan ** 2756998147ecSdrh ** (1) The WAL file has been reset since the snapshot was taken. 2757998147ecSdrh ** In this case, the salt will have changed. 275865127cd5Sdan ** 2759998147ecSdrh ** (2) A checkpoint as been attempted that wrote frames past 2760998147ecSdrh ** pSnapshot->mxFrame into the database file. Note that the 2761998147ecSdrh ** checkpoint need not have completed for this to cause problems. 276265127cd5Sdan */ 2763fc1acf33Sdan volatile WalCkptInfo *pInfo = walCkptInfo(pWal); 276465127cd5Sdan 276571b62fa4Sdrh assert( pWal->readLock>0 || pWal->hdr.mxFrame==0 ); 2766fc1acf33Sdan assert( pInfo->aReadMark[pWal->readLock]<=pSnapshot->mxFrame ); 276765127cd5Sdan 27683bf83ccdSdan /* It is possible that there is a checkpointer thread running 27693bf83ccdSdan ** concurrent with this code. If this is the case, it may be that the 27703bf83ccdSdan ** checkpointer has already determined that it will checkpoint 27713bf83ccdSdan ** snapshot X, where X is later in the wal file than pSnapshot, but 27723bf83ccdSdan ** has not yet set the pInfo->nBackfillAttempted variable to indicate 27733bf83ccdSdan ** its intent. To avoid the race condition this leads to, ensure that 27743bf83ccdSdan ** there is no checkpointer process by taking a shared CKPT lock 27751158498dSdan ** before checking pInfo->nBackfillAttempted. 27761158498dSdan ** 27771158498dSdan ** TODO: Does the aReadMark[] lock prevent a checkpointer from doing 27781158498dSdan ** this already? 27791158498dSdan */ 27803bf83ccdSdan rc = walLockShared(pWal, WAL_CKPT_LOCK); 27813bf83ccdSdan 2782a7aeb398Sdan if( rc==SQLITE_OK ){ 27833bf83ccdSdan /* Check that the wal file has not been wrapped. Assuming that it has 2784a7aeb398Sdan ** not, also check that no checkpointer has attempted to checkpoint any 2785a7aeb398Sdan ** frames beyond pSnapshot->mxFrame. If either of these conditions are 2786a7aeb398Sdan ** true, return SQLITE_BUSY_SNAPSHOT. Otherwise, overwrite pWal->hdr 27873bf83ccdSdan ** with *pSnapshot and set *pChanged as appropriate for opening the 27883bf83ccdSdan ** snapshot. */ 2789a7aeb398Sdan if( !memcmp(pSnapshot->aSalt, pWal->hdr.aSalt, sizeof(pWal->hdr.aSalt)) 2790998147ecSdrh && pSnapshot->mxFrame>=pInfo->nBackfillAttempted 279165127cd5Sdan ){ 27920f308f5dSdan assert( pWal->readLock>0 ); 2793fc1acf33Sdan memcpy(&pWal->hdr, pSnapshot, sizeof(WalIndexHdr)); 2794fc1acf33Sdan *pChanged = bChanged; 2795fc1acf33Sdan }else{ 2796fc1acf33Sdan rc = SQLITE_BUSY_SNAPSHOT; 2797fc1acf33Sdan } 279865127cd5Sdan 27993bf83ccdSdan /* Release the shared CKPT lock obtained above. */ 28003bf83ccdSdan walUnlockShared(pWal, WAL_CKPT_LOCK); 2801a7aeb398Sdan } 2802a7aeb398Sdan 28033bf83ccdSdan 2804fc1acf33Sdan if( rc!=SQLITE_OK ){ 2805fc1acf33Sdan sqlite3WalEndReadTransaction(pWal); 2806fc1acf33Sdan } 2807fc1acf33Sdan } 2808fc1acf33Sdan } 2809fc1acf33Sdan #endif 2810c438efd6Sdrh return rc; 2811c438efd6Sdrh } 2812c438efd6Sdrh 2813c438efd6Sdrh /* 281473b64e4dSdrh ** Finish with a read transaction. All this does is release the 281573b64e4dSdrh ** read-lock. 2816c438efd6Sdrh */ 281773b64e4dSdrh void sqlite3WalEndReadTransaction(Wal *pWal){ 281873d66fdbSdan sqlite3WalEndWriteTransaction(pWal); 281973b64e4dSdrh if( pWal->readLock>=0 ){ 282073b64e4dSdrh walUnlockShared(pWal, WAL_READ_LOCK(pWal->readLock)); 282173b64e4dSdrh pWal->readLock = -1; 282273b64e4dSdrh } 2823c438efd6Sdrh } 2824c438efd6Sdrh 2825c438efd6Sdrh /* 282699bd1097Sdan ** Search the wal file for page pgno. If found, set *piRead to the frame that 282799bd1097Sdan ** contains the page. Otherwise, if pgno is not in the wal file, set *piRead 282899bd1097Sdan ** to zero. 282973b64e4dSdrh ** 283099bd1097Sdan ** Return SQLITE_OK if successful, or an error code if an error occurs. If an 283199bd1097Sdan ** error does occur, the final value of *piRead is undefined. 2832c438efd6Sdrh */ 283399bd1097Sdan int sqlite3WalFindFrame( 2834bb23aff3Sdan Wal *pWal, /* WAL handle */ 2835bb23aff3Sdan Pgno pgno, /* Database page number to read data for */ 283699bd1097Sdan u32 *piRead /* OUT: Frame number (or zero) */ 2837b6e099a9Sdan ){ 2838bb23aff3Sdan u32 iRead = 0; /* If !=0, WAL frame to return data from */ 2839027a128aSdrh u32 iLast = pWal->hdr.mxFrame; /* Last page in WAL for this reader */ 2840bb23aff3Sdan int iHash; /* Used to loop through N hash tables */ 28416df003c7Sdan int iMinHash; 2842c438efd6Sdrh 2843aab4c02eSdrh /* This routine is only be called from within a read transaction. */ 2844aab4c02eSdrh assert( pWal->readLock>=0 || pWal->lockError ); 284573b64e4dSdrh 2846bb23aff3Sdan /* If the "last page" field of the wal-index header snapshot is 0, then 2847bb23aff3Sdan ** no data will be read from the wal under any circumstances. Return early 2848a927e94eSdrh ** in this case as an optimization. Likewise, if pWal->readLock==0, 2849a927e94eSdrh ** then the WAL is ignored by the reader so return early, as if the 2850a927e94eSdrh ** WAL were empty. 2851bb23aff3Sdan */ 285285bc6df2Sdrh if( iLast==0 || (pWal->readLock==0 && pWal->bShmUnreliable==0) ){ 285399bd1097Sdan *piRead = 0; 2854bb23aff3Sdan return SQLITE_OK; 2855bb23aff3Sdan } 2856bb23aff3Sdan 2857bb23aff3Sdan /* Search the hash table or tables for an entry matching page number 2858bb23aff3Sdan ** pgno. Each iteration of the following for() loop searches one 2859bb23aff3Sdan ** hash table (each hash table indexes up to HASHTABLE_NPAGE frames). 2860bb23aff3Sdan ** 2861a927e94eSdrh ** This code might run concurrently to the code in walIndexAppend() 2862bb23aff3Sdan ** that adds entries to the wal-index (and possibly to this hash 28636e81096fSdrh ** table). This means the value just read from the hash 2864bb23aff3Sdan ** slot (aHash[iKey]) may have been added before or after the 2865bb23aff3Sdan ** current read transaction was opened. Values added after the 2866bb23aff3Sdan ** read transaction was opened may have been written incorrectly - 2867bb23aff3Sdan ** i.e. these slots may contain garbage data. However, we assume 2868bb23aff3Sdan ** that any slots written before the current read transaction was 2869bb23aff3Sdan ** opened remain unmodified. 2870bb23aff3Sdan ** 2871bb23aff3Sdan ** For the reasons above, the if(...) condition featured in the inner 2872bb23aff3Sdan ** loop of the following block is more stringent that would be required 2873bb23aff3Sdan ** if we had exclusive access to the hash-table: 2874bb23aff3Sdan ** 2875bb23aff3Sdan ** (aPgno[iFrame]==pgno): 2876bb23aff3Sdan ** This condition filters out normal hash-table collisions. 2877bb23aff3Sdan ** 2878bb23aff3Sdan ** (iFrame<=iLast): 2879bb23aff3Sdan ** This condition filters out entries that were added to the hash 2880bb23aff3Sdan ** table after the current read-transaction had started. 2881c438efd6Sdrh */ 2882b8c7cfb8Sdan iMinHash = walFramePage(pWal->minFrame); 28838d3e15eeSdrh for(iHash=walFramePage(iLast); iHash>=iMinHash; iHash--){ 28844ece2f26Sdrh WalHashLoc sLoc; /* Hash table location */ 2885bb23aff3Sdan int iKey; /* Hash slot index */ 2886519426aaSdrh int nCollide; /* Number of hash collisions remaining */ 2887519426aaSdrh int rc; /* Error code */ 2888bb23aff3Sdan 28894ece2f26Sdrh rc = walHashGet(pWal, iHash, &sLoc); 28904280eb30Sdan if( rc!=SQLITE_OK ){ 28914280eb30Sdan return rc; 28924280eb30Sdan } 2893519426aaSdrh nCollide = HASHTABLE_NSLOT; 28944ece2f26Sdrh for(iKey=walHash(pgno); sLoc.aHash[iKey]; iKey=walNextHash(iKey)){ 28954ece2f26Sdrh u32 iFrame = sLoc.aHash[iKey] + sLoc.iZero; 28964ece2f26Sdrh if( iFrame<=iLast && iFrame>=pWal->minFrame 28974ece2f26Sdrh && sLoc.aPgno[sLoc.aHash[iKey]]==pgno ){ 2898622a53d5Sdrh assert( iFrame>iRead || CORRUPT_DB ); 2899bb23aff3Sdan iRead = iFrame; 2900c438efd6Sdrh } 2901519426aaSdrh if( (nCollide--)==0 ){ 2902519426aaSdrh return SQLITE_CORRUPT_BKPT; 2903519426aaSdrh } 2904c438efd6Sdrh } 29058d3e15eeSdrh if( iRead ) break; 2906bb23aff3Sdan } 2907c438efd6Sdrh 2908bb23aff3Sdan #ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT 2909bb23aff3Sdan /* If expensive assert() statements are available, do a linear search 2910bb23aff3Sdan ** of the wal-index file content. Make sure the results agree with the 2911bb23aff3Sdan ** result obtained using the hash indexes above. */ 2912bb23aff3Sdan { 2913bb23aff3Sdan u32 iRead2 = 0; 2914bb23aff3Sdan u32 iTest; 291585bc6df2Sdrh assert( pWal->bShmUnreliable || pWal->minFrame>0 ); 29166c9d8f64Sdan for(iTest=iLast; iTest>=pWal->minFrame && iTest>0; iTest--){ 291713a3cb82Sdan if( walFramePgno(pWal, iTest)==pgno ){ 2918bb23aff3Sdan iRead2 = iTest; 2919c438efd6Sdrh break; 2920c438efd6Sdrh } 2921c438efd6Sdrh } 2922bb23aff3Sdan assert( iRead==iRead2 ); 2923c438efd6Sdrh } 2924bb23aff3Sdan #endif 2925cd11fb28Sdan 292699bd1097Sdan *piRead = iRead; 292799bd1097Sdan return SQLITE_OK; 292899bd1097Sdan } 292999bd1097Sdan 293099bd1097Sdan /* 293199bd1097Sdan ** Read the contents of frame iRead from the wal file into buffer pOut 293299bd1097Sdan ** (which is nOut bytes in size). Return SQLITE_OK if successful, or an 293399bd1097Sdan ** error code otherwise. 2934c438efd6Sdrh */ 293599bd1097Sdan int sqlite3WalReadFrame( 293699bd1097Sdan Wal *pWal, /* WAL handle */ 293799bd1097Sdan u32 iRead, /* Frame to read */ 293899bd1097Sdan int nOut, /* Size of buffer pOut in bytes */ 293999bd1097Sdan u8 *pOut /* Buffer to write page data to */ 294099bd1097Sdan ){ 2941b2eced5dSdrh int sz; 2942b2eced5dSdrh i64 iOffset; 2943b2eced5dSdrh sz = pWal->hdr.szPage; 2944b07028f7Sdrh sz = (sz&0xfe00) + ((sz&0x0001)<<16); 29459b78f791Sdrh testcase( sz<=32768 ); 29469b78f791Sdrh testcase( sz>=65536 ); 2947b2eced5dSdrh iOffset = walFrameOffset(iRead, sz) + WAL_FRAME_HDRSIZE; 294809b5dbc5Sdrh /* testcase( IS_BIG_INT(iOffset) ); // requires a 4GiB WAL */ 2949f602963dSdan return sqlite3OsRead(pWal->pWalFd, pOut, (nOut>sz ? sz : nOut), iOffset); 2950c438efd6Sdrh } 2951c438efd6Sdrh 2952c438efd6Sdrh /* 2953763afe62Sdan ** Return the size of the database in pages (or zero, if unknown). 2954c438efd6Sdrh */ 2955763afe62Sdan Pgno sqlite3WalDbsize(Wal *pWal){ 29567e9e70b1Sdrh if( pWal && ALWAYS(pWal->readLock>=0) ){ 2957763afe62Sdan return pWal->hdr.nPage; 2958763afe62Sdan } 2959763afe62Sdan return 0; 2960c438efd6Sdrh } 2961c438efd6Sdrh 296230c8629eSdan 296373b64e4dSdrh /* 296473b64e4dSdrh ** This function starts a write transaction on the WAL. 296573b64e4dSdrh ** 296673b64e4dSdrh ** A read transaction must have already been started by a prior call 296773b64e4dSdrh ** to sqlite3WalBeginReadTransaction(). 296873b64e4dSdrh ** 296973b64e4dSdrh ** If another thread or process has written into the database since 297073b64e4dSdrh ** the read transaction was started, then it is not possible for this 297173b64e4dSdrh ** thread to write as doing so would cause a fork. So this routine 297273b64e4dSdrh ** returns SQLITE_BUSY in that case and no write transaction is started. 297373b64e4dSdrh ** 297473b64e4dSdrh ** There can only be a single writer active at a time. 297530c8629eSdan */ 297673b64e4dSdrh int sqlite3WalBeginWriteTransaction(Wal *pWal){ 297773b64e4dSdrh int rc; 297873b64e4dSdrh 297973b64e4dSdrh /* Cannot start a write transaction without first holding a read 298073b64e4dSdrh ** transaction. */ 298173b64e4dSdrh assert( pWal->readLock>=0 ); 2982c9a9022bSdan assert( pWal->writeLock==0 && pWal->iReCksum==0 ); 298373b64e4dSdrh 29841e5de5a1Sdan if( pWal->readOnly ){ 29851e5de5a1Sdan return SQLITE_READONLY; 29861e5de5a1Sdan } 29871e5de5a1Sdan 298873b64e4dSdrh /* Only one writer allowed at a time. Get the write lock. Return 298973b64e4dSdrh ** SQLITE_BUSY if unable. 299073b64e4dSdrh */ 2991ab372773Sdrh rc = walLockExclusive(pWal, WAL_WRITE_LOCK, 1); 299273b64e4dSdrh if( rc ){ 299373b64e4dSdrh return rc; 299430c8629eSdan } 2995c99597caSdrh pWal->writeLock = 1; 299673b64e4dSdrh 299773b64e4dSdrh /* If another connection has written to the database file since the 299873b64e4dSdrh ** time the read transaction on this connection was started, then 299973b64e4dSdrh ** the write is disallowed. 300073b64e4dSdrh */ 30014280eb30Sdan if( memcmp(&pWal->hdr, (void *)walIndexHdr(pWal), sizeof(WalIndexHdr))!=0 ){ 300273b64e4dSdrh walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1); 3003c99597caSdrh pWal->writeLock = 0; 3004f73819afSdan rc = SQLITE_BUSY_SNAPSHOT; 300530c8629eSdan } 300673b64e4dSdrh 3007c438efd6Sdrh return rc; 3008c438efd6Sdrh } 3009c438efd6Sdrh 3010c438efd6Sdrh /* 301173b64e4dSdrh ** End a write transaction. The commit has already been done. This 301273b64e4dSdrh ** routine merely releases the lock. 301373b64e4dSdrh */ 301473b64e4dSdrh int sqlite3WalEndWriteTransaction(Wal *pWal){ 3015da9fe0c3Sdan if( pWal->writeLock ){ 301673b64e4dSdrh walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1); 3017d54ff60bSdan pWal->writeLock = 0; 3018c9a9022bSdan pWal->iReCksum = 0; 3019f60b7f36Sdan pWal->truncateOnCommit = 0; 3020da9fe0c3Sdan } 302173b64e4dSdrh return SQLITE_OK; 302273b64e4dSdrh } 302373b64e4dSdrh 302473b64e4dSdrh /* 3025c438efd6Sdrh ** If any data has been written (but not committed) to the log file, this 3026c438efd6Sdrh ** function moves the write-pointer back to the start of the transaction. 3027c438efd6Sdrh ** 3028c438efd6Sdrh ** Additionally, the callback function is invoked for each frame written 302973b64e4dSdrh ** to the WAL since the start of the transaction. If the callback returns 3030c438efd6Sdrh ** other than SQLITE_OK, it is not invoked again and the error code is 3031c438efd6Sdrh ** returned to the caller. 3032c438efd6Sdrh ** 3033c438efd6Sdrh ** Otherwise, if the callback function does not return an error, this 3034c438efd6Sdrh ** function returns SQLITE_OK. 3035c438efd6Sdrh */ 30367ed91f23Sdrh int sqlite3WalUndo(Wal *pWal, int (*xUndo)(void *, Pgno), void *pUndoCtx){ 30375543759bSdan int rc = SQLITE_OK; 30387e9e70b1Sdrh if( ALWAYS(pWal->writeLock) ){ 3039027a128aSdrh Pgno iMax = pWal->hdr.mxFrame; 3040c438efd6Sdrh Pgno iFrame; 3041c438efd6Sdrh 30425d656852Sdan /* Restore the clients cache of the wal-index header to the state it 30435d656852Sdan ** was in before the client began writing to the database. 30445d656852Sdan */ 3045067f3165Sdan memcpy(&pWal->hdr, (void *)walIndexHdr(pWal), sizeof(WalIndexHdr)); 30465d656852Sdan 30470626bd65Sdan for(iFrame=pWal->hdr.mxFrame+1; 3048664f85ddSdrh ALWAYS(rc==SQLITE_OK) && iFrame<=iMax; 30490626bd65Sdan iFrame++ 30500626bd65Sdan ){ 30510626bd65Sdan /* This call cannot fail. Unless the page for which the page number 30520626bd65Sdan ** is passed as the second argument is (a) in the cache and 30530626bd65Sdan ** (b) has an outstanding reference, then xUndo is either a no-op 30540626bd65Sdan ** (if (a) is false) or simply expels the page from the cache (if (b) 30550626bd65Sdan ** is false). 30560626bd65Sdan ** 30570626bd65Sdan ** If the upper layer is doing a rollback, it is guaranteed that there 30580626bd65Sdan ** are no outstanding references to any page other than page 1. And 30590626bd65Sdan ** page 1 is never written to the log until the transaction is 30600626bd65Sdan ** committed. As a result, the call to xUndo may not fail. 30610626bd65Sdan */ 306213a3cb82Sdan assert( walFramePgno(pWal, iFrame)!=1 ); 306313a3cb82Sdan rc = xUndo(pUndoCtx, walFramePgno(pWal, iFrame)); 3064c438efd6Sdrh } 30657eb05752Sdan if( iMax!=pWal->hdr.mxFrame ) walCleanupHash(pWal); 30666f150148Sdan } 3067c438efd6Sdrh return rc; 3068c438efd6Sdrh } 3069c438efd6Sdrh 307071d89919Sdan /* 307171d89919Sdan ** Argument aWalData must point to an array of WAL_SAVEPOINT_NDATA u32 307271d89919Sdan ** values. This function populates the array with values required to 307371d89919Sdan ** "rollback" the write position of the WAL handle back to the current 307471d89919Sdan ** point in the event of a savepoint rollback (via WalSavepointUndo()). 30757ed91f23Sdrh */ 307671d89919Sdan void sqlite3WalSavepoint(Wal *pWal, u32 *aWalData){ 307773b64e4dSdrh assert( pWal->writeLock ); 307871d89919Sdan aWalData[0] = pWal->hdr.mxFrame; 307971d89919Sdan aWalData[1] = pWal->hdr.aFrameCksum[0]; 308071d89919Sdan aWalData[2] = pWal->hdr.aFrameCksum[1]; 30816e6bd565Sdan aWalData[3] = pWal->nCkpt; 30824cd78b4dSdan } 30834cd78b4dSdan 308471d89919Sdan /* 308571d89919Sdan ** Move the write position of the WAL back to the point identified by 308671d89919Sdan ** the values in the aWalData[] array. aWalData must point to an array 308771d89919Sdan ** of WAL_SAVEPOINT_NDATA u32 values that has been previously populated 308871d89919Sdan ** by a call to WalSavepoint(). 30897ed91f23Sdrh */ 309071d89919Sdan int sqlite3WalSavepointUndo(Wal *pWal, u32 *aWalData){ 30914cd78b4dSdan int rc = SQLITE_OK; 30924cd78b4dSdan 30936e6bd565Sdan assert( pWal->writeLock ); 30946e6bd565Sdan assert( aWalData[3]!=pWal->nCkpt || aWalData[0]<=pWal->hdr.mxFrame ); 30956e6bd565Sdan 30966e6bd565Sdan if( aWalData[3]!=pWal->nCkpt ){ 30976e6bd565Sdan /* This savepoint was opened immediately after the write-transaction 30986e6bd565Sdan ** was started. Right after that, the writer decided to wrap around 30996e6bd565Sdan ** to the start of the log. Update the savepoint values to match. 31006e6bd565Sdan */ 31016e6bd565Sdan aWalData[0] = 0; 31026e6bd565Sdan aWalData[3] = pWal->nCkpt; 31036e6bd565Sdan } 31046e6bd565Sdan 310571d89919Sdan if( aWalData[0]<pWal->hdr.mxFrame ){ 310671d89919Sdan pWal->hdr.mxFrame = aWalData[0]; 310771d89919Sdan pWal->hdr.aFrameCksum[0] = aWalData[1]; 310871d89919Sdan pWal->hdr.aFrameCksum[1] = aWalData[2]; 31094fa95bfcSdrh walCleanupHash(pWal); 31106e6bd565Sdan } 31116e6bd565Sdan 31124cd78b4dSdan return rc; 31134cd78b4dSdan } 31144cd78b4dSdan 3115c438efd6Sdrh /* 31169971e710Sdan ** This function is called just before writing a set of frames to the log 31179971e710Sdan ** file (see sqlite3WalFrames()). It checks to see if, instead of appending 31189971e710Sdan ** to the current log file, it is possible to overwrite the start of the 31199971e710Sdan ** existing log file with the new frames (i.e. "reset" the log). If so, 31209971e710Sdan ** it sets pWal->hdr.mxFrame to 0. Otherwise, pWal->hdr.mxFrame is left 31219971e710Sdan ** unchanged. 31229971e710Sdan ** 31239971e710Sdan ** SQLITE_OK is returned if no error is encountered (regardless of whether 31249971e710Sdan ** or not pWal->hdr.mxFrame is modified). An SQLite error code is returned 31254533cd05Sdrh ** if an error occurs. 31269971e710Sdan */ 31279971e710Sdan static int walRestartLog(Wal *pWal){ 31289971e710Sdan int rc = SQLITE_OK; 3129aab4c02eSdrh int cnt; 3130aab4c02eSdrh 313113a3cb82Sdan if( pWal->readLock==0 ){ 31329971e710Sdan volatile WalCkptInfo *pInfo = walCkptInfo(pWal); 31339971e710Sdan assert( pInfo->nBackfill==pWal->hdr.mxFrame ); 31349971e710Sdan if( pInfo->nBackfill>0 ){ 3135658d76c9Sdrh u32 salt1; 3136658d76c9Sdrh sqlite3_randomness(4, &salt1); 3137ab372773Sdrh rc = walLockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1); 31389971e710Sdan if( rc==SQLITE_OK ){ 31399971e710Sdan /* If all readers are using WAL_READ_LOCK(0) (in other words if no 31409971e710Sdan ** readers are currently using the WAL), then the transactions 31419971e710Sdan ** frames will overwrite the start of the existing log. Update the 31429971e710Sdan ** wal-index header to reflect this. 31439971e710Sdan ** 31449971e710Sdan ** In theory it would be Ok to update the cache of the header only 31459971e710Sdan ** at this point. But updating the actual wal-index header is also 31469971e710Sdan ** safe and means there is no special case for sqlite3WalUndo() 3147f26a1549Sdan ** to handle if this transaction is rolled back. */ 31480fe8c1b9Sdan walRestartHdr(pWal, salt1); 31499971e710Sdan walUnlockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1); 31504533cd05Sdrh }else if( rc!=SQLITE_BUSY ){ 31514533cd05Sdrh return rc; 31529971e710Sdan } 31539971e710Sdan } 31549971e710Sdan walUnlockShared(pWal, WAL_READ_LOCK(0)); 31559971e710Sdan pWal->readLock = -1; 3156aab4c02eSdrh cnt = 0; 31579971e710Sdan do{ 31589971e710Sdan int notUsed; 3159aab4c02eSdrh rc = walTryBeginRead(pWal, ¬Used, 1, ++cnt); 31609971e710Sdan }while( rc==WAL_RETRY ); 3161c90e0811Sdrh assert( (rc&0xff)!=SQLITE_BUSY ); /* BUSY not possible when useWal==1 */ 3162ab1cc746Sdrh testcase( (rc&0xff)==SQLITE_IOERR ); 3163ab1cc746Sdrh testcase( rc==SQLITE_PROTOCOL ); 3164ab1cc746Sdrh testcase( rc==SQLITE_OK ); 31659971e710Sdan } 31669971e710Sdan return rc; 31679971e710Sdan } 31689971e710Sdan 31699971e710Sdan /* 3170d992b150Sdrh ** Information about the current state of the WAL file and where 3171d992b150Sdrh ** the next fsync should occur - passed from sqlite3WalFrames() into 3172d992b150Sdrh ** walWriteToLog(). 3173d992b150Sdrh */ 3174d992b150Sdrh typedef struct WalWriter { 3175d992b150Sdrh Wal *pWal; /* The complete WAL information */ 3176d992b150Sdrh sqlite3_file *pFd; /* The WAL file to which we write */ 3177d992b150Sdrh sqlite3_int64 iSyncPoint; /* Fsync at this offset */ 3178d992b150Sdrh int syncFlags; /* Flags for the fsync */ 3179d992b150Sdrh int szPage; /* Size of one page */ 3180d992b150Sdrh } WalWriter; 3181d992b150Sdrh 3182d992b150Sdrh /* 318388f975a7Sdrh ** Write iAmt bytes of content into the WAL file beginning at iOffset. 3184d992b150Sdrh ** Do a sync when crossing the p->iSyncPoint boundary. 318588f975a7Sdrh ** 3186d992b150Sdrh ** In other words, if iSyncPoint is in between iOffset and iOffset+iAmt, 3187d992b150Sdrh ** first write the part before iSyncPoint, then sync, then write the 3188d992b150Sdrh ** rest. 318988f975a7Sdrh */ 319088f975a7Sdrh static int walWriteToLog( 3191d992b150Sdrh WalWriter *p, /* WAL to write to */ 319288f975a7Sdrh void *pContent, /* Content to be written */ 319388f975a7Sdrh int iAmt, /* Number of bytes to write */ 319488f975a7Sdrh sqlite3_int64 iOffset /* Start writing at this offset */ 319588f975a7Sdrh ){ 319688f975a7Sdrh int rc; 3197d992b150Sdrh if( iOffset<p->iSyncPoint && iOffset+iAmt>=p->iSyncPoint ){ 3198d992b150Sdrh int iFirstAmt = (int)(p->iSyncPoint - iOffset); 3199d992b150Sdrh rc = sqlite3OsWrite(p->pFd, pContent, iFirstAmt, iOffset); 320088f975a7Sdrh if( rc ) return rc; 3201d992b150Sdrh iOffset += iFirstAmt; 3202d992b150Sdrh iAmt -= iFirstAmt; 320388f975a7Sdrh pContent = (void*)(iFirstAmt + (char*)pContent); 3204daaae7b9Sdrh assert( WAL_SYNC_FLAGS(p->syncFlags)!=0 ); 3205daaae7b9Sdrh rc = sqlite3OsSync(p->pFd, WAL_SYNC_FLAGS(p->syncFlags)); 3206cc8d10a0Sdrh if( iAmt==0 || rc ) return rc; 320788f975a7Sdrh } 3208d992b150Sdrh rc = sqlite3OsWrite(p->pFd, pContent, iAmt, iOffset); 3209d992b150Sdrh return rc; 3210d992b150Sdrh } 3211d992b150Sdrh 3212d992b150Sdrh /* 3213d992b150Sdrh ** Write out a single frame of the WAL 3214d992b150Sdrh */ 3215d992b150Sdrh static int walWriteOneFrame( 3216d992b150Sdrh WalWriter *p, /* Where to write the frame */ 3217d992b150Sdrh PgHdr *pPage, /* The page of the frame to be written */ 3218d992b150Sdrh int nTruncate, /* The commit flag. Usually 0. >0 for commit */ 3219d992b150Sdrh sqlite3_int64 iOffset /* Byte offset at which to write */ 3220d992b150Sdrh ){ 3221d992b150Sdrh int rc; /* Result code from subfunctions */ 3222d992b150Sdrh void *pData; /* Data actually written */ 3223d992b150Sdrh u8 aFrame[WAL_FRAME_HDRSIZE]; /* Buffer to assemble frame-header in */ 3224d992b150Sdrh #if defined(SQLITE_HAS_CODEC) 3225fad3039cSmistachkin if( (pData = sqlite3PagerCodec(pPage))==0 ) return SQLITE_NOMEM_BKPT; 3226d992b150Sdrh #else 3227d992b150Sdrh pData = pPage->pData; 3228d992b150Sdrh #endif 3229d992b150Sdrh walEncodeFrame(p->pWal, pPage->pgno, nTruncate, pData, aFrame); 3230d992b150Sdrh rc = walWriteToLog(p, aFrame, sizeof(aFrame), iOffset); 3231d992b150Sdrh if( rc ) return rc; 3232d992b150Sdrh /* Write the page data */ 3233d992b150Sdrh rc = walWriteToLog(p, pData, p->szPage, iOffset+sizeof(aFrame)); 323488f975a7Sdrh return rc; 323588f975a7Sdrh } 323688f975a7Sdrh 323788f975a7Sdrh /* 3238d6f7c979Sdan ** This function is called as part of committing a transaction within which 3239d6f7c979Sdan ** one or more frames have been overwritten. It updates the checksums for 3240c9a9022bSdan ** all frames written to the wal file by the current transaction starting 3241c9a9022bSdan ** with the earliest to have been overwritten. 3242d6f7c979Sdan ** 3243d6f7c979Sdan ** SQLITE_OK is returned if successful, or an SQLite error code otherwise. 3244d6f7c979Sdan */ 3245c9a9022bSdan static int walRewriteChecksums(Wal *pWal, u32 iLast){ 3246d6f7c979Sdan const int szPage = pWal->szPage;/* Database page size */ 3247d6f7c979Sdan int rc = SQLITE_OK; /* Return code */ 3248d6f7c979Sdan u8 *aBuf; /* Buffer to load data from wal file into */ 3249d6f7c979Sdan u8 aFrame[WAL_FRAME_HDRSIZE]; /* Buffer to assemble frame-headers in */ 3250d6f7c979Sdan u32 iRead; /* Next frame to read from wal file */ 3251c9a9022bSdan i64 iCksumOff; 3252d6f7c979Sdan 3253d6f7c979Sdan aBuf = sqlite3_malloc(szPage + WAL_FRAME_HDRSIZE); 3254fad3039cSmistachkin if( aBuf==0 ) return SQLITE_NOMEM_BKPT; 3255d6f7c979Sdan 3256c9a9022bSdan /* Find the checksum values to use as input for the recalculating the 3257c9a9022bSdan ** first checksum. If the first frame is frame 1 (implying that the current 3258c9a9022bSdan ** transaction restarted the wal file), these values must be read from the 3259c9a9022bSdan ** wal-file header. Otherwise, read them from the frame header of the 3260c9a9022bSdan ** previous frame. */ 3261c9a9022bSdan assert( pWal->iReCksum>0 ); 3262c9a9022bSdan if( pWal->iReCksum==1 ){ 3263c9a9022bSdan iCksumOff = 24; 3264c9a9022bSdan }else{ 3265c9a9022bSdan iCksumOff = walFrameOffset(pWal->iReCksum-1, szPage) + 16; 3266c9a9022bSdan } 3267c9a9022bSdan rc = sqlite3OsRead(pWal->pWalFd, aBuf, sizeof(u32)*2, iCksumOff); 3268d6f7c979Sdan pWal->hdr.aFrameCksum[0] = sqlite3Get4byte(aBuf); 3269d6f7c979Sdan pWal->hdr.aFrameCksum[1] = sqlite3Get4byte(&aBuf[sizeof(u32)]); 3270d6f7c979Sdan 3271c9a9022bSdan iRead = pWal->iReCksum; 3272c9a9022bSdan pWal->iReCksum = 0; 3273c9a9022bSdan for(; rc==SQLITE_OK && iRead<=iLast; iRead++){ 3274d6f7c979Sdan i64 iOff = walFrameOffset(iRead, szPage); 3275d6f7c979Sdan rc = sqlite3OsRead(pWal->pWalFd, aBuf, szPage+WAL_FRAME_HDRSIZE, iOff); 3276d6f7c979Sdan if( rc==SQLITE_OK ){ 3277d6f7c979Sdan u32 iPgno, nDbSize; 3278d6f7c979Sdan iPgno = sqlite3Get4byte(aBuf); 3279d6f7c979Sdan nDbSize = sqlite3Get4byte(&aBuf[4]); 3280d6f7c979Sdan 3281d6f7c979Sdan walEncodeFrame(pWal, iPgno, nDbSize, &aBuf[WAL_FRAME_HDRSIZE], aFrame); 3282d6f7c979Sdan rc = sqlite3OsWrite(pWal->pWalFd, aFrame, sizeof(aFrame), iOff); 3283d6f7c979Sdan } 3284d6f7c979Sdan } 3285d6f7c979Sdan 3286d6f7c979Sdan sqlite3_free(aBuf); 3287d6f7c979Sdan return rc; 3288d6f7c979Sdan } 3289d6f7c979Sdan 3290d6f7c979Sdan /* 32914cd78b4dSdan ** Write a set of frames to the log. The caller must hold the write-lock 32929971e710Sdan ** on the log file (obtained using sqlite3WalBeginWriteTransaction()). 3293c438efd6Sdrh */ 3294c438efd6Sdrh int sqlite3WalFrames( 32957ed91f23Sdrh Wal *pWal, /* Wal handle to write to */ 32966e81096fSdrh int szPage, /* Database page-size in bytes */ 3297c438efd6Sdrh PgHdr *pList, /* List of dirty pages to write */ 3298c438efd6Sdrh Pgno nTruncate, /* Database size after this commit */ 3299c438efd6Sdrh int isCommit, /* True if this is a commit */ 3300c438efd6Sdrh int sync_flags /* Flags to pass to OsSync() (or 0) */ 3301c438efd6Sdrh ){ 3302c438efd6Sdrh int rc; /* Used to catch return codes */ 3303c438efd6Sdrh u32 iFrame; /* Next frame address */ 3304c438efd6Sdrh PgHdr *p; /* Iterator to run through pList with. */ 3305e874d9edSdrh PgHdr *pLast = 0; /* Last frame in list */ 3306d992b150Sdrh int nExtra = 0; /* Number of extra copies of last page */ 3307d992b150Sdrh int szFrame; /* The size of a single frame */ 3308d992b150Sdrh i64 iOffset; /* Next byte to write in WAL file */ 3309d992b150Sdrh WalWriter w; /* The writer */ 3310d6f7c979Sdan u32 iFirst = 0; /* First frame that may be overwritten */ 3311d6f7c979Sdan WalIndexHdr *pLive; /* Pointer to shared header */ 3312c438efd6Sdrh 3313c438efd6Sdrh assert( pList ); 331473b64e4dSdrh assert( pWal->writeLock ); 3315c438efd6Sdrh 33164120994fSdrh /* If this frame set completes a transaction, then nTruncate>0. If 33174120994fSdrh ** nTruncate==0 then this frame set does not complete the transaction. */ 33184120994fSdrh assert( (isCommit!=0)==(nTruncate!=0) ); 33194120994fSdrh 3320c74c3334Sdrh #if defined(SQLITE_TEST) && defined(SQLITE_DEBUG) 3321c74c3334Sdrh { int cnt; for(cnt=0, p=pList; p; p=p->pDirty, cnt++){} 3322c74c3334Sdrh WALTRACE(("WAL%p: frame write begin. %d frames. mxFrame=%d. %s\n", 3323c74c3334Sdrh pWal, cnt, pWal->hdr.mxFrame, isCommit ? "Commit" : "Spill")); 3324c74c3334Sdrh } 3325c74c3334Sdrh #endif 3326c74c3334Sdrh 3327d6f7c979Sdan pLive = (WalIndexHdr*)walIndexHdr(pWal); 3328b7c2f86bSdrh if( memcmp(&pWal->hdr, (void *)pLive, sizeof(WalIndexHdr))!=0 ){ 3329d6f7c979Sdan iFirst = pLive->mxFrame+1; 3330d6f7c979Sdan } 3331d6f7c979Sdan 33329971e710Sdan /* See if it is possible to write these frames into the start of the 33339971e710Sdan ** log file, instead of appending to it at pWal->hdr.mxFrame. 33349971e710Sdan */ 33359971e710Sdan if( SQLITE_OK!=(rc = walRestartLog(pWal)) ){ 33369971e710Sdan return rc; 33379971e710Sdan } 33389971e710Sdan 3339a2a42013Sdrh /* If this is the first frame written into the log, write the WAL 3340a2a42013Sdrh ** header to the start of the WAL file. See comments at the top of 3341a2a42013Sdrh ** this source file for a description of the WAL header format. 3342c438efd6Sdrh */ 3343027a128aSdrh iFrame = pWal->hdr.mxFrame; 3344c438efd6Sdrh if( iFrame==0 ){ 334510f5a50eSdan u8 aWalHdr[WAL_HDRSIZE]; /* Buffer to assemble wal-header in */ 334610f5a50eSdan u32 aCksum[2]; /* Checksum for wal-header */ 334710f5a50eSdan 3348b8fd6c2fSdan sqlite3Put4byte(&aWalHdr[0], (WAL_MAGIC | SQLITE_BIGENDIAN)); 334910f5a50eSdan sqlite3Put4byte(&aWalHdr[4], WAL_MAX_VERSION); 335023ea97b6Sdrh sqlite3Put4byte(&aWalHdr[8], szPage); 335123ea97b6Sdrh sqlite3Put4byte(&aWalHdr[12], pWal->nCkpt); 3352d2980310Sdrh if( pWal->nCkpt==0 ) sqlite3_randomness(8, pWal->hdr.aSalt); 33537e263728Sdrh memcpy(&aWalHdr[16], pWal->hdr.aSalt, 8); 335410f5a50eSdan walChecksumBytes(1, aWalHdr, WAL_HDRSIZE-2*4, 0, aCksum); 335510f5a50eSdan sqlite3Put4byte(&aWalHdr[24], aCksum[0]); 335610f5a50eSdan sqlite3Put4byte(&aWalHdr[28], aCksum[1]); 335710f5a50eSdan 3358b2eced5dSdrh pWal->szPage = szPage; 335910f5a50eSdan pWal->hdr.bigEndCksum = SQLITE_BIGENDIAN; 336010f5a50eSdan pWal->hdr.aFrameCksum[0] = aCksum[0]; 336110f5a50eSdan pWal->hdr.aFrameCksum[1] = aCksum[1]; 3362f60b7f36Sdan pWal->truncateOnCommit = 1; 336310f5a50eSdan 336423ea97b6Sdrh rc = sqlite3OsWrite(pWal->pWalFd, aWalHdr, sizeof(aWalHdr), 0); 3365c74c3334Sdrh WALTRACE(("WAL%p: wal-header write %s\n", pWal, rc ? "failed" : "ok")); 3366c438efd6Sdrh if( rc!=SQLITE_OK ){ 3367c438efd6Sdrh return rc; 3368c438efd6Sdrh } 3369d992b150Sdrh 3370d992b150Sdrh /* Sync the header (unless SQLITE_IOCAP_SEQUENTIAL is true or unless 3371d992b150Sdrh ** all syncing is turned off by PRAGMA synchronous=OFF). Otherwise 3372d992b150Sdrh ** an out-of-order write following a WAL restart could result in 3373d992b150Sdrh ** database corruption. See the ticket: 3374d992b150Sdrh ** 33759c6e07d2Sdrh ** https://sqlite.org/src/info/ff5be73dee 3376d992b150Sdrh */ 3377daaae7b9Sdrh if( pWal->syncHeader ){ 3378daaae7b9Sdrh rc = sqlite3OsSync(pWal->pWalFd, CKPT_SYNC_FLAGS(sync_flags)); 3379d992b150Sdrh if( rc ) return rc; 3380d992b150Sdrh } 3381c438efd6Sdrh } 3382bd2aaf9aSshaneh assert( (int)pWal->szPage==szPage ); 3383c438efd6Sdrh 3384d992b150Sdrh /* Setup information needed to write frames into the WAL */ 3385d992b150Sdrh w.pWal = pWal; 3386d992b150Sdrh w.pFd = pWal->pWalFd; 3387d992b150Sdrh w.iSyncPoint = 0; 3388d992b150Sdrh w.syncFlags = sync_flags; 3389d992b150Sdrh w.szPage = szPage; 3390d992b150Sdrh iOffset = walFrameOffset(iFrame+1, szPage); 3391d992b150Sdrh szFrame = szPage + WAL_FRAME_HDRSIZE; 339288f975a7Sdrh 3393d992b150Sdrh /* Write all frames into the log file exactly once */ 3394c438efd6Sdrh for(p=pList; p; p=p->pDirty){ 3395d992b150Sdrh int nDbSize; /* 0 normally. Positive == commit flag */ 3396d6f7c979Sdan 3397d6f7c979Sdan /* Check if this page has already been written into the wal file by 3398d6f7c979Sdan ** the current transaction. If so, overwrite the existing frame and 3399d6f7c979Sdan ** set Wal.writeLock to WAL_WRITELOCK_RECKSUM - indicating that 3400d6f7c979Sdan ** checksums must be recomputed when the transaction is committed. */ 3401d6f7c979Sdan if( iFirst && (p->pDirty || isCommit==0) ){ 3402d6f7c979Sdan u32 iWrite = 0; 34038997087aSdrh VVA_ONLY(rc =) sqlite3WalFindFrame(pWal, p->pgno, &iWrite); 34048997087aSdrh assert( rc==SQLITE_OK || iWrite==0 ); 3405d6f7c979Sdan if( iWrite>=iFirst ){ 3406d6f7c979Sdan i64 iOff = walFrameOffset(iWrite, szPage) + WAL_FRAME_HDRSIZE; 34078e0cea1aSdrh void *pData; 3408c9a9022bSdan if( pWal->iReCksum==0 || iWrite<pWal->iReCksum ){ 3409c9a9022bSdan pWal->iReCksum = iWrite; 3410c9a9022bSdan } 34118e0cea1aSdrh #if defined(SQLITE_HAS_CODEC) 34128e0cea1aSdrh if( (pData = sqlite3PagerCodec(p))==0 ) return SQLITE_NOMEM; 34138e0cea1aSdrh #else 34148e0cea1aSdrh pData = p->pData; 34158e0cea1aSdrh #endif 34168e0cea1aSdrh rc = sqlite3OsWrite(pWal->pWalFd, pData, szPage, iOff); 3417d6f7c979Sdan if( rc ) return rc; 3418d6f7c979Sdan p->flags &= ~PGHDR_WAL_APPEND; 3419d6f7c979Sdan continue; 3420d6f7c979Sdan } 3421d6f7c979Sdan } 3422d6f7c979Sdan 3423d992b150Sdrh iFrame++; 3424d992b150Sdrh assert( iOffset==walFrameOffset(iFrame, szPage) ); 3425d992b150Sdrh nDbSize = (isCommit && p->pDirty==0) ? nTruncate : 0; 3426d992b150Sdrh rc = walWriteOneFrame(&w, p, nDbSize, iOffset); 3427d992b150Sdrh if( rc ) return rc; 3428c438efd6Sdrh pLast = p; 3429d992b150Sdrh iOffset += szFrame; 3430d6f7c979Sdan p->flags |= PGHDR_WAL_APPEND; 3431d6f7c979Sdan } 3432d6f7c979Sdan 3433d6f7c979Sdan /* Recalculate checksums within the wal file if required. */ 3434c9a9022bSdan if( isCommit && pWal->iReCksum ){ 3435c9a9022bSdan rc = walRewriteChecksums(pWal, iFrame); 3436d6f7c979Sdan if( rc ) return rc; 3437c438efd6Sdrh } 3438c438efd6Sdrh 3439d992b150Sdrh /* If this is the end of a transaction, then we might need to pad 3440d992b150Sdrh ** the transaction and/or sync the WAL file. 3441d992b150Sdrh ** 3442d992b150Sdrh ** Padding and syncing only occur if this set of frames complete a 3443d992b150Sdrh ** transaction and if PRAGMA synchronous=FULL. If synchronous==NORMAL 344460ec914cSpeter.d.reid ** or synchronous==OFF, then no padding or syncing are needed. 3445d992b150Sdrh ** 3446cb15f35fSdrh ** If SQLITE_IOCAP_POWERSAFE_OVERWRITE is defined, then padding is not 3447cb15f35fSdrh ** needed and only the sync is done. If padding is needed, then the 3448cb15f35fSdrh ** final frame is repeated (with its commit mark) until the next sector 3449d992b150Sdrh ** boundary is crossed. Only the part of the WAL prior to the last 3450d992b150Sdrh ** sector boundary is synced; the part of the last frame that extends 3451d992b150Sdrh ** past the sector boundary is written after the sync. 3452d992b150Sdrh */ 3453daaae7b9Sdrh if( isCommit && WAL_SYNC_FLAGS(sync_flags)!=0 ){ 3454fe912510Sdan int bSync = 1; 3455374f4a04Sdrh if( pWal->padToSectorBoundary ){ 3456c9a53269Sdan int sectorSize = sqlite3SectorSize(pWal->pWalFd); 3457d992b150Sdrh w.iSyncPoint = ((iOffset+sectorSize-1)/sectorSize)*sectorSize; 3458fe912510Sdan bSync = (w.iSyncPoint==iOffset); 3459fe912510Sdan testcase( bSync ); 3460d992b150Sdrh while( iOffset<w.iSyncPoint ){ 3461d992b150Sdrh rc = walWriteOneFrame(&w, pLast, nTruncate, iOffset); 3462d992b150Sdrh if( rc ) return rc; 3463d992b150Sdrh iOffset += szFrame; 3464d992b150Sdrh nExtra++; 3465c438efd6Sdrh } 3466fe912510Sdan } 3467fe912510Sdan if( bSync ){ 3468fe912510Sdan assert( rc==SQLITE_OK ); 3469daaae7b9Sdrh rc = sqlite3OsSync(w.pFd, WAL_SYNC_FLAGS(sync_flags)); 3470c438efd6Sdrh } 34714e5e108eSdrh } 3472c438efd6Sdrh 3473d992b150Sdrh /* If this frame set completes the first transaction in the WAL and 3474d992b150Sdrh ** if PRAGMA journal_size_limit is set, then truncate the WAL to the 3475d992b150Sdrh ** journal size limit, if possible. 3476d992b150Sdrh */ 3477f60b7f36Sdan if( isCommit && pWal->truncateOnCommit && pWal->mxWalSize>=0 ){ 3478f60b7f36Sdan i64 sz = pWal->mxWalSize; 3479d992b150Sdrh if( walFrameOffset(iFrame+nExtra+1, szPage)>pWal->mxWalSize ){ 3480d992b150Sdrh sz = walFrameOffset(iFrame+nExtra+1, szPage); 3481f60b7f36Sdan } 3482f60b7f36Sdan walLimitSize(pWal, sz); 3483f60b7f36Sdan pWal->truncateOnCommit = 0; 3484f60b7f36Sdan } 3485f60b7f36Sdan 3486e730fec8Sdrh /* Append data to the wal-index. It is not necessary to lock the 3487a2a42013Sdrh ** wal-index to do this as the SQLITE_SHM_WRITE lock held on the wal-index 3488c438efd6Sdrh ** guarantees that there are no other writers, and no data that may 3489c438efd6Sdrh ** be in use by existing readers is being overwritten. 3490c438efd6Sdrh */ 3491027a128aSdrh iFrame = pWal->hdr.mxFrame; 3492c7991bdfSdan for(p=pList; p && rc==SQLITE_OK; p=p->pDirty){ 3493d6f7c979Sdan if( (p->flags & PGHDR_WAL_APPEND)==0 ) continue; 3494c438efd6Sdrh iFrame++; 3495c7991bdfSdan rc = walIndexAppend(pWal, iFrame, p->pgno); 3496c438efd6Sdrh } 349720e226d9Sdrh while( rc==SQLITE_OK && nExtra>0 ){ 3498c438efd6Sdrh iFrame++; 3499d992b150Sdrh nExtra--; 3500c7991bdfSdan rc = walIndexAppend(pWal, iFrame, pLast->pgno); 3501c438efd6Sdrh } 3502c438efd6Sdrh 3503c7991bdfSdan if( rc==SQLITE_OK ){ 3504c438efd6Sdrh /* Update the private copy of the header. */ 35051df2db7fSshaneh pWal->hdr.szPage = (u16)((szPage&0xff00) | (szPage>>16)); 35069b78f791Sdrh testcase( szPage<=32768 ); 35079b78f791Sdrh testcase( szPage>=65536 ); 3508027a128aSdrh pWal->hdr.mxFrame = iFrame; 3509c438efd6Sdrh if( isCommit ){ 35107ed91f23Sdrh pWal->hdr.iChange++; 35117ed91f23Sdrh pWal->hdr.nPage = nTruncate; 3512c438efd6Sdrh } 35137ed91f23Sdrh /* If this is a commit, update the wal-index header too. */ 35147ed91f23Sdrh if( isCommit ){ 35157e263728Sdrh walIndexWriteHdr(pWal); 35167ed91f23Sdrh pWal->iCallback = iFrame; 3517c438efd6Sdrh } 3518c7991bdfSdan } 3519c438efd6Sdrh 3520c74c3334Sdrh WALTRACE(("WAL%p: frame write %s\n", pWal, rc ? "failed" : "ok")); 3521c438efd6Sdrh return rc; 3522c438efd6Sdrh } 3523c438efd6Sdrh 3524c438efd6Sdrh /* 352573b64e4dSdrh ** This routine is called to implement sqlite3_wal_checkpoint() and 352673b64e4dSdrh ** related interfaces. 3527c438efd6Sdrh ** 352873b64e4dSdrh ** Obtain a CHECKPOINT lock and then backfill as much information as 352973b64e4dSdrh ** we can from WAL into the database. 3530a58f26f9Sdan ** 3531a58f26f9Sdan ** If parameter xBusy is not NULL, it is a pointer to a busy-handler 3532a58f26f9Sdan ** callback. In this case this function runs a blocking checkpoint. 3533c438efd6Sdrh */ 3534c438efd6Sdrh int sqlite3WalCheckpoint( 35357ed91f23Sdrh Wal *pWal, /* Wal connection */ 35367fb89906Sdan sqlite3 *db, /* Check this handle's interrupt flag */ 3537dd90d7eeSdrh int eMode, /* PASSIVE, FULL, RESTART, or TRUNCATE */ 3538a58f26f9Sdan int (*xBusy)(void*), /* Function to call when busy */ 3539a58f26f9Sdan void *pBusyArg, /* Context argument for xBusyHandler */ 3540c438efd6Sdrh int sync_flags, /* Flags to sync db file with (or 0) */ 3541b6e099a9Sdan int nBuf, /* Size of temporary buffer */ 3542cdc1f049Sdan u8 *zBuf, /* Temporary buffer to use */ 3543cdc1f049Sdan int *pnLog, /* OUT: Number of frames in WAL */ 3544cdc1f049Sdan int *pnCkpt /* OUT: Number of backfilled frames in WAL */ 3545c438efd6Sdrh ){ 3546c438efd6Sdrh int rc; /* Return code */ 354731c03907Sdan int isChanged = 0; /* True if a new wal-index header is loaded */ 3548f2b8dd58Sdan int eMode2 = eMode; /* Mode to pass to walCheckpoint() */ 3549dd90d7eeSdrh int (*xBusy2)(void*) = xBusy; /* Busy handler for eMode2 */ 3550c438efd6Sdrh 3551d54ff60bSdan assert( pWal->ckptLock==0 ); 3552a58f26f9Sdan assert( pWal->writeLock==0 ); 3553c438efd6Sdrh 3554dd90d7eeSdrh /* EVIDENCE-OF: R-62920-47450 The busy-handler callback is never invoked 3555dd90d7eeSdrh ** in the SQLITE_CHECKPOINT_PASSIVE mode. */ 3556dd90d7eeSdrh assert( eMode!=SQLITE_CHECKPOINT_PASSIVE || xBusy==0 ); 3557dd90d7eeSdrh 355866dfec8bSdrh if( pWal->readOnly ) return SQLITE_READONLY; 3559c74c3334Sdrh WALTRACE(("WAL%p: checkpoint begins\n", pWal)); 3560dd90d7eeSdrh 3561dd90d7eeSdrh /* IMPLEMENTATION-OF: R-62028-47212 All calls obtain an exclusive 3562dd90d7eeSdrh ** "checkpoint" lock on the database file. */ 3563ab372773Sdrh rc = walLockExclusive(pWal, WAL_CKPT_LOCK, 1); 356473b64e4dSdrh if( rc ){ 3565dd90d7eeSdrh /* EVIDENCE-OF: R-10421-19736 If any other process is running a 3566dd90d7eeSdrh ** checkpoint operation at the same time, the lock cannot be obtained and 3567dd90d7eeSdrh ** SQLITE_BUSY is returned. 3568dd90d7eeSdrh ** EVIDENCE-OF: R-53820-33897 Even if there is a busy-handler configured, 3569dd90d7eeSdrh ** it will not be invoked in this case. 3570dd90d7eeSdrh */ 3571dd90d7eeSdrh testcase( rc==SQLITE_BUSY ); 3572dd90d7eeSdrh testcase( xBusy!=0 ); 3573c438efd6Sdrh return rc; 3574c438efd6Sdrh } 3575d54ff60bSdan pWal->ckptLock = 1; 3576c438efd6Sdrh 3577dd90d7eeSdrh /* IMPLEMENTATION-OF: R-59782-36818 The SQLITE_CHECKPOINT_FULL, RESTART and 3578dd90d7eeSdrh ** TRUNCATE modes also obtain the exclusive "writer" lock on the database 3579dd90d7eeSdrh ** file. 3580f2b8dd58Sdan ** 3581dd90d7eeSdrh ** EVIDENCE-OF: R-60642-04082 If the writer lock cannot be obtained 3582dd90d7eeSdrh ** immediately, and a busy-handler is configured, it is invoked and the 3583dd90d7eeSdrh ** writer lock retried until either the busy-handler returns 0 or the 3584dd90d7eeSdrh ** lock is successfully obtained. 3585a58f26f9Sdan */ 3586cdc1f049Sdan if( eMode!=SQLITE_CHECKPOINT_PASSIVE ){ 3587a58f26f9Sdan rc = walBusyLock(pWal, xBusy, pBusyArg, WAL_WRITE_LOCK, 1); 3588c438efd6Sdrh if( rc==SQLITE_OK ){ 3589f2b8dd58Sdan pWal->writeLock = 1; 3590f2b8dd58Sdan }else if( rc==SQLITE_BUSY ){ 3591f2b8dd58Sdan eMode2 = SQLITE_CHECKPOINT_PASSIVE; 3592dd90d7eeSdrh xBusy2 = 0; 3593f2b8dd58Sdan rc = SQLITE_OK; 3594c438efd6Sdrh } 3595a58f26f9Sdan } 3596a58f26f9Sdan 3597f2b8dd58Sdan /* Read the wal-index header. */ 35987ed91f23Sdrh if( rc==SQLITE_OK ){ 3599a58f26f9Sdan rc = walIndexReadHdr(pWal, &isChanged); 3600f55a4cf8Sdan if( isChanged && pWal->pDbFd->pMethods->iVersion>=3 ){ 3601f55a4cf8Sdan sqlite3OsUnfetch(pWal->pDbFd, 0, 0); 3602f55a4cf8Sdan } 3603a58f26f9Sdan } 3604f2b8dd58Sdan 3605f2b8dd58Sdan /* Copy data from the log to the database file. */ 36069c5e3680Sdan if( rc==SQLITE_OK ){ 3607d6f7c979Sdan 36089c5e3680Sdan if( pWal->hdr.mxFrame && walPagesize(pWal)!=nBuf ){ 3609f2b8dd58Sdan rc = SQLITE_CORRUPT_BKPT; 3610f2b8dd58Sdan }else{ 36117fb89906Sdan rc = walCheckpoint(pWal, db, eMode2, xBusy2, pBusyArg, sync_flags, zBuf); 36129c5e3680Sdan } 36139c5e3680Sdan 36149c5e3680Sdan /* If no error occurred, set the output variables. */ 36159c5e3680Sdan if( rc==SQLITE_OK || rc==SQLITE_BUSY ){ 3616cdc1f049Sdan if( pnLog ) *pnLog = (int)pWal->hdr.mxFrame; 36179c5e3680Sdan if( pnCkpt ) *pnCkpt = (int)(walCkptInfo(pWal)->nBackfill); 3618c438efd6Sdrh } 3619f2b8dd58Sdan } 3620f2b8dd58Sdan 362131c03907Sdan if( isChanged ){ 362231c03907Sdan /* If a new wal-index header was loaded before the checkpoint was 3623a2a42013Sdrh ** performed, then the pager-cache associated with pWal is now 362431c03907Sdan ** out of date. So zero the cached wal-index header to ensure that 362531c03907Sdan ** next time the pager opens a snapshot on this database it knows that 362631c03907Sdan ** the cache needs to be reset. 362731c03907Sdan */ 362831c03907Sdan memset(&pWal->hdr, 0, sizeof(WalIndexHdr)); 362931c03907Sdan } 3630c438efd6Sdrh 3631c438efd6Sdrh /* Release the locks. */ 3632a58f26f9Sdan sqlite3WalEndWriteTransaction(pWal); 363373b64e4dSdrh walUnlockExclusive(pWal, WAL_CKPT_LOCK, 1); 3634d54ff60bSdan pWal->ckptLock = 0; 3635c74c3334Sdrh WALTRACE(("WAL%p: checkpoint %s\n", pWal, rc ? "failed" : "ok")); 3636f2b8dd58Sdan return (rc==SQLITE_OK && eMode!=eMode2 ? SQLITE_BUSY : rc); 3637c438efd6Sdrh } 3638c438efd6Sdrh 36397ed91f23Sdrh /* Return the value to pass to a sqlite3_wal_hook callback, the 36407ed91f23Sdrh ** number of frames in the WAL at the point of the last commit since 36417ed91f23Sdrh ** sqlite3WalCallback() was called. If no commits have occurred since 36427ed91f23Sdrh ** the last call, then return 0. 36437ed91f23Sdrh */ 36447ed91f23Sdrh int sqlite3WalCallback(Wal *pWal){ 3645c438efd6Sdrh u32 ret = 0; 36467ed91f23Sdrh if( pWal ){ 36477ed91f23Sdrh ret = pWal->iCallback; 36487ed91f23Sdrh pWal->iCallback = 0; 3649c438efd6Sdrh } 3650c438efd6Sdrh return (int)ret; 3651c438efd6Sdrh } 36525543759bSdan 36535543759bSdan /* 365461e4acecSdrh ** This function is called to change the WAL subsystem into or out 365561e4acecSdrh ** of locking_mode=EXCLUSIVE. 36565543759bSdan ** 365761e4acecSdrh ** If op is zero, then attempt to change from locking_mode=EXCLUSIVE 365861e4acecSdrh ** into locking_mode=NORMAL. This means that we must acquire a lock 365961e4acecSdrh ** on the pWal->readLock byte. If the WAL is already in locking_mode=NORMAL 366061e4acecSdrh ** or if the acquisition of the lock fails, then return 0. If the 366161e4acecSdrh ** transition out of exclusive-mode is successful, return 1. This 366261e4acecSdrh ** operation must occur while the pager is still holding the exclusive 366361e4acecSdrh ** lock on the main database file. 36645543759bSdan ** 366561e4acecSdrh ** If op is one, then change from locking_mode=NORMAL into 366661e4acecSdrh ** locking_mode=EXCLUSIVE. This means that the pWal->readLock must 366761e4acecSdrh ** be released. Return 1 if the transition is made and 0 if the 366861e4acecSdrh ** WAL is already in exclusive-locking mode - meaning that this 366961e4acecSdrh ** routine is a no-op. The pager must already hold the exclusive lock 367061e4acecSdrh ** on the main database file before invoking this operation. 367161e4acecSdrh ** 367261e4acecSdrh ** If op is negative, then do a dry-run of the op==1 case but do 367361e4acecSdrh ** not actually change anything. The pager uses this to see if it 367461e4acecSdrh ** should acquire the database exclusive lock prior to invoking 367561e4acecSdrh ** the op==1 case. 36765543759bSdan */ 36775543759bSdan int sqlite3WalExclusiveMode(Wal *pWal, int op){ 367861e4acecSdrh int rc; 3679aab4c02eSdrh assert( pWal->writeLock==0 ); 36808c408004Sdan assert( pWal->exclusiveMode!=WAL_HEAPMEMORY_MODE || op==-1 ); 36813cac5dc9Sdan 36823cac5dc9Sdan /* pWal->readLock is usually set, but might be -1 if there was a 36833cac5dc9Sdan ** prior error while attempting to acquire are read-lock. This cannot 36843cac5dc9Sdan ** happen if the connection is actually in exclusive mode (as no xShmLock 36853cac5dc9Sdan ** locks are taken in this case). Nor should the pager attempt to 36863cac5dc9Sdan ** upgrade to exclusive-mode following such an error. 36873cac5dc9Sdan */ 3688aab4c02eSdrh assert( pWal->readLock>=0 || pWal->lockError ); 36893cac5dc9Sdan assert( pWal->readLock>=0 || (op<=0 && pWal->exclusiveMode==0) ); 36903cac5dc9Sdan 369161e4acecSdrh if( op==0 ){ 3692c05a063cSdrh if( pWal->exclusiveMode!=WAL_NORMAL_MODE ){ 3693c05a063cSdrh pWal->exclusiveMode = WAL_NORMAL_MODE; 36943cac5dc9Sdan if( walLockShared(pWal, WAL_READ_LOCK(pWal->readLock))!=SQLITE_OK ){ 3695c05a063cSdrh pWal->exclusiveMode = WAL_EXCLUSIVE_MODE; 36965543759bSdan } 3697c05a063cSdrh rc = pWal->exclusiveMode==WAL_NORMAL_MODE; 369861e4acecSdrh }else{ 3699aab4c02eSdrh /* Already in locking_mode=NORMAL */ 370061e4acecSdrh rc = 0; 370161e4acecSdrh } 370261e4acecSdrh }else if( op>0 ){ 3703c05a063cSdrh assert( pWal->exclusiveMode==WAL_NORMAL_MODE ); 3704aab4c02eSdrh assert( pWal->readLock>=0 ); 370561e4acecSdrh walUnlockShared(pWal, WAL_READ_LOCK(pWal->readLock)); 3706c05a063cSdrh pWal->exclusiveMode = WAL_EXCLUSIVE_MODE; 370761e4acecSdrh rc = 1; 370861e4acecSdrh }else{ 3709c05a063cSdrh rc = pWal->exclusiveMode==WAL_NORMAL_MODE; 371061e4acecSdrh } 371161e4acecSdrh return rc; 37125543759bSdan } 37135543759bSdan 37148c408004Sdan /* 37158c408004Sdan ** Return true if the argument is non-NULL and the WAL module is using 37168c408004Sdan ** heap-memory for the wal-index. Otherwise, if the argument is NULL or the 37178c408004Sdan ** WAL module is using shared-memory, return false. 37188c408004Sdan */ 37198c408004Sdan int sqlite3WalHeapMemory(Wal *pWal){ 37208c408004Sdan return (pWal && pWal->exclusiveMode==WAL_HEAPMEMORY_MODE ); 37218c408004Sdan } 37228c408004Sdan 3723fc1acf33Sdan #ifdef SQLITE_ENABLE_SNAPSHOT 3724e230a899Sdrh /* Create a snapshot object. The content of a snapshot is opaque to 3725e230a899Sdrh ** every other subsystem, so the WAL module can put whatever it needs 3726e230a899Sdrh ** in the object. 3727e230a899Sdrh */ 3728fc1acf33Sdan int sqlite3WalSnapshotGet(Wal *pWal, sqlite3_snapshot **ppSnapshot){ 3729fc1acf33Sdan int rc = SQLITE_OK; 3730fc1acf33Sdan WalIndexHdr *pRet; 3731ba6eb876Sdrh static const u32 aZero[4] = { 0, 0, 0, 0 }; 3732fc1acf33Sdan 3733fc1acf33Sdan assert( pWal->readLock>=0 && pWal->writeLock==0 ); 3734fc1acf33Sdan 3735ba6eb876Sdrh if( memcmp(&pWal->hdr.aFrameCksum[0],aZero,16)==0 ){ 3736ba6eb876Sdrh *ppSnapshot = 0; 3737ba6eb876Sdrh return SQLITE_ERROR; 3738ba6eb876Sdrh } 3739fc1acf33Sdan pRet = (WalIndexHdr*)sqlite3_malloc(sizeof(WalIndexHdr)); 3740fc1acf33Sdan if( pRet==0 ){ 3741fad3039cSmistachkin rc = SQLITE_NOMEM_BKPT; 3742fc1acf33Sdan }else{ 3743fc1acf33Sdan memcpy(pRet, &pWal->hdr, sizeof(WalIndexHdr)); 3744fc1acf33Sdan *ppSnapshot = (sqlite3_snapshot*)pRet; 3745fc1acf33Sdan } 3746fc1acf33Sdan 3747fc1acf33Sdan return rc; 3748fc1acf33Sdan } 3749fc1acf33Sdan 3750e230a899Sdrh /* Try to open on pSnapshot when the next read-transaction starts 3751e230a899Sdrh */ 3752fc1acf33Sdan void sqlite3WalSnapshotOpen(Wal *pWal, sqlite3_snapshot *pSnapshot){ 3753fc1acf33Sdan pWal->pSnapshot = (WalIndexHdr*)pSnapshot; 3754fc1acf33Sdan } 3755ad2d5bafSdan 3756ad2d5bafSdan /* 3757ad2d5bafSdan ** Return a +ve value if snapshot p1 is newer than p2. A -ve value if 3758ad2d5bafSdan ** p1 is older than p2 and zero if p1 and p2 are the same snapshot. 3759ad2d5bafSdan */ 3760ad2d5bafSdan int sqlite3_snapshot_cmp(sqlite3_snapshot *p1, sqlite3_snapshot *p2){ 3761ad2d5bafSdan WalIndexHdr *pHdr1 = (WalIndexHdr*)p1; 3762ad2d5bafSdan WalIndexHdr *pHdr2 = (WalIndexHdr*)p2; 3763ad2d5bafSdan 3764ad2d5bafSdan /* aSalt[0] is a copy of the value stored in the wal file header. It 3765ad2d5bafSdan ** is incremented each time the wal file is restarted. */ 3766ad2d5bafSdan if( pHdr1->aSalt[0]<pHdr2->aSalt[0] ) return -1; 3767ad2d5bafSdan if( pHdr1->aSalt[0]>pHdr2->aSalt[0] ) return +1; 3768ad2d5bafSdan if( pHdr1->mxFrame<pHdr2->mxFrame ) return -1; 3769ad2d5bafSdan if( pHdr1->mxFrame>pHdr2->mxFrame ) return +1; 3770ad2d5bafSdan return 0; 3771ad2d5bafSdan } 3772*fa3d4c19Sdan 3773*fa3d4c19Sdan /* 3774*fa3d4c19Sdan ** The caller currently has a read transaction open on the database. 3775*fa3d4c19Sdan ** This function takes a SHARED lock on the CHECKPOINTER slot and then 3776*fa3d4c19Sdan ** checks if the snapshot passed as the second argument is still 3777*fa3d4c19Sdan ** available. If so, SQLITE_OK is returned. 3778*fa3d4c19Sdan ** 3779*fa3d4c19Sdan ** If the snapshot is not available, SQLITE_ERROR is returned. Or, if 3780*fa3d4c19Sdan ** the CHECKPOINTER lock cannot be obtained, SQLITE_BUSY. If any error 3781*fa3d4c19Sdan ** occurs (any value other than SQLITE_OK is returned), the CHECKPOINTER 3782*fa3d4c19Sdan ** lock is released before returning. 3783*fa3d4c19Sdan */ 3784*fa3d4c19Sdan int sqlite3WalSnapshotCheck(Wal *pWal, sqlite3_snapshot *pSnapshot){ 3785*fa3d4c19Sdan int rc; 3786*fa3d4c19Sdan rc = walLockShared(pWal, WAL_CKPT_LOCK); 3787*fa3d4c19Sdan if( rc==SQLITE_OK ){ 3788*fa3d4c19Sdan WalIndexHdr *pNew = (WalIndexHdr*)pSnapshot; 3789*fa3d4c19Sdan if( memcmp(pNew->aSalt, pWal->hdr.aSalt, sizeof(pWal->hdr.aSalt)) 3790*fa3d4c19Sdan || pNew->mxFrame<walCkptInfo(pWal)->nBackfillAttempted 3791*fa3d4c19Sdan ){ 3792*fa3d4c19Sdan rc = SQLITE_BUSY_SNAPSHOT; 3793*fa3d4c19Sdan walUnlockShared(pWal, WAL_CKPT_LOCK); 3794*fa3d4c19Sdan } 3795*fa3d4c19Sdan } 3796*fa3d4c19Sdan return rc; 3797*fa3d4c19Sdan } 3798*fa3d4c19Sdan 3799*fa3d4c19Sdan /* 3800*fa3d4c19Sdan ** Release a lock obtained by an earlier successful call to 3801*fa3d4c19Sdan ** sqlite3WalSnapshotCheck(). 3802*fa3d4c19Sdan */ 3803*fa3d4c19Sdan void sqlite3WalSnapshotUnlock(Wal *pWal){ 3804*fa3d4c19Sdan assert( pWal ); 3805*fa3d4c19Sdan walUnlockShared(pWal, WAL_CKPT_LOCK); 3806*fa3d4c19Sdan } 3807*fa3d4c19Sdan 3808*fa3d4c19Sdan 3809fc1acf33Sdan #endif /* SQLITE_ENABLE_SNAPSHOT */ 3810fc1acf33Sdan 381170708600Sdrh #ifdef SQLITE_ENABLE_ZIPVFS 3812b3bdc72dSdan /* 3813b3bdc72dSdan ** If the argument is not NULL, it points to a Wal object that holds a 3814b3bdc72dSdan ** read-lock. This function returns the database page-size if it is known, 3815b3bdc72dSdan ** or zero if it is not (or if pWal is NULL). 3816b3bdc72dSdan */ 3817b3bdc72dSdan int sqlite3WalFramesize(Wal *pWal){ 3818b3bdc72dSdan assert( pWal==0 || pWal->readLock>=0 ); 3819b3bdc72dSdan return (pWal ? pWal->szPage : 0); 3820b3bdc72dSdan } 382170708600Sdrh #endif 3822b3bdc72dSdan 382321d61853Sdrh /* Return the sqlite3_file object for the WAL file 382421d61853Sdrh */ 382521d61853Sdrh sqlite3_file *sqlite3WalFile(Wal *pWal){ 382621d61853Sdrh return pWal->pWalFd; 382721d61853Sdrh } 382821d61853Sdrh 38295cf53537Sdan #endif /* #ifndef SQLITE_OMIT_WAL */ 3830