How to Speed Up Uploading Video to Dropbox on Android
Photographic camera uploads is a feature in our Android and iOS apps that automatically backs up a user's photos and videos from their mobile device to Dropbox. The feature was first introduced in 2012, and uploads millions of photos and videos for hundreds of thousands of users every twenty-four hours. People who employ photographic camera uploads are some of our most dedicated and engaged users. They care deeply about their photo libraries, and expect their backups to be quick and dependable every time. Information technology'due south important that we offer a service they can trust.
Until recently, camera uploads was congenital on a C++ library shared between the Android and iOS Dropbox apps. This library served us well for a long time, uploading billions of images over many years. Nevertheless, it had numerous problems. The shared code had grown polluted with complex platform-specific hacks that made it hard to sympathize and risky to modify. This gamble was compounded by a lack of tooling back up, and a shortage of in-business firm C++ expertise. Plus, later on more five years in production, the C++ implementation was showtime to testify its historic period. Information technology was unaware of platform-specific restrictions on groundwork processes, had bugs that could delay uploads for long periods of fourth dimension, and made outage recovery difficult and fourth dimension-consuming.
In 2019, we decided that rewriting the feature was the all-time style to offer a reliable, trustworthy user experience for years to come. This fourth dimension, Android and iOS implementations would be separate and use platform-native languages (Kotlin and Swift respectively) and libraries (such asWorkManager andRoom for Android). The implementations could then be optimized for each platform and evolve independently, without being constrained by design decisions from the other.
This mail is almost some of the design, validation, and release decisions we made while building the new photographic camera uploads characteristic for Android, which we released to all users during the summer of 2021. The project shipped successfully, with no outages or major issues; mistake rates went downward, and upload performance greatly improved. If you haven't already enabled camera uploads, you should try it out for yourself.
Designing for background reliability
The master value proffer of photographic camera uploads is that it works silently in the background. For users who don't open up the app for weeks or even months at a time, new photos should notwithstanding upload promptly.
How does this work? When someone takes a new photograph or modifies an existing photo, the Bone notifies the Dropbox mobile app. A background worker we call the scanner carefully identifies all the photos (or videos) that haven't nevertheless been uploaded to Dropbox and queues them for upload. Then another background worker, the uploader, batch uploads all the photos in the queue.
Uploading is a two step process. First, like many Dropbox systems, we break the file into four MB blocks, compute the hash of each block, and upload each cake to the server. In one case all the file blocks are uploaded, we make a final commit request to the server with a listing of all block hashes in the file. This creates a new file consisting of those blocks in the user'due south Camera Uploads folder. Photos and videos uploaded to this binder tin can and so be accessed from whatsoever linked device.
1 of our biggest challenges is that Android places strong constraints on how often apps can run in the groundwork and what capabilities they have. For example, App Standby limits our background network access if the Dropbox app hasn't recently been foregrounded. This means we might only be allowed to access the network for a 10-minute interval in one case every 24 hours. These restrictions accept grown more strict in recent versions of Android, and the cantankerous-platform C++ version of camera uploads was not well-equipped to handle them. It would sometimes effort to perform uploads that were doomed to fail because of a lack of network admission, or fail to restart uploads during the system-provided window when network access became available.
Our rewrite does not escape these background restrictions; they withal apply unless the user chooses to disable them in Android's system settings. However, we reduce delays equally much as possible by taking maximum reward of the network access we do receive. We use WorkManager to handle these background constraints for u.s.a., guaranteeing that uploads are attempted if, and but if, network admission becomes available. Unlike our C++ implementation, we as well exercise as much work as possible while offline—for case, by performing rudimentary checks on new photos for duplicates—before asking WorkManager to schedule the states for network access.
To further optimize use of our express network access, nosotros likewise refined our handling of failed uploads. C++ camera uploads aggressively retried failed uploads an unlimited number of times. In the rewrite we added backoff intervals between retry attempts, and likewise tuned our retry beliefs for different error categories. If an error is likely to be transient, we retry multiple times. If information technology's likely to be permanent, we don't bother retrying at all. As a result, we make fewer overall retry attempts—which limits network and battery usage—andusers meet fewer errors.
Designing for performance
Our users don't only expect camera uploads to work reliably. They also expect their photos to upload quickly, and without wasting system resources. Nosotros were able to make some big improvements here. For example, first-time uploads of large photo libraries now finish upwardly to four times faster. There are a few ways our new implementation achieves this.
Parallel uploads
First, we substantially improved performance past adding support for parallel uploads. The C++ version uploaded just i file at a time. Early on in the rewrite, we collaborated with our iOS and backend infrastructure colleagues to design an updated commit endpoint with back up for parallel uploads.
Once the server constraint was gone, Kotlin coroutines made it easy to run uploads concurrently. Although Kotlin Catameniasouth are typically processed sequentially, the bachelor operators are flexible enough to serve as building blocks for powerful custom operators that support concurrent processing. These operators can be chained declaratively to produce code that's much simpler, and has less overhead, than the transmission thread management that would've been necessary in C++.
val uploadResults = mediaUploadStore .getPendingUploads() .unorderedConcurrentMap(concurrentUploadCount) { mediaUploader.upload(it) } .takeUntil { it != UploadTaskResult.SUCCESS } .toList()
A simple example of a concurrent upload pipeline. unorderedConcurrentMap is a custom operator that combines the built-in flatMapMerge and transform operators.
Optimizing memory apply
After calculation support for parallel uploads, we saw a big uptick in out-of-retentivity crashes from our early testers. A number of improvements were required to brand parallel uploads stable enough for production.
Commencement, we modified our uploader to dynamically vary the number of simultaneous uploads based on the corporeality of bachelor organisation retentivity. This way, devices with lots of memory could enjoy the fastest possible uploads, while older devices would non be overwhelmed. Nevertheless, nosotros were still seeing much college memory usage than we expected, and so nosotros used the memory profiler to take a closer look.
The first thing we noticed was that retentiveness consumption wasn't returning to its pre-upload baseline after all uploads were done. It turned out this was due to an unfortunate behavior of the Java NIO API. It created an in-memory enshroud on every thread where we read a file, and once created, the cache could never be destroyed. Since we read files with the threadpool-backed IO dispatcher, nosotros typically ended up with many of these caches, one for each dispatcher thread we used. Nosotros resolved this past switching to direct byte buffers, which don't allocate this cache.
The next thing we noticed were large spikes in memory usage when uploading, specially with larger files. During each upload, we read the file in blocks, copying each block into aByteArray for further processing. We never created a new byte assortment until the previous ane had gone out of scope, so we expected but one to be in-memory at a fourth dimension. Notwithstanding, it turned out that when we allocated a large number of byte arrays in a short time, the garbage collector could not costless them quickly enough, causing a transient memory fasten. We resolved this issue by re-using the aforementioned buffer for all block reads.
Parallel scanning and uploading
In the C++ implementation of camera uploads, uploading could not start until we finished scanning a user'south photo library for changes. To avoid upload delays, each scan only looked at changes that were newer than what was seen in the previous scan.
This arroyo had downsides. In that location were some edge cases where photos with misleading timestamps could be skipped completely. If we ever missed photos due to a bug or OS change, aircraft a fix wasn't enough to recover; nosotros also had to clear affected users' saved browse timestamps to force a full re-browse. Plus, when camera uploads was first enabled, we yet had to check everything earlier uploading anything. This wasn't a bang-up first impression for new users.
In the rewrite, we ensured definiteness by re-scanning the whole library later on every change. We also parallelized uploading and scanning, so new photos can starting time uploading while we're all the same scanning older ones. This means that although re-scanning can have longer, the uploads themselves all the same start and finish promptly.
Validation
A rewrite of this magnitude is risky to ship. It has dangerous failure modes that might only bear witness upward at scale, such every bit corrupting one out of every meg uploads. Plus, every bit with most rewrites, we could non avoid introducing new bugs because we did not understand—or even know near—every edge case handled by the old organisation. Nosotros were reminded of this at the kickoff of the project when we tried to remove some ancient camera uploads code that we thought was dead, and instead ended up DDOSing Dropbox's crash reporting service. 🙃
Hash validation in production
During early on evolution, we validated many low-level components by running them in production alongside their C++ counterparts and then comparing the outputs. This let usa confirm that the new components were working correctly before nosotros started relying on their results.
One of those components was a Kotlin implementation of the hashing algorithms that nosotros use to place photos. Because these hashes are used for de-duplication, unexpected things could happen if the hashes modify for even a tiny percentage of photos. For instance, we might re-upload old photos believing they are new. When we ran our Kotlin code alongside the C++ implementation, both implementations almost always returned matching hashes, merely they differed about 0.005% of the fourth dimension. Which implementation was wrong?
To answer this, we added some boosted logging. In cases where Kotlin and C++ disagreed, we checked if the server subsequently rejected the upload because of a hash mismatch, and if so, what hash it was expecting. We saw that the server was expecting the Kotlin hashes, giving usa high confidence the C++ hashes were wrong. This was great news, since it meant nosotros had stock-still a rare bug nosotros didn't fifty-fifty know we had.
Validating state transitions
Camera uploads uses a database to track each photo's upload state. Typically, the scanner adds photos in state NEW and then moves them to Awaiting (or DONE if they don't need to be uploaded). The uploader tries to upload PENDING photos and then moves them to Done or ERROR.
Since nosotros parallelize so much work, it'due south normal for multiple parts of the arrangement to read and write this state database simultaneously. Individual reads and writes are guaranteed to happen sequentially, merely we're still vulnerable to subtle bugs where multiple workers try to change the country in redundant or contradictory ways. Since unit of measurement tests merely cover single components in isolation, they won't catch these bugs. Even an integration examination might miss rare race weather condition.
In the rewritten version of photographic camera uploads, we guard against this by validating every land update against a set of allowed state transitions. For example, we stipulate that a photo can never move from ERROR to DONE without passing back through PENDING. Unexpected state transitions could indicate a serious problems, so if nosotros see one, we terminate uploading and report an exception.
These checks helped us notice a nasty bug early in our rollout. Nosotros started to run into a high volume of exceptions in our logs that were caused when camera uploads tried to transition photos fromDone toDONE. This made united states realize we were uploading some photos multiple times! The root cause was a surprising behavior in WorkManager whereunique workers can restart before the previous instance is fully cancelled. No duplicate files were existence created considering the server rejects them, only the redundant uploads were wasting bandwidth and time. Once nosotros fixed the event, upload throughput dramatically improved.
Rolling it out
Even afterwards all this validation, nosotros still had to be cautious during the rollout. The fully-integrated arrangement was more complex than its parts, and we'd also need to fence with a long tail of rare device types that are not represented in our internal user testing pool. We likewise needed to continue to meet or surpass the high expectations of all our users who rely on camera uploads.
To reduce this risk preemptively, nosotros made sure to support rollbacks from the new version to the C++ version. For instance, we ensured that all user preference changes made in the new version would utilise to the quondam version too. In the finish nosotros never ended up needing to roll back, just it was still worth the endeavor to accept the option available in instance of disaster.
We started our rollout with an opt-in puddle of beta (Play Store early on access) users who receive a new version of the Dropbox Android app every week. This puddle of users was large enough to surface rare errors and collect central performance metrics such as upload success rate. We monitored these fundamental metrics in this population for a number of months to gain confidence it was ready to send widely. We discovered many bug during this time period, but the fast beta release cadence allowed us to iterate and fix them quickly.
We also monitored many metrics that could hint at time to come problems. To make sure our uploader wasn't falling behind over time, we watched for signs of ever-growing backlogs of photos waiting to upload. We tracked retry success rates by fault blazon, and used this to fine-tune our retry algorithm. Last but not least, we too paid close attention to feedback and support tickets we received from users, which helped surface bugs that our metrics had missed.
When we finally released the new version of camera uploads to all users, information technology was clear our months spent in beta had paid off. Our metrics held steady through the rollout and we had no major surprises, with improved reliability and depression error rates right out of the gate. In fact, we concluded up finishing the rollout ahead of schedule. Since we'd front-loaded so much quality improvement work into the beta menstruum (with its weekly releases), we didn't have whatsoever multi-week delays waiting for disquisitional bug fixes to whorl out in the stable releases.
And then, was it worth it?
Rewriting a big legacy feature isn't ever the right conclusion. Rewrites are extremely time-consuming—the Android version alone took two people working for two full years—and can hands cause major regressions or outages. In order to be worthwhile, a rewrite needs to evangelize tangible value by improving the user feel, saving engineering time and effort in the long term, or both.
What communication do we accept for others who are get-go a project like this?
- Define your goals and how you will measure them. At the start, this is of import to brand sure that the benefits will justify the effort. At the finish, it will help you determine whether you got the results you wanted. Some goals (for instance, time to come resilience against OS changes) may not be quantifiable—and that's OK—but it'southward good to spell out which ones are and aren't.
- De-risk it. Place the components (or system-wide interactions) that would cause the biggest issues if they failed, and baby-sit against those failures from the very start. Build critical components first, and try to examination them in production without waiting for the whole system to be finished. It'southward as well worth doing extra piece of work up-front in order to be able to coil back if something goes wrong.
- Don't rush. Shipping a rewrite is arguably riskier than aircraft a new feature, since your audience is already relying on things to work as expected. Start by releasing to an audition that's just big enough to requite you the information you demand to evaluate success. Then, watch and wait (and fix stuff) until your information give y'all confidence to continue. Dealing with problems when the user-base is small is much faster and less stressful in the long run.
- Limit your scope. When doing a rewrite, information technology'south tempting to tackle new feature requests, UI cleanup, and other backlog work at the same time. Consider whether this will actually be faster or easier than shipping the rewrite first and fast-post-obit with the rest. During this rewrite we addressed issues linked to the cadre architecture (such as crashes intrinsic to the underlying data model) and deferred all other improvements. If you change the feature too much, not only does information technology have longer to implement, merely it'south besides harder to notice regressions or roll back.
In this case, we feel good about the decision to rewrite. We were able to meliorate reliability right abroad, and more chiefly, we set ourselves up to stay reliable in the futurity. Every bit the iOS and Android operating systems continue to evolve in separate directions, it was only a matter of fourth dimension before the C++ library broke badly plenty to require fundamental systemic changes. Now that the rewrite is complete, nosotros're able to build and iterate on camera uploads much faster—and offer a better feel for our users, too.
Too: We're hiring!
Are you lot a mobile engineer who wants to make software that'south reliable and maintainable for the long haul? If so, we'd love to have you at Dropbox! Visit our jobs page to see current openings.
Source: https://dropbox.tech/mobile/making-camera-uploads-for-android-faster-and-more-reliable
0 Response to "How to Speed Up Uploading Video to Dropbox on Android"
Post a Comment