Migration complete - new scraper is in use and public backups will be made soon. RebeccaBlackTech merged into Desuarchive (details).
Donations to the archive would be appreciated to help fund our server hardware & storage drives. We are looking for developers to help build new software and archives, discuss here.
Please report any issues you notice with the new scraper on /desu/.

Threads by latest replies - Page 8

the server is stable but the software is at its final limits

## Developer No.3366 View ViewReplyLast 50ReportDelete
> TL;DR:
* EDIT: 2018/08/20: Frontend server provider's SSD inodes fucked up, but we restored from backup after weeks of painstaking reinstallation.
* We used the search server to rescrape images and posts in between then and now, still importing. Unfortunately this means the search server cannot provide search for a few weeks until everything moves back.

* (Backend) Hard drives, server, is physically fine and did not face any issues.
* But FoolFuuka and Asagi has reached its utter limits. Not to mention cloudflare on 4chan itself exerts a limit on the amount of requests made, makes scraping extremely difficult (the same problem from before).
* As such we have had to pause archival of gif and wsg images or face an inability to scrape all images.
* The main admin was incapacitated the past 2 months and could only respond to issues in the last month intermittently. Need a successor.
* Demonstrate your contribution: help us fix Asagi or develop a replacement. As 4chan's volume grows and the amount of posts held reaches Big Data levels, the entire 4chan archival community is reaching the limits of this decade old software.

If you can help, contact us directly at the bridged channels irc.rizon.net #bibanon , our Matrix/Riot.im channel https://riot.im/app/#/room/#bibanon-chat:matrix.org .

## what is up

Since the primary admin is essentially incapacitated due to his brand new soul crushing job and recently a burst eardrum, there was no consensus on making an announcement, but as the hardware provider I feel it is time for the public to know.

Last year due to our size we were the first archiver to face and report the issues of Cloudflare's aggressive anti-bot protections on 4chan. You can read a full discussion about that in the thread below. There is a workaround that we have shared to the other living 4chan archivers but the fact remains for every 4chan archiver (not just desuarchive), the situation is not fully resolved and exerts a limit on how much can be archived from one node.


Recently in the past month there was two incidents where the site went down without restarting. Due to the primary admin's brand new soul crushing job he was unable to respond to notifications for a week. He returned and we worked out a method to recover the missed images from 4chan's archives.json, significantly mitigating the loss. But now without this admin we are still understaffed.

Without his expertise, in order to reduce dropped images on the other boards we have had to give up archiving /gif/ and /wsg/ images a few weeks ago to save enough requests for the other boards. As such Desuarchive slogs on for now.

## know the stakes

We are not alone. Every archiver of this size will meet this issue with cloudflare, died under the strain of scaling up, or dropped boards to keep running. Desuarchive by virtue of being the largest, holding threads from archive.moe and foolz, and having the most high volume boards that keep growing is the canary in the coal mine for the whole community.

To those who look down at how things pull together here, look at our colleagues and predecessors. Every single time they met a scalability or cost issue like this they quit and deleted their archiver. RebeccaBlackTech had already abandoned multiple boards, struggled with the fuuka engine with no updates or optimization, and was about to delete the site until we gave them a hand. Archive.moe, Foolz, they did much worse and actually lost or even deleted previously archived data. Loveisover already died and deleted everything due to their failure to handle expansion. 4plebs pulls together alright but that's because they choose not to expand for new boards. You can check out the page we wrote that is a literal graveyard for dead archivers.


The fact is, Foolfuuka and Asagi let them down with their poor design and horrendous and inefficient resource usage. 4chan grew too much for them. We are one of the last living by investing $6000 of our own money in hardware, countless hours in time to shore up FoolFuuka and Asagi, and $200 a month in maintaining this site. I think we got maybe $100 in donations to date. So who has a stake in our success? It matters not, one thing you can be sure of is our tenacity. We have done this for 3 years and more already and we will continue onward with support or no support.

## do it yourself???

If you are not satisfied we challenge you to run an archiver. Let us reiterate:

If you'd like to start a 4chan archiver of your own, just read our guide to set up FoolFuuka along with Asagi: https://wiki.bibanon.org/FoolFuuka

* FoolFuuka/Asagi is very RAM hungry. Unless it is optimized, archival will continue to be as expensive and unsustainable as it is today.
* For a server that supports a publicly viewable thumbs only archival of all boards, it will require 64GB of RAM, a decent CPU after Sandy Bridge, and at least 500GB for thumbs (to hold all thumbs released on the Internet Archive from Archive.moe, 4plebs, or such).
* For Desuarchive, 20-40TB of space is necessary just to hold its full images to date (not even that from the Archive.moe dump).

Don't be surprised if you face the same travails that we and our many predecessors have. But don't hesitate to ask us for assistance either, because we have long experience with setting up these systems. We are one of the last, but most dedicated support groups for this crumbling, aging piece of software.

Or maybe we can develop an alternative solution so that no more admins will have to suffer the horrors of Asagi again. Perhaps we will call the successor engine and framework Ayase.

## developers and sysadmins halp

Maybe we as anons are poor in money and surplus in time.

This is why I call on all who can to help us improve or replace the Asagi scraper engine. This is not just for our own good, the entire 4chan archiver community is at stake. So now this is an ingenious way for you to show how much you care, to prove your stake.

**Unfortunately, as the primary admin is incapacitated I lack the expertise in the software to fully explain the issues.**

But in short there can be considered to be a limit on how many requests can be made to 4chan from one node. Asagi itself also has a resource limit whereby it spirals out of control and crashes during times of high load from 4chan. Finally the moment cloudflare itself drops the 503 blocks, nothing can be done except to restart the archiver or reduce the amount of boards being scraped. From what I understand, Asagi is too inflexible to have more than one instance in one database, so a whole new software will be necessary to support multiple scraping nodes.

A new engine for archiving 4chan at scale must be developed. It will require the ability to asynchronously scrape threads and images without consuming too many idle resources of CPU and RAM time. It will need to be able to run in multiple nodes but report to one MySQL Database.

## why no public statement

Know that I have been reading your posts on this board and simply due to lack of consensus while the primary admin was incapacitated I was not able to respond, apologies. But let's be real, for most server issues there is really not much that debating with the crowd can solve. I can count on my two hands the amount of people in the world who are qualified to operate FoolFuuka and Asagi at scale.

But this moratorium ends now. If you are interested in assisting us drop in a pull request or drop into our channel at these communication points.

If you are interested in becoming a volunteer sysadmin for Desuarchive, stop by the channel. You must prove that you 1. have at least 2 hours per day of free time to volunteer and 2. have comfortable experience with most of the following:

* Utilizing command-line Linux
* Experience setting up virtual private servers, particularly LXC containers
* Setting up, partitioning, and recovering a ZFS RAID (without GUIs or wizards)
* Building Nginx webserver configurations
* Experience setting up a functioning instance of FoolFuuka and Asagi **strongly recommended**
* Tuning MySQL databases with Percona, TokuDB, and other optimizations
* Setting up Sphinxsearch at scale with multiple nodes
* Setting up node.js and Java instances and optimizing to keep resource usage low

This post was modified by Desuarchive Administrator on 2018-08-21
224 posts and 9 images omitted

No.3873 View ViewReplyReportDelete
Every image is broken, what happened?

No.3863 View ViewReplyReportDelete
Due to the KyoAni Animation Studio Arson Tragedy on July 18, 2019, 4chan mods added stickies on /a/ with one sticky reaching 8000 posts to supposedly reaching 32 thousand posts and then a rollover sticky afterwards that automatically deleted posts. This seems to have created a fatal error in the /a/ archive for that particular stickied thread.

For example, when I try to view the thread, I get:

Whoops, looks like something went wrong.

Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /var/www/foolfuuka/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php on line 183

The sticky is still active on /a/ with no end in sight so would it best to wait for that sticky to end before doing anything? The thread I have been referencing is thread number 190751690 but I am unsure if there was more than one sticky over the KyoAni arson tragedy.

API limitations

No.3859 View ViewReplyReportDelete

i've been trying to download all the posts from a specific board within certain date ranges, but the API only allows me to limit ranges on date (i.e. YYYY-MM-DD) instead of unixtime.
This has been a problem since the API limits me to 5000 posts for a given query, when some days in my query range have over 12,000 posts, making it impossible to actually retrieve a single day.

Is there a way for me either to get a raw dump of the posts if I email the relevant date intervals (I don't think desuarchive has uploaded to archive.org), or a some other option i've missed for either increasing the maximum number of returned posts or for querying a smaller time interval?

This is for a research project that's attempting to analyse the content and discussions over time on a bunch of online communities, and it would be a shame if I either had to scrape desuarchive via raw html or worse, not include 4chan at all.


No.3857 View ViewReplyReportDelete


holy shit what a cancerous captcha, why the fuck do you think a 10 minutes captcha is a good idea

the search bar doesn't work properly

No.3851 View ViewReplyReportDelete
The thing opens for a split second when you click on the bar, but goes away the moment you release the mouse button, so it's essentially impossible to use the advanced options.

No.3838 View ViewReplyReportDelete
Search just plain doesn't work. Clicking on it causes it to flash in and out. Holding down click allows you to type in and enter in a search term, but all other search functions-- image hash, name, tripcode, time range, etc.-- are inaccessible.

It's been like this for almost a month.

Is there a way to download archived threads?

No.3830 View ViewReplyReportDelete
I want to download whole threads(and the original sized pictures, I don't want a screenshot of the threads), I am planning on making a mini organized archive for a general on /mlp/ that spans many years, and many dozens of threads. Is there a way to download whole threads off this archive?

request for /mlp/

No.3834 View ViewReplyReportDelete
The /mlp/ image archives are incomplete. Sometimes i have noticed that the images are (likely) recoverable from Derpibooru (MLP image site).
Example: https://desuarchive.org/mlp/search/image/64QchVBaJWLdqB_tG1Xmlg/

Filename is: 626820__safe_solo_humanized_trix...
The ID maps to this: https://derpibooru.org/626820
And that's how you can recover Derpibooru images that aren't on Desuarchive. Question: How hard would it be to automate a Derpibooru lookup option for posts on /mlp/? Alongside the other options: View Same, Google, ImgOps, iqdb, SauceNAO

Like a regular expression for ^[0-9]+ that links you to https://derpibooru.org/idgoeshere .

It would have some false positives (4chan and similar boards also had numeric filenames) but could help recover lost images.

For files named like 1343457__tags_go_here , it's almost certain to be a Derpibooru image.

No.3820 View ViewReplyReportDelete
Is Nyafuu dead for anyone else? I get a message that says "The Sponsored Listings displayed above are served automatically by a third party. Neither Parkingcrew nor the domain owner maintain any relationship with the advertisers. " when I try to access it.