2019-12-03: We regret to inform users that the archiver stack FoolFuuka&Asagi on Rbt.asia, Desuarchive, Archived.moe, and others are suffering reduced accuracy and missing posts in scraping 4chan. New, efficient, and more accurate scrapers are in the progress of development, but needs assistance and testing from C# .NET and MySQL/Percona developers to be ready for use. On behalf of all archivers, unless FoolFuuka&Asagi is replaced, we will now be unable to scrape properly under the strain of deep software inefficiency and unsustainable costs (like fireden). More details here.
Donations to our site would help to ensure a normal lifecycle replacement of drives in our RAID.
Please refrain from spamming the ghostposting system or it may not be around for long.

## Developer No.4019 View ViewReplyReportDelete
BTW if anyone knows any 4chan admins, please tell them to unblock or increase the Cloudflare limits on the desuarchive/rbt.asia and archived.moe scrapers. The method used to hold us over for the past year does not seem to be working well anymore and it is failing to archive threads periodically.

Relevant actions taken by archived.moe admins as FYI, we are also using the same temporary fix: https://archived.moe/talk/thread/14/#q141_200

This reveals that our time is up with the FoolFuuka/Asagi archiver stack. It has not had active development since the collapse of Archive.moe in 2016. It is too inefficient with bandwidth to function any further. It is not worth feeding both with additional RAM and resources if it is too inefficient with requests as well, the stack is already consuming 168GB RAM as it is. It is only a matter of time before the other archivers meet this fate too.

This is an unexpected development and it could take weeks if not months, but the C# .NET developer of hayden is working hard on it with massive reductions and huge improvements in efficiency and request limit compliance, such as RAM use reduction to 100MB, but his time is limited, we wish we could have bought some more time.

To anyone who can help, call up all C# .NET developers and skilled MySQL/Percona DBAs to try and bring the Hayden code up to scratch as a suitable drop in Asagi replacement. Once the scraper is replaced, we can also work together with Python developers to build a new frontend replacement for FoolFuuka as described in the previous thread. It is already demonstrably more efficient and accurate, however it is not fully tested and often hits deadlock issues. Using Hayden will allow us to consolidate our operations on s2.desuarchive.org instead of s1.desuarchive.org on a separate continent, immediately saving $90 a month and removing a Sword of Damocles by eliminating s1 which the previous admin (peace be upon his wrists) is likely to default on due to crippling medical bills.

There are some download reliability issues that need testing to be able to use in production, but we wish we had some more time before this happened.


In the future we would like to totally overhaul the FoolFuuka/Asagi stack with a total replacement for the sole use of Desuarchive, but a drop in Asagi replacement is crucial as the vast majority of other archiver admins such as archived.moe have lives to live and are unlikely to take up any SQL schema changes.

Feel free to drop by the chat to help brainstorm what can be done.



It is helpful to read our previous thread on the topic as well: https://desuarchive.org/desu/thread/3894/

EDIT: Modified to clarify that FoolFuuka/Asagi, both the frontend, backend, and inefficient MySQL schema, are all to blame for the situation.

This post was modified by Desuarchive Administrator on 2019-12-07
42 posts and 1 image omitted

## Developer No.3894 View ViewReplyLast 50ReportDelete
## TL;DR Regarding the Absolute State of 4chan Archival

At Desuarchive we have long struggled with many issues with many unsolved mysteries from the previous admin (peace be upon his wrists), but we have now set up the archiver on a more stable footing and there is some development going on with the scraper at least, so things are looking up as you may have seen this past year.

It is imperative for the survival of all 4chan archivers that Java-based Asagi is replaced (especially given the downfall of Fireden) and significant efficiency improvements are made in both excessive HTTP requests and RAM usage, while providing the same reliability and accuracy. As 4chan grows all archivers will be in grave danger of dying under the strain of deep software inefficiency and unsustainable costs if this is not done.

Archives are not going to be sustainable as seen with Fireden if only one dude has to shoulder the weight of thousands of dollars of equipment and bandwidth usage. The next archiver on the deathwatch appears to be Warosu.

On behalf of all 4chan archives, we need your help with the two scrapers being developed. These scrapers are currently set to be Asagi compatible as a future drop-in replacement for all other archivers: which is no small feat as Asagi regularly uses 40-60GB of RAM at full load but these could use as low as 30-150MB.

https://github.com/bibanon/eve - Python based scraper. We currently actively test it in production with /wsg/ scraping.

https://github.com/bbepis/Hayden - C# .NET light scraper, still needs testing for evaluation. But it's doing real great.

As such, we hope to be able to build a brand new archival stack based on these that dispenses with the inefficiencies of the scrapers of the past using PostgreSQL JSONB to store threads exactly as they are from the 4chan API (NoSQL style). While we are not frontend developers, we can sidestep this by building middleware to emit a 4chan compatible API, so that 4chan-X can be used as the JavaScript webapp and Android apps (Chanu, Clover) and iPhone apps could be modified with a few lines of code to work with the archive.

In support of research and onboarding for this, this effort 4plebs has generously developed partial 4chan API compatibility for the FoolFuuka frontend which is slowly being rolled out. This will allow Android and iPhone applications to view the FoolFuuka archivers (but not ghostpost yet). If you are a PHP developer we need your help here.


They also developed 4plebs X which uses 4chan-X to function as a webapp frontend, possibly able utilize this 4chan API to replace the user facing part of the PHP HHVM FoolFuuka stack with a familiar alternative. It has flaws such as the lack of search and ghostposting, but hopefully developers could try to step in regarding that.


Demo: https://test.4plebs.org (to use disable 4chan-X to avoid conflicts).

If you know any third party 4chan app devs, please refer them to us so we can direct them on how to set the proper configurations for their app to access FoolFuuka archives. (there was an old FoolFuuka API already but it predates 4chan API so it is not directly compatible, best to move off it)

We are willing to provide support and troubleshooting for better understanding of FoolFuuka/Asagi instances for the construction of new ones or development of replacement scrapers, or if anyone wants to pick up the boards of Fireden. We have institutional knowledge and experience running many major archival websites gathered over 2 years, so don't hesitate to drop by.



Our guide could use some work but it will guide you there with some hiccups.


## Regarding the Absolute State of Fireden and /v/ and /vg/ archival

Fireden is infamous in the community for never reaching out for help or advice, and never acting on anything other than abuse emails. I don't think they ever planned to operate for this long they were set up on the whim in 2015 after archive.moe died, so they probably just had enough it costs a lot to operate a site that can scrape /v/ and /vg/ images. But if the Fireden admin is reading this, be the prodigial son: we can provide any assistance or backup you need so that your hard work is not in vain.

The next archiver I expect to collapse under pressure is Warosu. As for us we are pretty stable after a $500 chassis upgrade and hot spare SSDs, but it really sucks to be one of the few people in the world who puts a large amount of capital into 4chan archival.

We refuse to pump more money in to bail out any more archivers for barely any returns, We have had to bail out 4 of them already and have paid $7000 to date out of pocket, and $200 a month, can't someone else pony up?

The best bet is for a large capital investment to be made on arch.b4k.co so it can be significantly upgraded to our standards to match the levels of Fireden, providing /vg/ scraping and full images for both. It will actually not cost too much to start out with maybe only 5x10TB drives for $700, $300 for a new case and maybe $600 for new AMD Ryzen with 160GB of RAM for Asagi and MySQL and $100 for colocation. Because we will probably never see the fireden images ever again, so that saves a lot of space.

4plebs refuses to take on any more boards as they are barely able to handle the ones they have.

## Basic Details about the Maintenance Done

This weekend we managed to do a major case upgrade for $500 for our backend image server to allow it to host more services such as scrapers and frontend content. All SSDs were moved out of the internal bay and into hotswap bays, and a hot spare SSD for booting was added: without those it was really difficult to service and made it difficult to consider using it for hosting databases safely. It may be possible to attach at least 6 more 3.5" drives which will be necessary as only 10TB of storage is available.

This may make it possible to halve the costs of cloud servers and bandwidth that we currently use by consolidating service together into a single server.

1 drive with bad sectors was replaced safely for $150 and a ZFS resilver completed. The other drives do not appear to have issues, but we continue to monitor the situation.

Tests done with the bibanon/eve scraper for scraping /wsg/ have been extremely promising, though development is still ongoing to put it on par with the Asagi scraper. It is possible that any new deployment of the scraper will utilize either this or hayden, but proper testing will still be necessary.
71 posts and 1 image omitted

## Admin No.3026 View ViewReplyReportDelete
Welcome to /desu/. Use this board to report issues, request features, and for other discussions regarding desuarchive.org & rbt.asia. Other posts will be removed.

When reporting a technical issue, be sure to include the full URL of the page/image.

Do not use this board for removal requests, which must be emailed to [email protected] Other rule violations can be reported by clicking the "Report" button on the post.

Heads up admins

No.4172 View ViewReplyReportDelete
Hey it's late and I'm tired. /m/'s resident schizo is spamming your ghostposting system. If this causes the site strain I would do something about if if I were you.

Post your board's booru shit

## Developer No.4157 View ViewReplyReportDelete
Occasional-assistant-with-Desuarchive here; got bored a few weeks back and decided to look into the suggestions in >>3834. tl;dr: find missing images for /mlp/, and if the filenames look like the pony booru filename format, download them and add them to the archive.

It was a pain in the ass, but I did it, and /mlp/ now has ~80,000 more images in the archive than it used to. There are still some questions around how to regenerate thumbnails for webms and rough edges, but adding shit into the archive from external sources is now down to 'tedious and time-consuming' levels of difficulty.

If your board has a popular booru or other media site with distinctive filenames, we can look at scraping those as well. Post that shit, and we'll look into it with our usual focus and responsiveness.

Furry edits

No.4160 View ViewReplyReportDelete
Edits about furry here (any type of edit allowed)

Bulk data?

No.4146 View ViewReplyReportDelete
How can I make very large requests from Python or the API to gather a crap ton of posts spanning many years from a certain board (to feed it to GPT-2)? Or do I have to ask the administrator for that?

Unironically using jewgles recaptcha

No.4125 View ViewReplyReportDelete
You kikes never fucking end. first they add recaptcha to 4chan, then they take down (((8chan))), then they disable no js, then you fuckers add it to this website. Consider suicide, the world would be a better place without """""""Developers""""""" who shove bloatware and botnets into every corner of the shiternet.

No.4116 View ViewReplyReportDelete
why don't quotes work when searching for filenames?
example: https://desuarchive.org/a/search/filename/%22id%20invaded%2001%22/
this should only return screenshots of the first episode, but since the quotes are ignored it can return other episodes too if they have 01 somewhere in the filename (the timestamp and date often have numbers in them)

i found this on the internet:
>$query->match('comment', $input['text'], true);
>$query->match('media_filename', $input['filename']);
is there a reason the filename search can't have true as the third parameter too? from what i understand it's the part enabling special search syntax like quotes